| Term 
 
        |       clustering provides means of what |  | Definition 
 
        |       finding structure in collection of unlabeled data |  | 
        |  | 
        
        | Term 
 | Definition 
 
        | determine intrinsic groupings     1. classification 2. simplification 3. to create populations of types for downstream analysis |  | 
        |  | 
        
        | Term 
 
        |       two kinds of metrics to calculate similarity give example of each |  | Definition 
 
        |     statistical (Pearson correlation)   geometric (Euclidean distance) |  | 
        |  | 
        
        | Term 
 
        |       Pearson Correlation tells you what two things about correlation?
 |  | Definition 
 
        | tells you degree and direction of correlation   measure of the correlation (linear dependence) between two variables X and Y, giving a value between +1 and −1 inclusive.   used as a measure of the strength of linear dependence between two variables. |  | 
        |  | 
        
        | Term 
 
        |       distance metrics if paying attention to most deviant conditions, use what method to measure distance |  | Definition 
 | 
        |  | 
        
        | Term 
 | Definition 
 
        |       method of cluster analysis in which n objects are grouped into k clusters in which each component belongs to the cluster with the nearest mean |  | 
        |  | 
        
        | Term 
 
        |       k-means clustering advantages and disadvantages
 |  | Definition 
 
        |   advantages: simplicity speed can use on large datasets   disadvantages: gives you different results with each run |  | 
        |  | 
        
        | Term 
 
        |       in k-means clustering, how is each cluster primarily defined? |  | Definition 
 | 
        |  | 
        
        | Term 
 | Definition 
 
        |       mean over all cluster members for each dimension |  | 
        |  | 
        
        | Term 
 
        |       steps to k-means clustering |  | Definition 
 
        | 1. pick K random points and define as cluster centroids 2. add points to cluter with cloest centroid (assoc every oberservation with the nearest mean) 3. recalculate the means 4. repeat until you have reached max iteration OR until centroids don't move anymore when you recalculate them |  | 
        |  | 
        
        | Term 
 
        |       why are outliers a problem in k-means clustering? |  | Definition 
 
        |       Every case is forced to join a cluster no matter how atypical or remote it might be, with the result that the classification can be substantially distorted. 
 |  | 
        |  | 
        
        | Term 
 | Definition 
 
        |       find successive clusters using previously established clusters |  | 
        |  | 
        
        | Term 
 
        |       two main classes of clustering:  |  | Definition 
 | 
        |  | 
        
        | Term 
 | Definition 
 
        | k-means clustering   perform a single procedure over and over until it can't be done any more and see what kind of results it will produce.   the procedure tries to maximize the return based on examining local conditions, with the hope that the outcome will lead to a desired outcome for the global problem.   In some cases such a strategy is guaranteed to offer optimal solutions, and in some other cases it may provide a compromise that produces acceptable approximations. |  | 
        |  | 
        
        | Term 
 
        |   how to measure distance between two groups in agglomerative hierarchical clustering 3 ways |  | Definition 
 
        |   average linkage   single linkage   complete linkage |  | 
        |  | 
        
        | Term 
 | Definition 
 
        |       way to calculate distance btw two groups in agglomerative hierarchical clustering   minimum distance btw any two points in each cluster |  | 
        |  | 
        
        | Term 
 | Definition 
 
        |         way to calculate distance btw two groups in agglomerative hierarchical clustering   maximum distance btw any two points in each cluster |  | 
        |  | 
        
        | Term 
 
        |         how do you denote clusters in a dendrogram? |  | Definition 
 | 
        |  | 
        
        | Term 
 
        |         Silhouette what s(i) values of 1,-1, & zero tells you |  | Definition 
 
        |   method of interpretation and validation of clusters of data   provides a succinct graphical representation of how well each object lies within its cluster.   value of 1means datum in approp cluster value of 0 means btw clusters -1 means datum should be in neighboring cluster |  | 
        |  | 
        
        | Term 
 
        |   what can computing the average silhouette distance of a cluster and of the entire dataset tell you? |  | Definition 
 
        |   average s(i) of a cluster: how tightly grouped all the data in the cluster are   average s(i) of entire data set: measure of how appropriately the data has been clustered |  | 
        |  | 
        
        | Term 
 
        |       silhouette plots and averages are a powerful tool for determining what? |  | Definition 
 
        |       the natural number of clusters within a dataset. |  | 
        |  | 
        
        | Term 
 
        |       PAM partitioning around mediods |  | Definition 
 
        |   more robust k-means   computes medoids instead of centroids as cluster centers   unlike k-means, it uses data to define cluster # (often silhouette)     |  | 
        |  | 
        
        | Term 
 
        |       model based clustering two characteristics |  | Definition 
 
        |       strategy for determination of # of clusters and cluster membership   two characteristics:   1.fits gaussian distributions to data 2. uses BIC to pick # of distributions |  | 
        |  | 
        
        | Term 
 
        |         Bayesian information criterion BIC |  | Definition 
 
        |     helps determine the best clustering method along with the # of clusters |  | 
        |  |