Term
clustering provides means of what |
|
Definition
finding structure in collection of unlabeled data |
|
|
Term
|
Definition
determine intrinsic groupings
1. classification
2. simplification
3. to create populations of types for downstream analysis |
|
|
Term
two kinds of metrics to calculate similarity
give example of each |
|
Definition
statistical
(Pearson correlation)
geometric
(Euclidean distance) |
|
|
Term
Pearson Correlation
tells you what two things about correlation?
|
|
Definition
tells you degree and direction of correlation
measure of the correlation (linear dependence) between two variables X and Y, giving a value between +1 and −1 inclusive.
used as a measure of the strength of linear dependence between two variables. |
|
|
Term
distance metrics
if paying attention to most deviant conditions, use what method to measure distance |
|
Definition
|
|
Term
|
Definition
method of cluster analysis in which n objects are grouped into k clusters in which each component belongs to the cluster with the nearest mean |
|
|
Term
k-means clustering
advantages and disadvantages
|
|
Definition
advantages:
simplicity
speed
can use on large datasets
disadvantages:
gives you different results with each run |
|
|
Term
in k-means clustering, how is each cluster primarily defined? |
|
Definition
|
|
Term
|
Definition
mean over all cluster members for each dimension |
|
|
Term
steps to k-means clustering |
|
Definition
1. pick K random points and define as cluster centroids
2. add points to cluter with cloest centroid (assoc every oberservation with the nearest mean)
3. recalculate the means
4. repeat until you have reached max iteration OR until centroids don't move anymore when you recalculate them |
|
|
Term
why are outliers a problem in k-means clustering? |
|
Definition
Every case is forced to join a cluster no matter how atypical or remote it might be, with the result that the classification can be substantially distorted.
|
|
|
Term
|
Definition
find successive clusters using previously established clusters |
|
|
Term
two main classes of clustering: |
|
Definition
|
|
Term
|
Definition
k-means clustering
perform a single procedure over and over until it can't be done any more and see what kind of results it will produce.
the procedure tries to maximize the return based on examining local conditions, with the hope that the outcome will lead to a desired outcome for the global problem.
In some cases such a strategy is guaranteed to offer optimal solutions, and in some other cases it may provide a compromise that produces acceptable approximations. |
|
|
Term
how to measure distance between two groups in agglomerative hierarchical clustering
3 ways |
|
Definition
average linkage
single linkage
complete linkage |
|
|
Term
|
Definition
way to calculate distance btw two groups in agglomerative hierarchical clustering
minimum distance btw any two points in each cluster |
|
|
Term
|
Definition
way to calculate distance btw two groups in agglomerative hierarchical clustering
maximum distance btw any two points in each cluster |
|
|
Term
how do you denote clusters in a dendrogram? |
|
Definition
|
|
Term
Silhouette
what s(i) values of 1,-1, & zero tells you |
|
Definition
method of interpretation and validation of clusters of data
provides a succinct graphical representation of how well each object lies within its cluster.
value of 1means datum in approp cluster
value of 0 means btw clusters
-1 means datum should be in neighboring cluster |
|
|
Term
what can computing the average silhouette distance of a cluster and of the entire dataset tell you? |
|
Definition
average s(i) of a cluster: how tightly grouped all the data in the cluster are
average s(i) of entire data set: measure of how appropriately the data has been clustered |
|
|
Term
silhouette plots and averages are a powerful tool for determining what? |
|
Definition
the natural number of clusters within a dataset. |
|
|
Term
PAM
partitioning around mediods |
|
Definition
more robust k-means
computes medoids instead of centroids as cluster centers
unlike k-means, it uses data to define cluster # (often silhouette)
|
|
|
Term
model based clustering
two characteristics |
|
Definition
strategy for determination of # of clusters and cluster membership
two characteristics:
1.fits gaussian distributions to data
2. uses BIC to pick # of distributions |
|
|
Term
Bayesian information criterion
BIC |
|
Definition
helps determine the best clustering method along with the # of clusters |
|
|