Shared Flashcard Set

Details

Title

Lecture eleven

Description

Biocore: Clustering

Total Cards

Subject

Biology

Level

Graduate

Created

12/13/2009

Click here to study/print these flashcards.

Create your own flash cards! Sign up here.

Additional Biology Flashcards

Cards Return to Set Details

Term

clustering provides means of what

Definition

finding structure in collection of unlabeled data

Term

reasons for clustering

Definition

determine intrinsic groupings

1. classification

2. simplification

3. to create populations of types for downstream analysis

Term

two kinds of metrics to calculate similarity

give example of each

Definition

statistical

(Pearson correlation)

geometric

(Euclidean distance)

Term

Pearson Correlation

tells you what two things about correlation?

Definition

tells you degree and direction of correlation

measure of the correlation (linear dependence) between two variables X and Y, giving a value between +1 and −1 inclusive.

used as a measure of the strength of linear dependence between two variables.

Term

distance metrics

if paying attention to most deviant conditions, use what method to measure distance

Definition

Chebyche

Term

k-means clustering

Definition

method of cluster analysis in which n objects are grouped into k clusters in which each component belongs to the cluster with the nearest mean

Term

k-means clustering

advantages and disadvantages

Definition

advantages:

simplicity

speed

can use on large datasets

disadvantages:

gives you different results with each run

Term

in k-means clustering, how is each cluster primarily defined?

Definition

via its centroid

Term

centroid

Definition

mean over all cluster members for each dimension

Term

steps to k-means clustering

Definition

1. pick K random points and define as cluster centroids

2. add points to cluter with cloest centroid (assoc every oberservation with the nearest mean)

3. recalculate the means

4. repeat until you have reached max iteration OR until centroids don't move anymore when you recalculate them

Term

why are outliers a problem in k-means clustering?

Definition

Every case is forced to join a cluster no matter how atypical or remote it might be, with the result that the classification can be substantially distorted.

Term

hierarchical clustering

Definition

find successive clusters using previously established clusters

Term

two main classes of clustering:

Definition

agglomerative

divisive

Term

greedy algorithm

example

Definition

k-means clustering

perform a single procedure over and over until it can't be done any more and see what kind of results it will produce.

the procedure tries to maximize the return based on examining local conditions, with the hope that the outcome will lead to a desired outcome for the global problem.

In some cases such a strategy is guaranteed to offer optimal solutions, and in some other cases it may provide a compromise that produces acceptable approximations.

Term

how to measure distance between two groups in agglomerative hierarchical clustering

3 ways

Definition

average linkage

single linkage

complete linkage

Term

single linkage

Definition

way to calculate distance btw two groups in agglomerative hierarchical clustering

minimum distance btw any two points in each cluster

Term

complete linkage

Definition

way to calculate distance btw two groups in agglomerative hierarchical clustering

maximum distance btw any two points in each cluster

Term

how do you denote clusters in a dendrogram?

Definition

horizontal cuts

Term

Silhouette

what s(i) values of 1,-1, & zero tells you

Definition

method of interpretation and validation of clusters of data

provides a succinct graphical representation of how well each object lies within its cluster.

value of 1means datum in approp cluster

value of 0 means btw clusters

-1 means datum should be in neighboring cluster

Term

what can computing the average silhouette distance of a cluster and of the entire dataset tell you?

Definition

average s(i) of a cluster: how tightly grouped all the data in the cluster are

average s(i) of entire data set: measure of how appropriately the data has been clustered

Term

silhouette plots and averages are a powerful tool for determining what?

Definition

the natural number of clusters within a dataset.

Term

PAM

partitioning around mediods

Definition

more robust k-means

computes medoids instead of centroids as cluster centers

unlike k-means, it uses data to define cluster # (often silhouette)

Term

model based clustering

two characteristics

Definition

strategy for determination of # of clusters and cluster membership

two characteristics:

1.fits gaussian distributions to data

2. uses BIC to pick # of distributions

Term

Bayesian information criterion

BIC

Definition

helps determine the best clustering method along with the # of clusters

Flashcard Machine - create, study and share online flash cards

Shared Flashcard Set

Details

Additional Biology Flashcards

Cards Return to Set Details

My Flashcards

Flashcard Library

Browse

About

Help

Mobile