Shared Flashcard Set

Details

Title

Test 2

Description

Total Cards

Subject

Business

Level

Graduate

Created

03/21/2014

Click here to study/print these flashcards.

Create your own flash cards! Sign up here.

Additional Business Flashcards

Cards Return to Set Details

Term

What is the most general of the automatic clustering techniques

Definition

K-means clustering

Term

Cluster detection is a ___ data mining technique? Why?

Definition

undirected
It finds patterns in the data without a target variable.

Term

define a similarity cluster

Definition

breaking down clusters based on their similarity.
I.E. Credit card customers that: maintain high balance
Use the card often
Use once in a while for a large purchase

Term

what is the cluster centroid

Definition

the average position of cluster memebers in each dimension

Term

___ is the most commonly used clustering algorithm

Definition

K-means.

Term

what does the "K" in K means clustering refer to?

Definition

the algorithm looks for a fixed number of clusters. K is specified by the user.

Term

the best assignment of cluster centers could be defined as:

Definition

one that minimizes the sum of the distance from every data point to its nearest cluster center

Term

k means uses an algorithm that alternates between two steps, ___ and ___

Definition

assignment and update

Term

in K-means clustering, the cluster centers can be used to define a ___: describe this.

Definition

voronoi diagram: a diagram whose lines mark the points that are equidistant from the two nearest seeds.

Term

how does clustering ID outliers?

Definition

A record that is beyond a threshold distance from the cluster.

Term

what are the two factors used when interpreting clusters?

Definition

1. what do cluster members have in common?
2. what distinguishes each cluster from the others?

Term

What are the cluster characteristics?

Definition

Diameter: maximum distance between two records in one cluster variance: sum of the squared distance from the centroid of cluster members silhouette: measure of cluster dispersion or goodness.

Term

What can you do with a silhouette score?

Definition

chose an appropriate value of K compare clusters produced by different random seeds remove strong clusters for further analysis

Term

When clustering, you should impose a ___. Why

Definition

maximum cluster diameter. Otherwise the cluster keeps growing.

Term

what are the related variations of k means

Definition

k medians
k medoids
k modes

Term

what does k median clustering do?

Definition

looks for the set of centroids that minimizes the sum of distances from cluster members to cluster centroids.

(it tightens the grouping)

Term

k medians is less sensitive to ___

Definition

outliers

Term

what does k medoids clustering do?

Definition

as it does assignment and update, the next center is the best representative for each cluster.

Term

define scaling and weighting

Definition

scaling: adjusts the values of variables to take into account that different variables are measured in different units/ranges

weighting: is encoding the information that one variable is more or less important than the others

Term

scaling variables is calculated by ___

Definition

z score

Term

what are the 3 types of clustering and their description?

Definition

K-means: Start with fixed number of clusters, and make clusters based on a criterion.

divisive: start with one cluster and keep breaking till some stopping rule is triggerd

hierarchical: starts with every record in its own cluster

Term

what implements hierarchical clustering?

Definition

wards method

Term

how does hierarchical clustering work?

Definition

takes every record in it's own cluster and gradually merge, forming larger groupings. Continues until all records are in one cluster

Term

what is principal component analysis used for?

Definition

finding an optimal way of combining variables into a small number of subsets.

Term

principle components are sensitive to units, therefore you should ___

Definition

standardize the inputs

Term

What are the strengths of K-Nearest Neighbors (KNN)?

Definition

Simple and effective

don't make underlying distributional assumptions

fast training phase

Term

what are the weaknesses of K-nearest neighbors(KNN)?

Definition

doesn't produce a model, so no insights in relationships

slow classification phase

memory intensive

nominal and missing data need additional processing

Term

Unlike K-means clustering, with K-Nearest Neighbors there is a ___

Definition

target variable

Term

[image]

Fill in the chart for each predicted/actual variable and per arrow.

Definition

[image]

Term

What are the kNN classification sytaxes?

Definition

train: a data frame containing numeric training data

test: data frame containing numeric test date

class: factor vector with the class for each row in the training data

k: an integer indicating the number of nearest neighbors

Term

in an artificial neural network, network topology describes:

Definition

the number of layers

number of nodes in each layer

if info is allowed to travel backward

Term

in an artificial neural network, training algorithm specifies:

Definition

how connection weights are set in order to inhibit or excite neurons in proportion to the input signal

Term

What type of data mining can artificial neural networks be used for?

Definition

classification

numeric prediction

unsupervised pattern recognition

Term

what activation functions are often used in artificial neural networks?

Definition

sigmoid activation function (output ranges from 0-1)

Radial basis function

linear activation function (results in a network similar to linear regression model)

Term

what are some strengths of artificial neural networks?

Definition

adapted to classification or numeric prediction problems

among most accurate modeling approaches

makes few assumptions about data's underlying relationships

Term

what are some weaknesses of artificial neural networks

Definition

computationally intensive and slow to train

easy to over- and underfit training data

difficult or impossible to interpret

Term

what is the goal of a support vector machine?

Definition

to create a flat boundary called a hyperplane, which leads to partitions of data on either side

Term

what is the maximum margin hyperplane?

Definition

a hyperplane that creates the greatest separation between two classes

support vectors are points from each class that are closes to maximum vector hyperplane (must have 1, but can have more support vectors)

Term

What is a key feature of support vector machines

Definition

the support vectors provide a very compact way to store a classification model

Term

with support vector machines with non-linear kernels, what must you do with the data?

Definition

standardize variables

convert nominal variables to dummies and ordinal to scale.

Term

what are some commonly used kernesl?

Definition

linear polynomial sigmoid gaussian RBF kernel

Term

What is market basket analysis. Give an example of when it could be used

Definition

set of association rules that specify patterns of relationships among items in transactional data.

How many times peanut butter and jelly were purchased at the same time as bread

Term

how does market basket analysis form a set? What does the set mean?

Definition

with brackets {peanut butter, jelly}

it means that the item set appears in the data with some regularity

Term

what are association rules in market basket analysis used for?

Definition

They are used for unsupervised knowledge discovery in large databases, NOT for prediction.

Term

what are the strengths of market basket analysis?

Definition

good for large amounts of transactional data

results in rules that are easy to understand

good for discovering unexpected knowledge in databases

Term

what are the weaknesses of market basket analysis?

Definition

not good with small datasets

takes effort to separate the insight from the common sense

easy to draw spurious conclusions from random patterns

Term

what is the end result of the Apriori algorithm?

Definition

It reduces the association rule search space. This means all subsets of a frequent itemset must also be frequent

Term

how do you calculate support of an item in market basket analysis?

Definition

Support = count(x/n)
x = number of transactions the itemset appears in
n = total transactions

Term

in market basket analysis, what is the definition of confidence

Definition

it is a measurement of the predictive power or accuracy.

Term

for Market basket analysis, describe the following Arules:

Inspect()
itemFrequency()
itemFrequencyPlot()
Image()

Definition

Inspect()= looks at the contents of the sparce matrix

itemFrequency()lets you see the proportion of transactions that contain an item

itemFrequencyPlot()= allows you to produce a bar chart depicting the proportion of transactions with a certain item (shows support)

Image()= helps with identification of potential data issues

Term

with market basket analysis, what are the issues with low and high confidence?

Definition

low = leads to many unreliable rules

high = leads to obvious or inevitable rules (smoke detector purchased with batteries)

Term

in market basket analysis, what is lift?

Definition

a measure of how much more likely one item is to be purchased relative to its typical purchase rate, given that you know another item has been purchased.

Term

in market basket analysis, what does a lift greater than 1 mean?

Definition

it suggests that items are found together more often than chance occurrence (it isn't by chance they are bought together)

Term

Define the following
True positive (TP)
True Negative (TN)
False positive (FP)
false negative (FN)

Definition

TP = correctly classified as the class of interest

TN = correctly classified as not the class of interest

FP = incorrectly classified as the class of interest

FN = incorrectly classified as not the class of interest

Term

sensitivity is known as ___

Definition

actual negative power of a test

Term

specificity is known as ___

Definition

true negative rate

Term

in evaluating a model, what is a Type 1 error

Definition

Is when you predicted/expected a positive but is an actual negative. Known as crying wolf.

Term

in evaluating a model, what is a Type 2 error

Definition

is when you predicted/expected a negative, but is positive. AKA, missed detection

Term

What is a kappa statistic.
Describe the agreement rates

Definition

it adjusts accuracy by accounting for the possibility of a correct prediction by chance alone.

Max value = 1
.8-1.0 = very good agreement
.6-.8 = good agreement
.4-.6 = moderate agreement
.2-.4 = fair agreement
less than .2 = poor agreement

Term

what is an F score?

Definition

a measure of model performance that combines precision and recall into a single number

Term

what is the goal of text analysis

Definition

Term

with text analytics, what is a tag.

Definition

it refers to information associated with a text document. It is info ABOUT the document, but not part of document itself.(metadata)

Term

During text analysis, how does the bag-of-words approach function?

Definition

It examines each word individually and without context (not tied into another word, or part of a sentence)

Term

what are the commonly used ways of text analysis.

Definition

bag of words

natural language processing

Term

What is the end product of text analytics?

Definition

word cloud

Term

What is sentiment analysis?

Definition

it is text analytics with a purpose.

use of text measures to learn about the past and make predictions about the future.

opinion mining

Term

What are some design text measures (that work)?

Definition

list based

item-weighted

models for text classification

training and test regiment in evaluation

Term

with sentiment analysis, what are the two lists that are computed?

Definition

Positive: the % of words in the review that match up with the positive word list

Negative: the % of words in the review that match up with the negative word list

Term

In sentiment analysis, there are six measures and modeling techniques. What are they?

Definition

Simple difference: difference scores (positive minus negative scores)

regression difference: use linear regression to determine weights to use for positive and negative scores into predictor ratings.

word/item analysis: use original 50 words and training data to ID positive/negative leaning words. Then +1/-1 accordingly.

Logistic regression: stepwise logistic regression to select useful predictors from the set of 50 sentiment words.

Support vector machines: effective technique in text classification problems with large numbers of explanatory variables

Random forests: ensemble method that uses thousands of tree structured classifiers to arrive at a single prediction

Term

How do you calculate support when targeting a single item during transactions?

Definition

Support(x)= count(x)/N
X = targeted Item that appeared in purchases
N = total transactions

Term

How do you calculate confidence when targeting items purchased together during transactions?

Definition

Confidence(x,y)= support(X,Y)/support(x)
X and Y together = targeted Items
Support x = Individual item (x)

Term

How do you calculate lift for market basket analysis?

Definition

Lift(x,y) = Confidence(x,y)/support(y)

Term

how do you calculate specificity?

Definition

(TN)/(TN+FP)

Term

how do you calculate sensitivity?

Definition

(TP)/(TP+FN)

Term

How do you calculate precision?

Definition

(TP)/(TP+FP)

Term

How do you calculate accuracy?

Definition

(TP+TN)/(TP+TN+FP+FN)

Term

How do you calculate error rate?

Definition

1-Accuracy
Accuracy = (TP+TN)/(TP+TN+FP+FN)

Flashcard Machine - create, study and share online flash cards

Shared Flashcard Set

Details

Additional Business Flashcards

Cards Return to Set Details

My Flashcards

Flashcard Library

Browse

About

Help

Mobile