Term
What is the most general of the automatic clustering techniques |
|
Definition
|
|
Term
Cluster detection is a ___ data mining technique? Why? |
|
Definition
undirected It finds patterns in the data without a target variable. |
|
|
Term
define a similarity cluster |
|
Definition
breaking down clusters based on their similarity. I.E. Credit card customers that: maintain high balance Use the card often Use once in a while for a large purchase |
|
|
Term
what is the cluster centroid |
|
Definition
the average position of cluster memebers in each dimension |
|
|
Term
___ is the most commonly used clustering algorithm |
|
Definition
|
|
Term
what does the "K" in K means clustering refer to? |
|
Definition
the algorithm looks for a fixed number of clusters. K is specified by the user. |
|
|
Term
the best assignment of cluster centers could be defined as: |
|
Definition
one that minimizes the sum of the distance from every data point to its nearest cluster center |
|
|
Term
k means uses an algorithm that alternates between two steps, ___ and ___ |
|
Definition
|
|
Term
in K-means clustering, the cluster centers can be used to define a ___: describe this. |
|
Definition
voronoi diagram: a diagram whose lines mark the points that are equidistant from the two nearest seeds. |
|
|
Term
how does clustering ID outliers? |
|
Definition
A record that is beyond a threshold distance from the cluster. |
|
|
Term
what are the two factors used when interpreting clusters? |
|
Definition
1. what do cluster members have in common? 2. what distinguishes each cluster from the others? |
|
|
Term
What are the cluster characteristics? |
|
Definition
Diameter: maximum distance between two records in one cluster variance: sum of the squared distance from the centroid of cluster members silhouette: measure of cluster dispersion or goodness. |
|
|
Term
What can you do with a silhouette score? |
|
Definition
chose an appropriate value of K compare clusters produced by different random seeds remove strong clusters for further analysis |
|
|
Term
When clustering, you should impose a ___. Why |
|
Definition
maximum cluster diameter. Otherwise the cluster keeps growing. |
|
|
Term
what are the related variations of k means |
|
Definition
k medians k medoids k modes |
|
|
Term
what does k median clustering do? |
|
Definition
looks for the set of centroids that minimizes the sum of distances from cluster members to cluster centroids.
(it tightens the grouping) |
|
|
Term
k medians is less sensitive to ___ |
|
Definition
|
|
Term
what does k medoids clustering do? |
|
Definition
as it does assignment and update, the next center is the best representative for each cluster. |
|
|
Term
define scaling and weighting |
|
Definition
scaling: adjusts the values of variables to take into account that different variables are measured in different units/ranges
weighting: is encoding the information that one variable is more or less important than the others |
|
|
Term
scaling variables is calculated by ___ |
|
Definition
|
|
Term
what are the 3 types of clustering and their description? |
|
Definition
K-means: Start with fixed number of clusters, and make clusters based on a criterion.
divisive: start with one cluster and keep breaking till some stopping rule is triggerd
hierarchical: starts with every record in its own cluster |
|
|
Term
what implements hierarchical clustering? |
|
Definition
|
|
Term
how does hierarchical clustering work? |
|
Definition
takes every record in it's own cluster and gradually merge, forming larger groupings. Continues until all records are in one cluster |
|
|
Term
what is principal component analysis used for? |
|
Definition
finding an optimal way of combining variables into a small number of subsets. |
|
|
Term
principle components are sensitive to units, therefore you should ___ |
|
Definition
|
|
Term
What are the strengths of K-Nearest Neighbors (KNN)? |
|
Definition
Simple and effective
don't make underlying distributional assumptions
fast training phase |
|
|
Term
what are the weaknesses of K-nearest neighbors(KNN)? |
|
Definition
doesn't produce a model, so no insights in relationships
slow classification phase
memory intensive
nominal and missing data need additional processing |
|
|
Term
Unlike K-means clustering, with K-Nearest Neighbors there is a ___ |
|
Definition
|
|
Term
[image]
Fill in the chart for each predicted/actual variable and per arrow. |
|
Definition
|
|
Term
What are the kNN classification sytaxes? |
|
Definition
train: a data frame containing numeric training data
test: data frame containing numeric test date
class: factor vector with the class for each row in the training data
k: an integer indicating the number of nearest neighbors |
|
|
Term
in an artificial neural network, network topology describes: |
|
Definition
the number of layers
number of nodes in each layer
if info is allowed to travel backward |
|
|
Term
in an artificial neural network, training algorithm specifies: |
|
Definition
how connection weights are set in order to inhibit or excite neurons in proportion to the input signal |
|
|
Term
What type of data mining can artificial neural networks be used for? |
|
Definition
classification
numeric prediction
unsupervised pattern recognition |
|
|
Term
what activation functions are often used in artificial neural networks? |
|
Definition
sigmoid activation function (output ranges from 0-1)
Radial basis function
linear activation function (results in a network similar to linear regression model) |
|
|
Term
what are some strengths of artificial neural networks? |
|
Definition
adapted to classification or numeric prediction problems
among most accurate modeling approaches
makes few assumptions about data's underlying relationships |
|
|
Term
what are some weaknesses of artificial neural networks |
|
Definition
computationally intensive and slow to train
easy to over- and underfit training data
difficult or impossible to interpret |
|
|
Term
what is the goal of a support vector machine? |
|
Definition
to create a flat boundary called a hyperplane, which leads to partitions of data on either side |
|
|
Term
what is the maximum margin hyperplane? |
|
Definition
a hyperplane that creates the greatest separation between two classes
support vectors are points from each class that are closes to maximum vector hyperplane (must have 1, but can have more support vectors) |
|
|
Term
What is a key feature of support vector machines |
|
Definition
the support vectors provide a very compact way to store a classification model |
|
|
Term
with support vector machines with non-linear kernels, what must you do with the data? |
|
Definition
standardize variables
convert nominal variables to dummies and ordinal to scale. |
|
|
Term
what are some commonly used kernesl? |
|
Definition
linear polynomial sigmoid gaussian RBF kernel |
|
|
Term
What is market basket analysis. Give an example of when it could be used |
|
Definition
set of association rules that specify patterns of relationships among items in transactional data.
How many times peanut butter and jelly were purchased at the same time as bread |
|
|
Term
how does market basket analysis form a set? What does the set mean? |
|
Definition
with brackets {peanut butter, jelly}
it means that the item set appears in the data with some regularity |
|
|
Term
what are association rules in market basket analysis used for? |
|
Definition
They are used for unsupervised knowledge discovery in large databases, NOT for prediction. |
|
|
Term
what are the strengths of market basket analysis? |
|
Definition
good for large amounts of transactional data
results in rules that are easy to understand
good for discovering unexpected knowledge in databases |
|
|
Term
what are the weaknesses of market basket analysis? |
|
Definition
not good with small datasets
takes effort to separate the insight from the common sense
easy to draw spurious conclusions from random patterns |
|
|
Term
what is the end result of the Apriori algorithm? |
|
Definition
It reduces the association rule search space. This means all subsets of a frequent itemset must also be frequent |
|
|
Term
how do you calculate support of an item in market basket analysis? |
|
Definition
Support = count(x/n) x = number of transactions the itemset appears in n = total transactions |
|
|
Term
in market basket analysis, what is the definition of confidence |
|
Definition
it is a measurement of the predictive power or accuracy. |
|
|
Term
for Market basket analysis, describe the following Arules:
Inspect() itemFrequency() itemFrequencyPlot() Image() |
|
Definition
Inspect()= looks at the contents of the sparce matrix
itemFrequency()lets you see the proportion of transactions that contain an item
itemFrequencyPlot()= allows you to produce a bar chart depicting the proportion of transactions with a certain item (shows support)
Image()= helps with identification of potential data issues |
|
|
Term
with market basket analysis, what are the issues with low and high confidence? |
|
Definition
low = leads to many unreliable rules
high = leads to obvious or inevitable rules (smoke detector purchased with batteries) |
|
|
Term
in market basket analysis, what is lift? |
|
Definition
a measure of how much more likely one item is to be purchased relative to its typical purchase rate, given that you know another item has been purchased. |
|
|
Term
in market basket analysis, what does a lift greater than 1 mean? |
|
Definition
it suggests that items are found together more often than chance occurrence (it isn't by chance they are bought together) |
|
|
Term
Define the following True positive (TP) True Negative (TN) False positive (FP) false negative (FN) |
|
Definition
TP = correctly classified as the class of interest
TN = correctly classified as not the class of interest
FP = incorrectly classified as the class of interest
FN = incorrectly classified as not the class of interest |
|
|
Term
sensitivity is known as ___ |
|
Definition
actual negative power of a test |
|
|
Term
specificity is known as ___ |
|
Definition
|
|
Term
in evaluating a model, what is a Type 1 error |
|
Definition
Is when you predicted/expected a positive but is an actual negative. Known as crying wolf. |
|
|
Term
in evaluating a model, what is a Type 2 error |
|
Definition
is when you predicted/expected a negative, but is positive. AKA, missed detection |
|
|
Term
What is a kappa statistic. Describe the agreement rates |
|
Definition
it adjusts accuracy by accounting for the possibility of a correct prediction by chance alone.
Max value = 1 .8-1.0 = very good agreement .6-.8 = good agreement .4-.6 = moderate agreement .2-.4 = fair agreement less than .2 = poor agreement |
|
|
Term
|
Definition
a measure of model performance that combines precision and recall into a single number |
|
|
Term
what is the goal of text analysis |
|
Definition
|
|
Term
with text analytics, what is a tag. |
|
Definition
it refers to information associated with a text document. It is info ABOUT the document, but not part of document itself.(metadata) |
|
|
Term
During text analysis, how does the bag-of-words approach function? |
|
Definition
It examines each word individually and without context (not tied into another word, or part of a sentence) |
|
|
Term
what are the commonly used ways of text analysis. |
|
Definition
bag of words
natural language processing |
|
|
Term
What is the end product of text analytics? |
|
Definition
|
|
Term
What is sentiment analysis? |
|
Definition
it is text analytics with a purpose.
use of text measures to learn about the past and make predictions about the future.
opinion mining |
|
|
Term
What are some design text measures (that work)? |
|
Definition
list based
item-weighted
models for text classification
training and test regiment in evaluation |
|
|
Term
with sentiment analysis, what are the two lists that are computed? |
|
Definition
Positive: the % of words in the review that match up with the positive word list
Negative: the % of words in the review that match up with the negative word list |
|
|
Term
In sentiment analysis, there are six measures and modeling techniques. What are they? |
|
Definition
Simple difference: difference scores (positive minus negative scores)
regression difference: use linear regression to determine weights to use for positive and negative scores into predictor ratings.
word/item analysis: use original 50 words and training data to ID positive/negative leaning words. Then +1/-1 accordingly.
Logistic regression: stepwise logistic regression to select useful predictors from the set of 50 sentiment words.
Support vector machines: effective technique in text classification problems with large numbers of explanatory variables
Random forests: ensemble method that uses thousands of tree structured classifiers to arrive at a single prediction |
|
|
Term
How do you calculate support when targeting a single item during transactions? |
|
Definition
Support(x)= count(x)/N X = targeted Item that appeared in purchases N = total transactions |
|
|
Term
How do you calculate confidence when targeting items purchased together during transactions? |
|
Definition
Confidence(x,y)= support(X,Y)/support(x) X and Y together = targeted Items Support x = Individual item (x) |
|
|
Term
How do you calculate lift for market basket analysis? |
|
Definition
Lift(x,y) = Confidence(x,y)/support(y) |
|
|
Term
how do you calculate specificity? |
|
Definition
|
|
Term
how do you calculate sensitivity? |
|
Definition
|
|
Term
How do you calculate precision? |
|
Definition
|
|
Term
How do you calculate accuracy? |
|
Definition
|
|
Term
How do you calculate error rate? |
|
Definition
1-Accuracy Accuracy = (TP+TN)/(TP+TN+FP+FN) |
|
|