Term
|
Definition
the process of extracting patterns from data. an increasingly more important tool used to turn data into information. |
|
|
Term
Data mining: the confluence of disciplines |
|
Definition
statistics, visualization, algorithms, pattern recognition, machine learning, database technology |
|
|
Term
data mining and business intelligence |
|
Definition
decision making ^ data presentation (visualization) ^ data mining (info discovery) ^ data exploration (statistical summary, querying, reporting) ^ data prepocessing/integration ^ data sources (paper, files, web docs, scientific experiments) |
|
|
Term
|
Definition
use some variables to predict unknown or future values of other variable |
|
|
Term
|
Definition
find human-interpretable patterns that describe the data |
|
|
Term
|
Definition
predictive. goal: previously unseen records should be assigned a class as accurately as possible |
|
|
Term
|
Definition
|
|
Term
Classification Applications |
|
Definition
direct marketing- reduce cost of mailing by targeting a set of consumers likely to buy a new product.
fraud detection- predict fraudulent cases in credit card transactions (ex, american express used this approach)
Customer Attrition/Churn-to predict whether a customer is likely to be lost to a competitor
Sky survey cataloging- to predict the class of sky objects, esp visually faint ones, based on the telescopic survey images |
|
|
Term
|
Definition
an arbitrary collection of objects (A= {a,b,c})
union, intersection,subset, and difference are in notes week 2 |
|
|
Term
|
Definition
a set of all possible pairs (x,y) |
|
|
Term
|
Definition
a function in Matlab must be in this form:
function [output_parameter_list] = function_name(input_parameter_list) |
|
|
Term
|
Definition
descriptive. given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that: -data points in one cluster are more similar to one another -data points in separate clusters are less similar to one another |
|
|
Term
|
Definition
market segmentation: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix.
document clustering: to find groups of docs that are similar to each other based on the important terms appearing in them. |
|
|
Term
Association rule discovery |
|
Definition
descriptive. Given a set of records each of which contain some number of items from a given collection, produce dependency rules which will predict occurrences of an item based on occurrences of other items. |
|
|
Term
Association rule discovery applications |
|
Definition
marketing and sales promotion: find out what to put on sale to boost sales for another item
supermarket self management: to identify items that are bought together by sufficiently many customers
inventory management: a consumer appliance repair company wants to anticipate the nature of repairs on its customer products and keep the service vehicles equipped with the right parts to reduce on number of visits to consumer households. |
|
|
Term
Sequential pattern discovery |
|
Definition
descriptive. given a set of objects, with each object associated with it's own timeline of events, find rules that predict strong sequential dependencies among different events.
-rules are formed by first discovering patterns. Event occurrences in the patterns are governed by timing contraints |
|
|
Term
|
Definition
predict a valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency |
|
|
Term
Deviation/Anomaly detection |
|
Definition
detect significant deviations from normal behavior |
|
|
Term
Deviation/Anomaly detection Applications |
|
Definition
credit card fraud and network intrusion detection. |
|
|
Term
|
Definition
a collection of facts from which conclusions may be drawn
a collection of data objects and their attributes |
|
|
Term
|
Definition
a property or characteristic of an object (columns)
a collection of attributes describe a data point |
|
|
Term
|
Definition
numbers or symbols assigned to an attribute |
|
|
Term
|
Definition
the way you measure an attribute is something that may not match the attributes porperties |
|
|
Term
Types of attributes and examples |
|
Definition
nominal: Id numbers, eye color, zip code
Ordinal: rankings, grades, height in (tall, medium, short)
Interval: calendar dates, temperatures
Ratio: length, time, counts |
|
|
Term
properties of attribute values |
|
Definition
distintiveness: equal or not equal to
order: < >
addition: + -
multiplication: * / |
|
|
Term
Nominal definition and properties |
|
Definition
provides only enough info to distinguish one object from another
distinctiveness |
|
|
Term
Ordinal definition and properties |
|
Definition
provide enough info to order objects
Distinctness and order |
|
|
Term
Interval definition and properties |
|
Definition
the differences between values are meaningful (i.e. a unit of value exists)
distinctness, order, and addition |
|
|
Term
Ratio definition and properties |
|
Definition
both differences and ratios are meaningful
Distinctness, order, addition, multiplication |
|
|
Term
|
Definition
has only a finite or countably infinite set of values (zip codes, counts, set of words i a collection of docs)
-often represented as integer variables |
|
|
Term
|
Definition
has real numbers as attribute values (temp, height, weight)
-continuous attributes are typically represented as floating-point variables |
|
|
Term
|
Definition
Record: data matrix, doc data, transaction data
Graph: WWW, molecular structures
Ordered: spatial data, temporal data, sequential data, genetic sequence data. |
|
|
Term
Characteristics of structured data |
|
Definition
Dimensionality: curse of dimensionality(an exponential increase in the number of dimensions of data. hard to analyze)
Sparsity: only presence counts
Resolution: patterns depend on the scale
Attribute and Class imbalance: small number of non sero elements (related to sparsity) |
|
|
Term
|
Definition
data that consists of a collection of records, each of which consists of a fixed set of attributes |
|
|
Term
|
Definition
if data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multidimensional space, where each dimension represents a distinct attribute. This data can be represented in an m-by-n matrix where m=rows (one for each object) and n=columns (one for each attribute) |
|
|
Term
|
Definition
each document becomes a term vector -each term is a component (attribute) of the vector -the value of the component is the number of times the corresponding term occurs in the document |
|
|
Term
|
Definition
a special type of record data where each record (transaction) involves a set of items (ex, a grocery store purchase per customer) |
|
|
Term
|
Definition
|
|
Term
Load data from excel files |
|
Definition
|
|
Term
Load data from text files |
|
Definition
use textscan and relation functions |
|
|
Term
|
Definition
|
|
Term
Reading custom file types |
|
Definition
use fopen, fclose, fgetl, fget for text files use fread, fwrite, fseek, ftell for binary files |
|
|
Term
Data quality problems (examples) |
|
Definition
-noise and outliers -missing values -duplicate data |
|
|
Term
|
Definition
refers to modification of original values (static on television) |
|
|
Term
|
Definition
data objects with characteristics that are considerably different than most of the other data objects in the data set |
|
|
Term
|
Definition
reasons for missing values: information not collected or attributes may not be applicable to all cases
Handling missing values: eliminate data objects or estimate missing values |
|
|
Term
|
Definition
data set may include objects that are duplicates (this is major issue when merging data from heterogeneous sources) |
|
|
Term
|
Definition
process of dealing with noise and duplicate data |
|
|
Term
|
Definition
-numerical measure of how alike two data points are -is higher when objects are more alike -often falls in the range [0,1] |
|
|
Term
|
Definition
-numerical measure of how different are two data points -lower when objects are more alike -minimum dissimilarity is often 0 -upper limit varies |
|
|
Term
|
Definition
refers to both similarity and dissimilarity |
|
|
Term
|
Definition
where r=2
dist(p,q)= ((p(k)-q(k))^2)^1/2
k=1
where n is the number of dimensions (attributes) p(k) and q(k) are respectively, the kth attributes (components) or data objects p and q. |
|
|
Term
|
Definition
minkowski distance is a generalization of euclidean distance
dist= (|p(k)-q(k)|^r) ^1/r |
|
|
Term
|
Definition
|
|
Term
|
Definition
r=infinity
dist= (|p(k)-q(k)|^r) ^1/r |
|
|
Term
Simple matching coefficients |
|
Definition
= the number of matches/ number of attributes |
|
|
Term
|
Definition
number of 11 matches/ number of NOT-BOTH-ZERO attribute values |
|
|
Term
|
Definition
cos(d1,d2)=d1*d2 / ||d1||*||d2|| |
|
|
Term
|
Definition
numbers that summarize properties of the data.
-properties include: frequency, location, and spread |
|
|
Term
|
Definition
frequency of an attribute value is the precentage of time the value occurs in the data set |
|
|
Term
|
Definition
the mode of an attribute is the most frequent attribute value |
|
|
Term
|
Definition
the most common measure of the location of a set of m points (sensitive to outliers) |
|
|
Term
|
Definition
middle number in ordered data.
if m is odd, m=2r+1 if m is even, m=2r |
|
|
Term
|
Definition
the difference between the max and min |
|
|
Term
|
Definition
the most common measure of the spread of a set of points (notes for equation) |
|
|
Term
|
Definition
measures the linear relationship between objects
to compute:
P'k = (pk-mean(p))/stdev(p) q'k= (qk-mean(q))/stdev(q)
correlation(p,q)= p'*q' / n-1 |
|
|
Term
K-nearest neighbor algorithm (KNN) |
|
Definition
a simple algorithm that stores alll vailable data points (examples) and classifies new data points based on a similarity measure |
|
|
Term
|
Definition
-belongs to the class of "lazy" algorithms. no process of learning a model. exmaples are simply stored as the data is collected -difficulty comes at classification stage. we need to calculate n distances and find best K data points |
|
|
Term
|
Definition
a preliminary exploration of data to better understand its characteristics |
|
|
Term
|
Definition
the conversion of data into a visual or tabular format so that the characteristics of the data and the relationships among data items or attributes can be analyzed or reported.
one of the most powerful and appealing techniques of data exploration |
|
|
Term
|
Definition
the mapping of info to a visual format |
|
|
Term
|
Definition
the placement of visual elements within a display |
|
|
Term
|
Definition
the elimination or the de-emphasis of certain objects and attributes |
|
|
Term
|
Definition
-usually shows the distribution of values of a single variable -divide the values into bins and show a bar plot of the number of objects in each bin -the hegiht of each bar indicates the number of objects -shape of it depends on the number of bins |
|
|
Term
|
Definition
-invented by J. Turkey -another way of displaying the distribution of data |
|
|
Term
|
Definition
-attribute values determine the position -2 dimensional scatter plots most common -often attributes can be displayed by using the size, shape, color, of the markers that represent the objects |
|
|
Term
|
Definition
-useful when a continuous attribute is measured on a spatial grid -they partition the plane into regions of similar values - the contour lines that form the boundaries of these regions connect points with equal values |
|
|
Term
|
Definition
-can plot the data matrix -can be useful when objects are sorted according to class |
|
|
Term
|
Definition
Parallel coordinates: used to plot the attribute values of high-dimensional data. Instead of using perpendicular axes, use a set of parallel axes.
Star plots: similar approach to parallel coords but axes radiate from a central point. The line connecting the values of an object is a polygon.
Chernoff faces: created by herman chernoff, associates each attribute with a characteristic of a face. The values of each attribute determine the appearance of the corresponding facial characteristic. |
|
|
Term
Multi-dimensional measure of data quality |
|
Definition
accuracy completeness consistency timeliness believability value added interpretability accessibility |
|
|
Term
|
Definition
fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies |
|
|
Term
|
Definition
integration of multiple databases or files |
|
|
Term
|
Definition
normalization and aggregation |
|
|
Term
|
Definition
obtains reduced representation in volume but produces the same or similar analytical results |
|
|
Term
|
Definition
part of data reduction but with particular importance, esp for numerical data |
|
|
Term
|
Definition
fill in missing values, identify outliers and smooth out noisy data, correct inconsistent data, resolve redundancy caused by data integration |
|
|
Term
|
Definition
a function that maps the entire set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new values |
|
|
Term
Types of sampling: simple random sampling |
|
Definition
there's an equal probability of selecting any particular item |
|
|
Term
Types of sampling: sampling without replacement |
|
Definition
as each item is selected, it's removed from the population |
|
|
Term
Types of sampling: sampling with replacement |
|
Definition
objects not removed from the population as they are selected for the sample. (the object can be picked up more than once) |
|
|
Term
Types of sampling: stratified sampling |
|
Definition
split the data into several partitions; then draw random samples from each parition |
|
|
Term
|
Definition
another way to reduce dimensionality of data
redundant features/ irrelevant |
|
|
Term
Feature subset selection techniques |
|
Definition
brute force approach: try all possible feature subsets as input to data mining algorithm
embedded approaches: feature selection occurs naturally as part of the data mining algorithm
filter approaches: features are seleced before data mining algorithm is run
wrapper approaches: use the data mining algorithm as a black box to find the best subset of attributes |
|
|
Term
|
Definition
create new attributes that can capture the important info in a data set much more efficiently than the original attributes |
|
|
Term
|
Definition
mapping data to a new space |
|
|