Term
|
Definition
| the process of extracting patterns from data. an increasingly more important tool used to turn data into information. |
|
|
Term
| Data mining: the confluence of disciplines |
|
Definition
| statistics, visualization, algorithms, pattern recognition, machine learning, database technology |
|
|
Term
| data mining and business intelligence |
|
Definition
decision making ^ data presentation (visualization) ^ data mining (info discovery) ^ data exploration (statistical summary, querying, reporting) ^ data prepocessing/integration ^ data sources (paper, files, web docs, scientific experiments) |
|
|
Term
|
Definition
| use some variables to predict unknown or future values of other variable |
|
|
Term
|
Definition
| find human-interpretable patterns that describe the data |
|
|
Term
|
Definition
| predictive. goal: previously unseen records should be assigned a class as accurately as possible |
|
|
Term
|
Definition
|
|
Term
| Classification Applications |
|
Definition
direct marketing- reduce cost of mailing by targeting a set of consumers likely to buy a new product.
fraud detection- predict fraudulent cases in credit card transactions (ex, american express used this approach)
Customer Attrition/Churn-to predict whether a customer is likely to be lost to a competitor
Sky survey cataloging- to predict the class of sky objects, esp visually faint ones, based on the telescopic survey images |
|
|
Term
|
Definition
an arbitrary collection of objects (A= {a,b,c})
union, intersection,subset, and difference are in notes week 2 |
|
|
Term
|
Definition
| a set of all possible pairs (x,y) |
|
|
Term
|
Definition
a function in Matlab must be in this form:
function [output_parameter_list] = function_name(input_parameter_list) |
|
|
Term
|
Definition
descriptive. given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that: -data points in one cluster are more similar to one another -data points in separate clusters are less similar to one another |
|
|
Term
|
Definition
market segmentation: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix.
document clustering: to find groups of docs that are similar to each other based on the important terms appearing in them. |
|
|
Term
| Association rule discovery |
|
Definition
| descriptive. Given a set of records each of which contain some number of items from a given collection, produce dependency rules which will predict occurrences of an item based on occurrences of other items. |
|
|
Term
| Association rule discovery applications |
|
Definition
marketing and sales promotion: find out what to put on sale to boost sales for another item
supermarket self management: to identify items that are bought together by sufficiently many customers
inventory management: a consumer appliance repair company wants to anticipate the nature of repairs on its customer products and keep the service vehicles equipped with the right parts to reduce on number of visits to consumer households. |
|
|
Term
| Sequential pattern discovery |
|
Definition
descriptive. given a set of objects, with each object associated with it's own timeline of events, find rules that predict strong sequential dependencies among different events.
-rules are formed by first discovering patterns. Event occurrences in the patterns are governed by timing contraints |
|
|
Term
|
Definition
| predict a valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency |
|
|
Term
| Deviation/Anomaly detection |
|
Definition
| detect significant deviations from normal behavior |
|
|
Term
| Deviation/Anomaly detection Applications |
|
Definition
| credit card fraud and network intrusion detection. |
|
|
Term
|
Definition
a collection of facts from which conclusions may be drawn
a collection of data objects and their attributes |
|
|
Term
|
Definition
a property or characteristic of an object (columns)
a collection of attributes describe a data point |
|
|
Term
|
Definition
| numbers or symbols assigned to an attribute |
|
|
Term
|
Definition
| the way you measure an attribute is something that may not match the attributes porperties |
|
|
Term
| Types of attributes and examples |
|
Definition
nominal: Id numbers, eye color, zip code
Ordinal: rankings, grades, height in (tall, medium, short)
Interval: calendar dates, temperatures
Ratio: length, time, counts |
|
|
Term
| properties of attribute values |
|
Definition
distintiveness: equal or not equal to
order: < >
addition: + -
multiplication: * / |
|
|
Term
| Nominal definition and properties |
|
Definition
provides only enough info to distinguish one object from another
distinctiveness |
|
|
Term
| Ordinal definition and properties |
|
Definition
provide enough info to order objects
Distinctness and order |
|
|
Term
| Interval definition and properties |
|
Definition
the differences between values are meaningful (i.e. a unit of value exists)
distinctness, order, and addition |
|
|
Term
| Ratio definition and properties |
|
Definition
both differences and ratios are meaningful
Distinctness, order, addition, multiplication |
|
|
Term
|
Definition
has only a finite or countably infinite set of values (zip codes, counts, set of words i a collection of docs)
-often represented as integer variables |
|
|
Term
|
Definition
has real numbers as attribute values (temp, height, weight)
-continuous attributes are typically represented as floating-point variables |
|
|
Term
|
Definition
Record: data matrix, doc data, transaction data
Graph: WWW, molecular structures
Ordered: spatial data, temporal data, sequential data, genetic sequence data. |
|
|
Term
| Characteristics of structured data |
|
Definition
Dimensionality: curse of dimensionality(an exponential increase in the number of dimensions of data. hard to analyze)
Sparsity: only presence counts
Resolution: patterns depend on the scale
Attribute and Class imbalance: small number of non sero elements (related to sparsity) |
|
|
Term
|
Definition
| data that consists of a collection of records, each of which consists of a fixed set of attributes |
|
|
Term
|
Definition
| if data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multidimensional space, where each dimension represents a distinct attribute. This data can be represented in an m-by-n matrix where m=rows (one for each object) and n=columns (one for each attribute) |
|
|
Term
|
Definition
each document becomes a term vector -each term is a component (attribute) of the vector -the value of the component is the number of times the corresponding term occurs in the document |
|
|
Term
|
Definition
| a special type of record data where each record (transaction) involves a set of items (ex, a grocery store purchase per customer) |
|
|
Term
|
Definition
|
|
Term
| Load data from excel files |
|
Definition
|
|
Term
| Load data from text files |
|
Definition
| use textscan and relation functions |
|
|
Term
|
Definition
|
|
Term
| Reading custom file types |
|
Definition
use fopen, fclose, fgetl, fget for text files use fread, fwrite, fseek, ftell for binary files |
|
|
Term
| Data quality problems (examples) |
|
Definition
-noise and outliers -missing values -duplicate data |
|
|
Term
|
Definition
| refers to modification of original values (static on television) |
|
|
Term
|
Definition
| data objects with characteristics that are considerably different than most of the other data objects in the data set |
|
|
Term
|
Definition
reasons for missing values: information not collected or attributes may not be applicable to all cases
Handling missing values: eliminate data objects or estimate missing values |
|
|
Term
|
Definition
| data set may include objects that are duplicates (this is major issue when merging data from heterogeneous sources) |
|
|
Term
|
Definition
| process of dealing with noise and duplicate data |
|
|
Term
|
Definition
-numerical measure of how alike two data points are -is higher when objects are more alike -often falls in the range [0,1] |
|
|
Term
|
Definition
-numerical measure of how different are two data points -lower when objects are more alike -minimum dissimilarity is often 0 -upper limit varies |
|
|
Term
|
Definition
| refers to both similarity and dissimilarity |
|
|
Term
|
Definition
where r=2
dist(p,q)= ((p(k)-q(k))^2)^1/2
k=1
where n is the number of dimensions (attributes) p(k) and q(k) are respectively, the kth attributes (components) or data objects p and q. |
|
|
Term
|
Definition
minkowski distance is a generalization of euclidean distance
dist= (|p(k)-q(k)|^r) ^1/r |
|
|
Term
|
Definition
|
|
Term
|
Definition
r=infinity
dist= (|p(k)-q(k)|^r) ^1/r |
|
|
Term
| Simple matching coefficients |
|
Definition
| = the number of matches/ number of attributes |
|
|
Term
|
Definition
| number of 11 matches/ number of NOT-BOTH-ZERO attribute values |
|
|
Term
|
Definition
| cos(d1,d2)=d1*d2 / ||d1||*||d2|| |
|
|
Term
|
Definition
numbers that summarize properties of the data.
-properties include: frequency, location, and spread |
|
|
Term
|
Definition
| frequency of an attribute value is the precentage of time the value occurs in the data set |
|
|
Term
|
Definition
| the mode of an attribute is the most frequent attribute value |
|
|
Term
|
Definition
| the most common measure of the location of a set of m points (sensitive to outliers) |
|
|
Term
|
Definition
middle number in ordered data.
if m is odd, m=2r+1 if m is even, m=2r |
|
|
Term
|
Definition
| the difference between the max and min |
|
|
Term
|
Definition
| the most common measure of the spread of a set of points (notes for equation) |
|
|
Term
|
Definition
measures the linear relationship between objects
to compute:
P'k = (pk-mean(p))/stdev(p) q'k= (qk-mean(q))/stdev(q)
correlation(p,q)= p'*q' / n-1 |
|
|
Term
| K-nearest neighbor algorithm (KNN) |
|
Definition
| a simple algorithm that stores alll vailable data points (examples) and classifies new data points based on a similarity measure |
|
|
Term
|
Definition
-belongs to the class of "lazy" algorithms. no process of learning a model. exmaples are simply stored as the data is collected -difficulty comes at classification stage. we need to calculate n distances and find best K data points |
|
|
Term
|
Definition
| a preliminary exploration of data to better understand its characteristics |
|
|
Term
|
Definition
the conversion of data into a visual or tabular format so that the characteristics of the data and the relationships among data items or attributes can be analyzed or reported.
one of the most powerful and appealing techniques of data exploration |
|
|
Term
|
Definition
| the mapping of info to a visual format |
|
|
Term
|
Definition
| the placement of visual elements within a display |
|
|
Term
|
Definition
| the elimination or the de-emphasis of certain objects and attributes |
|
|
Term
|
Definition
-usually shows the distribution of values of a single variable -divide the values into bins and show a bar plot of the number of objects in each bin -the hegiht of each bar indicates the number of objects -shape of it depends on the number of bins |
|
|
Term
|
Definition
-invented by J. Turkey -another way of displaying the distribution of data |
|
|
Term
|
Definition
-attribute values determine the position -2 dimensional scatter plots most common -often attributes can be displayed by using the size, shape, color, of the markers that represent the objects |
|
|
Term
|
Definition
-useful when a continuous attribute is measured on a spatial grid -they partition the plane into regions of similar values - the contour lines that form the boundaries of these regions connect points with equal values |
|
|
Term
|
Definition
-can plot the data matrix -can be useful when objects are sorted according to class |
|
|
Term
|
Definition
Parallel coordinates: used to plot the attribute values of high-dimensional data. Instead of using perpendicular axes, use a set of parallel axes.
Star plots: similar approach to parallel coords but axes radiate from a central point. The line connecting the values of an object is a polygon.
Chernoff faces: created by herman chernoff, associates each attribute with a characteristic of a face. The values of each attribute determine the appearance of the corresponding facial characteristic. |
|
|
Term
| Multi-dimensional measure of data quality |
|
Definition
accuracy completeness consistency timeliness believability value added interpretability accessibility |
|
|
Term
|
Definition
| fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies |
|
|
Term
|
Definition
| integration of multiple databases or files |
|
|
Term
|
Definition
| normalization and aggregation |
|
|
Term
|
Definition
| obtains reduced representation in volume but produces the same or similar analytical results |
|
|
Term
|
Definition
| part of data reduction but with particular importance, esp for numerical data |
|
|
Term
|
Definition
| fill in missing values, identify outliers and smooth out noisy data, correct inconsistent data, resolve redundancy caused by data integration |
|
|
Term
|
Definition
| a function that maps the entire set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new values |
|
|
Term
| Types of sampling: simple random sampling |
|
Definition
| there's an equal probability of selecting any particular item |
|
|
Term
| Types of sampling: sampling without replacement |
|
Definition
| as each item is selected, it's removed from the population |
|
|
Term
| Types of sampling: sampling with replacement |
|
Definition
| objects not removed from the population as they are selected for the sample. (the object can be picked up more than once) |
|
|
Term
| Types of sampling: stratified sampling |
|
Definition
| split the data into several partitions; then draw random samples from each parition |
|
|
Term
|
Definition
another way to reduce dimensionality of data
redundant features/ irrelevant |
|
|
Term
| Feature subset selection techniques |
|
Definition
brute force approach: try all possible feature subsets as input to data mining algorithm
embedded approaches: feature selection occurs naturally as part of the data mining algorithm
filter approaches: features are seleced before data mining algorithm is run
wrapper approaches: use the data mining algorithm as a black box to find the best subset of attributes |
|
|
Term
|
Definition
| create new attributes that can capture the important info in a data set much more efficiently than the original attributes |
|
|
Term
|
Definition
| mapping data to a new space |
|
|