Term
3 factors behind the sudden popularity in data mining |
|
Definition
a. It is now cheaper to store and process data, and increased hardware provide the ability to collect and accumulate more data
b. Increased database capacities and availability of data analysis tools made companies realize they have untapped data and the tools to analyze it
c. Consolidation in a data warehouse, data at the customer level and from various sources gives the ability to analyze from a more complete view |
|
|
Term
6 examples of applications of data mining |
|
Definition
a. Identify successful therapies for illnesses & to discover new drugs
b. Reduce fraudulent behavior (Insurance Claims, Credit Card Usage)
c. Identify customer buying patterns
d. Reclaim profitable customers
e. Aid in market-basket analysis
f. Better target customers/clients |
|
|
Term
_____ is used to describe knowledge discovery in databases. |
|
Definition
|
|
Term
Data Mining uses ___, ___, and other techniques to extract and identify useful information and subsequent knowledge from large databases. |
|
Definition
|
|
Term
Data mining is also referred to as:
1. _____
2. _____
3. _____
4. _____
5. _____ |
|
Definition
1. Knowledge extraction
2. Data archaeology
3. Data exploration
4. Data dredging
5. Information harvesting |
|
|
Term
Data mining finds ___ and defines them in terms of mathematical rules. Those rules can then be used for prediction or association in an attempt to aid in decision making. |
|
Definition
|
|
Term
Data mining algorithms fall into four broad categories:
1. ____ - find the commonly co-occurring groupings of things
2. ____ - tell the nature of future occurences of certain events based on what has happened in the past
3. ____ - Identify natural groupings of things based on their known characteristics
4. ____ - discover time-ordered events |
|
Definition
1. Associations
2. Predictions
3. Clusters
4. Sequential Relationships |
|
|
Term
Two other data mining procedures
are __ __
and __ __ __ |
|
Definition
Data Visualization
Time Series Forecasting |
|
|
Term
__ __ are the most common of all data mining approaches |
|
Definition
Classification Procedures |
|
|
Term
Classification involves identifying patterns of data as belonging to a certain ____. Examples:
a. Credit Approval
b. Store Location
c. Target Marketing
d. Fraud Detection
e. Telecommunications
f. Route or Segmentation Decisions |
|
Definition
|
|
Term
The Basic Idea:
1. Define the ___
2. Use the data to develop a __ model
3. Use that model to predict unknown outcomes for __ __ |
|
Definition
Data
Mathematical
Future Observations |
|
|
Term
If the outcome (Y) is categorical, and the predictors (Xs) are either categorical or numeric, you would use a _____ |
|
Definition
|
|
Term
If the outcome (Y) is categorical, and the predictors (Xs) are all numeric and have normal distributions and equal variances, then you would use ___ ___ ___ |
|
Definition
linear discriminant analysis |
|
|
Term
If the outcome (Y) is continuous numeric, and the predictors (Xs) are numeric with normal distributions and equal variances, then you would use
____ ____ ____ |
|
Definition
|
|
Term
Organizations must use a standardized approach for condicting a ____ project. |
|
Definition
|
|
Term
Some proposed industry-standard models for data mining are:
1. _____ - one of the most popular non-proprietary standard methodologies for data mining
2. ____ - Ordinarily used in manufacturing, service delivery, management, and other business activities that rely on eliminating defects, waste, & quality control.
3. ___ - developed by the SAS institute. |
|
Definition
|
|
Term
CRISP-DM stands for: ___
There are 6 steps of the CRISP-DM Model:
1. ____
2. ____
3. ____
4. ____
5. ____
6. ____
|
|
Definition
Cross-Industry Standard Process for Data Mining
1. Business Understanding
2. Data Understanding
3. Data Preparation
4. Modeling
5. Evaluation
6. Deployment |
|
|
Term
|
Definition
Define, Measure, Analyze, Improve, Control |
|
|
Term
|
Definition
Sample, Explore, Modify, Model, Assess |
|
|
Term
____ places observations (rows, customers, students, etc.) into groups such that the members share similar characteristics but the groups themselves are highly different |
|
Definition
|
|
Term
Cluster Analysis is different from ___ analysis in that the groups are unknown and created in cluster analysis, where as the groups are distinct and known when conducting a __ analysis. |
|
Definition
Classification
Classification |
|
|
Term
Market Segmentation is a common application of __ __ |
|
Definition
|
|
Term
Market Segmentation is used to understand the ________ |
|
Definition
buyer behavior of customers |
|
|
Term
Market Segmentation is used to help retailers in targeting similar groups of customers for defining the _______ |
|
Definition
approprite advertising campaign |
|
|
Term
Association Analysis is aimed at establishing relationships between ___ |
|
Definition
items (variables, columns) |
|
|
Term
The goal of ___ is to group variables that are similar. |
|
Definition
|
|
Term
A common application of __ analysis is Market Basket Analysis |
|
Definition
|
|
Term
____ - the semiautomatic process of extracting patterns from large amount of unstructured data sources |
|
Definition
|
|
Term
Some of the most popular text mining analyses discussed in class are:
a. ____
b. ____
c. ____
d. ____ |
|
Definition
Summarization
Categorization/Classification
Clustering
Concept Linking (Association) |
|
|
Term
The most basic form of text mining used for summarization is ____ |
|
Definition
|
|
Term
The ___-___ matrix is used for Categorization/Classification, Clustering, and Concept Linking. |
|
Definition
|
|
Term
_____ maps unstructured information (in the form of a document of words) into a structured format (in the form of a feature/term vector) or a concept. |
|
Definition
|
|
Term
A __ vector, or __ vector, is a weighted list of words which defines a concept that describes unstructured information (document of words) |
|
Definition
|
|
Term
Steps to creating a feature vector:
1. Eliminate ___
2. Replace words with their _ or _
3. Consider __ and __
4. Calculate the __ of the remaining terms |
|
Definition
articles (the, and, other, etc)
stems/roots
Synonyms and Phrases
Weights |
|
|
Term
To get the "TF" factor (term frequency), divide ___ by ___ |
|
Definition
Frequency
Total words left over |
|
|
Term
A ___-___ matrix is created where the ROWS represent the documents and the COLUMNS represent the terms (excluding top terms), and the frequencies represent the number of times a term appears in a particular document |
|
Definition
|
|
Term
The text mining process can be defined in _ consecutive tasks.
1. Establish the ___
2. Create the _____
3. Extract the _____ |
|
Definition
1. Establish the corpus
2. Create the term-document matrix
3. Extract the knowledge |
|
|
Term
The largest data/text repository is ___ |
|
Definition
|
|
Term
Examples of information found on the web:
a. Whose __ __ is linked to which other pages
b. How many people have on their own website ___ to other websites
c. How a particular site is ___
d. Tracking __ to a site, __ on a search engine, __ on e-commerce sites |
|
Definition
home page
hyperlinks
organized
visitors, searches, transactions |
|
|
Term
___ - the discovering of relationships from web data |
|
Definition
|
|
Term
The 3 areas of web mining:
1. Web __ mining
2. Web __ mining
3. Web __ mining |
|
Definition
|
|
Term
Web __ mining extracts and uses content found within web pages. |
|
Definition
|
|
Term
Web __ mining extracts useful information from the analysis of links found in web documents |
|
Definition
|
|
Term
Web __ mining extracts and uses information that is generated through web page visits, traffic, transactions, etc. |
|
Definition
|
|
Term
Web content mining is similar to ___ mining |
|
Definition
|
|
Term
Web usage mining uses ____ data, which provides a trail of the user's activity and shows the user's browsing patterns: which sites are visited, pages accessed, time spent per page/site, etc. |
|
Definition
|
|
Term
Formulas:
1. Which predictor is best (given alpha)?
Compare P-Value of the type (radio/newspaper) to the given alpha. p-value < alpha = good predictor p-value > alpha = bad predictor.
2. Predict weekly sales (make sure Adj R-sq is between 0 and 1) Y = intercept coeff + (Coeff * X1) + (Coeff * X2) where Y = Incercept Coeff X = money spent per week on advertising
3. Calculate LCF (linear classification function) LCF1/LCF0 = constant + (coeff * X1) + (coeff * X2) + (coeff * X3) WHERE -Coeff is under classification analysis. Column to use depends on if LCF is 0 or 1. Use column 1 for 1, 0 for 0. -Xn = given
|
|
Definition
|
|