Shared Flashcard Set

Details

Title

ISDS 2001 Test 4

Description

ISDS 2001 Catanzaro Test 4 Ch 4-5 Data Mining

Total Cards

Subject

Computer Science

Level

Undergraduate 2

Created

04/03/2012

Click here to study/print these flashcards.

Create your own flash cards! Sign up here.

Additional Computer Science Flashcards

Cards Return to Set Details

Term

3 factors behind the sudden popularity in data mining

Definition

a.       It is now cheaper to store and process data, and increased hardware provide the ability to collect and accumulate more data

b.      Increased database capacities and availability of data analysis tools made companies realize they have untapped data and the tools to analyze it

c.       Consolidation in a data warehouse, data at the customer level and from various sources gives the ability to analyze from a more complete view

Term

6 examples of applications of data mining

Definition

a.       Identify successful therapies for illnesses & to discover new drugs

b.      Reduce fraudulent behavior (Insurance Claims, Credit Card Usage)

c.       Identify customer buying patterns

d.      Reclaim profitable customers

e.       Aid in market-basket analysis

f.       Better target customers/clients

Term

_____ is used to describe knowledge discovery in databases.

Definition

Data mining

Term

Data Mining uses ___, ___, and other techniques to extract and identify useful information and subsequent knowledge from large databases.

Definition

Statistical

Mathematical

Term

Data mining is also referred to as:

1. _____

2. _____

3. _____

4. _____

5. _____

Definition

1. Knowledge extraction

2. Data archaeology

3. Data exploration

4. Data dredging

5. Information harvesting

Term

Data mining finds ___ and defines them in terms of mathematical rules. Those rules can then be used for prediction or association in an attempt to aid in decision making.

Definition

patterns

Term

Data mining algorithms fall into four broad categories:

1. ____ - find the commonly co-occurring groupings of things

2. ____ - tell the nature of future occurences of certain events based on what has happened in the past

3. ____ - Identify natural groupings of things based on their known characteristics

4. ____ - discover time-ordered events

Definition

1. Associations

2. Predictions

3. Clusters

4. Sequential Relationships

Term

Two other data mining procedures

are __ __

and __ __ __

Definition

Data Visualization

Time Series Forecasting

Term

__ __ are the most common of all data mining approaches

Definition

Classification Procedures

Term

Classification involves identifying patterns of data as belonging to a certain ____. Examples:

a. Credit Approval

b. Store Location

c. Target Marketing

d. Fraud Detection

e. Telecommunications

f. Route or Segmentation Decisions

Definition

Category

Term

The Basic Idea:

1. Define the ___

2. Use the data to develop a __ model

3. Use that model to predict unknown outcomes for __ __

Definition

Data

Mathematical

Future Observations

Term

If the outcome (Y) is categorical, and the predictors (Xs) are either categorical or numeric, you would use a _____

Definition

decision tree

Term

If the outcome (Y) is categorical, and the predictors (Xs) are all numeric and have normal distributions and equal variances, then you would use ___ ___ ___

Definition

linear discriminant analysis

Term

If the outcome (Y) is continuous numeric, and the predictors (Xs) are numeric with normal distributions and equal variances, then you would use

____ ____ ____

Definition

linear regression

Term

Organizations must use a standardized approach for condicting a ____ project.

Definition

data mining

Term

Some proposed industry-standard models for data mining are:

1. _____ - one of the most popular non-proprietary standard methodologies for data mining

2. ____ - Ordinarily used in manufacturing, service delivery, management, and other business activities that rely on eliminating defects, waste, & quality control.

3. ___ - developed by the SAS institute.

Definition

CRISP-DM

DMAIC

SEMMA

Term

CRISP-DM stands for: ___

There are 6 steps of the CRISP-DM Model:

1. ____

2. ____

3. ____

4. ____

5. ____

6. ____

Definition

Cross-Industry Standard Process for Data Mining

1. Business Understanding

2. Data Understanding

3. Data Preparation

4. Modeling

5. Evaluation

6. Deployment

Term

DMAIC stands for: ___

Definition

Define, Measure, Analyze, Improve, Control

Term

SEMMA stands for: ____

Definition

Sample, Explore, Modify, Model, Assess

Term

____ places observations (rows, customers, students, etc.) into groups such that the members share similar characteristics but the groups themselves are highly different

Definition

Clustering Analysis

Term

Cluster Analysis is different from ___ analysis in that the groups are unknown and created in cluster analysis, where as the groups are distinct and known when conducting a __ analysis.

Definition

Classification

Classification

Term

Market Segmentation is a common application of __ __

Definition

Cluster Analysis

Term

Market Segmentation is used to understand the ________

Definition

buyer behavior of customers

Term

Market Segmentation is used to help retailers in targeting similar groups of customers for defining the _______

Definition

approprite advertising campaign

Term

Association Analysis is aimed at establishing relationships between ___

Definition

items (variables, columns)

Term

The goal of ___ is to group variables that are similar.

Definition

Association

Term

A common application of __ analysis is Market Basket Analysis

Definition

Association

Term

____ - the semiautomatic process of extracting patterns from large amount of unstructured data sources

Definition

Text Mining

Term

Some of the most popular text mining analyses discussed in class are:

a. ____

b. ____

c. ____

d. ____

Definition

Summarization

Categorization/Classification

Clustering

Concept Linking (Association)

Term

The most basic form of text mining used for summarization is ____

Definition

Term Extraction

Term

The ___-___ matrix is used for Categorization/Classification, Clustering, and Concept Linking.

Definition

Term-Document

Term

_____ maps unstructured information (in the form of a document of words) into a structured format (in the form of a feature/term vector) or a concept.

Definition

Text Mining

Term

A __ vector, or __ vector, is a weighted list of words which defines a concept that describes unstructured information (document of words)

Definition

Feature (term) vector

Term

Steps to creating a feature vector:

1. Eliminate ___

2. Replace words with their _ or _

3. Consider __ and __

4. Calculate the __ of the remaining terms

Definition

articles (the, and, other, etc)

stems/roots

Synonyms and Phrases

Weights

Term

To get the "TF" factor (term frequency), divide ___ by ___

Definition

Frequency

Total words left over

Term

A ___-___ matrix is created where the ROWS represent the documents and the COLUMNS represent the terms (excluding top terms), and the frequencies represent the number of times a term appears in a particular document

Definition

Term-Document Matrix

Term

The text mining process can be defined in _ consecutive tasks.

1. Establish the ___

2. Create the _____

3. Extract the _____

Definition

1. Establish the corpus

2. Create the term-document matrix

3. Extract the knowledge

Term

The largest data/text repository is ___

Definition

the web

Term

Examples of information found on the web:

a. Whose __ __ is linked to which other pages

b. How many people have on their own website ___ to other websites

c. How a particular site is ___

d. Tracking __ to a site, __ on a search engine, __ on e-commerce sites

Definition

home page

hyperlinks

organized

visitors, searches, transactions

Term

___ - the discovering of relationships from web data

Definition

Web Mining

Term

The 3 areas of web mining:

1. Web __ mining

2. Web __ mining

3. Web __ mining

Definition

Content

Structure

Usage

Term

Web __ mining extracts and uses content found within web pages.

Definition

Content

Term

Web __ mining extracts useful information from the analysis of links found in web documents

Definition

Structure

Term

Web __ mining extracts and uses information that is generated through web page visits, traffic, transactions, etc.

Definition

Usage

Term

Web content mining is similar to ___ mining

Definition

text

Term

Web usage mining uses ____ data, which provides a trail of the user's activity and shows the user's browsing patterns: which sites are visited, pages accessed, time spent per page/site, etc.

Definition

Clickstream

Term

Formulas:

1. Which predictor is best (given alpha)?

Compare P-Value of the type (radio/newspaper) to the given alpha.
p-value < alpha = good predictor
p-value > alpha = bad predictor.

2. Predict weekly sales (make sure Adj R-sq is between 0 and 1)
Y = intercept coeff + (Coeff * X1) + (Coeff * X2)
where
Y = Incercept Coeff
X = money spent per week on advertising

3. Calculate LCF (linear classification function)
LCF1/LCF0 = constant + (coeff * X1) + (coeff * X2) + (coeff * X3)
WHERE
-Coeff is under classification analysis. Column to use depends on if
LCF is 0 or 1. Use column 1 for 1, 0 for 0.
-Xn = given

Definition

Flashcard Machine - create, study and share online flash cards

Shared Flashcard Set

Details

Additional Computer Science Flashcards

Cards Return to Set Details

My Flashcards

Flashcard Library

Browse

About

Help

Mobile