Shared Flashcard Set

Details

Title

i211 midterm exam

Description

advanced comp programming

Total Cards

Subject

Computer Science

Level

Undergraduate 3

Created

02/23/2010

Click here to study/print these flashcards.

Create your own flash cards! Sign up here.

Additional Computer Science Flashcards

Cards Return to Set Details

Term

data mining

Definition

the process of extracting patterns from data. an increasingly more important tool used to turn data into information.

Term

Data mining: the confluence of disciplines

Definition

statistics, visualization, algorithms, pattern recognition, machine learning, database technology

Term

data mining and business intelligence

Definition

decision making
^
data presentation (visualization)
^
data mining (info discovery)
^
data exploration (statistical summary, querying, reporting)
^
data prepocessing/integration
^
data sources (paper, files, web docs, scientific experiments)

Term

Predictive methods

Definition

use some variables to predict unknown or future values of other variable

Term

Descriptive methods

Definition

find human-interpretable patterns that describe the data

Term

Classification

Definition

predictive. goal: previously unseen records should be assigned a class as accurately as possible

Term

training set

Definition

a collection of records

Term

Classification Applications

Definition

direct marketing- reduce cost of mailing by targeting a set of consumers likely to buy a new product.

fraud detection- predict fraudulent cases in credit card transactions (ex, american express used this approach)

Customer Attrition/Churn-to predict whether a customer is likely to be lost to a competitor

Sky survey cataloging- to predict the class of sky objects, esp visually faint ones, based on the telescopic survey images

Term

set

Definition

an arbitrary collection of objects
(A= {a,b,c})

union, intersection,subset, and difference are in notes week 2

Term

Cartesian product AxB

Definition

a set of all possible pairs (x,y)

Term

Matlab function

Definition

a function in Matlab must be in this form:

function [output_parameter_list] = function_name(input_parameter_list)

Term

clustering

Definition

descriptive. given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that:
-data points in one cluster are more similar to one another
-data points in separate clusters are less similar to one another

Term

Clustering Applications

Definition

market segmentation: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix.

document clustering: to find groups of docs that are similar to each other based on the important terms appearing in them.

Term

Association rule discovery

Definition

descriptive. Given a set of records each of which contain some number of items from a given collection, produce dependency rules which will predict occurrences of an item based on occurrences of other items.

Term

Association rule discovery applications

Definition

marketing and sales promotion: find out what to put on sale to boost sales for another item

supermarket self management: to identify items that are bought together by sufficiently many customers

inventory management: a consumer appliance repair company wants to anticipate the nature of repairs on its customer products and keep the service vehicles equipped with the right parts to reduce on number of visits to consumer households.

Term

Sequential pattern discovery

Definition

descriptive. given a set of objects, with each object associated with it's own timeline of events, find rules that predict strong sequential dependencies among different events.

-rules are formed by first discovering patterns. Event occurrences in the patterns are governed by timing contraints

Term

regression

Definition

predict a valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency

Term

Deviation/Anomaly detection

Definition

detect significant deviations from normal behavior

Term

Deviation/Anomaly detection Applications

Definition

credit card fraud and network intrusion detection.

Term

data

Definition

a collection of facts from which conclusions may be drawn

a collection of data objects and their attributes

Term

attribute

Definition

a property or characteristic of an object (columns)

a collection of attributes describe a data point

Term

attribute values

Definition

numbers or symbols assigned to an attribute

Term

Measure of length

Definition

the way you measure an attribute is something that may not match the attributes porperties

Term

Types of attributes and examples

Definition

nominal: Id numbers, eye color, zip code

Ordinal: rankings, grades, height in (tall, medium, short)

Interval: calendar dates, temperatures

Ratio: length, time, counts

Term

properties of attribute values

Definition

distintiveness: equal or not equal to

order: < >

addition: + -

multiplication: * /

Term

Nominal definition and properties

Definition

provides only enough info to distinguish one object from another

distinctiveness

Term

Ordinal definition and properties

Definition

provide enough info to order objects

Distinctness and order

Term

Interval definition and properties

Definition

the differences between values are meaningful (i.e. a unit of value exists)

distinctness, order, and addition

Term

Ratio definition and properties

Definition

both differences and ratios are meaningful

Distinctness, order, addition, multiplication

Term

Discrete attribute

Definition

has only a finite or countably infinite set of values (zip codes, counts, set of words i a collection of docs)

-often represented as integer variables

Term

Continuous attribute

Definition

has real numbers as attribute values (temp, height, weight)

-continuous attributes are typically represented as floating-point variables

Term

Types of data sets

Definition

Record: data matrix, doc data, transaction data

Graph: WWW, molecular structures

Ordered: spatial data, temporal data, sequential data, genetic sequence data.

Term

Characteristics of structured data

Definition

Dimensionality: curse of dimensionality(an exponential increase in the number of dimensions of data. hard to analyze)

Sparsity: only presence counts

Resolution: patterns depend on the scale

Attribute and Class imbalance: small number of non sero elements (related to sparsity)

Term

Record data

Definition

data that consists of a collection of records, each of which consists of a fixed set of attributes

Term

Data matrix

Definition

if data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multidimensional space, where each dimension represents a distinct attribute. This data can be represented in an m-by-n matrix where m=rows (one for each object) and n=columns (one for each attribute)

Term

Document data

Definition

each document becomes a term vector
-each term is a component (attribute) of the vector
-the value of the component is the number of times the corresponding term occurs in the document

Term

Transaction data

Definition

a special type of record data where each record (transaction) involves a set of items (ex, a grocery store purchase per customer)

Term

Load data from websites

Definition

use readurl function

Term

Load data from excel files

Definition

use xlsread function

Term

Load data from text files

Definition

use textscan and relation functions

Term

Load data from CSV files

Definition

use csvread function

Term

Reading custom file types

Definition

use fopen, fclose, fgetl, fget for text files
use fread, fwrite, fseek, ftell for binary files

Term

Data quality problems (examples)

Definition

-noise and outliers
-missing values
-duplicate data

Term

Noise

Definition

refers to modification of original values (static on television)

Term

Outliers

Definition

data objects with characteristics that are considerably different than most of the other data objects in the data set

Term

Missing values

Definition

reasons for missing values: information not collected or attributes may not be applicable to all cases

Handling missing values: eliminate data objects or estimate missing values

Term

Duplicate data

Definition

data set may include objects that are duplicates (this is major issue when merging data from heterogeneous sources)

Term

Data cleaning

Definition

process of dealing with noise and duplicate data

Term

Similarity

Definition

-numerical measure of how alike two data points are
-is higher when objects are more alike
-often falls in the range [0,1]

Term

Dissimilarity

Definition

-numerical measure of how different are two data points
-lower when objects are more alike
-minimum dissimilarity is often 0
-upper limit varies

Term

Proximity

Definition

refers to both similarity and dissimilarity

Term

Euclidean distance

Definition

where r=2

dist(p,q)= ((p(k)-q(k))^2)^1/2

k=1

where n is the number of dimensions (attributes) p(k) and q(k) are respectively, the kth attributes (components) or data objects p and q.

Term

Minkowski distance

Definition

minkowski distance is a generalization of euclidean distance

dist= (|p(k)-q(k)|^r) ^1/r

Term

Manhattan distance

Definition

r=1

dist= (|p(k)-q(k)|)

Term

Lmax distance

Definition

r=infinity

dist= (|p(k)-q(k)|^r) ^1/r

Term

Simple matching coefficients

Definition

= the number of matches/ number of attributes

Term

Jaccard coefficients

Definition

number of 11 matches/ number of NOT-BOTH-ZERO attribute values

Term

Cosine similarity

Definition

cos(d1,d2)=d1*d2 / ||d1||*||d2||

Term

Summary statistics

Definition

numbers that summarize properties of the data.

-properties include: frequency, location, and spread

Term

Frequency

Definition

frequency of an attribute value is the precentage of time the value occurs in the data set

Term

Mode

Definition

the mode of an attribute is the most frequent attribute value

Term

Mean

Definition

the most common measure of the location of a set of m points (sensitive to outliers)

Term

Median

Definition

middle number in ordered data.

if m is odd, m=2r+1
if m is even, m=2r

Term

Range

Definition

the difference between the max and min

Term

STDEV

Definition

the most common measure of the spread of a set of points (notes for equation)

Term

Correlation

Definition

measures the linear relationship between objects

to compute:

P'k = (pk-mean(p))/stdev(p)
q'k= (qk-mean(q))/stdev(q)

correlation(p,q)= p'*q' / n-1

Term

K-nearest neighbor algorithm (KNN)

Definition

a simple algorithm that stores alll vailable data points (examples) and classifies new data points based on a similarity measure

Term

Properties of KNN

Definition

-belongs to the class of "lazy" algorithms. no process of learning a model. exmaples are simply stored as the data is collected
-difficulty comes at classification stage. we need to calculate n distances and find best K data points

Term

Data exploration

Definition

a preliminary exploration of data to better understand its characteristics

Term

Visualization

Definition

the conversion of data into a visual or tabular format so that the characteristics of the data and the relationships among data items or attributes can be analyzed or reported.

one of the most powerful and appealing techniques of data exploration

Term

Representation

Definition

the mapping of info to a visual format

Term

Arrangement

Definition

the placement of visual elements within a display

Term

Selection

Definition

the elimination or the de-emphasis of certain objects and attributes

Term

Histogram

Definition

-usually shows the distribution of values of a single variable
-divide the values into bins and show a bar plot of the number of objects in each bin
-the hegiht of each bar indicates the number of objects
-shape of it depends on the number of bins

Term

Box plots

Definition

-invented by J. Turkey
-another way of displaying the distribution of data

Term

Scatter plots

Definition

-attribute values determine the position
-2 dimensional scatter plots most common
-often attributes can be displayed by using the size, shape, color, of the markers that represent the objects

Term

Contour plots

Definition

-useful when a continuous attribute is measured on a spatial grid
-they partition the plane into regions of similar values
- the contour lines that form the boundaries of these regions connect points with equal values

Term

Matrix plots

Definition

-can plot the data matrix
-can be useful when objects are sorted according to class

Term

Visualization techniques

Definition

Parallel coordinates: used to plot the attribute values of high-dimensional data. Instead of using perpendicular axes, use a set of parallel axes.

Star plots: similar approach to parallel coords but axes radiate from a central point. The line connecting the values of an object is a polygon.

Chernoff faces: created by herman chernoff, associates each attribute with a characteristic of a face. The values of each attribute determine the appearance of the corresponding facial characteristic.

Term

Multi-dimensional measure of data quality

Definition

accuracy
completeness
consistency
timeliness
believability
value added
interpretability
accessibility

Term

data cleaning

Definition

fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies

Term

data integration

Definition

integration of multiple databases or files

Term

data transformation

Definition

normalization and aggregation

Term

data reduction

Definition

obtains reduced representation in volume but produces the same or similar analytical results

Term

data discretization

Definition

part of data reduction but with particular importance, esp for numerical data

Term

Data cleaning tasks

Definition

fill in missing values, identify outliers and smooth out noisy data, correct inconsistent data, resolve redundancy caused by data integration

Term

Attribute transformation

Definition

a function that maps the entire set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new values

Term

Types of sampling: simple random sampling

Definition

there's an equal probability of selecting any particular item

Term

Types of sampling: sampling without replacement

Definition

as each item is selected, it's removed from the population

Term

Types of sampling: sampling with replacement

Definition

objects not removed from the population as they are selected for the sample. (the object can be picked up more than once)

Term

Types of sampling: stratified sampling

Definition

split the data into several partitions; then draw random samples from each parition

Term

feature subset selection

Definition

another way to reduce dimensionality of data

redundant features/ irrelevant

Term

Feature subset selection techniques

Definition

brute force approach: try all possible feature subsets as input to data mining algorithm

embedded approaches: feature selection occurs naturally as part of the data mining algorithm

filter approaches: features are seleced before data mining algorithm is run

wrapper approaches: use the data mining algorithm as a black box to find the best subset of attributes

Term

Feature creation

Definition

create new attributes that can capture the important info in a data set much more efficiently than the original attributes

Term

methodologies

Definition

mapping data to a new space

Flashcard Machine - create, study and share online flash cards

Shared Flashcard Set

Details

Additional Computer Science Flashcards

Cards Return to Set Details

My Flashcards

Flashcard Library

Browse

About

Help

Mobile