Shared Flashcard Set

Details

Stats Test #2
Vocab, Methods, and anything worth remembering
112
Mathematics
Undergraduate 1
04/17/2011

Additional Mathematics Flashcards

 


 

Cards

Term
Basic Context for Data (5-6 questions to ask)
Definition
Who, What, When, Where, Why and How?
Term
Categorical Variable
Definition
When Data answers questions but does not represent a sumable or manipulatable quantity. Can be represented by a #
Term
Quantitative Variable
Definition
Whenever a variable is in units representing exact amounts of something or some occurrence.
Term
Identifier Variables
Definition
A number assigned to each individual case for sorting purposes
Term
Frequency Table/ Relative Frequency Table
Definition
A table with different categories and and total counts or one which represents the proportion of each count as a percent
Term
Bar Chart
Definition
Displays distribution of a categorical variable. NOT a quantitative variable
Term
Contingency Table
Definition
A table which represents categories and breaks down the totals into their representative parts. The margins represent the totals
Term
Area Principle
Definition
When graphing data, make sure each catagory has an area which is proportional to its total in the group
Term
Simpson's Paradox
Definition
Unfair averaging over different groups without the same conditions and quantity
Term
Histogram
Definition
Only for quantitative data. Looks like bar graph (only for catagorical data) except that there is no space between bars unless there is a gap in the data. Good for illustrating distribution
Term
Stem and Leaf Displays (and Dotplots)
Definition
Writing the first digit on one side of the table, then listing one following digit for each case in that range. Dotplots replace digits with dots
Term
Three things to mention when describing distribution
Definition
Shape: Describe how many modes in data set/ symmetricallity/ outliers?
Center: Median/ Mean
Spread: Average variation/ interquartile range
Term
Unimodal /Bimodal/ Multimodal
Definition
With one hump/ 2 humps/ more than 2 heads
Term
Uniform (Shape)
Definition
Data which is fairly consistent, no modes or trend
Term
Skew
Definition
When there is a Tail (thinner ends of the distribution) one way or the other, the graph is said to be this
Term
Interquartile Range (IQR)
Definition
The upper quartile (75th percentile)- lower quartile (25th percentile)
Term
Variance
Definition
The total sum of the difference between each y value and the mean squared divided by (n-1)

It is just before you square root to find the standard deviation
Term
Standard Deviation
Definition
Take the square root of:
Sum of difference between y and the mean squared/ (n-1)
Term
Drawing Boxplots
Definition
1. Make boxes with lower, upper quartiles and mean. Add whiskers up to 1.5 times the IQR and add outliars
Term
Z-Score (Standardized Value)
Definition
(y-the mean of y)/ standard deviation. Written z(x) or z(y)
Term
How does standardizing data change data
Definition
Shape: Does not change
Center: Makes the mean 0
Spread: The standard deviation becomes 1
Term
Nearly Normal Condition
Definition
The shape of the data's distribution is unimodal and symmetric, then you can apply different things. Make a Picture
Term
The 68-95-99.7 Rule
Definition
Within 1 sd positively and negatively of 0 is 68% of data, within 2 is 95% of data, within 3 is 99.7
Term
Finding Normal Percentiles
Definition
Calculate Z-Score then look to left of table for 1st 2 digits and match with the top of the table to find the corresponding normal percentile
Term
Normal Probability Plot
Definition
The y axis is the x of the corresponding histogram (ex. mpg) and the x axis is each data points Z-score. Should be a diagonal, left-right graph
Term
Things to look for in Scatterplots
Definition
Direction: Is it positive or negative
Form: Is it linear? Curved?
Strength: How much does it scatter?
Outliers: Anything that significantly skews the data
Term
Predictor/ Explanatory Variable
Definition
The x-axis which is believed to inform or predict the y value
Term
Response Variable
Definition
The y axis and variable of interest. This is the variable used in St. dev. etc...
Term
Correlation (r)
Definition
Measures the strength of the linear association between two quantitative variables.

r= The sum of z(x) times z(y) / (n-1)
Term
Correlation Conditions
Definition
Quantitative Variables Condition: Make sure data isn't categorical
Straight Enough Condition: It is subjective, but make sure the data isn't clearly non-linear
Outlier Condition: Make sure outliers are not present as they can distory the correlation dramatically

Check these conditions with a scatter plot
Term
Lurking Variable
Definition
The explanation of why correlation is misleading and does not prove causation
Term
Kendall's tau
Definition
Designed to assess how close the relationship between two variables is to being monotone. A monotone relationship is how consistently they increase or decrease, not necessarily linearly. A value of -1 means constant decreasing, 1 means constant increase. Its a nonparametric value
Term
Spearman's Rho
Definition
Is less sensitive to outliers. Gives a rank (starting with 1, 2,3 etc....) to each x value. Also between -1 and 1. It is a nonparametric value.
Term
Residual
Definition
The difference of the y value of a coordinate and the predicted y value of a linear regression (also refered to as y(hat).
Term
Line of Best Fit
Definition
Also know as the least squares line
Term
Linear Regression equation
Definition
y(hat)= b0+ b1(x)
Term
b1 (The slope of linear regression) equation
Definition
r (sy/sx)
or
the correlation x times (standard deviation of y/ stand. dev. of x)
Term
b0 (y intercept)
Definition
y (avg)- b1*x(avg)
Term
R^2 value
Definition
Gives a positive fraction of the data's variation accounted for by the model
Term
Does the Plot Thinken? Condition
Definition
When you plot the residuals against the model, there should be no discernable pattern. If there is, your model isn't ideal
Term
Inverting the Regression
Definition
You can't simply rearrange regresion line equations unless correlation is 1.0. You must do the b1 and b0 formulas again
Term
Leverage
Definition
The extent to which a point influences analysis
Term
Subsets
Definition
Distinguishable traits of the data that can allow you to fit different regression lines to different segments of information (male/female etc...)
Term
Goals of Re-expression
Definition
1. Make the distribution of a variable more symmetric
2. Make the spread of several groups (as seen in side-by-side boxplots) more alike, even if their centers differ (often achieved with logs)
3. Make the form of a scatterplot more nearly linear
4. Make the scatter in a scatterplot spread out evenly rather than thickening at one end
Term
Ladder of Powers: 2
Definition
Try for unimodal, left skewed histograms
Term
Ladder of Powers: "0" aka Logs
Definition
This is the go to. You can't have negative or 0 numbers, so add small constants to all data to avoid mistakes. Try logging y, then logging x, and if all else fails log both.
Term
Ladder of Powers: -1/2
Definition
Negative square root perserves the direction of relationships. Your last bet
Term
Ladder of Powers:-1
Definition
Positive or negative, depending on which way you want the data to go. Ratios of 2 quantities benefit the most.
Term
Sample Strategies and Ideals to keep in mind:
Definition
1: Examine a Part of the Whole: Try to avoid bias by representing all parts of the population equally proportional to their representation in the whole
2: Randomize: When in doubt, make sure there is nothing that could be associated with what your sample
3: Its the Sample Size: The fraction of the population doesn't matter, just the actual sample size (2,000 is a good number).
Term
Sample Strategies and Ideals to keep in mind:
Definition
1: Examine a Part of the Whole: Try to avoid bias by representing all parts of the population equally proportional to their representation in the whole
2: Randomize: When in doubt, make sure there is nothing that could be associated with what your sample
3: Its the Sample Size: The fraction of the population doesn't matter, just the actual sample size (2,000 is a good number).
Term
Census
Definition
A sample of the entire population, often quite inefficent
Term
Parameter v. Statistics
Definition
Parameters are real information about the world that we are trying to get at, often in vain.
Statistics are anything we calculate from data
Term
Simple Random Sample (SRS)
Definition
A method by which any combination of samples could be selected. The basis for comparison with all other statistical methods
Term
Sampling Frame
Definition
The list of individuals from which the sample is drawn
Term
Stratified Random Sampling
Definition
Dividing the population into distinct strata of samples, and using a simple random sample within each strata.
Term
Cluster Sampling
Definition
Taking a representative cluster of the population which expresses the population as a whole. If it doesn't represent the population as a whole it will be bias. Can also be a piece of multistage samples
Term
Systematic Sample
Definition
When you use a nonrandom, but systematic sample of individuals. For example, selected every 20th person in a population.
Term
Pilot
Definition
A trial run of a survey before it is employed in a larger group at higher cost. Gives you a chance to recognize flaws in your design
Term
Sampling Technique Errors
Definition
Voluntary Response Sample: Because it is self-selective, it is inherently bias
Convenience Sampling: Does not usually make unbiased information
Term
Mistakes Which Can Arise
Definition
Nonrespondants: Its always a good investment to limit the amount of Nonrespondants, because their lack of incorporation can shift data
Response Bias: Anything in the survey which influences response (wording of a question, the environment its taken in)
Term
Observational Studies
Definition
When people or subjects are viewed in their natural environments. Often retrospective studies
Term
Prospective v. Retrospective Studies
Definition
Prospective studies follow randomly picked individuals and watch them for a given amount of time, generally favored over retrospective options
Term
Experiment
Definition
When you attempt to isolate very simple variables through random assignment of treatments to subjects. Active manipulation by researchers.
Term
The 4 Principles of Experimental Design
Definition
1. Control: Control sources of variation other than what we are testing
2. Randomization: Equalizes the effects of unforseen or uncontrollable sources of variation
3. Replicate: Results have to be replicated in slightly altered situations to show no bias
4. Block: Sometimes attributes affect outcomes of an experiment, so grouping different blocks together is more accurate
Term
The 4 Principles of Experimental Design
Definition
1. Control: Control sources of variation other than what we are testing
2. Randomization: Equalizes the effects of unforseen or uncontrollable sources of variation
3. Replicate: Results have to be replicated in slightly altered situations to show no bias
4. Block: Sometimes attributes affect outcomes of an experiment, so grouping different blocks together is more accurate
Term
Blinding
Definition
Limiting the effect knowledge can influence the experiment, by keeping key catagorical variables a secret from the subject and from the researcher. An experiment is "double blind" when even those who interprete the data are unaware of its identity.
Term
Matching
Definition
Pairing subjects because they are similar in ways not under study
Term
Discrete v. Continuous Random Variables
Definition
Discrete random variables are randomly selected from a set of outcomes which can be listed, while continuous random variables cannot be listed, they are infinite
Term
Expected Value of a discrete random variable
Definition
Multiple each possible outcome by its probability and add them all together
Term
Calculating variance
Definition
The difference between observed and expected (mean), squared and multiplied by liklihood of it happening + the same process for all different outcomes

For an insurance policy, if average cost is $20 per policy, with a payout of 10,000 and a likihood of having to pay out of 1/1,000, then

(10,000-20)^2*(1,000)
Term
Calculate S.D. (given variance)
Definition
SD(x)= sq. rt. Var (x)
Term
Adding/subtracting rules for SD and variance
Definition
-The variance of the sum of two independent random variables is the sum of their individual variances NOT S.D.

-If random variables are independent, the variance of their sum or difference is always the sum of the variances.
Term
Adding/subtracting means
Definition
The mean of the sum/ difference of two random variables is the sum/difference of their means
Term
Calculating z-score
Definition
Difference between Expected value and observed (or theoretically observed) value over S.D., then use z-score technology/table
Term
Definition of and Calculating Covariance
Definition
Measures how X and Y vary together. When two things correlated (i.e. X above its mean and y above its mean) they will have positive covariance.

Covariance (X,Y)= E((X-u)(Y-v))

In other words, the difference of individual data point and mean of x and y times one another.
Term
Geometric probability model for Bernoulli trials
Definition
p= probabilty of success
X=number of trials until first success

P(x)=p*(1-p)^x-1

In other words, the probabilty of success with only x trials equals the individual probability of success times the probability of failure to the x-1 degree.
Term
Expected number of trials for geometric probability model for Bernoulli trials
Definition
E(X)=1/p

1/ the probabilty of an accurance equals how many times you would expect to have to run the experiment before a success.
Term
Standard deviation geometric probability model for Bernoulli trials
Definition
S.D.= sq. rt.(1-p)/(p^2)
Term
Does X+X+X=3X
Definition
Random events labeled x are not algebraically manipulatable, X(1)+X(2)+X(3) cannot be simplified. Insuring 3 people for 10,000 each is not the same as insuring one for 30,000
Term
If shifting a data set by a constant, describe effect on s.d., variance and mean
Definition
The mean of the data fluctuates the same way the change influenced.

Variance and Standard Deviation are completely unaffected by addition/subtraction of a constant
Term
Multiplying data by a constant
Definition
Multiplying data by a constant shifts the mean that same amount

The variance of the constant is multipied by the square of the constant.

If we multiply X by a, then E(x*a)=a*E(x)
Var (x*a)= a^2Var(x)
Term
Probabilty of certain outcome given
-x successes and
-n trials
Definition
The number of possible outcomes giving x successes in n trials* p^x * (1-p)^n-x

The probability is the number of possible outcomes times the probability of individual success to the number of successes and the probability of failure to the number of failures.
Term
Standard deviation of a binomial model
Definition
square root (n*p*(1-p))

or the number of outcomes times the probability of success times the probability of failure
Term
Estimate binomial probability for large sample size using the normal method
Definition
The difference of the mean and the observed (or necessary) number of successes over the standard deviation

n*p- observed/ sq. rt. (np(1-p))
Term
To estimate the probability you will get your first success on a certain trial, use...
Definition
Geometric trial

P(x)= (p)(1-p)^(x-1)
Term
To estimate the probability you'll get a certain number of success in a specified number of independant trials, use...
Definition
the Binomial method

nCx= n!/(x!)(n-x)!
nCx ("n choose x")*(p^x)*(1-p)^n-x

Number of possibilities * probability of success to the number of successes*probability of faliure to the number of needed failures
Term
To estimate probability involving (large) quantitative variables, use
Definition
The normal model

exp. mean- observed/ sd. sq. rt. [np(1-p)]
Term
Sampling distribution model
Definition
allows us to quantify variation between samples and talk about how likely it is that we'd observe a sample proportion in any particular interval
Term
SD for a proportion
Definition
sq. rt. [P(1-P)/n]

square root of probabilty of success * failure/ number of cases
Term
Assumptions and Conditions for normal model usage in proportions
Definition
Independance assumption: each sample is indep.
Sample size assumption: enough "n"s
Randomization Condition: Subjects randomly assigned to treatments
10% condition: sample size must be no larger than 10% of population
Success/Failure condition: Sample size has to be big enough to have 10 successes and 10 failures
Term
Central Limit Theorm
Definition
The mean of a random sample is a random variable whose sampling distribution can be approximated by a normal model. The larger the sample, the better the approximation will be.
Term
Sampling distribution model for a mean (CLT)
Definition
If you take a sample out of a known population, the standard deviation for that sample is smaller than one random instance. Your new sample (new mean) is always smaller than standard deviation of each sample point. It is represented universally as
SD(y bar, or the sample)=SD(population)/(sq. rt. [n]) --> the sample size
Term
Z score calculation for sampling distribution for the mean
Definition
Book def: y(bar)-mu/SD(y bar)

given difference that we are testing (in question) - parameter(what we are given as true)/ new standard deviation (SD of population/ sq. rt.[n]
Term
Standard deviation of a sampleing distribution
Definition
sq. rt. [(p)(1-p)/n]
Term
Estimating the standard deviation of a sampling distribution if parameter is unknown
Definition
Is called Standard Error, found with the same formula substituting p(hat) for p.
Term
Given p and n, find margin of error w/ 95% confidence interval
Definition
Calculate SE = sq. rt. [(p*(1-p))/n]

Multiply standard error times z*(1.96) to get margin of error
Term
To find sample size to get the confidence interval for a proportion you want
Definition
Use p=.5 and the Margin of error you want (often 0.03) and work backwards until you solve for n
Term
Null v. Alternative Hypotheses
Definition
We assume the null hypothosis is true, alternative hypothesis is something we consider plausible should the null be overturned
Term
Conditions for Hypothesis testing (4)
Definition
Independence Assumption
Randomization Condition
10% Condition
10 Success/Failure Condition
Term
Calculating Margin of Error
Definition
ME= z* x SE(p)

Where SE= sq rt [(p)(1-p)/n]
Term
Errors in Hypothesis Testing (2)
Definition
Type 1: The null hypothesis is true, but we mistakenly reject it
Type 2: The null hypothesis is false, but we fail to reject it
Term
Power (of a test)
Definition
The probability that it correctly rejects a false null hypothesis. If B is the probability that a test fails to reject a false hypothesis (Type 2 error), 1-B is the power of the test.
Term
Effect size
Definition
The distance between the null hypothesis value and the truth the effect size. This can be estimated with the observed mean
Term
Assumptions and Conditions for comparing proportions
Definition
Independence Assumption
Randomization Condition
10% condition if sampled w/o replacement
Success/failure condition
Independant Groups Assumption: Two groups comparing must be independent of each other
Term
Two-proportion z-interval
Definition
p(1)-p(2) +/- (z*)(SE {p(1)-p(2)})

(SE {p(1)-p(2)})= sq rt[(p)(1-p)/n + (p)(1-p)/n] for both p(1) and p(2), using appropriate "n"s as well
Term
Pooling proportions (not means!)
Definition
Add # of successes and divid by sum of trials, but when calculating SE,


SE {p(pooled)})= sq rt[(p)(1-p)/n(1) + (p)(1-p)/n(2)] for both p(1) and p(2), noting that although the p and q values are pooled, the n value is NOT, and remains distinct for both calculations.
Term
Calculating degrees of freedom
Definition
df= (n-1)
Term
One-sample t-interval for the mean
Definition
y(bar) +/- (t*)(SE[estimated from y])
SE= s/ sq rt[n]
Term
Getting the standardized sample mean t
Definition
[y(bar)-mu]/ SE(y bar)

In other words, the mean from the data, the parameter mean divided by estimated Standard Deviation (s/sq. rt. [n])
Term
Assumptions and conditions for t test
Definition
Independance asumption
Randomization condition
10% condition
Nearly normal condition- unimodal, symmetric distribution. You can use histograms or normal probabilty plot
Term
Bonus assumptions for Counts
Definition
Counted Data Condition- There must be counts in each cell, not %s or anything else
Expected Cell Frequency Condition- There must be at least 5 counts in each bar of the table
Term
Basic 4 Assumptions and Conditions
+ 3 test specific assumptions/ conditions
Definition
1. Independance
2. Randomization
3. 10% condition
4. Success/failure condition
AND
5. Independence of groups- The two groups we are comparing have to be independent of each other
6. Nearly normal condition
7. Paired data assumption
Supporting users have an ad free experience!