Term
Basic Context for Data (5-6 questions to ask) |
|
Definition
Who, What, When, Where, Why and How? |
|
|
Term
|
Definition
When Data answers questions but does not represent a sumable or manipulatable quantity. Can be represented by a # |
|
|
Term
|
Definition
Whenever a variable is in units representing exact amounts of something or some occurrence. |
|
|
Term
|
Definition
A number assigned to each individual case for sorting purposes |
|
|
Term
Frequency Table/ Relative Frequency Table |
|
Definition
A table with different categories and and total counts or one which represents the proportion of each count as a percent |
|
|
Term
|
Definition
Displays distribution of a categorical variable. NOT a quantitative variable |
|
|
Term
|
Definition
A table which represents categories and breaks down the totals into their representative parts. The margins represent the totals |
|
|
Term
|
Definition
When graphing data, make sure each catagory has an area which is proportional to its total in the group |
|
|
Term
|
Definition
Unfair averaging over different groups without the same conditions and quantity |
|
|
Term
|
Definition
Only for quantitative data. Looks like bar graph (only for catagorical data) except that there is no space between bars unless there is a gap in the data. Good for illustrating distribution |
|
|
Term
Stem and Leaf Displays (and Dotplots) |
|
Definition
Writing the first digit on one side of the table, then listing one following digit for each case in that range. Dotplots replace digits with dots |
|
|
Term
Three things to mention when describing distribution |
|
Definition
Shape: Describe how many modes in data set/ symmetricallity/ outliers? Center: Median/ Mean Spread: Average variation/ interquartile range |
|
|
Term
Unimodal /Bimodal/ Multimodal |
|
Definition
With one hump/ 2 humps/ more than 2 heads |
|
|
Term
|
Definition
Data which is fairly consistent, no modes or trend |
|
|
Term
|
Definition
When there is a Tail (thinner ends of the distribution) one way or the other, the graph is said to be this |
|
|
Term
Interquartile Range (IQR) |
|
Definition
The upper quartile (75th percentile)- lower quartile (25th percentile) |
|
|
Term
|
Definition
The total sum of the difference between each y value and the mean squared divided by (n-1)
It is just before you square root to find the standard deviation |
|
|
Term
|
Definition
Take the square root of: Sum of difference between y and the mean squared/ (n-1) |
|
|
Term
|
Definition
1. Make boxes with lower, upper quartiles and mean. Add whiskers up to 1.5 times the IQR and add outliars |
|
|
Term
Z-Score (Standardized Value) |
|
Definition
(y-the mean of y)/ standard deviation. Written z(x) or z(y) |
|
|
Term
How does standardizing data change data |
|
Definition
Shape: Does not change Center: Makes the mean 0 Spread: The standard deviation becomes 1 |
|
|
Term
|
Definition
The shape of the data's distribution is unimodal and symmetric, then you can apply different things. Make a Picture |
|
|
Term
|
Definition
Within 1 sd positively and negatively of 0 is 68% of data, within 2 is 95% of data, within 3 is 99.7 |
|
|
Term
Finding Normal Percentiles |
|
Definition
Calculate Z-Score then look to left of table for 1st 2 digits and match with the top of the table to find the corresponding normal percentile |
|
|
Term
|
Definition
The y axis is the x of the corresponding histogram (ex. mpg) and the x axis is each data points Z-score. Should be a diagonal, left-right graph |
|
|
Term
Things to look for in Scatterplots |
|
Definition
Direction: Is it positive or negative Form: Is it linear? Curved? Strength: How much does it scatter? Outliers: Anything that significantly skews the data |
|
|
Term
Predictor/ Explanatory Variable |
|
Definition
The x-axis which is believed to inform or predict the y value |
|
|
Term
|
Definition
The y axis and variable of interest. This is the variable used in St. dev. etc... |
|
|
Term
|
Definition
Measures the strength of the linear association between two quantitative variables.
r= The sum of z(x) times z(y) / (n-1) |
|
|
Term
|
Definition
Quantitative Variables Condition: Make sure data isn't categorical Straight Enough Condition: It is subjective, but make sure the data isn't clearly non-linear Outlier Condition: Make sure outliers are not present as they can distory the correlation dramatically
Check these conditions with a scatter plot |
|
|
Term
|
Definition
The explanation of why correlation is misleading and does not prove causation |
|
|
Term
|
Definition
Designed to assess how close the relationship between two variables is to being monotone. A monotone relationship is how consistently they increase or decrease, not necessarily linearly. A value of -1 means constant decreasing, 1 means constant increase. Its a nonparametric value |
|
|
Term
|
Definition
Is less sensitive to outliers. Gives a rank (starting with 1, 2,3 etc....) to each x value. Also between -1 and 1. It is a nonparametric value. |
|
|
Term
|
Definition
The difference of the y value of a coordinate and the predicted y value of a linear regression (also refered to as y(hat). |
|
|
Term
|
Definition
Also know as the least squares line |
|
|
Term
Linear Regression equation |
|
Definition
|
|
Term
b1 (The slope of linear regression) equation |
|
Definition
r (sy/sx) or the correlation x times (standard deviation of y/ stand. dev. of x) |
|
|
Term
|
Definition
|
|
Term
|
Definition
Gives a positive fraction of the data's variation accounted for by the model |
|
|
Term
Does the Plot Thinken? Condition |
|
Definition
When you plot the residuals against the model, there should be no discernable pattern. If there is, your model isn't ideal |
|
|
Term
|
Definition
You can't simply rearrange regresion line equations unless correlation is 1.0. You must do the b1 and b0 formulas again |
|
|
Term
|
Definition
The extent to which a point influences analysis |
|
|
Term
|
Definition
Distinguishable traits of the data that can allow you to fit different regression lines to different segments of information (male/female etc...) |
|
|
Term
|
Definition
1. Make the distribution of a variable more symmetric 2. Make the spread of several groups (as seen in side-by-side boxplots) more alike, even if their centers differ (often achieved with logs) 3. Make the form of a scatterplot more nearly linear 4. Make the scatter in a scatterplot spread out evenly rather than thickening at one end |
|
|
Term
|
Definition
Try for unimodal, left skewed histograms |
|
|
Term
Ladder of Powers: "0" aka Logs |
|
Definition
This is the go to. You can't have negative or 0 numbers, so add small constants to all data to avoid mistakes. Try logging y, then logging x, and if all else fails log both. |
|
|
Term
|
Definition
Negative square root perserves the direction of relationships. Your last bet |
|
|
Term
|
Definition
Positive or negative, depending on which way you want the data to go. Ratios of 2 quantities benefit the most. |
|
|
Term
Sample Strategies and Ideals to keep in mind: |
|
Definition
1: Examine a Part of the Whole: Try to avoid bias by representing all parts of the population equally proportional to their representation in the whole 2: Randomize: When in doubt, make sure there is nothing that could be associated with what your sample 3: Its the Sample Size: The fraction of the population doesn't matter, just the actual sample size (2,000 is a good number). |
|
|
Term
Sample Strategies and Ideals to keep in mind: |
|
Definition
1: Examine a Part of the Whole: Try to avoid bias by representing all parts of the population equally proportional to their representation in the whole 2: Randomize: When in doubt, make sure there is nothing that could be associated with what your sample 3: Its the Sample Size: The fraction of the population doesn't matter, just the actual sample size (2,000 is a good number). |
|
|
Term
|
Definition
A sample of the entire population, often quite inefficent |
|
|
Term
|
Definition
Parameters are real information about the world that we are trying to get at, often in vain. Statistics are anything we calculate from data |
|
|
Term
Simple Random Sample (SRS) |
|
Definition
A method by which any combination of samples could be selected. The basis for comparison with all other statistical methods |
|
|
Term
|
Definition
The list of individuals from which the sample is drawn |
|
|
Term
Stratified Random Sampling |
|
Definition
Dividing the population into distinct strata of samples, and using a simple random sample within each strata. |
|
|
Term
|
Definition
Taking a representative cluster of the population which expresses the population as a whole. If it doesn't represent the population as a whole it will be bias. Can also be a piece of multistage samples |
|
|
Term
|
Definition
When you use a nonrandom, but systematic sample of individuals. For example, selected every 20th person in a population. |
|
|
Term
|
Definition
A trial run of a survey before it is employed in a larger group at higher cost. Gives you a chance to recognize flaws in your design |
|
|
Term
Sampling Technique Errors |
|
Definition
Voluntary Response Sample: Because it is self-selective, it is inherently bias Convenience Sampling: Does not usually make unbiased information |
|
|
Term
|
Definition
Nonrespondants: Its always a good investment to limit the amount of Nonrespondants, because their lack of incorporation can shift data Response Bias: Anything in the survey which influences response (wording of a question, the environment its taken in) |
|
|
Term
|
Definition
When people or subjects are viewed in their natural environments. Often retrospective studies |
|
|
Term
Prospective v. Retrospective Studies |
|
Definition
Prospective studies follow randomly picked individuals and watch them for a given amount of time, generally favored over retrospective options |
|
|
Term
|
Definition
When you attempt to isolate very simple variables through random assignment of treatments to subjects. Active manipulation by researchers. |
|
|
Term
The 4 Principles of Experimental Design |
|
Definition
1. Control: Control sources of variation other than what we are testing 2. Randomization: Equalizes the effects of unforseen or uncontrollable sources of variation 3. Replicate: Results have to be replicated in slightly altered situations to show no bias 4. Block: Sometimes attributes affect outcomes of an experiment, so grouping different blocks together is more accurate |
|
|
Term
The 4 Principles of Experimental Design |
|
Definition
1. Control: Control sources of variation other than what we are testing 2. Randomization: Equalizes the effects of unforseen or uncontrollable sources of variation 3. Replicate: Results have to be replicated in slightly altered situations to show no bias 4. Block: Sometimes attributes affect outcomes of an experiment, so grouping different blocks together is more accurate |
|
|
Term
|
Definition
Limiting the effect knowledge can influence the experiment, by keeping key catagorical variables a secret from the subject and from the researcher. An experiment is "double blind" when even those who interprete the data are unaware of its identity. |
|
|
Term
|
Definition
Pairing subjects because they are similar in ways not under study |
|
|
Term
Discrete v. Continuous Random Variables |
|
Definition
Discrete random variables are randomly selected from a set of outcomes which can be listed, while continuous random variables cannot be listed, they are infinite |
|
|
Term
Expected Value of a discrete random variable |
|
Definition
Multiple each possible outcome by its probability and add them all together |
|
|
Term
|
Definition
The difference between observed and expected (mean), squared and multiplied by liklihood of it happening + the same process for all different outcomes
For an insurance policy, if average cost is $20 per policy, with a payout of 10,000 and a likihood of having to pay out of 1/1,000, then
(10,000-20)^2*(1,000) |
|
|
Term
Calculate S.D. (given variance) |
|
Definition
|
|
Term
Adding/subtracting rules for SD and variance |
|
Definition
-The variance of the sum of two independent random variables is the sum of their individual variances NOT S.D.
-If random variables are independent, the variance of their sum or difference is always the sum of the variances. |
|
|
Term
|
Definition
The mean of the sum/ difference of two random variables is the sum/difference of their means |
|
|
Term
|
Definition
Difference between Expected value and observed (or theoretically observed) value over S.D., then use z-score technology/table |
|
|
Term
Definition of and Calculating Covariance |
|
Definition
Measures how X and Y vary together. When two things correlated (i.e. X above its mean and y above its mean) they will have positive covariance.
Covariance (X,Y)= E((X-u)(Y-v))
In other words, the difference of individual data point and mean of x and y times one another. |
|
|
Term
Geometric probability model for Bernoulli trials |
|
Definition
p= probabilty of success X=number of trials until first success
P(x)=p*(1-p)^x-1
In other words, the probabilty of success with only x trials equals the individual probability of success times the probability of failure to the x-1 degree. |
|
|
Term
Expected number of trials for geometric probability model for Bernoulli trials |
|
Definition
E(X)=1/p
1/ the probabilty of an accurance equals how many times you would expect to have to run the experiment before a success. |
|
|
Term
Standard deviation geometric probability model for Bernoulli trials |
|
Definition
|
|
Term
|
Definition
Random events labeled x are not algebraically manipulatable, X(1)+X(2)+X(3) cannot be simplified. Insuring 3 people for 10,000 each is not the same as insuring one for 30,000 |
|
|
Term
If shifting a data set by a constant, describe effect on s.d., variance and mean |
|
Definition
The mean of the data fluctuates the same way the change influenced.
Variance and Standard Deviation are completely unaffected by addition/subtraction of a constant |
|
|
Term
Multiplying data by a constant |
|
Definition
Multiplying data by a constant shifts the mean that same amount
The variance of the constant is multipied by the square of the constant.
If we multiply X by a, then E(x*a)=a*E(x) Var (x*a)= a^2Var(x) |
|
|
Term
Probabilty of certain outcome given -x successes and -n trials |
|
Definition
The number of possible outcomes giving x successes in n trials* p^x * (1-p)^n-x
The probability is the number of possible outcomes times the probability of individual success to the number of successes and the probability of failure to the number of failures. |
|
|
Term
Standard deviation of a binomial model |
|
Definition
square root (n*p*(1-p))
or the number of outcomes times the probability of success times the probability of failure |
|
|
Term
Estimate binomial probability for large sample size using the normal method |
|
Definition
The difference of the mean and the observed (or necessary) number of successes over the standard deviation
n*p- observed/ sq. rt. (np(1-p)) |
|
|
Term
To estimate the probability you will get your first success on a certain trial, use... |
|
Definition
Geometric trial
P(x)= (p)(1-p)^(x-1) |
|
|
Term
To estimate the probability you'll get a certain number of success in a specified number of independant trials, use... |
|
Definition
the Binomial method
nCx= n!/(x!)(n-x)! nCx ("n choose x")*(p^x)*(1-p)^n-x
Number of possibilities * probability of success to the number of successes*probability of faliure to the number of needed failures |
|
|
Term
To estimate probability involving (large) quantitative variables, use |
|
Definition
The normal model
exp. mean- observed/ sd. sq. rt. [np(1-p)] |
|
|
Term
Sampling distribution model |
|
Definition
allows us to quantify variation between samples and talk about how likely it is that we'd observe a sample proportion in any particular interval |
|
|
Term
|
Definition
sq. rt. [P(1-P)/n]
square root of probabilty of success * failure/ number of cases |
|
|
Term
Assumptions and Conditions for normal model usage in proportions |
|
Definition
Independance assumption: each sample is indep. Sample size assumption: enough "n"s Randomization Condition: Subjects randomly assigned to treatments 10% condition: sample size must be no larger than 10% of population Success/Failure condition: Sample size has to be big enough to have 10 successes and 10 failures |
|
|
Term
|
Definition
The mean of a random sample is a random variable whose sampling distribution can be approximated by a normal model. The larger the sample, the better the approximation will be. |
|
|
Term
Sampling distribution model for a mean (CLT) |
|
Definition
If you take a sample out of a known population, the standard deviation for that sample is smaller than one random instance. Your new sample (new mean) is always smaller than standard deviation of each sample point. It is represented universally as SD(y bar, or the sample)=SD(population)/(sq. rt. [n]) --> the sample size |
|
|
Term
Z score calculation for sampling distribution for the mean |
|
Definition
Book def: y(bar)-mu/SD(y bar)
given difference that we are testing (in question) - parameter(what we are given as true)/ new standard deviation (SD of population/ sq. rt.[n] |
|
|
Term
Standard deviation of a sampleing distribution |
|
Definition
|
|
Term
Estimating the standard deviation of a sampling distribution if parameter is unknown |
|
Definition
Is called Standard Error, found with the same formula substituting p(hat) for p. |
|
|
Term
Given p and n, find margin of error w/ 95% confidence interval |
|
Definition
Calculate SE = sq. rt. [(p*(1-p))/n]
Multiply standard error times z*(1.96) to get margin of error |
|
|
Term
To find sample size to get the confidence interval for a proportion you want |
|
Definition
Use p=.5 and the Margin of error you want (often 0.03) and work backwards until you solve for n |
|
|
Term
Null v. Alternative Hypotheses |
|
Definition
We assume the null hypothosis is true, alternative hypothesis is something we consider plausible should the null be overturned |
|
|
Term
Conditions for Hypothesis testing (4) |
|
Definition
Independence Assumption Randomization Condition 10% Condition 10 Success/Failure Condition |
|
|
Term
Calculating Margin of Error |
|
Definition
ME= z* x SE(p)
Where SE= sq rt [(p)(1-p)/n] |
|
|
Term
Errors in Hypothesis Testing (2) |
|
Definition
Type 1: The null hypothesis is true, but we mistakenly reject it Type 2: The null hypothesis is false, but we fail to reject it |
|
|
Term
|
Definition
The probability that it correctly rejects a false null hypothesis. If B is the probability that a test fails to reject a false hypothesis (Type 2 error), 1-B is the power of the test. |
|
|
Term
|
Definition
The distance between the null hypothesis value and the truth the effect size. This can be estimated with the observed mean |
|
|
Term
Assumptions and Conditions for comparing proportions |
|
Definition
Independence Assumption Randomization Condition 10% condition if sampled w/o replacement Success/failure condition Independant Groups Assumption: Two groups comparing must be independent of each other |
|
|
Term
Two-proportion z-interval |
|
Definition
p(1)-p(2) +/- (z*)(SE {p(1)-p(2)})
(SE {p(1)-p(2)})= sq rt[(p)(1-p)/n + (p)(1-p)/n] for both p(1) and p(2), using appropriate "n"s as well |
|
|
Term
Pooling proportions (not means!) |
|
Definition
Add # of successes and divid by sum of trials, but when calculating SE,
SE {p(pooled)})= sq rt[(p)(1-p)/n(1) + (p)(1-p)/n(2)] for both p(1) and p(2), noting that although the p and q values are pooled, the n value is NOT, and remains distinct for both calculations. |
|
|
Term
Calculating degrees of freedom |
|
Definition
|
|
Term
One-sample t-interval for the mean |
|
Definition
y(bar) +/- (t*)(SE[estimated from y]) SE= s/ sq rt[n] |
|
|
Term
Getting the standardized sample mean t |
|
Definition
[y(bar)-mu]/ SE(y bar)
In other words, the mean from the data, the parameter mean divided by estimated Standard Deviation (s/sq. rt. [n]) |
|
|
Term
Assumptions and conditions for t test |
|
Definition
Independance asumption Randomization condition 10% condition Nearly normal condition- unimodal, symmetric distribution. You can use histograms or normal probabilty plot |
|
|
Term
Bonus assumptions for Counts |
|
Definition
Counted Data Condition- There must be counts in each cell, not %s or anything else Expected Cell Frequency Condition- There must be at least 5 counts in each bar of the table |
|
|
Term
Basic 4 Assumptions and Conditions + 3 test specific assumptions/ conditions |
|
Definition
1. Independance 2. Randomization 3. 10% condition 4. Success/failure condition AND 5. Independence of groups- The two groups we are comparing have to be independent of each other 6. Nearly normal condition 7. Paired data assumption |
|
|