Term
|
Definition
basic unit/individual that we are describing in the study |
|
|
Term
|
Definition
data we are recording for each observational unit |
|
|
Term
|
Definition
categorical (not-numbers) |
|
|
Term
|
Definition
|
|
Term
|
Definition
the investigator simply records what is/has happened |
|
|
Term
|
Definition
the investigator imposes a treatment on the observational units |
|
|
Term
|
Definition
The observational units on which we have data (***if we gave someone a questionnaire but they didn’t return it, they don’t count!***) |
|
|
Term
|
Definition
All observational units who had a chance of being selected in the sample |
|
|
Term
|
Definition
The group of observational units we are ultimately trying to describe.
The population will depend on the question being asked
Sometimes the sample/sampling frame/population can be the same group. That is called a census. |
|
|
Term
|
Definition
truth about the population
*** Will almost always be in % unless explicitly asks for a number***
We usually do not know the true number, but we can describe it in words (% of UW students who live on campus) |
|
|
Term
|
Definition
describes the sample
May not be given in % format, but should be converted to match format of the parameter
We will usually be able to calculate this from the data given |
|
|
Term
|
Definition
Describes the population Fixed, will not change True value may be unknown
Describes the Sample Will vary when different samples are taken Will be able to compute from information given |
|
|
Term
|
Definition
includes SRS. Any type of design in which randomization is used to pick the observational units |
|
|
Term
|
Definition
the investigator selects which observational units will be in the sample
**almost always biased** |
|
|
Term
|
Definition
the observational units choose whether they want to be in the sample or not
**almost always biased** |
|
|
Term
|
Definition
Think of Variability as how spread out are my estimates
Think of bias as how far away are my estimates from the truth
Not one against the other, both can be high, both can be low, or can be opposite |
|
|
Term
Sources of Variability (4) |
|
Definition
Random Sampling Error (sampling variability). ***This is the only variability accounted for by the Margin of Error*** Any additional bias or variability caused by poor survey design will add extra variability
Shortcut method for Confidence Interval = p.hat +/- 1/sqrt(n); where n = sample size
Confidence statement- We are 95% confident that the true parameter lies between (confidence interval) ***95% of the time that I follow this same procedure and construct a confidence interval, it will cover the true parameter***
When the sample size increases we can be more sure about our estimate, so we do not need as large of a margin of error |
|
|
Term
Sources of Bias (2; 1 with 4 possible) |
|
Definition
Undercoverage- when the sampling frame does not accurately reflect the population (ex. random digit dials won’t include people without phones)
Non-sampling errors- o Response error- people don’t answer truthfully (ex- how many times have you cheated on a test?)
o Non-response- when people don’t respond because they can’t be contact or don’t cooperate
o Processing errors- typos when recording data
o Question wording- confusing questions or questions which can cause a certain response to be more likely (leading questions) |
|
|
Term
|
Definition
a variable that may cause a change in the response variable; the cause, usually the X variable |
|
|
Term
|
Definition
measures the outcome of an experiment; the effect, usually the Y variable |
|
|
Term
|
Definition
specific condition that is applied in an experiment; often the explanatory variable or mix of explanatory variables |
|
|
Term
|
Definition
variable that may have effect on response variable that is not measured |
|
|
Term
|
Definition
When two variables have effects on the response variable that cannot be distinguished from each other |
|
|
Term
Statistically significant |
|
Definition
***The result we found would rarely occur simply by chance*** |
|
|
Term
|
Definition
The benefit derived from the psychological effect of receiving a treatment |
|
|
Term
|
Definition
Both the clinicians and subjects are “blind” to whether they are in the control or treatment group |
|
|
Term
|
Definition
using impersonal chance to assign subjects to either the treatment or control group |
|
|
Term
|
Definition
Is the distribution skewed or symmetric? Is there one mode or multiple modes? |
|
|
Term
|
Definition
where do most of the observations lie? What is the highest/lowest values? |
|
|
Term
|
Definition
What is the center point of the distribution? (mean, median or mode) |
|
|
Term
Numerical Descriptions: Mean |
|
Definition
Add up all observations and then divide the total by the number of observations. Highly affected by outliers, changes when you add to/multiply to the data |
|
|
Term
Numerical Descriptions: Median |
|
Definition
Midpoint of the distribution. Sort all your observations, and choose the middle observation, or average the middle two if there are an even number of observations. Not affected by outliers as much, changes when you add to/multiply to data |
|
|
Term
Numerical Descriptions: Mode |
|
Definition
|
|
Term
Numerical Descriptions: Percentiles |
|
Definition
The cth percentile of a distribution is defined so that (at least) c% of the observations are at or below it and (at least) (100-c)% of the observations are at or above it |
|
|
Term
Numerical Descriptions: Five Number summary |
|
Definition
Min, 25%, Median, 75%, Max |
|
|
Term
Numerical Descriptions: Standard Deviation |
|
Definition
a measure of how spread out the data are. 68% of all observations lie within +/- 1 sd of the mean, 95% within 2 sd, 99.7% within 3 sd, changes when add to/multiply to data
o First find xbar (mean) o Then add up (x – xbar)squared for each observation o Divide that total by n-1 o Take the square root of that ratio |
|
|
Term
|
Definition
|
|
Term
|
Definition
• At least 25% of observations are ≤ 1st Quartile, and at least 75% of observations are ≥ 1st quartile
• At least 75% of observations are ≤ 3rd Quartile, and at least 25% of observations are ≥ 3rd quartile
• Interquartile range = 3rd quartile – 1st quartile
• Changes when you add to or multiply the data |
|
|
Term
|
Definition
plots two variables on same graph. Each point is one individual observation |
|
|
Term
|
Definition
measures “strength of relationship between two variables
• Always between -1 and 1 • Positive correlation means positive association (as one increases, so does the other). Negative value means negative association (as one increases, the other decreases) • ***Correlation does not imply causation!!*** • Must be linear (or football shaped) to be valid measurement of association. No outliers |
|
|
Term
|
Definition
correlations based on averages or rates. Usually overstates the correlation |
|
|
Term
|
Definition
|
|
Term
|
Definition
|
|
Term
|
Definition
Regression sd is the “average size of error” (√1 − r.squared)(Sy) ***only use this when you are making a prediction involving prior information*** (think about on the quiz. When we picked a random student and guessed their quiz 2 score, we used the quiz two average and sd. But when we knew their quiz one score, then we used the regression sd) |
|
|
Term
|
Definition
|
|
Term
|
Definition
observations that are extreme in the X-direction are not as extreme in the Y-direction |
|
|
Term
|
Definition
P(A) must be between 0 and 1 Total probability must add up to 1 P(A not happening) = 1- P(A) |
|
|
Term
|
Definition
Symmetric, bell shaped
Only need to know mean and standard deviation to define the whole curve
68% of all observations lie within +/- 1 sd of the mean, 95% within 2 sd, 99.7% within 3 sd
The standard score is the number of standard deviations an observation is away from the mean std. score = (obs - mean)/ SD
Once we have the standard score, we can look up P(X < standard score) in Table B of the book |
|
|
Term
|
Definition
As we take larger and larger samples, the sum or average (not product or ratio) will begin to look like an normal curve |
|
|
Term
Central Limit Theorem pt. 2 |
|
Definition
If we take a sample proportion many times, the distribution will be a normal distribution with mean = p, and standard deviation √ [ p (1-p) / n ] |
|
|
Term
Central Limit Theorem pt. 3 |
|
Definition
We would expect 95% of all p-hats to be within 2 sd of the mean, or p ± 2√ [ p (1-p) / n ] |
|
|
Term
Central Limit Theorem pt. 4 |
|
Definition
When we don’t know mean, but have an estimate for p, |
|
|
Term
|
Definition
The basic idea, is that we will reject the null hypothesis if our observation is very unlikely to happen if the null hypothesis is true
Null Hypothesis- the status quo, or the no change option Alternative Hypothesis- usually what we are trying to prove |
|
|
Term
Calculating test of significance |
|
Definition
Assume the null hypothesis is true, and calculate how likely our sample 1. Determine the mean and standard deviation of our “Null distribution” (the distribution when the null is true) --Find mean = p and sd = √ [ p (1-p) / n ] --Find the standard score of p.hat = (p.hat - p) /√ [ p (1-p) / n ] Look up the value in the table. ***you may need to subtract the value from 1 depending on whether you want the area to the left or to the right of the standard score***
This is the p-value- the probability that something as extreme or more extreme than our current observation would occur when the null is true
If the p-value is less than .05, reject the null hypothesis |
|
|