Term
|
Definition
the most basic statistical model; single independent variable and a single dependent variable x -> Y |
|
|
Term
|
Definition
| a single dependent variable and several independent variables |
|
|
Term
|
Definition
| A fit line is a line drawn through a scatterplot such that the sum of the squared vertical distance between each dot and the line is smaller than if the line were drawn any other way. |
|
|
Term
| Pearson's correlation (with regression) |
|
Definition
| a mathematical way of representing the information seen in the scatterplot. Conceptually, a correlation (denoted mathematically by the letter r) is a ratio of the degree to which X and Y vary together divided by the degree to which X and Y vary separately. Mathematically, Pearson’s correlation is based on the squared deviations of X and Y from the mean, with the mean visually represented by the fit line. A Pearson’s correlation, which is a number between -1 and 1, indicates (1) the direction of the relationship between X and Y, (2) the form of the relationship, and (3) the strength of the relationship. |
|
|
Term
|
Definition
| the indicator of how steep in the line is. The slope is calculated as the amount that the line rises or falls for every one unit change in the X-axis. |
|
|
Term
|
Definition
| the point in which the line crosses the Y-axis. |
|
|
Term
|
Definition
| the line of best fit in regression; formula is Ŷ=a + bX for a regression line based on a sample and Ŷ=α + βX for a population. We use the symbol Ŷ to mean the predicted value of Y. |
|
|
Term
| interpretation of the regression line |
|
Definition
| for every one unit of X increase in X, Y increases/decreases by b/β units of Y. |
|
|
Term
| coefficient of determination |
|
Definition
| (r2) It indicates the proportion of variance in Y explained by X. |
|
|
Term
|
Definition
| R2 is equal to the population plus (but never minus) random error. The inability to adjust downward for random error creates a bias that inflates R2. Thus, adjusted- R2 adjusts for this bias (Tabachnick and Fidel 1996). In practical terms, the inflation of R2 is a problem when sample sizes are small or when there are many independent variables in a model. |
|
|
Term
| interpretation of b in linear regression when the dependent variable is dichotomous |
|
Definition
| the regression coefficient indicates how much higher (or lower, in the case of a negative coefficient) Y is in the 1 category of X are than Y is in the 0 category of X. Note that this interpretation holds only if the dichotomous variable is coded as a dummy (0 and 1) and not for any other coding. |
|
|
Term
| interpretation of indicator variables in linear regression |
|
Definition
| those in the included group have Y-units of Y more/less than those in the reference group. |
|
|
Term
| F-test and t-test in multivariate models |
|
Definition
| In a multivariate model the F-test indicates whether or not the overall model is a good one; If the model is good (i.e., the p-value of the F-test is less than .05), then it is necessary to examine the p-values associated with the t-tests for each variable, as it is possible that some variables are significant predictors of the dependent variable, while other variables are not. |
|
|
Term
| standardized scales based on z-scores |
|
Definition
| To gain a more easily interpretable coefficient, scholars will often standardize their scales based on z-scores. A scale standardized in this manner will have a mean of 0 and a standard deviation of 1. Regression coefficient predicting standardized scale values are in units of standard deviations. |
|
|
Term
| natural log of a scale (and its interpretation) |
|
Definition
| Taking the natural log of the scale can also be useful, since regression coefficients based on natural logs can be easily converted to percentage changes. If you take the natural log of the dependent variable the interpretation of the regression coefficient (b) is every one unit increase in X is associated with a 100(b) percent increase/decrease in Y. If you take the natural log of the independent variable, the interpretation is a one percent increase in X results in a b/100 unit change in Y. The interpretation of a regression coefficient when both X and Y have been log transformed is that a one percent change in X results in a b percent change in Y. |
|
|
Term
|
Definition
| a mathematical relationship in which the occurences have no casual connection, yet it may be inferred that they do, due to an unseen, third factor |
|
|
Term
| multivariate regression model |
|
Definition
| simultaneously examine the impact of several independent variables on a single interval-level dependent variable; not strictly linear regression, but the underlying assumption that each independent variable has a linear relationship to the dependent variable remains. |
|
|
Term
| interpretation of regression coefficients in multiple regression models |
|
Definition
| coefficient for X1 as the effect of a one unit change in X1 on Y when X2 is controlled. |
|
|
Term
| Standardized regression coefficients |
|
Definition
| To compare the relative strength of coefficients, it is necessary to standardize the regression coefficients so that they are in the same metric. Standardized regression coefficients are often called betas, and they can be easily calculated in Stata by adding the beta subcommand to the regression command. Standardized regression coefficients range from -1 to 1, with values closer to the absolute value of 1 indicating greater influence on the model and values closer to zero indicating lesser influence on the model. Thus, a variable with a standardized regression coefficient of -.68 has more influence on the dependent variable than a variable with a beta of .06. |
|
|
Term
|
Definition
| The Anscombe quartet graphs point to the importance of exploring your underlying regression assumptions before beginning any research. In 1973, F.J. Anscombe created four sets of artificial data to demonstrate how nearly identical regression equations can be derived from four very divergent sets of data. These sets of data are referred to collectively as the Anscombe quartet, and each XY regression equation is Ŷ=3 + .5X. |
|
|
Term
| Seven underlying assumptions of linear regression |
|
Definition
conceptual 1) errors in variables 2) errors in model specification
mathematical 3) random distribution 4) normality 5) Linearity 6) Homoskedasticity 7) Independence |
|
|
Term
| Errors in model specification |
|
Definition
| arise when you create a regression model that does not include all of the relevant variables or includes irrelevant variables. If we fail to control for factors that are likely to impact our predictors and our dependent variable, then any results we produce from our research are likely to be suspect. |
|
|
Term
|
Definition
| (measurement errors) occurs when we improperly measure an outcome or predictor. It is necessary to obtain measures that come very close to reflecting the situations that you have conceptualized. |
|
|
Term
|
Definition
| (errors in variables)refers to the need to measure what we think we are measuring. |
|
|
Term
| postestimation techniques |
|
Definition
| There are five mathematical assumptions that underlie OLS regression. All OLS regression models and the data used in these models are assumed to meet the standards of (1) random distribution, (2) normality, (3) linearity, (4) homoskedasticity, and (5) independence. |
|
|
Term
|
Definition
| the assumption that the cases from which the data are collected were randomly sampled and that there is some variation in the responses. 2 kinds were it may not be randomly sampled: design effects and missing data |
|
|
Term
|
Definition
| one situation in which data may not be randomly sampled with which researchers using secondary data should be concerned; one in which the sampling design is more complex than a simple random sample. |
|
|
Term
|
Definition
| one situation in which data may not be randomly sampled; cases were randomly sampled, but the resultant data are biased due to data collection factors, i.e. If many respondents refuse to answer a question (or do not know the answer), then responses on that item may not be random, even if the respondents are randomly sampled. |
|
|
Term
|
Definition
| assumes that all results fall within a couple of standard deviations from the means; Both outliers and leverage create influence in the model. When you have influence, it is unlikely that your model meets the assumptions of normality in OLS regression. |
|
|
Term
|
Definition
| a data point with a large residual, meaning that the distance between the data point and the line of best fit is large; references to the line of best fit, it can only be identified in relation to a dependent variable |
|
|
Term
|
Definition
| occurs when an observation is far from it’s own mean (rather than from a regression line) |
|
|
Term
| four conditions under which data might be classified as missing |
|
Definition
| Respondents might refuse to answer a question or respondents may want to answer a question, but do not know the answer. Respondents may answer a question, but for a variety of reasons, their responses may be unusable. A respondent who is 35 years old, for example, cannot have 40 years of completed information. Finally, respondents may not be asked a question either because a question does not apply to them (e.g., age at first birth for a respondent with no children) or because data were collected from a random subsample of the larger sample. |
|
|
Term
| three rules for dealing with missing data |
|
Definition
| First, if fewer than five percent of the data are missing, do nothing. Second, if 5 to 15 percent of the cases on a variable are missing, replace the missing values with values derived from imputation. Third, if more than 15 percent of cases are missing, do not use the variable. |
|
|
Term
|
Definition
| refers to any of a number of mathematical means of determining information for cases on which you have incomplete data; should only be used if the variable for which you have missing cases is continuous or discrete. |
|
|
Term
|
Definition
| a Monte Carlo technique in which the missing values are replaces by simulated versions based on other variables you specify (Rubin 1987). A conceptual way of thinking about this is that missing values are replaced by values that are typical for a respondent who shares a set of characteristics with other respondents. |
|
|
Term
| leverage versus residual squared (lvr2) plot |
|
Definition
| a tool used to get a visual idea of the points of influence and how far they lie from the rest of the data. |
|
|
Term
|
Definition
| an index that ranges from 0 to positive infinity used to measure points of influence. The basic idea is that the higher Cook’s D, the more influential the data point. How much influence is too much influence? One statistical convention is that any point greater than 4/n is too high. |
|
|
Term
|
Definition
| the assumption that the relationship between X and Y is best predicted by a straight line. |
|
|
Term
| residual versus fitted (rvf) plot |
|
Definition
| tool to measure linearity; an image in which standardized residuals are plotted against standardized predicted values of Y. Most produce a graph that looks fairly rectangular with concentrations of scores down the middle. |
|
|
Term
| 3 conditions that you don't want in your rvfplot |
|
Definition
1. Clustering of data points at the top or bottom of the rectangle. This means non-normality. Removing points of influence using Cook’s D should have eliminated that problem. 2. A curved, rather than rectangular, shape to the graph. This means that the X, Y relationship is not linear. 3. Clustering of data points to the left or to the right. This indicates homoskedasticity. |
|
|
Term
|
Definition
| When errors (residuals) are identically distributed; problematic because although your regression coefficients will not be affected, your standard errors will be. This means that p-values are likely to be inaccurate, so you may run the risk of concluding that X and Y are significantly related when they are not. |
|
|
Term
|
Definition
| tested for “normal i.i.d.” This means that the researcher has determined that their data are normally, identically, and independently distributed. |
|
|
Term
|
Definition
| a rather simple test for heterogeneity of variance; null hypothesis of the Cook-Weisberg test is that variances are equal (homoskedasticity) and the alternative hypothesis is that variances are unequal (heteroskedasticity). Want p > .05 |
|
|
Term
|
Definition
| assumption of independence is violated when you have two or more independent variables in your model that are highly correlated. If these variables are correlated, they are measuring the same underlying concept, which means that the residuals are likely to be associated. |
|
|
Term
|
Definition
| colinearity diagnosis; the degree to which independent variables are correlated to a degree that we can tolerate (or that does not violate the assumption of independence). Tolerance is calculated based on the variance inflation factor (or VIF). |
|
|
Term
|
Definition
| Meyers, et al (need year) argues that it is fine to include two predictors if their tolerance (calculated as 1/VIF) is greater than .01. Stevens (2002) argues that variables are problematic if the VIFs are greater than 10. Chatterjee, et al. (2006) argues that variables are likely multicolinear if the mean VIF is greater than 1. |
|
|
Term
|
Definition
| error terms that have been corrected for the inflation that occurs because of heteroskedastic data. |
|
|
Term
|
Definition
| means to draw boundaries around a subsample, including some respondents while excluding others. |
|
|
Term
|
Definition
| when data collection techniques include stratification, clustering, and unequal probability of selection |
|
|
Term
|
Definition
| indicates the proportion of each sample drawn from each stratum. Strata are mutually exclusive groups from which samples are drawn. |
|
|
Term
|
Definition
| indicates the primary sampling unit (PSU) from which the data are drawn. Clustering undermines the assumption of independence in the model because respondents drawn from the same cluster and likely to be more alike than respondents from different clusters. |
|
|
Term
| sampling weight (or probability) variable |
|
Definition
| indicates the number of subjects in the population represented by participants. Mathematically, it is in the inverse of the probability of being selected. |
|
|
Term
| population number of PSUs per stratum variable |
|
Definition
| a variable that is necessary if your sample is drawn without replacement (WOR). Sampling with replacement (WR) means that a sampling unit is returned to the population after selection. |
|
|
Term
| Huber or White estimators |
|
Definition
| linearized standard errors that are adjusted for violations of regression assumptions |
|
|
Term
|
Definition
| the process by which we run a series of nested models in order to examine changes in coefficients |
|
|
Term
|
Definition
| a regression equation that includes the variable (or variables) of interest along with all pertinent controls. |
|
|
Term
|
Definition
| any subset of this model (including our bivariate model) |
|
|
Term
|
Definition
| The b we obtain when we run the bivariate model |
|
|
Term
|
Definition
| means that there is a relationship between X1 and Y, but that relationship is indirect. |
|
|
Term
|
Definition
| arises when the XY relationship is not very linear or when sample sizes are very small |
|
|
Term
|
Definition
| the slope between X1 and Y changes at different levels of X2 |
|
|
Term
| In model building- adding X2 to the model has no (or not much) effect on b1. Why? |
|
Definition
| This means that X1 leads to Y and X2 also leads to Y (assuming that relationship is significant), but that there is no relationship between X1 and X2. If the coefficients associated with both X1 and X2 are significant, we can conclude that they both show a direct effect on Y. |
|
|
Term
| In model building- adding X2 to the model results in the p-values associated with b1 becoming non-significant. Why? |
|
Definition
One reason is that the relationship between X1 and Y is spurious. In other words, it looks like X1 is related to Y, but in fact, X1 is related to X2 and X2 is related to Y, and it only looks like X1 and Y are related. The second reason that the relationship between X1 and Y may be rendered non-significant with the inclusion of X2 is that the relationship between X1 and Y is mediated by X2. |
|
|
Term
| In model building- adding X2 in your model, b1 remains significant, but the coefficient gets smaller. Why? |
|
Definition
| This means that some of the relationship between X1 and Y is mediated by X2, but not all of it. In short, X1 has both a direct and indirect effect on Y. In fact, in a type of model building called path analysis and its more sophisticated variant, structural equation modeling, the purpose is to examine the total effects of X1 on Y. Total effects are the sum of the direct and indirect effects. |
|
|
Term
| In model building- adding X2 to your model results in b1 increasing, increasing in statistical significance, and/or changing signs. Why? |
|
Definition
One possibility is instability in the model. A more likely possibility is that X1 and X2 have an interaction effect. |
|
|
Term
|
Definition
|
|
Term
|
Definition
| One way to determine which reduced models to include in your model; add each control to the model, one by one |
|
|
Term
| which regression to use when Y is ratio or interval |
|
Definition
| OLS regression (but NOT appropriate for count data or for ratio or interval data with a large number of cases that will be excluded) |
|
|
Term
| which regression to use when Y is dichotomous |
|
Definition
|
|
Term
| which regression to use when Y is ordinal |
|
Definition
|
|
Term
| which regression to use when Y is categorical and not ordinal |
|
Definition
| multinomial logistic regression |
|
|
Term
| which regression to use when Y is count data |
|
Definition
| binomial or Poisson regression |
|
|
Term
| which regression to use when Y is ratio or interval with a large number of cases that will be excluded |
|
Definition
|
|
Term
| regression model for OLS regression |
|
Definition
| Y = a + b1X1 + b2X2...+ bnXn + e |
|
|
Term
| regression model for logistic regression |
|
Definition
| logged odds of Y = a + b1X1 + b2X2...+ bnXn + e |
|
|
Term
| regression model for ordered logit |
|
Definition
S = b1X1 + b2X2...+ bnXn + e (no constant) |
|
|
Term
| Interpreting coefficients for OLS regression |
|
Definition
| every one unit in X is associated with a b unit increase/decrease in Y |
|
|
Term
| Interpreting coefficients for logistic regression |
|
Definition
| first, convert logged odds to odds ratios. Every one unit increase in X is associated with a 1-b percent increase/decrease in the likelihood of being in the 1 category of Y |
|
|
Term
| Interpreting coefficients for ordered logit |
|
Definition
| First, convert to odds ratios. Every one unit increase in X is associated with a 1-b percent increase in the odds of going up/down one point on the scale that is measured by Y |
|
|
Term
| Interpreting coefficients for multinomial logistic |
|
Definition
| First, convert to relative risk ratios. Every one unit increase in X is associated with a 1-b percent increase/decrease in the relative risk of being in that category of X instead of the reference category of X |
|
|