Shared Flashcard Set

Details

Regression basics
for Social Sciences
80
Sociology
Graduate
01/04/2010

Additional Sociology Flashcards

 


 

Cards

Term
bivarite model
Definition
the most basic statistical model; single independent variable and a single dependent variable
x -> Y
Term
multivariate model
Definition
a single dependent variable and several independent variables
Term
line of best fit
Definition
A fit line is a line drawn through a scatterplot such that the sum of the squared vertical distance between each dot and the line is smaller than if the line were drawn any other way.
Term
Pearson's correlation (with regression)
Definition
a mathematical way of representing the information seen in the scatterplot. Conceptually, a correlation (denoted mathematically by the letter r) is a ratio of the degree to which X and Y vary together divided by the degree to which X and Y vary separately. Mathematically, Pearson’s correlation is based on the squared deviations of X and Y from the mean, with the mean visually represented by the fit line. A Pearson’s correlation, which is a number between -1 and 1, indicates (1) the direction of the relationship between X and Y, (2) the form of the relationship, and (3) the strength of the relationship.
Term
slope
Definition
the indicator of how steep in the line is. The slope is calculated as the amount that the line rises or falls for every one unit change in the X-axis.
Term
Y-intercept
Definition
the point in which the line crosses the Y-axis.
Term
regression line
Definition
the line of best fit in regression; formula is Ŷ=a + bX for a regression line based on a sample and Ŷ=α + βX for a population. We use the symbol Ŷ to mean the predicted value of Y.
Term
interpretation of the regression line
Definition
for every one unit of X increase in X, Y increases/decreases by b/β units of Y.
Term
coefficient of determination
Definition
(r2) It indicates the proportion of variance in Y explained by X.
Term
adjusted r2
Definition
R2 is equal to the population plus (but never minus) random error. The inability to adjust downward for random error creates a bias that inflates R2. Thus, adjusted- R2 adjusts for this bias (Tabachnick and Fidel 1996). In practical terms, the inflation of R2 is a problem when sample sizes are small or when there are many independent variables in a model.
Term
interpretation of b in linear regression when the dependent variable is dichotomous
Definition
the regression coefficient indicates how much higher (or lower, in the case of a negative coefficient) Y is in the 1 category of X are than Y is in the 0 category of X. Note that this interpretation holds only if the dichotomous variable is coded as a dummy (0 and 1) and not for any other coding.
Term
interpretation of indicator variables in linear regression
Definition
those in the included group have Y-units of Y more/less than those in the reference group.
Term
F-test and t-test in multivariate models
Definition
In a multivariate model the F-test indicates whether or not the overall model is a good one; If the model is good (i.e., the p-value of the F-test is less than .05), then it is necessary to examine the p-values associated with the t-tests for each variable, as it is possible that some variables are significant predictors of the dependent variable, while other variables are not.
Term
standardized scales based on z-scores
Definition
To gain a more easily interpretable coefficient, scholars will often standardize their scales based on z-scores. A scale standardized in this manner will have a mean of 0 and a standard deviation of 1. Regression coefficient predicting standardized scale values are in units of standard deviations.
Term
natural log of a scale (and its interpretation)
Definition
Taking the natural log of the scale can also be useful, since regression coefficients based on natural logs can be easily converted to percentage changes. If you take the natural log of the dependent variable the interpretation of the regression coefficient (b) is every one unit increase in X is associated with a 100(b) percent increase/decrease in Y. If you take the natural log of the independent variable, the interpretation is a one percent increase in X results in a b/100 unit change in Y. The interpretation of a regression coefficient when both X and Y have been log transformed is that a one percent change in X results in a b percent change in Y.
Term
spurious
Definition
a mathematical relationship in which the occurences have no casual connection, yet it may be inferred that they do, due to an unseen, third factor
Term
multivariate regression model
Definition
simultaneously examine the impact of several independent variables on a single interval-level dependent variable; not strictly linear regression, but the underlying assumption that each independent variable has a linear relationship to the dependent variable remains.
Term
interpretation of regression coefficients in multiple regression models
Definition
coefficient for X1 as the effect of a one unit change in X1 on Y when X2 is controlled.
Term
Standardized regression coefficients
Definition
To compare the relative strength of coefficients, it is necessary to standardize the regression coefficients so that they are in the same metric. Standardized regression coefficients are often called betas, and they can be easily calculated in Stata by adding the beta subcommand to the regression command. Standardized regression coefficients range from -1 to 1, with values closer to the absolute value of 1 indicating greater influence on the model and values closer to zero indicating lesser influence on the model. Thus, a variable with a standardized regression coefficient of -.68 has more influence on the dependent variable than a variable with a beta of .06.
Term
Anscombe quartet
Definition
The Anscombe quartet graphs point to the importance of exploring your underlying regression assumptions before beginning any research. In 1973, F.J. Anscombe created four sets of artificial data to demonstrate how nearly identical regression equations can be derived from four very divergent sets of data. These sets of data are referred to collectively as the Anscombe quartet, and each XY regression equation is Ŷ=3 + .5X.
Term
Seven underlying assumptions of linear regression
Definition
conceptual
1) errors in variables
2) errors in model specification

mathematical
3) random distribution
4) normality
5) Linearity
6) Homoskedasticity
7) Independence
Term
Errors in model specification
Definition
arise when you create a regression model that does not include all of the relevant variables or includes irrelevant variables. If we fail to control for factors that are likely to impact our predictors and our dependent variable, then any results we produce from our research are likely to be suspect.
Term
Errors in variables
Definition
(measurement errors) occurs when we improperly measure an outcome or predictor. It is necessary to obtain measures that come very close to reflecting the situations that you have conceptualized.
Term
Measurement validity
Definition
(errors in variables)refers to the need to measure what we think we are measuring.
Term
postestimation techniques
Definition
There are five mathematical assumptions that underlie OLS regression. All OLS regression models and the data used in these models are assumed to meet the standards of (1) random distribution, (2) normality, (3) linearity, (4) homoskedasticity, and (5) independence.
Term
Random distribution
Definition
the assumption that the cases from which the data are collected were randomly sampled and that there is some variation in the responses. 2 kinds were it may not be randomly sampled: design effects and missing data
Term
design effects
Definition
one situation in which data may not be randomly sampled with which researchers using secondary data should be concerned; one in which the sampling design is more complex than a simple random sample.
Term
missing data
Definition
one situation in which data may not be randomly sampled; cases were randomly sampled, but the resultant data are biased due to data collection factors, i.e. If many respondents refuse to answer a question (or do not know the answer), then responses on that item may not be random, even if the respondents are randomly sampled.
Term
Normality
Definition
assumes that all results fall within a couple of standard deviations from the means; Both outliers and leverage create influence in the model. When you have influence, it is unlikely that your model meets the assumptions of normality in OLS regression.
Term
outliers
Definition
a data point with a large residual, meaning that the distance between the data point and the line of best fit is large; references to the line of best fit, it can only be identified in relation to a dependent variable
Term
point of leverage
Definition
occurs when an observation is far from it’s own mean (rather than from a regression line)
Term
four conditions under which data might be classified as missing
Definition
Respondents might refuse to answer a question or respondents may want to answer a question, but do not know the answer. Respondents may answer a question, but for a variety of reasons, their responses may be unusable. A respondent who is 35 years old, for example, cannot have 40 years of completed information. Finally, respondents may not be asked a question either because a question does not apply to them (e.g., age at first birth for a respondent with no children) or because data were collected from a random subsample of the larger sample.
Term
three rules for dealing with missing data
Definition
First, if fewer than five percent of the data are missing, do nothing. Second, if 5 to 15 percent of the cases on a variable are missing, replace the missing values with values derived from imputation. Third, if more than 15 percent of cases are missing, do not use the variable.
Term
imputation
Definition
refers to any of a number of mathematical means of determining information for cases on which you have incomplete data; should only be used if the variable for which you have missing cases is continuous or discrete.
Term
multiple imputation
Definition
a Monte Carlo technique in which the missing values are replaces by simulated versions based on other variables you specify (Rubin 1987). A conceptual way of thinking about this is that missing values are replaced by values that are typical for a respondent who shares a set of characteristics with other respondents.
Term
leverage versus residual squared (lvr2) plot
Definition
a tool used to get a visual idea of the points of influence and how far they lie from the rest of the data.
Term
Cook’s D
Definition
an index that ranges from 0 to positive infinity used to measure points of influence. The basic idea is that the higher Cook’s D, the more influential the data point. How much influence is too much influence? One statistical convention is that any point greater than 4/n is too high.
Term
Linearity
Definition
the assumption that the relationship between X and Y is best predicted by a straight line.
Term
residual versus fitted (rvf) plot
Definition
tool to measure linearity; an image in which standardized residuals are plotted against standardized predicted values of Y. Most produce a graph that looks fairly rectangular with concentrations of scores down the middle.
Term
3 conditions that you don't want in your rvfplot
Definition
1. Clustering of data points at the top or bottom of the rectangle. This means non-normality. Removing points of influence using Cook’s D should have eliminated that problem.
2. A curved, rather than rectangular, shape to the graph. This means that the X, Y relationship is not linear.
3. Clustering of data points to the left or to the right. This indicates homoskedasticity.
Term
Homoskedasticity
Definition
When errors (residuals) are identically distributed; problematic because although your regression coefficients will not be affected, your standard errors will be. This means that p-values are likely to be inaccurate, so you may run the risk of concluding that X and Y are significantly related when they are not.
Term
Gauss-Markov assumptions
Definition
tested for “normal i.i.d.” This means that the researcher has determined that their data are normally, identically, and independently distributed.
Term
Cook-Weisberg test
Definition
a rather simple test for heterogeneity of variance; null hypothesis of the Cook-Weisberg test is that variances are equal (homoskedasticity) and the alternative hypothesis is that variances are unequal (heteroskedasticity). Want p > .05
Term
Idendendence
Definition
assumption of independence is violated when you have two or more independent variables in your model that are highly correlated. If these variables are correlated, they are measuring the same underlying concept, which means that the residuals are likely to be associated.
Term
tolerance
Definition
colinearity diagnosis; the degree to which independent variables are correlated to a degree that we can tolerate (or that does not violate the assumption of independence). Tolerance is calculated based on the variance inflation factor (or VIF).
Term
interpreting VIFs
Definition
Meyers, et al (need year) argues that it is fine to include two predictors if their tolerance (calculated as 1/VIF) is greater than .01. Stevens (2002) argues that variables are problematic if the VIFs are greater than 10. Chatterjee, et al. (2006) argues that variables are likely multicolinear if the mean VIF is greater than 1.
Term
Robust standard errors
Definition
error terms that have been corrected for the inflation that occurs because of heteroskedastic data.
Term
delimit
Definition
means to draw boundaries around a subsample, including some respondents while excluding others.
Term
design effects
Definition
when data collection techniques include stratification, clustering, and unequal probability of selection
Term
stratification variable
Definition
indicates the proportion of each sample drawn from each stratum. Strata are mutually exclusive groups from which samples are drawn.
Term
cluster variable
Definition
indicates the primary sampling unit (PSU) from which the data are drawn. Clustering undermines the assumption of independence in the model because respondents drawn from the same cluster and likely to be more alike than respondents from different clusters.
Term
sampling weight (or probability) variable
Definition
indicates the number of subjects in the population represented by participants. Mathematically, it is in the inverse of the probability of being selected.
Term
population number of PSUs per stratum variable
Definition
a variable that is necessary if your sample is drawn without replacement (WOR). Sampling with replacement (WR) means that a sampling unit is returned to the population after selection.
Term
Huber or White estimators
Definition
linearized standard errors that are adjusted for violations of regression assumptions
Term
Model building
Definition
the process by which we run a series of nested models in order to examine changes in coefficients
Term
full model
Definition
a regression equation that includes the variable (or variables) of interest along with all pertinent controls.
Term
a reduced model
Definition
any subset of this model (including our bivariate model)
Term
main effect
Definition
The b we obtain when we run the bivariate model
Term
intervening variable
Definition
means that there is a relationship between X1 and Y, but that relationship is indirect.
Term
instability in the model
Definition
arises when the XY relationship is not very linear or when sample sizes are very small
Term
interaction effect
Definition
the slope between X1 and Y changes at different levels of X2
Term
In model building- adding X2 to the model has no (or not much) effect on b1. Why?
Definition
This means that X1 leads to Y and X2 also leads to Y (assuming that relationship is significant), but that there is no relationship between X1 and X2. If the coefficients associated with both X1 and X2 are significant, we can conclude that they both show a direct effect on Y.
Term
In model building- adding X2 to the model results in the p-values associated with b1 becoming non-significant. Why?
Definition
One reason is that the relationship between X1 and Y is spurious. In other words, it looks like X1 is related to Y, but in fact, X1 is related to X2 and X2 is related to Y, and it only looks like X1 and Y are related.
The second reason that the relationship between X1 and Y may be rendered non-significant with the inclusion of X2 is that the relationship between X1 and Y is mediated by X2.
Term
In model building- adding X2 in your model, b1 remains significant, but the coefficient gets smaller. Why?
Definition
This means that some of the relationship between X1 and Y is mediated by X2, but not all of it. In short, X1 has both a direct and indirect effect on Y. In fact, in a type of model building called path analysis and its more sophisticated variant, structural equation modeling, the purpose is to examine the total effects of X1 on Y. Total effects are the sum of the direct and indirect effects.
Term
In model building- adding X2 to your model results in b1 increasing, increasing in statistical significance, and/or changing signs. Why?
Definition
One possibility is instability in the model.
A more likely possibility is that X1 and X2 have an interaction effect.
Term
vector
Definition
a group of variables
Term
Stepwise regression
Definition
One way to determine which reduced models to include in your model; add each control to the model, one by one
Term
which regression to use when Y is ratio or interval
Definition
OLS regression (but NOT appropriate for count data or for ratio or interval data with a large number of cases that will be excluded)
Term
which regression to use when Y is dichotomous
Definition
logistic regression
Term
which regression to use when Y is ordinal
Definition
Ordered logit
Term
which regression to use when Y is categorical and not ordinal
Definition
multinomial logistic regression
Term
which regression to use when Y is count data
Definition
binomial or Poisson regression
Term
which regression to use when Y is ratio or interval with a large number of cases that will be excluded
Definition
truncated regression
Term
regression model for OLS regression
Definition
Y = a + b1X1 + b2X2...+ bnXn + e
Term
regression model for logistic regression
Definition
logged odds of Y = a + b1X1 + b2X2...+ bnXn + e
Term
regression model for ordered logit
Definition
S = b1X1 + b2X2...+ bnXn + e
(no constant)
Term
Interpreting coefficients for OLS regression
Definition
every one unit in X is associated with a b unit increase/decrease in Y
Term
Interpreting coefficients for logistic regression
Definition
first, convert logged odds to odds ratios. Every one unit increase in X is associated with a 1-b percent increase/decrease in the likelihood of being in the 1 category of Y
Term
Interpreting coefficients for ordered logit
Definition
First, convert to odds ratios. Every one unit increase in X is associated with a 1-b percent increase in the odds of going up/down one point on the scale that is measured by Y
Term
Interpreting coefficients for multinomial logistic
Definition
First, convert to relative risk ratios. Every one unit increase in X is associated with a 1-b percent increase/decrease in the relative risk of being in that category of X instead of the reference category of X
Supporting users have an ad free experience!