Shared Flashcard Set

Details

Title

Regression basics

Description

for Social Sciences

Total Cards

Subject

Sociology

Level

Graduate

Created

01/04/2010

Click here to study/print these flashcards.

Create your own flash cards! Sign up here.

Additional Sociology Flashcards

Cards Return to Set Details

Term

bivarite model

Definition

the most basic statistical model; single independent variable and a single dependent variable
x -> Y

Term

multivariate model

Definition

a single dependent variable and several independent variables

Term

line of best fit

Definition

A fit line is a line drawn through a scatterplot such that the sum of the squared vertical distance between each dot and the line is smaller than if the line were drawn any other way.

Term

Pearson's correlation (with regression)

Definition

a mathematical way of representing the information seen in the scatterplot. Conceptually, a correlation (denoted mathematically by the letter r) is a ratio of the degree to which X and Y vary together divided by the degree to which X and Y vary separately. Mathematically, Pearson’s correlation is based on the squared deviations of X and Y from the mean, with the mean visually represented by the fit line. A Pearson’s correlation, which is a number between -1 and 1, indicates (1) the direction of the relationship between X and Y, (2) the form of the relationship, and (3) the strength of the relationship.

Term

slope

Definition

the indicator of how steep in the line is. The slope is calculated as the amount that the line rises or falls for every one unit change in the X-axis.

Term

Y-intercept

Definition

the point in which the line crosses the Y-axis.

Term

regression line

Definition

the line of best fit in regression; formula is Ŷ=a + bX for a regression line based on a sample and Ŷ=α + βX for a population. We use the symbol Ŷ to mean the predicted value of Y.

Term

interpretation of the regression line

Definition

for every one unit of X increase in X, Y increases/decreases by b/β units of Y.

Term

coefficient of determination

Definition

(r2) It indicates the proportion of variance in Y explained by X.

Term

adjusted r2

Definition

R2 is equal to the population plus (but never minus) random error. The inability to adjust downward for random error creates a bias that inflates R2. Thus, adjusted- R2 adjusts for this bias (Tabachnick and Fidel 1996). In practical terms, the inflation of R2 is a problem when sample sizes are small or when there are many independent variables in a model.

Term

interpretation of b in linear regression when the dependent variable is dichotomous

Definition

the regression coefficient indicates how much higher (or lower, in the case of a negative coefficient) Y is in the 1 category of X are than Y is in the 0 category of X. Note that this interpretation holds only if the dichotomous variable is coded as a dummy (0 and 1) and not for any other coding.

Term

interpretation of indicator variables in linear regression

Definition

those in the included group have Y-units of Y more/less than those in the reference group.

Term

F-test and t-test in multivariate models

Definition

In a multivariate model the F-test indicates whether or not the overall model is a good one; If the model is good (i.e., the p-value of the F-test is less than .05), then it is necessary to examine the p-values associated with the t-tests for each variable, as it is possible that some variables are significant predictors of the dependent variable, while other variables are not.

Term

standardized scales based on z-scores

Definition

To gain a more easily interpretable coefficient, scholars will often standardize their scales based on z-scores. A scale standardized in this manner will have a mean of 0 and a standard deviation of 1. Regression coefficient predicting standardized scale values are in units of standard deviations.

Term

natural log of a scale (and its interpretation)

Definition

Taking the natural log of the scale can also be useful, since regression coefficients based on natural logs can be easily converted to percentage changes. If you take the natural log of the dependent variable the interpretation of the regression coefficient (b) is every one unit increase in X is associated with a 100(b) percent increase/decrease in Y. If you take the natural log of the independent variable, the interpretation is a one percent increase in X results in a b/100 unit change in Y. The interpretation of a regression coefficient when both X and Y have been log transformed is that a one percent change in X results in a b percent change in Y.

Term

spurious

Definition

a mathematical relationship in which the occurences have no casual connection, yet it may be inferred that they do, due to an unseen, third factor

Term

multivariate regression model

Definition

simultaneously examine the impact of several independent variables on a single interval-level dependent variable; not strictly linear regression, but the underlying assumption that each independent variable has a linear relationship to the dependent variable remains.

Term

interpretation of regression coefficients in multiple regression models

Definition

coefficient for X1 as the effect of a one unit change in X1 on Y when X2 is controlled.

Term

Standardized regression coefficients

Definition

To compare the relative strength of coefficients, it is necessary to standardize the regression coefficients so that they are in the same metric. Standardized regression coefficients are often called betas, and they can be easily calculated in Stata by adding the beta subcommand to the regression command. Standardized regression coefficients range from -1 to 1, with values closer to the absolute value of 1 indicating greater influence on the model and values closer to zero indicating lesser influence on the model. Thus, a variable with a standardized regression coefficient of -.68 has more influence on the dependent variable than a variable with a beta of .06.

Term

Anscombe quartet

Definition

The Anscombe quartet graphs point to the importance of exploring your underlying regression assumptions before beginning any research. In 1973, F.J. Anscombe created four sets of artificial data to demonstrate how nearly identical regression equations can be derived from four very divergent sets of data. These sets of data are referred to collectively as the Anscombe quartet, and each XY regression equation is Ŷ=3 + .5X.

Term

Seven underlying assumptions of linear regression

Definition

conceptual
1) errors in variables
2) errors in model specification

mathematical
3) random distribution
4) normality
5) Linearity
6) Homoskedasticity
7) Independence

Term

Errors in model specification

Definition

arise when you create a regression model that does not include all of the relevant variables or includes irrelevant variables. If we fail to control for factors that are likely to impact our predictors and our dependent variable, then any results we produce from our research are likely to be suspect.

Term

Errors in variables

Definition

(measurement errors) occurs when we improperly measure an outcome or predictor. It is necessary to obtain measures that come very close to reflecting the situations that you have conceptualized.

Term

Measurement validity

Definition

(errors in variables)refers to the need to measure what we think we are measuring.

Term

postestimation techniques

Definition

There are five mathematical assumptions that underlie OLS regression. All OLS regression models and the data used in these models are assumed to meet the standards of (1) random distribution, (2) normality, (3) linearity, (4) homoskedasticity, and (5) independence.

Term

Random distribution

Definition

the assumption that the cases from which the data are collected were randomly sampled and that there is some variation in the responses. 2 kinds were it may not be randomly sampled: design effects and missing data

Term

design effects

Definition

one situation in which data may not be randomly sampled with which researchers using secondary data should be concerned; one in which the sampling design is more complex than a simple random sample.

Term

missing data

Definition

one situation in which data may not be randomly sampled; cases were randomly sampled, but the resultant data are biased due to data collection factors, i.e. If many respondents refuse to answer a question (or do not know the answer), then responses on that item may not be random, even if the respondents are randomly sampled.

Term

Normality

Definition

assumes that all results fall within a couple of standard deviations from the means; Both outliers and leverage create influence in the model. When you have influence, it is unlikely that your model meets the assumptions of normality in OLS regression.

Term

outliers

Definition

a data point with a large residual, meaning that the distance between the data point and the line of best fit is large; references to the line of best fit, it can only be identified in relation to a dependent variable

Term

point of leverage

Definition

occurs when an observation is far from it’s own mean (rather than from a regression line)

Term

four conditions under which data might be classified as missing

Definition

Respondents might refuse to answer a question or respondents may want to answer a question, but do not know the answer. Respondents may answer a question, but for a variety of reasons, their responses may be unusable. A respondent who is 35 years old, for example, cannot have 40 years of completed information. Finally, respondents may not be asked a question either because a question does not apply to them (e.g., age at first birth for a respondent with no children) or because data were collected from a random subsample of the larger sample.

Term

three rules for dealing with missing data

Definition

First, if fewer than five percent of the data are missing, do nothing. Second, if 5 to 15 percent of the cases on a variable are missing, replace the missing values with values derived from imputation. Third, if more than 15 percent of cases are missing, do not use the variable.

Term

imputation

Definition

refers to any of a number of mathematical means of determining information for cases on which you have incomplete data; should only be used if the variable for which you have missing cases is continuous or discrete.

Term

multiple imputation

Definition

a Monte Carlo technique in which the missing values are replaces by simulated versions based on other variables you specify (Rubin 1987). A conceptual way of thinking about this is that missing values are replaced by values that are typical for a respondent who shares a set of characteristics with other respondents.

Term

leverage versus residual squared (lvr2) plot

Definition

a tool used to get a visual idea of the points of influence and how far they lie from the rest of the data.

Term

Cook’s D

Definition

an index that ranges from 0 to positive infinity used to measure points of influence. The basic idea is that the higher Cook’s D, the more influential the data point. How much influence is too much influence? One statistical convention is that any point greater than 4/n is too high.

Term

Linearity

Definition

the assumption that the relationship between X and Y is best predicted by a straight line.

Term

residual versus fitted (rvf) plot

Definition

tool to measure linearity; an image in which standardized residuals are plotted against standardized predicted values of Y. Most produce a graph that looks fairly rectangular with concentrations of scores down the middle.

Term

3 conditions that you don't want in your rvfplot

Definition

1. Clustering of data points at the top or bottom of the rectangle. This means non-normality. Removing points of influence using Cook’s D should have eliminated that problem.
2. A curved, rather than rectangular, shape to the graph. This means that the X, Y relationship is not linear.
3. Clustering of data points to the left or to the right. This indicates homoskedasticity.

Term

Homoskedasticity

Definition

When errors (residuals) are identically distributed; problematic because although your regression coefficients will not be affected, your standard errors will be. This means that p-values are likely to be inaccurate, so you may run the risk of concluding that X and Y are significantly related when they are not.

Term

Gauss-Markov assumptions

Definition

tested for “normal i.i.d.” This means that the researcher has determined that their data are normally, identically, and independently distributed.

Term

Cook-Weisberg test

Definition

a rather simple test for heterogeneity of variance; null hypothesis of the Cook-Weisberg test is that variances are equal (homoskedasticity) and the alternative hypothesis is that variances are unequal (heteroskedasticity). Want p > .05

Term

Idendendence

Definition

assumption of independence is violated when you have two or more independent variables in your model that are highly correlated. If these variables are correlated, they are measuring the same underlying concept, which means that the residuals are likely to be associated.

Term

tolerance

Definition

colinearity diagnosis; the degree to which independent variables are correlated to a degree that we can tolerate (or that does not violate the assumption of independence). Tolerance is calculated based on the variance inflation factor (or VIF).

Term

interpreting VIFs

Definition

Meyers, et al (need year) argues that it is fine to include two predictors if their tolerance (calculated as 1/VIF) is greater than .01. Stevens (2002) argues that variables are problematic if the VIFs are greater than 10. Chatterjee, et al. (2006) argues that variables are likely multicolinear if the mean VIF is greater than 1.

Term

Robust standard errors

Definition

error terms that have been corrected for the inflation that occurs because of heteroskedastic data.

Term

delimit

Definition

means to draw boundaries around a subsample, including some respondents while excluding others.

Term

design effects

Definition

when data collection techniques include stratification, clustering, and unequal probability of selection

Term

stratification variable

Definition

indicates the proportion of each sample drawn from each stratum. Strata are mutually exclusive groups from which samples are drawn.

Term

cluster variable

Definition

indicates the primary sampling unit (PSU) from which the data are drawn. Clustering undermines the assumption of independence in the model because respondents drawn from the same cluster and likely to be more alike than respondents from different clusters.

Term

sampling weight (or probability) variable

Definition

indicates the number of subjects in the population represented by participants. Mathematically, it is in the inverse of the probability of being selected.

Term

population number of PSUs per stratum variable

Definition

a variable that is necessary if your sample is drawn without replacement (WOR). Sampling with replacement (WR) means that a sampling unit is returned to the population after selection.

Term

Huber or White estimators

Definition

linearized standard errors that are adjusted for violations of regression assumptions

Term

Model building

Definition

the process by which we run a series of nested models in order to examine changes in coefficients

Term

full model

Definition

a regression equation that includes the variable (or variables) of interest along with all pertinent controls.

Term

a reduced model

Definition

any subset of this model (including our bivariate model)

Term

main effect

Definition

The b we obtain when we run the bivariate model

Term

intervening variable

Definition

means that there is a relationship between X1 and Y, but that relationship is indirect.

Term

instability in the model

Definition

arises when the XY relationship is not very linear or when sample sizes are very small

Term

interaction effect

Definition

the slope between X1 and Y changes at different levels of X2

Term

In model building- adding X2 to the model has no (or not much) effect on b1. Why?

Definition

This means that X1 leads to Y and X2 also leads to Y (assuming that relationship is significant), but that there is no relationship between X1 and X2. If the coefficients associated with both X1 and X2 are significant, we can conclude that they both show a direct effect on Y.

Term

In model building- adding X2 to the model results in the p-values associated with b1 becoming non-significant. Why?

Definition

One reason is that the relationship between X1 and Y is spurious. In other words, it looks like X1 is related to Y, but in fact, X1 is related to X2 and X2 is related to Y, and it only looks like X1 and Y are related.
The second reason that the relationship between X1 and Y may be rendered non-significant with the inclusion of X2 is that the relationship between X1 and Y is mediated by X2.

Term

In model building- adding X2 in your model, b1 remains significant, but the coefficient gets smaller. Why?

Definition

This means that some of the relationship between X1 and Y is mediated by X2, but not all of it. In short, X1 has both a direct and indirect effect on Y. In fact, in a type of model building called path analysis and its more sophisticated variant, structural equation modeling, the purpose is to examine the total effects of X1 on Y. Total effects are the sum of the direct and indirect effects.

Term

In model building- adding X2 to your model results in b1 increasing, increasing in statistical significance, and/or changing signs. Why?

Definition

One possibility is instability in the model.
A more likely possibility is that X1 and X2 have an interaction effect.

Term

vector

Definition

a group of variables

Term

Stepwise regression

Definition

One way to determine which reduced models to include in your model; add each control to the model, one by one

Term

which regression to use when Y is ratio or interval

Definition

OLS regression (but NOT appropriate for count data or for ratio or interval data with a large number of cases that will be excluded)

Term

which regression to use when Y is dichotomous

Definition

logistic regression

Term

which regression to use when Y is ordinal

Definition

Ordered logit

Term

which regression to use when Y is categorical and not ordinal

Definition

multinomial logistic regression

Term

which regression to use when Y is count data

Definition

binomial or Poisson regression

Term

which regression to use when Y is ratio or interval with a large number of cases that will be excluded

Definition

truncated regression

Term

regression model for OLS regression

Definition

Y = a + b1X1 + b2X2...+ bnXn + e

Term

regression model for logistic regression

Definition

logged odds of Y = a + b1X1 + b2X2...+ bnXn + e

Term

regression model for ordered logit

Definition

S = b1X1 + b2X2...+ bnXn + e
(no constant)

Term

Interpreting coefficients for OLS regression

Definition

every one unit in X is associated with a b unit increase/decrease in Y

Term

Interpreting coefficients for logistic regression

Definition

first, convert logged odds to odds ratios. Every one unit increase in X is associated with a 1-b percent increase/decrease in the likelihood of being in the 1 category of Y

Term

Interpreting coefficients for ordered logit

Definition

First, convert to odds ratios. Every one unit increase in X is associated with a 1-b percent increase in the odds of going up/down one point on the scale that is measured by Y

Term

Interpreting coefficients for multinomial logistic

Definition

First, convert to relative risk ratios. Every one unit increase in X is associated with a 1-b percent increase/decrease in the relative risk of being in that category of X instead of the reference category of X

Flashcard Machine - create, study and share online flash cards

Shared Flashcard Set

Details

Additional Sociology Flashcards

Cards Return to Set Details

My Flashcards

Flashcard Library

Browse

About

Help

Mobile