Term
|
Definition
Assigning numbers to phenomena such that the numbers preserve the relations among the phenomena. |
|
|
Term
Standards for Educational and Pscyhological Tests (Know three levels; focus on 1 level) |
|
Definition
_______ for tests and test users are classified on three levels:Essential, very desirable, and desirable. Essentials: a. A general knowledge of measurement theory b. Explicit and well-formulated goals for test use c. A consideration of alternative and supplemental measurement procedures, especially examinees coming from very different backgrounds. d. A careful review of the development of tests being considered for use. e. The competence to administer each test used for decision-making purposes. f. Established conditions for test administration that enable all examinees to do their best. g. A rationale for decisions based on test scores. h. A consideration of alternative interpretations of scores. i. The correct use of norms and the avoidance of scores that have technical problems—for example, age or grade scores (see Chapter 7). j. Periodic reviews of the effectiveness of tests and decision-making procedures for institution tests. *Other issues in _____ for professional judgment; tests can be useful tools, but they can also be dangerous if misused* They revolve around technical competency and wise professional judgment.
|
|
|
Term
Four basic problems in measurement theory |
|
Definition
1. Uniqueness 2. Representation 3. Meaningfulness 4. Scaling |
|
|
Term
|
Definition
How many constraints do we have in assigning numbers to a phenomenon (relations between attitudes matches relations between numbers which can be an ordinal restraint)? Interval constraint: movie x > movie y > movie z
therefore, 2 m(y) = m(x) + m (z) m(x) – m(y) = mz (equal intervals rule) |
|
|
Term
|
Definition
Can you measure all of the important attributes of the phenomenon you are interested in? What are the minimum attributes that we need to have in order to have some measurement of this phenomenon? Homomorphism: Can you match the numbers to the attributes? Isopmorphism: Is there a 1:1 correspondence between items on each set? *If an attribute has more than 1 dimension, isomorphism cannot be achieved* *Major mistake that is made: Failing to achieve unidimensionality* |
|
|
Term
|
Definition
What inferences can be validly made? Example 1: Movies and tickets, Ticket 1 (movie F with a rating of 2 and movie G with a rating of 4) or Ticket 2 (movie h with a rating of 5); would you choose T1 or T2? Probably T1. Example 2: Job satisfaction survey, worker A has a 4.5 for job satisfaction and worker B has a 2.7 for job satisfaction; can you accept or use interval constraints? No! |
|
|
Term
|
Definition
How do you go about actually writing or creating a scale? *true scale ≠ response scale, and response scale = rating scales (e.g., on a scale of 1 to 5… T/F… strongly agree v disagree)* * a true scale is the complete process by which you decide which attributes get assigned to certain numbers* |
|
|
Term
|
Definition
1. Nominal 2. Ordinal 3. Interval 4. Ratio |
|
|
Term
|
Definition
Scales that have unique numbers assigned to distinct objects or types of objects. These numbers can be changed in any way as long as the numbers assigned to the distinct object remain different. |
|
|
Term
|
Definition
Scales that have large numbers assigned to objects with more of the property being measured than to objects with less of that property. |
|
|
Term
|
Definition
Scales that have numbers that allow us to calculate and interpret ratios of intervals between pairs of objects. |
|
|
Term
|
Definition
Scales that have numbers that allow us to calculate and interpret ratios of the scale values (have a true zero). |
|
|
Term
|
Definition
Distinctiveness Ordering in Magnitude Equal Intervals Rule Absolute Zero Rule |
|
|
Term
|
Definition
If attributes are different, then the numbers assigned to them should be different. |
|
|
Term
|
Definition
Objects with more of the attribute get bigger numbers. |
|
|
Term
|
Definition
Differences between numbers should be proportional to the differences between attributes. |
|
|
Term
|
Definition
If there is an absence of the attribute, you assign a zero. |
|
|
Term
|
Definition
Allows us to determine the equation for the best prediction line, the ________ line and its accuracy can be determined by the correlation coefficient (adhere to the principles of the least squares). |
|
|
Term
|
Definition
A placement of scores that is a bell-shaped curve, with many scores near the middle and fewer scores tapering off to the extremes; it is important to note that a variable must be continuous for a normal distribution to exist. |
|
|
Term
|
Definition
A measure of the strength of relationship between two variables. **Know 15 characteristics about these!** |
|
|
Term
15 Characteristics of Correlations |
|
Definition
1. “Linear Linear Linear”: Figure 2.9 (Curvilinear relationship); you have to examine the scatterplot; most human characteristics are linearly related; the single best statistical tool that you have is your eyeball… always eyeball the scatterplot… 2. “Not Causal Not Causal Not Causal (no matter what the structural equationists people say)”: Correlation does not equal Causation. 3. Linear transformations of x or y have (potentially) two effects (i.e., changing of the signs v changing of the magnitude) (e.g., interview (x), job performance (y) you may want to conduct a linear transformation of x or y or both; if conducted it does not change the magnitude of the relationship; however, the transformation may change the sign of the correlation… One of the most common mistakes with testing is researchers will not triple check the direction of the scales they use; so the sign of the correlation will be backwards from what they think it is (published papers even have this error)… Example: Job performance measured by supervisor ratings; supervisor ratings are on a 1 to 10 scale where 1 represents really poor job performance and 10 represents really great job performance scores; you find a positive correlation for structured interviews; when you conduct study later, you get data from another company and plug it in and then you find a negative correlation; the problem was that your data was in a different format (e.g., rankings) and the order of the scale was reversed… this mistake happens most often when the authors are reading other articles and these authors don’t specify the direction of the measures… Example: personality test scores, introversion/extraversion, if you are writing an intra/extra test, you can choose your scores and authors don’t always have consistent ordering of scales… so the problem is that researchers are unable to find what the direction of the measures meant when they conduct literature reviews… don’t make this common mistake! If you go and look at the correlation b/w amount of antibiotics administered and mortality rate, there is a positive correlation… seems like a mortality is caused by antibodies but this is not true; usually antibodies are prescribed to severely ill patients along with other factors; rather than reporting the mortality rate, describe the survival rate… Presentation of the data… 4. Correlations that deal with marginal distributions: The distribution of x without regard to y; the distribution of y without regard to x. Quick review: Variability, differences in scores (e.g., interview scores); common Measures of variability: standard deviation, standard error, range, variance. 5. Correlations that deal with conditional distributions: Distributions of y for a given x. So you can have a conditional distribution of x with a bunch of ys or vice versa. If you put a regression line through a cloud of x, y points (shows a graph), and you have a large sample you can calculate the mean squared ys from the actual ys… s²y·x (symbolic notation for conditional distribution); for every single value of x that you have, if you calculate, add, and take the square root… one particular x score how many possible y scores. 6. Standard error of estimate: in a perfect world, if job interview predicted job performance, variance would be zero and the SEE would be zero and predictions would be perfect; the lower the SEE the better (tests are used to measure people and predict their behaviors; like interview skills to predict job performance)! If the y’s are close together, low SEE, if the y’s are spread apart from each other, high SEE (which is NOT good). 2 reasons for the importance of the SEE: 1. The classic index of test score accuracy (you can determine whether to buy a test from the SEE). 2. It’s used to construct confidence intervals around test scores. Very important when you are comparing individuals’ data (e.g., SAT scores, GRE scores); so the SEE is the true measure of the closeness of individuals’ scores and determines the accuracy of test scores… Item stats (a way to shut TA’s down)… One little note: common mistake, don’t confuse SEE with Standard Error of the Mean (standard deviation of the sampling distribution)… sampling distribution is important. *Be careful with regards the way in which you describe correlations (e.g., correlations does not equal causation)! 7. Variance Explained (r²): An index of predictability NOT causality (if I have an interview score for person y, can I predict how well they will perform on the job?); accuracy of test. Example: Relationship between interview and job performance is .4 and variance is .16. Precise definition of .16, is sixteen percent of the variance in job performance (if you look at people’s job performance not everybody performs at the same level) can be predicted by a person’s interview score. *Often in academia, r², is not used; r is often used in academia because the sign is not obscured compared to r² (caution: be careful of the direction of the relationship of the variance for either your or other published tests.). Reverse score items (Fred is wary of use of this method); equation 2.25 (in textbook). 8. Correlation is directly related to linear regression: Quick and dirty review of linear regressions (drawing graph, scatterplot, and attempting to fit a line (or curve) to the dots, he’s showing line fitting… how do we determine where the put the line? Principle of least squares (we are minimizing the squared deviations to get rid of the sign or so they don’t add up to zero) another method of fitting a line to points on a scatterplot is maximum likelihood. 9. Linear Regression Functional Equation: How are the x and y in the correlations related? Y = a + bx Functional equation: y (hat) = rxy (Sy/Sx) (xi – x(bar)) + y(bar) The Ss in this equation are the proportion of the two variances (they are a correction factor). If you know how far off they are off from the average x, then you know how far off they are from the average y and you get the regression equation… (review) 10. Confidence intervals of correlations (alpha level, If we want to construct a confidence interval; with some percentage level of confidence, we are sure that the score that we saw from the person is that person’s actual score. Ex: when you take a test, you get a score. A person is brainwashed and doesn’t remember the test. If test is taken for the second time, you will not get the same score. Why? Preparation, sleep, etc. *When discussing error, reliability, etc. you have to go back to the basics to go to the root of low scores, predictions, etc. (e.g., test taking conditions)… are you putting people in test taking situations where they will be resistant to factors in class or will you place people in situations where small disruptions can have large effects on test scores… This is a very important point. Confidence intervals measure all of the distractions that enhance or detract from people’s test scores; an index of how much you have to worry about extraneous variables… Confidence interval equation: y (hat) ± zc Sy·x “95% is not written anywhere in stone on the planet”- Frank Schmidt You have to give an explanation as to why you chose that number… S²y·x = conditional variance and Sy·x = SEE 11. Confidence intervals are great but here is the catch: They depend on three assumptions: a. The relation between x and y (humans have relationships, variables have “relations”) is linear; confidence intervals depend on the assumption that x and y are linearly related (be careful for his tests by looking at the data on a graph). b. The conditional distribution of y is normal (this assumption is often violated). c. The conditional distribution of y for every single x that you have is the same (i.e., homoscedasticity). When you use confidence intervals, check these three assumptions; if these three are not met, CI’s will be misleading. 12. Correlations are influenced by two main factors: 1. Range restriction (attenuator) 2. Range enhancement: Artificially inflates correlation coefficient. Ex. of Range Restriction: Let’s say you are working in a business and you are giving a test to applicants and this test will determine who should and shouldn’t be hired; if you are conducting the test you are using people who have been hired already, hence it’s range restriction; population is very big but the test sample is very small very limited. Ex. of Range Enhancement: Let’s say you’re doing educational research and you are interested in gifted kids and you go to a classroom at a school with gifted kids and the reality is that there are gifted kids (some recommended by teachers, some based on tests, some based on class availability) and as a result the sample is bigger than your original sample population. Hence correlation will be bigger than it ought to be. Range enhancement happens more rarely but occurs in education, sociology, etc. 13. Correlations are also influenced by lack of reliability or measurement error (these terms are the same thing); lack of reliability can occur in the x (predictor) or y (predictor 2). The lack of reliability always attenuates the correlation. *Normally observed correlation is an underestimation... Let’s say you’re looking at correlation b/w SAT scores and how good they will be as teacher (teacher rating), if the reliability of SAT is .88, and the reliability of the teacher rating is .7, observed correlation is going to be .78 of the actual correlation; you will underestimate the correlation by .22. 14. Correlations are influenced by sampling error (attenuator); observed correlation will not be true correlation if you are trying to extrapolate to the gen. population. 15. Correlations are effected by moderator variables (moderator is a third variable that affects the direction and relation between two variables; look at page 35 Figure 2.15 in the text book; Figure 2.16… this figure is about if the different combo of the groups changes the direction and relation between the variables). |
|
|
Term
The Pearson Product Moment Correlation
|
|
Definition
The covariance divided by the product of the standard deviations OR the average product of the standard (Z) scores for the two variables (x and y). |
|
|
Term
Alternative Measures of Association |
|
Definition
1. two dichotomous variables: phi coefficient 2. two artificially dichotomized variables: tetrachoric correlation coefficient 3. one true dichotomous and one artificial continuous variable: point-biserial correlation 4. artificially dichotomized variable and one continuous variable: biserial correlation. 5. two different ordinal variables (or one dichotomous and one ordinal): Spearman’s rho or Kendall’s tau (rho is slightly higher; tau is more amenable to statistical calculations and tied to ranks).
|
|
|
Term
|
Definition
A subset of the GT theory; it is a simple, quite useful model that describes how errors of measurement can influence observed scores. The model assumes certain conditions to be true; if these assumptions are reasonable, then the conclusions derived from the model are reasonable; if not, model leads to faulty conclusions… |
|
|
Term
|
Definition
1. X = T + e(unsystematic error) (observed score = true score + error) “True” score has a very unfortunate name b/c it’s not really the “true score”. E.g., true typing ability score doesn’t reflect one’s true typing ability. In reality true score is the mean of the test scores (??). 2. Expected value of X = T (true score) If you could repeatedly conduct tests over and over expectancy is the mean of the frequency distribution. Implications: 1. Theoretical, that the population will be the same individual that will take the same test at the same time (theoretical). 2. The relationship b/w the test score and the real traits is validity. 3. Error scores & true scores are uncorrelated; how could they be related? Example of teacher who puts noisy troublesome students up front and good students in the back; as a result good students will do worst on test b/c they can’t hear instructions of the teacher v good students (so big error correlated with big ability)… 4. Error on different tests is uncorrelated; how could they be correlated? Big 5 personality test, five test; hence this assumption can be false with regards to the test score being affected by effects of the environment, such as fatigue, practice, examinees’ mood, stereotype threat; how do you fix it? Have a homogeneous testing environment and use common sense. Homogenous testing environment across three domains:Testing conditions, examinees, time A discussion of stereotype threat… Social facilitation studies (1897), social facilitation is the fact that people’s automatic nervous system sensitivity goes up in the presence of other people v being along which can be either good or bad… a little bit of anxiety or nervousness is a good thing but too much is a bad thing (e.g., stage fright)… Yerkes Dodson (??) curve… How far down does the social facilitation phenomenon go? Down to cockroaches (hmm…) 5. Error scores on one test are uncorrelated with true scores on another test; how could they be correlated? Moderating variable… g factor, if you have a test that is highly g loaded that requires a lot of general intelligence, it’s possible that g score on this test (true score) is correlated with error scores on other g related tests… How do you fix it? You should try to make sure that validity is high as you can possibly get it. **An error of measurement is unsystematic error** Problem is that some error can be systematic… 6. Parallel Tests: One’s true score on one version of the test must equal must equal one’s true score on another test (w/some level of variability) AND the error variance on one test has to equal the error variance on the other test; this deals w/in a given population. **Note that this mean the true score variances and error scores variance will be the same for both tests but the observed scores will not be the same for both tests** 7. τ-equivalence tests: True scores are equivalent but error variances don’t have to be. |
|
|
Term
|
Definition
Describes the conditions, under which examinees can be observed or tested, that produce results that are equivalent to some specified degree. |
|
|
Term
|
Definition
Is defined to be the expected value of his or her observed scores over all admissible observations; this is directly analogous to the true score used in CTT. |
|
|
Term
Sources of error in measurements
|
|
Definition
Assumption that examinees’ standard error of the mean is the same (not the case in the real world); some examinees’ latent traits or true-score values are estimated more accurately than those of other examinees, and the degree of this accuracy can be estimated. |
|
|
Term
|
Definition
When test publishers define carefully his or her universe in terms of factors that could affect the test score. Includes everything, # of items, training of test administrator, testing conditions, etc. |
|
|
Term
|
Definition
Reliability study; main goal is to specify the degree to which the test results are equivalent when obtained under different testing conditions. Is this test working the same way in different circumstances (today v tomorrow, juniors v sophomores, Clemson v Anderson)? |
|
|
Term
|
Definition
Validity study. The measurement instrument produces data to be used in making decisions or reaching conclusions. In a _-study, basically it is a form of a validity study, what are using the test scores for? What decision are you making with the test scores (another term within generalizability; how do I decide who should go to therapy v who shouldn’t)? |
|
|
Term
Coefficient of generalizability
|
|
Definition
The ratio of universe-score variance to observed-score variance and is the counterpart of the reliability coefficient used in classical true score theory. |
|
|
Term
Reliability & Consistency in Measurement |
|
Definition
7 Different Definitions of _______: 1. C of measurement (is your instrument measuring the same way every time you use it? do you want it to be ___ across time, people, testing conditions? At the very least specify the conditions you are looking for ____ in) 2. Correlation b/w observed scores on parallel tests (r = correlation). 3. Proportion of variance in X explained by a linear relation with X’, where X is a score from a test and X’ is a score from a parallel test. 4. Ratio of true score variance to observed score variance. 5. Squared correlation between true scores and observed scores. 6. R coefficient = 1-r² (squared r) between observed scores & error. 7. 1-ratio of error variance to observed score variance (we don’t want error variance to be really big compared to the observed score variance; we want to keep the errors low) A desired level of _____ is around or above .80 (.70 minimum; but in reality .8 on up)… *You make sure that your test score can be interpreted by various persons in the same way* |
|
|
Term
Methods of Estimating Reliability |
|
Definition
1. Test-retest 2. Parallel/alternate forms 3. Internal Consistency
|
|
|
Term
|
Definition
You give people test, wait, and re-administer the test. Problems with this method: carryover effects (boredom—one does worse on test 2 b/c he/she has already read items on test—test item, memory, practice, etc.); know thy construct! What should be the appropriate time interval? Appropriate time where person doesn’t change and a time that is long enough… |
|
|
Term
Parallel/Alternate Forms Reliability
|
|
Definition
Idea here is whether or not the two tests are identical (not used often b/c of time and effort taken to construct _______ form of a test). |
|
|
Term
|
Definition
Are the test items correlated with each other (under assumption that test has multiple items)… Reliability of a certain sort (this question is being measured the same as that question) but it doesn’t refer to or acknowledge dimensionality; for this to work, items have to be based on the same dimension/same thing. (one common error that test writers make is that we are not careful about the dimensionality of the constructs that you are testing; we measure 2 or 3 things when we thought that we were measuring one thing) |
|
|
Term
2 Types of Internal Consistency Methods |
|
Definition
1. Split Half 2. Spearman Brown's Formula |
|
|
Term
|
Definition
Similar to Parallel or Alternate forms...
|
|
|
Term
Spearman Brown's Prophecy Formula (why better than Split Half)
|
|
Definition
You have to make a correction for the fact that you have shortened the test and reliability is related to the # of items; ______ formula solves this issue (what would occur if the test was longer)… When you add more items true score variability increases more than the error… there is a catch, untrained people don’t know that the addition of item that you are plugging in actually works. |
|
|
Term
Problem with Internal Consistency Methods/best way to establish internal consistency |
|
Definition
Doesn’t work with speeded test (time-limit) because some items are not answered on one item for test 1 but then it’s answered for test2. The best way to calculate internal consistency reliability is Cronbach’s alpha… |
|
|
Term
Assumptions of CTT Standard Errors
|
|
Definition
CTT, and reliability; don’t forget that Standard Errors of Measurement have the assumption that examinees’ standard error of the mean are the same; unfortunately, out in the real world, this assumption can be violated (e.g., examinees who are erratic and moody)… what happens when assumptions are violated? When there is a decent sample size, we fudge the assumption… we are going to compare two people’s observed scores (Bob, 490 and Ann, 505), and we want to figure out if they are really different, we have to come up with confidence intervals for their scores (in a perfect world, we know Bob’s sampling distribution and Ann’s sampling distribution); reality, is that we have to use the overall sampling distribution estimate because we do not have person’s distribution scores in the real world… |
|
|
Term
Alternative Model to CTT? |
|
Definition
|
|
Term
Assumptions of Binomial-Error Model |
|
Definition
1. Observed scores are the sum of n test items (test items have to be independent and binary, right and wrong). 2. Your true score (defined differently in CTT) is your probability of getting a given item right (this implies two different things, one it is operating at the item level not at the test score level each item functions as its true score; two, true score is expressed as a probability, a knowledgeable person has a probability of scoring higher than a less knowledgeable person)… so at first it was “what grade did you get,” but the real score lies with regards to the test items (reason why people don’t do IRT and why CTT exists is b/c that IRT requires lots of items and lots of people compared to CTT; if it didn’t require lots of items and lots of people, we could use it more and improve tests). 3. Assumption that the aforementioned probability is the same for all items (60% of getting first item right, 60% chance of getting second item right, so on and so forth)… Jumping ahead for a second, don’t confuse unidimensionality with composite scores (a lot of tests are used for making decisions regarding selection, grade school advancement, etc.); you often are measuring different things (e.g., professors performance, teaching, research, and service) but you should be sure that different measures are measuring the same constructs… One note about binomial error model is that it doesn’t give you confidence interval levels; let’s say you have an examinee with an observed score and what’s the probability of getting a certain score on the test if their true score is a certain score; what’s the probability of Ann getting score x if her true score is y? You have to make a lot of judgment calls. A term we need to know (regarding the binomial error model): Zeta, which means true score (assumption #2 for the binomial error model)… There is also out there a compound binomial model called “compound binomial error”… (binomial error model when you are uncomfortable that the standard error of measurement is the same from person to person)… |
|
|
Term
|
Definition
Does it look like it measures what it is supposed to measure? |
|
|
Term
Logical (aka Sampling) Validity
|
|
Definition
Related to content validity, the idea of have you carefully defined the domain of your content and have you carefully sampled from that domain? Have the items been sampled from your construct? |
|
|
Term
|
Definition
Does content of measurement match content of construct (establishing correspondence between measurement and construct)? |
|
|
Term
Criterion-Oriented Validity (& two types)
|
|
Definition
Does your measurement predict (correlated with) other indicators of the construct? Establishing the correlation between your test and some related phenomenon (or some other measure of your construct). Predictive and Concurrent Validity |
|
|
Term
5 Steps in creating C-RV Study |
|
Definition
1. Determine Feasibility 2. Develop the Criterion 3. Choose a sample 4. Plan your C-RV study. 5. Plan your data analysis
|
|
|
Term
|
Definition
Are you measuring what you think you are measuring? Establishing correspondence between tests and psychological functions and structures. |
|
|
Term
Types of Construct Validity
|
|
Definition
1. Factorial validity: Do you have a unidimensional construct? If not, how many dimensions are in your construct? E.g., human emotion (2 dimensions of human emotions)… Note: it is really easy to come up with a hypothesized effect of “emotion” but we use this construct very sloppily… KNOW THY CONSTRUCT!!! 2. Convergent validity 3. Discriminatory (divergent) validity |
|
|
Term
14 Threats to Construct Validity
|
|
Definition
Look at this set of notecards go to threats!
|
|
|
Term
Experimental validity equals?
|
|
Definition
Equals internal and external validity. |
|
|
Term
|
Definition
The ability to reject other rival hypotheses. |
|
|
Term
|
Definition
The extent to which you can generalize your experiment’s findings to other situations. |
|
|
Term
Issues in criterion-related validity
|
|
Definition
1. Statistical errors 2. Construct valditiy problems 3. Cross-validation & Shrinkage 4. Criterion deficiency & criterion contamination |
|
|
Term
|
Definition
There are several (apparently 7)
|
|
|
Term
|
Definition
Statistical method to estimate what is your real r with the shrinkage that is going on Formula for the Estimated Population Validity (Fred will provide formula for the test) WHEN CREATING TESTS IN REAL LIFE, you have to conduct several calculations which will bump up and down your original observed r to determine true r between measure and construct. |
|
|
Term
|
Definition
If you ran the exact same c-r validity study on another sample of people (who are the same), you wouldn’t get the same r, because the second r will ALWAYS be lower (occurs in c-rv studies). |
|
|
Term
Criterion Deficiency and Contamination |
|
Definition
- When measure does not fully measure construct. - Any situation where the criterion is measuring something outside of the construct (let’s say that Fred was (he actually did this) developing a test for Goodrich plant operators; one of the problems that Fred had was how to define the criterion of job performance for these operators… was their job performance based off of bails of rubber produced (often it was)? Was the number of bails produced solely under the control of the worker? No, several other factors influence “bails of rubber produced” (machine breakdown, people, supplies available, etc.)).
|
|
|
Term
|
Definition
- MTMM composed of convergent & divergent validity - ... |
|
|
Term
|
Definition
The idea is that if you measure one construct using two completely different kinds of measures, they should come to the same conclusion. The idea that two measures of the same thing should correlate with each other. |
|
|
Term
|
Definition
If you use the same types of measures but they are measuring two different constructs, they should not reach the same conclusion. The notion that a measure of A can be discriminated from a measure of B, when B is thought to be different from A; discriminant validity correlations should be lower than convergent validity correlations. |
|
|