Term
|
Definition
proportion of people in population who can successfully do the job |
|
|
Term
|
Definition
proportion of persons hired/admitted (SR= #selected / #applicants) |
|
|
Term
The accuracy of selection decisions based on test scores depends on 3 factors: |
|
Definition
Based Rate Selection Ratio Test Validity |
|
|
Term
There are four possible outcomes of every selection decision. These can be arranged in a table. |
|
Definition
True Positive (TP) False Positive (FP) False Negative (FN) True Negative (TN) |
|
|
Term
When the base rate is high |
|
Definition
When the base rate is highthere will be more TP and FN |
|
|
Term
When the base rate is low |
|
Definition
|
|
Term
|
Definition
--The relationship of validity coefficients to the practical effectiveness of tests in selection. --Taylor and Russell catalogued expected hit rates (TP) for different base rates, selection ratios, and validity coefficients. --These tables are based on premise that an organization wants to maximize True Positive decisions; other outcomes are not considered. --The tables assume bivariate normality, which can be violated if a test has floor or ceiling effects. Unless this violation is severe, the tables are reasonably accurate. |
|
|
Term
Test reliability can be increased in two ways: |
|
Definition
-- by adding items that correlate positively with others --by removing items that are problematic (too wordy, tricky/confusing, too hard/easy) |
|
|
Term
Multiple-choice items 1. Body of a multiple choice item 2. The choices that follow 3. Incorrect response options |
|
Definition
1. Stem 2. Response Options 3. Distractors |
|
|
Term
2 possibilities for scoring multiple |
|
Definition
If response options can be ordered to reflect different |
|
|
Term
|
Definition
degrees of correctness, a 3 might be assigned to the right answer; 2 for the next best, 1 for the next, and 0 for completely incorrect. This scheme awards points for partial knowledge. (polytomous scoring)
On the other hand, it is more common to simply assign 1 for choosing the correct response; and 0 for all other responses (dichotomous scoring). |
|
|
Term
According to Murphy, a "perfect test item" has two characteristics |
|
Definition
--all people who know the answer will choose the correct reponse --those who do not know the answer will choose randomly among the distractors, which implies (some respondents will guess correctly, each possible incorrect response will be equally popular) |
|
|
Term
Distractors that are rarely chosen |
|
Definition
decrease the difficulty of an item |
|
|
Term
Test items may be scored DICHOTOMOUSLY |
|
Definition
**two possible socres for each item EX. math items have a right and wrong answer.. --assign 0 if wrong answer is chosen --assign 1 if right answer is chosen |
|
|
Term
Survey and multiple choice items may be scored POLYTOMOUSLY |
|
Definition
**3 or more possible scores per item EX. attitude surveys do not have right and wrong answers (-2=SD -1=D and etc) |
|
|
Term
Traditional methods of item analysis |
|
Definition
judge the quality of items with respect to the intended sample of test takers. Three psychometric properties are important: --how difficult/easy is the item for the target group of examinees --how well does the item discriminate among persons having different levels of ability --good test items are of moderate difficulty and discriminate well among examinees. |
|
|
Term
To examine the difficulty and discriminating power of items, we often consider 3 basic statistics: |
|
Definition
** P-values ** Item-total correlations ** Inter-item correlations |
|
|
Term
|
Definition
*which represent the proportion of persons correctly answering or endorsing an item. high p-value -> easy (too high >.9) |
|
|
Term
|
Definition
correlation of responses to individual test items with the total test score (correlations greater then .3 you would want to keep!) |
|
|
Term
|
Definition
(the correlation of items with each other) --in general, a reliable test can be created by adding items that correlate positively with each other, even if the correlations are small (0.2) --Inter-item correlations are LARGE when test content is homogeneous and small when heterogeneous |
|
|
Term
Test reliability is influenced |
|
Definition
by the variance of total test scores |
|
|
Term
One way to increase test score variance is to |
|
Definition
select items having p-valuues near 0.5 |
|
|
Term
Items considered bad or (not useful for the target sample) have the following general properties: |
|
Definition
*p-values less than 0.1 or greater than 0.9 *negative or very low (<.1) inter-item correlations *negative or low item-total correlations (<0.3) |
|
|
Term
P-values less than 0.1 or greater than 0.9 |
|
Definition
items having p-values less than 0.1 or greater than 0.9, contribute little to test variance. They don't differentiate among examinees of high and low ability. So, they can be dropped without loss of measurement precision. |
|
|
Term
Negative or very low (<.1)inter-item correlations |
|
Definition
Negative or very low (<.1) inter-item correlations suggest that the test is measuring more than 1 construct. Removing items having negative or very low inter-item correlations will increase internal consistency reliability (recall, coefficient alpha assumes homogeneity) |
|
|
Term
Negative or low item-total correlations (<.3) |
|
Definition
Items having negative correlations with the total test score must be dropped. People who did well on the test did poorly on the item, indicating a possible problem with content. Items having low item-total correlations are |
|
|
Term
|
Definition
inter-item correlations will be small (hetero) |
|
|
Term
|
Definition
inter-item correlations larger (homo)
*if a test is broad more items needed to achieve acceptable levels (.7 or more) |
|
|
Term
Item Response Theory (IRT) |
|
Definition
*is a relatively new and powerful methodology for examining the properties of test items. *Items can be compared using parameters that reflect difficulty, discrimination and effects of guessing. *The greek letter theta is used to represent an examinee's trait level (score) (ability skill, or standing on the construct measured by a test). Scores are standard normal, ranging from about -3 to +3. *The quality of items is examined using item response functions (IRF's) which graphically illustrate the relationship between trait level and the probability of a correct response. (plot functions they have S shaped ) |
|
|
Term
Computerized adaptive testing (applications of IRT methods) |
|
Definition
*create test tailored to examinee ability *administer only items tht provide high information about the examinee, thus, reducing the error in a person's score *Adaptive tests require only about half as many items as nonadaptive tests to obtain a similar level of accuracy. |
|
|
Term
Detecting biased items (applications of IRT methods) |
|
Definition
*by comparing IRF's across groups, one can determine whether a test item exhibits psychometric bias (a.k.a differential item functioning, DIF) *An item is said to be biased if its IRFs differ across groups of examinees (eg., men and women) after a process called "linking" |
|
|
Term
|
Definition
tailored examinations (to group or individuals applications) |
|
|
Term
|
Definition
*Selecting item types *Item Writing *Item Content *Item response alternatives (response format) |
|
|
Term
|
Definition
*constructed response (short answer/essay)demos of skill *Low fidelity simulation (describe how something should be produced) *High fidelity simulation (actually develop the product or do the task) * |
|
|
Term
|
Definition
*first step in test constructions is to generate a pool of items *generally need 2-3 times as many items in poolas you desire in final version of test *items will be selected based on both content and psychometric properties |
|
|
Term
Guidelines for item writing |
|
Definition
AVOID *Long items *Double negatives *Double-barreled statements (mix different concepts) do not inculde asking two things *Sexist, racist, offensive, language *Slang that may go out of date quickly *Using big, complicated or esoteric words (EX. the word HOT can have different meanings) DO select appropriate reading level for target group (e.g., 5th grade) |
|
|
Term
|
Definition
*generally there are two approaches to scale development: rational and empirical *often use a hybrid (mix) of these two approaches; call it rational-empirical method. |
|
|
Term
Rational Scales (item content) |
|
Definition
*create items based on a theory of behavior; some underlying thought, belief, or rationale used as based for selectiong items. Answers based on theoretical grounds. *Advantage- can use theory to make predictions about behavior. good face validity *Disadvantage- items tend to be transparent (ie. clear what they are measuring); so responses are subject to conscious (faking) or unconscious (self-deception) distortion. |
|
|
Term
|
Definition
*generate broad range of items- not tied to any theory *compute correlation between item responses and some criterion variable *select and retain items that predict well (ie. have highest correlation with external criterion) and those that differentiate among members of different groups. For example; select items that best differentiate between schizophrenics and "normal" indviduals. *Items are scored by empirical keying (aka criterion keying) *Advantage- ---------.....--------- *Disadvantage- lower face validity |
|
|
Term
Item Response Alternatives (response format) EXAMPLES |
|
Definition
-the response format refers to the manner in which responses will be collected from the examinees. Ex: True-False, Multiple Choice, Free Responses, Auditory Response , Likert Type, Forced Choice -MC populat because can be scored objectively difficult to write good distractors -Free Response; get rich information, requires subjective judgemnet to score, must often ecxamine inter-rater agreement |
|
|
Term
|
Definition
-Psychologist frequently use self-report measures; *some questions perceived as too invasive or personal *sometimes persons concerned about confidentiality so they consciously distort responses (fake good, fake bad, respond randomly) -Test developers try to control these effects, which are called RESPONSE SETS; *use scales designed to detect unusual responses *use warnings that unusual response can be detected and that verifiable information will be examined for accuracy |
|
|
Term
Examples of response sets |
|
Definition
Social Desirability: IDEA persons tend to answer in ways that present themselves in best light (fake good) or worst light (fake bad), ratherthan answer honestly -Intentionally distorting one's responses is known as FAKING or DISSIMULATION -Faking is a big issue in non cognitive assessment (personality, biographical data, worker diaries and etc) -Sharp disagreement about ramificationof faking -Can you correct for faking after a measure has been administered (research suggest no) -Can you prevent faking by strategic construction of items or tests (Maybe) |
|
|
Term
Random Responding.... How can you try to detect it? |
|
Definition
occurs when the examinees fail to attend to content of items because unmotivated, in a hurry or unwilling to cooperate -Try to detect by: using scales containing mix of negatively and positively worded items and apply mathematical models |
|
|
Term
|
Definition
*tendency to answer in a certain way; characteristic you bring to the test -Acquiescence: tendency to agree with content of item without attending to content -Criticalness: tendency to disagree with content of item without attending to content -Dealing with response Styles: are response styles elicited by items that are ambiguous or confusing. Try to detect by using negatively and positively worded items and perhaps including statements that would be clearly false or true EX, it would be odd if person agreed with statement, "Ive never drank water". |
|
|
Term
|
Definition
once you do something bad, that's it. |
|
|
Term
|
Definition
if the rater likes you they will score high on everything |
|
|
Term
|
Definition
-allow for inter-individual (between person) comparisons -compare each person's score to those of a normative group -give indication of amount or level of trait exhibited **can compare scores across people |
|
|
Term
|
Definition
allow only for intra-individual (within person) comparisons -use a forced-choice format (paired comparison) where examinee must express a preference between two alternatives (think about the Carrots or Broccoli example) -With forced choice like the carrots and broccoli example, you don't know how strong the liking of any vegetable is. -Thus this test's scores cannot be used for inter-individual comparison as in job selection. This is the "challenge" for developing "fake-resistant" personality tests. |
|
|
Term
Normative Scales vs Ipsative Scales |
|
Definition
Normative -can be used for inter-individual comparisons -provide information about absolute standing on trait(s) assessed Ipsative -can only be used for intra-individual comparisons -provide information about relative standing on traits assessed |
|
|
Term
Norming psychological tests |
|
Definition
must choose samples that represent the target population: *good comparison groups provide a "representative" sample (demographic characteristics) *typically have several norm groups for each test; local norms preferred |
|
|
Term
Steps in developing norms |
|
Definition
defining the target population; *decide on composition of normative group based on intended use of test EX, LSAT, MCAT, ACT -Selecting the sample; *obtain samples that are a cross-section of he population; *regional samples (rural/urban) *gepgraphical -Standardication; *administer test same way to all individuals *standardization decreases error variance by keeping conditions uniform across administrations *use anchor items to equate scores from different test forms |
|
|
Term
Test publication and revision (WRITING THE MANUAL) |
|
Definition
*state purpose of test, directions for administration and scoring, and describes test development and validity evidence EX, Describe validation samples, reliability, converegent and discriminant validity with other measures *Manual must be revised with each new form or amendment |
|
|
Term
Test publication and revision (REVISING THE TEST) |
|
Definition
should be revised when: *language is outdated *security is compromised *cotent disclosed *changes to content, format, medium of administration or scoring |
|
|
Term
|
Definition
every 5 years or so you have to go back and update your manual because it’s out of date *some test are not as urgent to change/update as others. EX, personality test is pretty stable |
|
|