Term
|
Definition
the degree to which evidence & theory support the interpretationsof test scores entailed by uses of tests. |
|
|
Term
What is the most fundamental consideration in devleoping and evaluation test? |
|
Definition
|
|
Term
Tests are not “____” or “_____”, interpretations are ____ or ____ |
|
Definition
Tests are not “valid” or “invalid”, interpretations are valid or invalid |
|
|
Term
|
Definition
Construct Underrepresentation: Degree to which test fails to capture important aspects of the construct.
Reading comprehension test might ignore common type of reading material or contain insufficient sample of reading passages
Test of anxiety might only measure physiological anxiety & ignore cognitive, emotional, or situational components
Construct Irrelevant Variance: Degree to which test scores are affected by processes that are extraneous to intended construct
Reading comprehension test elicits emotional reaction to test content
Response requires writing skill which is confounded with reading comprehension
Flinneffecct, using an IQ test that is out of data |
|
|
Term
Standards for Educational and Psychological Tests (AERA, APA, NCME, 1999) VALIDITY |
|
Definition
Evidence Based on Test Content (formerly content validity)
Evidence Based on Response Processes (formerly construct validity)
Evidence Based on Internal Structure (formerly construct validity)
Evidence Based on Relations to Other Variables (formerly criterion-related validity)
Evidence Based on Consequences of Testing (new)
|
|
|
Term
EVIDENCE BASED ON TEST CONTENT |
|
Definition
Based on content specification- an issue for achievement test
Includes both logical and/or empirical analyses of adequacy with which the test represents the content domain
Relevance of content domain to purposed interpretation of test scores
Can come from expert judgments regarding relevance
When tests are used for individual decisions (e.g., graduation or retention), content should be limited to what students have had opportunity to learn
Evidence about content can be used to address questions about differences in meaning/interpretation of test scores across relevant subgroups
Most frequently used for achievement tests
Can be somewhat subjective not based on a statistics |
|
|
Term
EVIDENCE BASED ON RESPONSE PROCESSES |
|
Definition
Allows for determination of fit between construct & nature of performance or response processes engaged in by examinee
Speed
Short-term memory
Long-term memory
Latency
Evidence usually comes from analyses of individual responses
Studies on response processes can be directed toward observers, judges, or raters of performance
Test scores can be affected by processes irrelevant to construct being measured (observer bias, observer drift, halo effects, errors of leniency, errors of stringency) |
|
|
Term
EVIDENCE BASED ON INTERNAL STRUCTURE |
|
Definition
Indicates degree to which test items measure construct to be interpreted
Based on factor analysis
Structure can be unidimensional or multidimensional
Typically based on factor analysis
Can be exploratory or confirmatory
Differential item functioning (DIF) can be used to assess internal structure of tests for subgroups
Ethnicity
Sex
Age |
|
|
Term
EVIDENCE BASED ON RELATIONS TO OTHER VARIABLES |
|
Definition
—May include measures of some criteria test is expected to predict (nomological network)
—May include relationships to other tests of same construct (nomological network)
—May include prediction of some categorical variable (e.g., group membership)
—Convergent-Discriminant Validity & Multitrait-Multimethod Matrix (MTMM)
—Test-Criterion Relationships (concurrent-predictive-postdictive)
—Recall the standard error of estimate (SEE)
—Validity Generalization
Meta-Analytic Methods (most often used to assess validity generalization) |
|
|
Term
|
Definition
|
|
Term
CONVERGENT VALIDITY: Campbell & Fiske (1959) |
|
Definition
Relatively high correlations among the same traits measured by multiple methods |
|
|
Term
DISCRIMINANT VALIDITY: Campbell & Fiske (1959 |
|
Definition
Low correlations among multiple traits measured by the same method and low correlations among different traits measured by multiple methods |
|
|
Term
MTMM Campbell & Fiske (1959) |
|
Definition
Reliability diagonal (monotrait-monomethod correlations)
Heterotrait/monomethod correlations
Validity diagonal (monotrait/heteromethod correlations)
Heterotrait/heteromethod correlations
More than one way of measuring something
The correlation between different methods is what validity is |
|
|
Term
RELIABILITY AND VALIDITY IN MTMM |
|
Definition
Reliability is agreement between 2 efforts to measure same trait using maximally similar methods
Validity is represented by agreement to measure same trait using maximally different methods
Different traits can be invalidated by too high correlations between them (falls closer to reliability than validity)
Measures of same trait by 2 or more methods should correlate higher with each other than with measures of different traits (method variance)
Lack of convergent validity suggests: (1) Neither method measures the trait, (2) one method does not measure the trait, or (3) the trait is not a functional unity (response processes involved are specific to nontrait attributes of each test) |
|
|
Term
|
Definition
|
|
Term
SOME ISSUES IN PREDICTIVE VALIDITY |
|
Definition
Standard Error of Estimate (errors in prediction)
Effects of Reliability on Prediction
Effects of Attenuation on Prediction
Multiple Regression and Prediction
Predictive Accuracy (Discriminant Function Analysis/Logistic Regression) |
|
|
Term
EVIDENCE BASED ON CONSEQUENCES OF TESTING |
|
Definition
Deals with intended and unintended consequences of test use and interpretation
Must distinguish between validity evidence versus social policy decisions
Differential consequences important in employment selection, promotion, placement of children in SPED, and choice of criterion scores for pass or fail (e.g., graduation)
Consequences of testing may influence decisions about test use, they do not detract from validity of test interpretations
Griggs v. Duke Power
Larry P. v. Riles
NFL’s use of the Wonderlich Test
Tests used in decisions or placements that result in poorer outcomes for certain subgroups versus other subgroups can be considered biased |
|
|
Term
MESSICK’S MODEL OF VALIDITY |
|
Definition
|
|
Term
EXPLANATION OF 4 FACETS OF VALIDITY |
|
Definition
|
|
Term
|
Definition
Decision reliability refers to consistency of decision-making process
Decision reliability studies are conducted using Generalizability Theory
Dependability of behavioral measures
Considers multiple sources of error in test scores
Studied in ANOVA designs
Factors Affecting Decision Reliability
Reliability of test scores used to make decisions
Selection rate (%population identified by diagnostic procedure)
As correlation between 2 measures decreases, decision reliability decreases and vice-versa |
|
|
Term
|
Definition
Appropriateness of using assessment information for a specific decision-making purpose
Requires consideration of meaning of information used to make decisions and consequences of using it
Diagnostic Accuracy assumes “true” diagnostic status is known |
|
|
Term
4 Measures to Consider in diagnostic accuracy: |
|
Definition
Sensitivity: Diagnostic status present on criterion & present on predictor variable
Specificity: Diagnostic status absent on criterion & absent on predictor variable
Base Rate: Proportion of persons in sample identified by criterion
Selection Rate: Proportion of people identified by predictor |
|
|
Term
|
Definition
|
|