Term
|
Definition
•Characteristics measured represent quantity of an attribute
• |
|
|
Term
What type of Charateristics can be measured by Psycholgoical Inference |
|
Definition
Characteristics may be overt (directly observable behavior) or covert (intelligence, self-esteem, working memory)
|
|
|
Term
What is the Major Problem in Measurement |
|
Definition
Major problem in measurement is one often has no basis for assuming that a numerical score accurately reflects the quantify of interest
|
|
|
Term
What does assement often involve? |
|
Definition
Assessment often involves measurement of constructs
|
|
|
Term
|
Definition
–Average value examinee obtains over infinite number of observed scores
–
|
|
|
Term
What do errors of measurement reflect in ture scores? |
|
Definition
Errors of measurement reflect discrepancy between observed scores & the true score
–Standard error of measurement (SEM) is standard deviation of infinite number of observed scores
–
|
|
|
Term
What is the formula for estimating ture scores? |
|
Definition
True score can be estimated:rxx (X – M) + M
•.90 (110-100) + 100 = 109
•.90 (90 – 100) + 100 = 91
•What explains the above?
–
|
|
|
Term
can in individual have more than one true score? |
|
Definition
|
|
Term
Tell me about Biological or Absolute True Scores
|
|
Definition
–Absolute true score exists independent of measurement process used
–Errors of measurement still occur
–Individuals have only 1 absolute true score (ex. DNA,Blood pressure,Cholesterol level, Cancer,Pregnancy)
•Different lab tests may yield different results but one would never average tests to estimate an absolute true score |
|
|
Term
Ways to measure response strength in operant behavior |
|
Definition
–Frequency
–Duration
–Latency
–Interresponse time |
|
|
Term
Sources of Error for operant behavior |
|
Definition
–Less than perfect IOA
–Can we generalize reliability to all other observers?
–Is there a standard error of measurement |
|
|
Term
Threats to Measurement of operant behavior |
|
Definition
–Observer bias
–Observer drift
–Code complexity |
|
|
Term
What does accuracy refer to in operant behavior measurement? |
|
Definition
Degree to which measurement reflects true value |
|
|
Term
|
Definition
•Raw scores
•Composite scores
•Percentile ranks (1-99)
•Stanines (M = 5, SD = 2)
•Normal curve equivalents (M = 50, SD =21.06)
•Standard scores
–z scores (M=0, SD=1)
–T scores (M=50, SD=10)
–Scale scores (M=10, SD=3)
–DIQ scores (M=100, SD=15)
•Equivalence of standard scores
–z of 1=T of 60=DIQ of 115=SS of 13=NCE of 71.06 |
|
|
Term
|
Definition
•Age norms
•Grade norms
•Gender norms
•Special group norms
•Percentile rank norms
•Standard score norms
•National norms
•State norms
•District norms
•School norms |
|
|
Term
How long has Classical Test Theory (CTT) been around? |
|
Definition
|
|
Term
What are test scores made of in CTT |
|
Definition
•Test scores made up of true score component & error score component
•X = True score + Error score
•True score is hypothetical (never actually known but can be estimated)
•True score is average score on infinite number of tests (model of parallel tests)
•Standard deviation of these tests is standard error of measurement
• |
|
|
Term
What reflects reliablitity of the test in CTT |
|
Definition
Degree of error in a test reflects the reliability of the test
•
|
|
|
Term
How is reliability defined CTT? |
|
Definition
Reliability defined as true score variance relative to total variance in scores
•R= True score variance/Total variance (R = 75/100 = .75)
•Or R = 1– Error Variance/Total variance (R = 1- .25 = .75)
•Or True score variance = Total Variance X R (100 X .75 = .75) |
|
|
Term
Fundamental Assumptions in CTT |
|
Definition
•Individuals possess stable traits or characteristics (true scores) that persist through time
•Errors of measurement are completely random (due entirely to unsystematic variation in test scores)
•Fallible scores are the result of the addition of true & error scores Xobtained = Xtrue + Xerror
•Based on above logic, an individual’s true scores is exactly the same on all parallel tests
•Also (based on the above logic), an individual’s fallible score will vary from one parallel test to another (based on differences in reliability)
•Also (and confusingly), an individual can more than 1 true score (WISC-Stanford Binet-Woodcock Johnson) |
|
|
Term
Major Sources of Error in Testing |
|
Definition
•Errors associated with specific situation
•Errors associated with different occasions (situational-centered & person-centered sources of variation)
•Errors associated with different test contents (2 or more measures thought to be parallel may not be parallel)
•Errors associated with subjective scoring systems (ratings, essay examinations, oral examinations, etc.) |
|
|
Term
What Makes Tests Parallel? |
|
Definition
•If they have the same mean
•If they have the same standard deviation
•If they correlate the same with a set of true scores
•If all their variance that is not explainable by true scores is pure random error |
|
|
Term
Methods for Estimating Reliability of Tests |
|
Definition
•Correlations between scores on repetitions of same test (coefficient of stability)
•Correlations among scores on parallel forms of a test (coefficient of equivalence)
•Correlations between repetitions & parallel forms (coefficient of stability & equivalence)
•Correlations between comparable halves of a test (split-half reliability)
•Intercorrelations among all components of a test (inter-item correlation) |
|
|
Term
Indices of Reliability and Error |
|
Definition
•Reliability Index:
•Reliability Coefficient:
•Coefficient Alpha:
•Spearman-Brown Formula:
•Standard Error of Measurement:
•Standard Error of Estimate:
•Correction for Attenuation: |
|
|
Term
|
Definition
• Correlation between true scores & fallible scores
|
|
|
Term
|
Definition
Correlation between scores on parallel test forms or how well scores on 1 parallel test can predict scores on another parallel test
• |
|
|
Term
|
Definition
Reflects the average inter-item correlation or the reliability of an item sample in a content domain
• |
|
|
Term
|
Definition
Demonstrates the relationship between test length & measurement error
• |
|
|
Term
Standard Error of Measurement: |
|
Definition
Extent to which an individual’s scores vary over a series of parallel tests
• |
|
|
Term
Standard Error of Estimate: |
|
Definition
Degree of measurement error in prediction from 1 variable to another
• |
|
|
Term
Correction for Attenuation: |
|
Definition
Extent to which unreliability in test scores diminishes the correlation between 2 or more sets of test scores |
|
|
Term
Differences between Standards |
|
Definition
•Standard Deviation (dispersion in set of test scores)
•Standard Error of Measurement (error in a single test score)
•Standard Error of Estimate (error in prediction from 1 test score to another)
•Standard Error of the Mean (sampling error in average test score)
•Standard Error of Correlation (sampling error in correlation of test scores) |
|
|
Term
The Importance of Spearman-Brown |
|
Definition
•Major way of making tests more reliable is to make them longer
•Major source of error in tests is content error
•Longer tests have less content error than shorter tests
•CTT assumes subject error is minimized because of large samples
•Spearman Brown Prophesy Formula: rkk = kr11/1 + (k-1)r11
•What would be the reliability of a test that was increased by a factor of k? A 20 item test have a reliability of .70 and is increased to 60 items.
•rkk= 3(.7)/1 + (3-1)(.7) = .88
•Spearman Brown can also be used to estimate reliability by shortening a test using the same formula: 100 item test with r =.95 & I want to shorten it to a 50 item test
•rxx’ = ..50 (.95)/ 1 + .50 – 1)(.95)
•rxx’ = .475/.525 = .90
•You can also modify the formula to estimate the number of items required to obtain a desired level of reliability:
•K = rkk (1 – r11)r11(1-rkk)
•K = .80 (1-.5)/(.5)(1-.8)=.4/.1=4 |
|
|
Term
Importance of Standard Error of Measurement |
|
Definition
•SEM reflects degree of measurement error around an obtain score
•SEM is standard deviation of infinite number of parallel tests
•SEM reflects degree of confidence one has in a test score
•SEM is a function of test reliability and variability in test performance
•SEM places confidence intervals around test scores
•As reliability of test decreases, SEM approaches the SD of a test
•+/1 SEM=68% CI; +/- 2 SEM=95% CI; +/-3 SEM=99% CI
•r= .93 & SD=15 X=76 SEM = 4 points
•68% CI 72 - 80
•95% CI 68 - 84
•99% CI 64 – 88
•You can also place a SEM around the estimated true score
•Xtrue = rxx (X – M) + M
•Xtrue= .93 (76-100) + 100 = 78 |
|
|
Term
Reliability Coefficient & SEM
Internal consistency reliability (α)
|
|
Definition
–Invariably higher than test-retest
–Produces smaller SEMs
–Based on average inter-item correlation
–Reflects precision of test score on a given day
–IQ score on Monday |
|
|
Term
Reliability Coefficient & SEM
Test-retest reliability (r)
|
|
Definition
•–Invariably lower than α
–Produces larger SEMs
–Based on correlation over time (Time 1/Time 2)
–Reflects precision of test scores on any given day
–IQ score in September vs IQ score in January |
|
|
Term
Standard Error of Measurement: A Practical Example |
|
Definition
•Forrest is administered a WAIS-IV and obtains a FSIQ of 71
•Diagnosis of MR requires a score of 2 SDs below mean (70)
•Based on Forrest’s IQ of 71, is he MR?
•rxx of FSIQ=.97 SD of FSIQ=15 r1,2 = .95 SD of FSIQ=15
•SEM=2 points SEM=3
–68% CI: 68-74 68%CI: 68-74
–95% CI: 65-76 95%CI: 65-77
–99% CI: 62-79 99%CI: 62-80
•Conclusion? What should we do with Forrest? |
|
|
Term
Generalizability Theory: An Alternative to CTT |
|
Definition
•Extends notion of measurement error beyond CTT
•Offers way of assessing multiple sources of error (facets) concurrently
•CTT lumps all sources of error into one estimate (cannot separate)
• |
|
|
Term
Generalizability Theory:
Dependaility
|
|
Definition
–Accuracy of generalizing from a person’s observed score on a test or measure to the average score that person would have received under all conditions of measurement
•Single score obtained on one occasion on a form of a test with a single administrator is not fully dependable (multiple sources of error)
• |
|
|
Term
Multifaceted Measurement Error |
|
Definition
–Persons
–Occasions
–Items
–Raters
–Settings |
|
|
Term
G-theory Studies use what? |
|
Definition
•G Theory studies investigated in ANOVA designs
•ANOVA designs separate sources of variation in test scores
•ANOVA designs can be:
–Crossed
–Nested
•ANOVA designs can be:
–Fixed effects
Random effects |
|
|
Term
|
Definition
–G studies collect information
–D study use above information to make best decision |
|
|
Term
|
Definition
–Anticipates multiple uses of measurement
–Provides as much information as possible about sources of error
–Incorporates this information into proper test interpretation |
|
|
Term
|
Definition
–Makes use of information in G Study to design best application
–Specifies which facets to be considered
–Specifies proper interpretation
–Estimates dependability based on increasing conditions (facets) of measurement |
|
|
Term
G & D Studies Systematic Direct Observations |
|
Definition
•14 students from 5th grade classroom
•DV: on-task/off task behavior
•5 observers (4 hour training session)
•SDOs collected twice a day for 10 consecutive days using momentary time sampling
•IOA=90%
•G Study
–Persons (62%)
–Time (1%)
–Setting (0%)
–Person x Time (0%)
–Person x Setting (13%)
–Time x Setting (0%)
–Person x Time x Setting (24%) |
|
|
Term
Decision Studies
and # of observations |
|
Definition
•Decision Study 1
–1 observation per day for 10 days
–G=.46
•Decision Study 2
–1 observation per day for 3 days
–G=.25
•Decision Study 3
–2 observations per day for 20 days
–G=.72
•Decision Study 4
–4 observations per day for 20 days
–G=.83
|
|
|
Term
D Studies What Does It All Mean |
|
Definition
•Iwas 90% but Greliability=.62 (Interpretation?)
•Adequate reliability only obtained if observations collected 4 times per day for 4 weeks (40 days)
•2400 minutes or 40 hours of observation (What would JT say?)
•IOA as proxy for accuracy of measurement
•No inconvertible index with which to compare observed scores
•SDOs should not be used in isolation (other methods needed)
•SDOs certainly not a “gold standard” measurement method |
|
|
Term
G Study Behavior Rating Scales |
|
Definition
•G Study of BASC & TRF (Achenbach) Externalizing Behavior
•6 teacher pairs Grades 1-5 rated 61 students
•α=90-.97; r1.2 =.70-.90; rinterrater=.60-.76
•Dependability Coefficients
–Externalizing Composites=.68
–Aggression=.59
–Oppositional Defiant=.58
–Conduct Problems=.47
•Dependability coefficients weaker than bivariate correlations
•Dependability coefficients all in moderate range
•Considering multiple sources of error attenuates dependability
•Should not rely solely on rating scales in assessment
•Rating scales certainly not a “gold standard” method
|
|
|
Term
G & D Studies Direct Behavior Rating |
|
Definition
•Academic engagement
–SDO
–DBR
•Data collected over 10 consecutive school days
•DBR
–Teacher cued to start observation
–End of period teachers rated student behavior
–100 mm line divided into 11 equal gradients (never-sometimes-always)
•SDO
–Momentary time sampling
–15-s interval
•Design:
–Raters (Methods) Observation Periods (Days) X Persons
–[p x (r.m) x [o:d] |
|
|
Term
G & D Studies Direct Behavior Rating
Results |
|
Definition
•12 persons x 4 raters x 10 days x 3 rating periods
•1440 total ratings
•ϕ=.77
•D Studies
–1 observation/day SDO
•1 day-.50
•5 days-.83
•10 days-.91
•15 days-.93
•20 days-.98
•100 days-.99
–1 observation /day DBR
•.48-1 day
•.82-5 days
•.91-10 days
•.93-15 days
•.97—20 days
•.99-100 days |
|
|
Term
Implications for Practice from G-studies |
|
Definition
•More dependable estimates obtained via SDO
•SDO records behavior every 15 s vs. every 15-m (DBR)
•Sufficient reliability SDO after 3 sessions vs. 20 DBR ratings
•Quick decisions best made using SDO
•Findings inconsistent with Hintz & Matthews study
•DBRs less intrusive & can measure low frequency behaviors
•DBRs less time consuming & require less training effort |
|
|