Quality control and the impact of variation and prediction errors on item family design

UNCG Author/Contributor (non-UNCG co-authors, if there are any, appear on document)
Shonai Someshwar (Creator)
Institution
The University of North Carolina at Greensboro (UNCG )
Web Site: http://library.uncg.edu/
Advisor
Richard Luecht

Abstract: This two-part study examined the impact of variation within item families and errors associated with predicted item difficulty parameters on examinee test scores. Part A served as an extension of Shu et al.’s (2010) study to address how much variation matters within item families before they begin to negatively impact scores. Part A also evaluated the impact of two calibration strategies on examinee scores – CS1 or calibrating task model families and CS2 or calibrating individual items. Part B attempted to verify Bejar’s (1983) proposition, which stated that an explained variance of 80 percent needs to be met before predicted item difficulties could be used as a substitute for empirical estimates obtained from pre-testing. Both parts relied on a simulation approach to generate differential quality of item families and predicted item difficulties across different degrees of explained variance. Some quality control (QC) statistics were used to assess any variation in IRT statistics and their impact on examinee scores. The results from Part A suggested that CS1 and CS2 were appropriate for low variation (< 0.2 ??) and high variation conditions (0.2 ?? to 0.5 ??), respectively. While a within task model family variation of 0.2s and 0.5s showed increased trends in bias and RMSE for moderate and high conditions under CS1, this variation ultimately did not result in significant score differences between the two calibration strategies, especially for longer tests. The findings from Part B showed how IRT models are robust enough to withstand error introduced by poorly predicted difficulty parameters used to score examinees. While the estimated scores remained relatively unaffected, the residual-based fit statistics (for the probability of an examinee endorsing an item based on the estimated scores and predicted item parameters) revealed larger errors as the correlations between the true and predicted item difficulties decreased. Results from the person fit analysis revealed that misfit is more likely to occur for the lower ??! conditions. Overall, the results from both Part A and Part B showed that developing a QC system for modern item and test development approaches is feasible and even necessary. [This abstract may have been edited to remove characters that will not display in this system. Please see the PDF for the full abstract.]

Additional Information

Publication
Dissertation
Language: English
Date: 2024
Keywords
Assessment Engineering, Item Difficulty Prediction, Item Families, Quality Control
Subjects
Educational tests and measurements $x Data processing
Item response theory
Prediction theory

Email this document to