Conditions affecting the accuracy of classical equating methods for small samples under the NEAT design: a simulation study

UNCG Author/Contributor (non-UNCG co-authors, if there are any, appear on document)
Devdass Sunnassee (Creator)
The University of North Carolina at Greensboro (UNCG )
Web Site:
Richard Luecht

Abstract: Small sample equating remains a largely unexplored area of research. This study attempts to fill in some of the research gaps via a large-scale, IRT-based simulation study that evaluates the performance of seven small-sample equating methods under various test characteristic and sampling conditions. The equating methods considered are typically applied to non-equivalent [group] anchor test (NEAT) designs using observed scores, where common items are used to link test two or more test forms; that is: (1) the identity method (IDEN); (2) the circle-arc method (CARC); (3) the chained linear method (CLIN); (4) the smoothed chained equipercentile method (SCEE); (5) the smoothed frequency estimation method (SFRE); (6) the Tucker method (TLIN); and (7) the Levine-observed score method (LLIN). The simulation study design includes 60 test characteristic conditions, including various test lengths and levels of test difficulty and measurement precision, and 20 different sampling conditions related to sample size and the magnitude of ability differences between the samples under a non-equivalent anchor test (NEAT) equating design. The IRT-based simulations provide a powerful way to evaluate equating errors in an absolute sense, even though IRT-based equating is not considered in this comparative study. The ultimate purpose of this study is to establish a set of guidelines that may help testing practitioners better understand which methods of small-sample equating work best under particular conditions, as well as when small-sample equating may not be appropriate. The findings suggest that caution is needed when equating small samples under the NEAT design where any of six conditions occur: (1) the sample size for either the base test form or any alternate form is 50 or smaller; (2) the magnitude of the differences in ability between the groups is larger than.1 standard deviation units; (3) the alternate forms differ in mean item difficulty from the base form by more than a quarter of standard deviation unit; (4) the average item discrimination of any alternate test forms is considerably lower than that of the base form; (5) the test forms being equated have too few items (30 or less); and (6) the base form average item discrimination is relatively low. With the exception of these rather extreme conditions, the simulation results suggest that small-sample equating is indeed feasible. The relative ordering of the seven small-sample equating methods in terms of accuracy (mean bias) is as follows (best to worst): LLIN, CLIN, SCEE, TLIN, SFRE, CARC and IDEN. However all of the methods produce comparable results when the equating samples are similar in average ability. The variability of the equating errors was also used to generally rank-order the seven equating methods, producing the following sequence: SFRE, SCEE, CLIN, TLIN, LLIN, CARC and IDEN. Interestingly, the IDEN and to a lesser extent the CARC methods are consistently most accurate and stable when the equated forms are equal in difficulty (i.e., no equating needed). However, these two methods tend to result in very biased scores for longer tests. Other results were more idiosyncratic in nature and addressed in detail in Chapter IV.

Additional Information

Language: English
Date: 2011
Examinations $x Scoring $x Statistical methods
Educational tests and measurements $x Statistical methods

Email this document to