Data collection design for equivalent groups equating:using a matrix stratification framework for mixed-format assessment

UNCG Author/Contributor (non-UNCG co-authors, if there are any, appear on document)
Kinge Keka Mbella (Creator)
The University of North Carolina at Greensboro (UNCG )
Web Site:
Richard Luecht

Abstract: Mixed-format assessments are increasingly being used in large scale standardized assessments to measure a continuum of skills ranging from basic recall to higher order thinking skills. These assessments are usually comprised of a combination of (a) multiple-choice items which can be efficiently scored, have stable psychometric properties, and measure a broader range of concepts; and (b) constructed-response items that measure higher order thinking skills, but are associated with lower psychometric qualities and higher cost of test administration and scoring. The combination of such item types in a single test form complicates the use of psychometric procedures, particularly test equating which is a vital component in standardized assessment. Currently there is very little research that examines the robustness of current equating methodologies for tests that employ a mixed format. The purpose of this dissertation was twofold. The first goal of this research was to present evidence on the use of a predictive stratification framework based on an already available covariate to create equivalent groups. The second goal was to present supporting evidence on an appropriate data collection designs for mixed-format test equating. AP data from an AP Chemistry test and an AP Spanish Language test were obtained, covering a three year period. Two categorical covariates were created based on average AP score and school size from previous years. A 5 X 5 crosstab stratified cluster sampling matrix was created from the two new categorical variables and used to evaluate the accuracy and precision of mixed-format observed-score equipercentile equating. Six research conditions were investigated using a re-sampling framework as follows: (a) two random stratified cluster groups equating designs, (b) two test form conditions, (c) four sampling rates, (d) two AP test subjects, (e) two sampling frame conditions, and (f) three equating designs. There were two major findings summarized from the 500 bootstrap replications in each design condition. Firsts, the random stratified cluster group equating design had the most conditions with total equating error less than .1 standard deviation unit of the raw score scale. Second, Model 1, in which the equating function was estimated using a smaller sample and the larger sampling frame, was more accurate than Model 2 where the equating function was based on two equivalent samples from the stratified matrix. An unanticipated but interesting finding was that equating estimates from AP Spanish was more accurate compared to those from AP Chemistry despite the fact that the dis-attenuated correlation coefficient between the multiple-choice and constructed-response section was higher (unity) in AP Chemistry than in AP Spanish.

Additional Information

Language: English
Date: 2012
Advanced Placement classes, standardized testing, Equipercentile, standardized assessments
Educational tests and measurements $x Evaluation
Scaling (Social sciences)
Advanced placement programs (Education) $x Evaluation

Email this document to