REID (Research and Evaluation in Education)


estimation, ability, item parameter, Mathematics test, 3PLM/GRM model, MCM/GPCM model

Document Type



The main purpose of the study was to investigate the superiority of scoring by utilizing the combination of MCM/GPCM model in comparison to 3PLM/GRM model within a mixed-item format of Mathematics tests. To achieve the purpose, the impact of two scoring models was investigated based on the test length, the sample size, and the M-C item proportion within the mixed-item format test and the investigation was conducted on the aspects of: (1) estimation of ability and item parameters, (2) optimalization of TIF, (3) standard error rates, and (4) model fitness on the data. The investigation made use of simulated data that was generated based on fixed effects factorial design 2 x 3 x 3 x 3 and 5 replications resulting in 270 data sets. The data were analyzed by means of fixed effect MANOVA on Root Mean Square Error (RMSE) of the ability and RMSE and Root Mean Square Deviation (RNSD) of the itemparameters in order to identify the significant main effects at level of a = .05; on the other hand, the interaction effects were incorporated into the error term for statistical testing. The -2LL statistics were also used in order to evaluate the moel fitness on the data set. The results of the study show that the combination of MCM/GPCM model provide higher accurate estimation than that of 3PLM/GRM model. In addition, the test information given by the combination of MCM/GPCM model is three times hhigher than that of 3PLM/GRM model although the test information cannot offer a solid conclusion in relation to the sample size and the M-C item proportion on each test length which provides the optimal score of thest information. Finally, the differences of fit statistics between the two models of scoring determine the position of MCM/GPCM model rather than that of 3PLM/GRM model.

Page Range






Digital Object Identifier (DOI)





Bastari, B. (2000). Linking multiple-choice and constructed-response items to a common proficiency scale (Unpublished doctoral dissertation). University of Massachusetts Amherst, USA. UMI Microform 9960735.

Berger, M. P. (1998). Optimal design of testswith dichotomous and polytomous items. Applied Psychological Measurement, 22(3), pp. 248-258.

Bock, R. D. (1972). Estimating item parameters and latent ability when responsesare scored in two or more nominal categories. Psychometrika,

(1) 29-51.

Cao, Y. (2008). Mixed-format test equating: Effects of test dimensionality and common item sets (Unpublished doctoral

dissertation). University of Maryland, Maryland USA.

Chon, K. H., Lee, W. C, & Anlsey, T. N. (2007). Assessing IRT model-data fit for mixed format tests. CASMA Research Report, Number 26

De Ayala, R. J. (1989). A comparison of the nominal response model and the three parameter logistic model in computerized adaptive testing. Educational and Psychological Measurement, 23(3), 789-805.

De Mars, C. E. (2008, March). Scoring multiple choice items: A comparison of IRT and classical polytomous and dichotomous methods. Paper presented at the annual meeting of the National Council on Measurement in Education, NewYork.

Donoghue, J. R. (1994). An empirical examination of the IRT information

of polytomously scored reading items under the generalized partial

credit model. Journal of Educational Measurement, 31(4) pp. 295-311.

Ercikan, K. et al. (1998). Calibration and scoring oftests with multiple choice and constructed-response item types. Journal of Educational Measurement, 35(2), pp. 137-154.

Garner, M., & Engelhard, Jr., G. (1999). Gender differences in performance on multiple-choice and constructedresponse Mathematics items. Applied Measurement in Education, 12, pp. 29-51.

Gierl, M. J., Wang, C., & Zhou, J. (2008). Using the attribute hierarchy method to make diagnostic inferences about examinees' cognitive skills in algebra on the SAT. Journal of Technology, Learning, and Assessment, 6(6).

Glasersfeld, E.von. (1982). An interpretation of Piaget‟s constructivism. Revue Internationale de Philosophie, 36, pp. 612–635.

Hagge, S. L. (2010). The impact of equating method and format representation of common items on the adequacy of mixedformat test equating using nonequivalent groups (Unpublished doctoral dissertation). University of Iowa, USA.

He, Y. (2011). Evaluating equating properties for mixed-format tests (Unpublished doctoral dissertation). University of Iowa, USA.

Hoskens, M. & De Boeck, P. (2001). Multidimensional componential item

response theory models for polytomous items. Applied Psychological Measurement, 25, pp. 19-37.

Jurich, D., & Goodman, J. (2009, October). A comparison of IRT parameter recoveryin mixed format examinations using PARSCALE and ICL. Poster session presentedat the Annual meeting of Northeastern

Educational Research Association, James Madison University.

Kennedy, P., & Walstad, W. B. (1997). Combining multiple-choice and

constructed response test scores: An economist‟s view. Applied Measurement in Education, 10, pp. 359- 375.

Kentucky Department of Education. (2008). Educational Planning and Assessment System (EPAS) College Readiness Standards and Program of Studies Standards Alignment Introduction [Digital edition version]. Retrieved from http://www.education.ky.gov/

Kinsey, T. L. (2003). A comparison of IRT and Rasch procedures in a mixed-item format test (Unpublished doctoral dissertation). University of North Texas, USA. UMI Microform 3215773.

Lau, C. A. & Wang, T. (1998, April). Comparing and combining dichotomous and polytomous items with SPRT procedure in computerized classification testing. Paper presented at the annual meeting of the American Educational Research Association, San Diego, CA.

Levine, M. V., & Drasgow, F. (1983). The relationship between incorrect

optionchoice and estimated ability. Educational and Psychological Measurement, 43, pp. 675-685.

Li, Y. H., Lissitz, R. W., & Yang, Y. N. (1999). Estimating IRT equating coefficients for tests with polytomously and dichotomously scored items. Paper presented at the annual meeting of the National Council on Measurement in Education, Montreal Canada.

Lukhele, R., Thissen, D., & Wainer, H. (1994). On the relative value of

multiple-choice, constructedresponse, and examinee-selected items on two achievement tests. Journal of Educational Measurement, 31, pp. 234-250.

Meng, H. (2007). A comparison study of IRT calibration methods for mixed-formattests in vertical scaling (Unpublished doctoral dissertation). University of Iowa, USA.

Reynolds, C. R., Livingston, R. B., & Willson, V. (2009). Measurement and assessment in education (2nd ed.). New York: Pearson Education, Inc.

Sadler, P. M. (1998). Psychomatric models of examinee conceptions in science: Reconciling qualitative studies and distractor-driven assessment

instruments. Journal of Research in Science Teaching, 35(3), pp. 265-296.

Si, C. B. (2002). Ability estimation under different item parameterization and scoring models (Unpublished doctoral dissertation). University of North Texas, USA.

Van Someren, M. W., Barnard, Y. F., & Sandberg, J. A. C. (1994). The think aloud method: A practical guide to modelling cognitive processes. London: Academic Press.

Susongko, P. (2009). Perbandingan keefektifan bentuk tes uraian dan testlet dengan penerapan ‘graded response model’ (GRM) [The comparison between the effectiveness of explanatory test and test let with the implementation of graded response model] (Unpublished doctoral dissertation). Yogyakarta State University, Yogyakarta.

Sykes, R. C., & Yen, W. M. (2000). The scaling of mixed-item-format tests with theone-parameter and twoparameter partial credit. Journal of Educational Measurement, 37, pp. 221-244.

Tall, D. O. et al. (2012). Cognitive develop-ment of proof. In ICMI 19: Proof and Proving in Mathematics Education. Springer. [Digital edition version]. Retrieved from http://homepages.warwick.ac.uk/staff/David.Tall/pdfs

Tang, K. L., & Eignor, D. R. (1997). Concurrent calibration of

dichotomously and polytomously scored TOEFL items using IRT

models. TEOFL Technical Report 13. Princeton, NJ: Educational Testing Service.

Thissen, D. M. (1976). Information in wrong responses to the raven

progressivematrices. Journal of Educational Measurement, 13(3), pp.


Thissen, D., & Steinberg, L. (1984). A response model for multiple choice items. Psychometrika, 49, 501-519.

Thissen, D. M., Steinberg, L., & Fitzpatrick, A. R. (1989). Multiple-choice

models: The distractors are also part of the item. Journal of Educationa

Measurement, 26(2), pp. 161-176.

Traub, R. E. (1993). On the equivalence of the traits assessed by multiple-choice andconstructed-response tests. In R.

E. Bennett, & W. C. Ward (Eds). Construction versus choice in cognitive

measurement (pp. 29-44). Hillsdale, NJ: Lawrenc Erlbaum Associates.

Wainer, H. & Thissen, D. M. (1993).

Combining multiple-choice and constructed response test scores:

toward a marxist theory of test construction. Applied Measurement in

Education, 6(2), pp. 103-118.

Wainer, H. (1989). The future of item analysis. Journal of Educational

Measurement, 26(2), pp. 191-208.

Wasis. (2009). Penskoran model partial credit pada item multiple true-false bidang fisika [Partial credit scoring model on multiple true-false items in physics field] (Unpublished professor dissertation). Yogyakarta State University, Yogyakarta.