Jurnal Penelitian dan Evaluasi Pendidikan


Sugeng Sugeng


penyetaraan vertikal, model kredit parsial, tes anchor, kalibrasi, RMSD, RMSE

Document Type




Penelitian ini bertujuan menemukan ukuran sampel minimum, pengaruh panjang tes, panjang tes anchor minimum, dan metode penyetaraan tes dalam penyetaraan vertikal model kredit parsial soal Matematika SMP menggunakan common-item nonequivalent groups design. Pembangkitan data melibatkan variasi peringkat kelas terhadap ukuran sampel (300; 600; 1000), panjang tes (10; 20), dan distribusi kemampuan (N(0,1), N(1,1)) sebanyak 50 replikasi menggunakan Program WinGen2. Penyetaraan vertikal melibatkan (a) panjang tes anchor 2, 3, 4, 5, dan 8 butir (panjang tes 20 butir); dan (b) panjang tes anchor 2, 3, 4, dan 5 butir (panjang tes 10 butir). Kriteria pengujian keakuratan penyetaraan menggunakan RMSD dan RMSE. Hasil penelitian menunjukkan: (1) Penyetaraan vertikal pada sampel 300 memiliki rata-rata RMSD dan RMSE cukup kecil untuk semua situasi; (2) Keakuratan penyetaraan meningkat seiring meningkatnya panjang tes; (3) Dengan rentang panjang tes anchor 25% sampai 30% untuk butir politomus, penyetaraan vertikal model kredit parsial memerlukan panjang tes anchor minimum 5 untuk panjang tes 20 butir dan 3 untuk panjang tes 10 butir; dan (4) Metode Mean/Mean cenderung lebih akurat, dalam penyetaraan vertikal IRT butir tes Matematika model kredit parsial diikuti Stocking-Lord, Mean/Sigma, dan Haebara. Kata kunci: penyetaraan vertikal, model kredit parsial, tes anchor, kalibrasi, RMSD, RMSE

First Page


Last Page






Digital Object Identifier (DOI)



Adams, R. J., & Khoo, S. T. (1996). QUEST: The interactive test analysis system. Camberwell, VA: ACER Press.

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.

Baker, F. B. (1993). Equating test under the nominal response model.

Applied Psycho-logical Measurement, 17(3), 239–251.

Baker, F. B., & Al-Karni, A. (1991). A comparison of two procedures for computing IRT equating coefficients. Journal of Educational Measurement, 28, 147–162.

Cook, L. L., & Eignor, D. R. (1991). An NCME instructional module on IRT equating methods. Educational Measurement. Issues and Practice, 10, 37–45.

Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory.

New York: Holt, Rinehart and Winston.

Dodd, B. G. & de Ayala, R. J. (1994). Item information as a function of threshold values in the rating scale model. Dalam M. Wilson (Ed.), Objective Measurement: Theory into Practice (pp. 299-315). Norwood, NJ: Ablex Publishing Corporation.

Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologist.

Mahwah, NJ: Lawrence Erlbaum Associates.

Ferrara, S. & Walker-Bartnick. (29 Maret 1989). Constructing an essay prompt bank using the partial credit model. Paper presented at the annual meeting of the National Council on Measurement in Education, San Francisco. Diambil pada tanggal 18 Maret 2010 dari http://www.education.umd.edu/EDMS/MARCES/mdarch/ pdf/M 013027.pdf

Gifford, J. A. & Swaminathan, H. (1990). Bias and the effect of priors in bayesian estimation of parameters of item response models. Applied Psychological Measurement, 14(1), 33–43.

Haebara, T. (1980). Equating logistic ability scales by a weighted least squares method. Iowa Testing Programs Occasional Papers, 17. Abstract. Diambil pada tanggal 12 Februari 2004, dari http://SearchERIC.org/ericda/ED193300.htm

Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: Principles and applications. Boston, MA: Kluwer-Nijhoff Publishing.

Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park, CA: Sage Publications.

Han, K. T. & Hambleton, R. K. (2007). User’s manual: WinGen2. Amherst, MA: University of Massachusetts, Center for Educational Assesment.

Harris, D. J. (1991). A comparison of Angoff’s design I and Angoff’s design II for vertical equating using traditional and IRT methodology. Abstract. Diambil pada tanggal 12 Februari 2004, dari http://SearchERIC.org/ericda/EJ35192.htm

Hoe, S. L. (2008). Issue and procedures in adopting structural equation modeling technique. Journal of Applied Quantitative Methods, 3, 1, 76-83.

Hulin, C. L., Drasgow, F., & Parsons, C. K. (1983). Item response theory: Application to psychological theory. Homewood, IL: Dow Jones-Irwin.

Jöreskog, K. G. & Sörbom, D. (1996). LISREL 8 . User’s reference guide.

Chicago: Scientific Software International.

Kennedy, L. M., Tipps, S., & Johnson, A. (2008). Guiding children’s learning of mathematics (11th ed.). Belmont, CA: Thomson Wadsworth.

Kim, S. H., & Cohen, A. S. (2002). A comparasion of linking and concurrent calibration under item response theory. Applied Psychological Measurement, 26(1), 25–41.

Kim, S. & Kolen, M. J. (2004). STUIRT: A computer program for scale transformation under unidimensional item response theory models. Iowa, IA: The University of Iowa, Iowa Testing Programs.

Kirisci, L, Hsu, T. C., & Yu, L. (2001). Robustness of item parameter estimation programs to assumpsions of unidimensionality and normality. Applied Psychological Measurement, 25(2), 146–162.

Kolen, M. J., & Brennan, R. L. (1995). Test equating: Methods and practices.

New York: Springer.

Kolen, M. J., & Brennan, R. L. (2004). Test equating, scaling, and linking: Methods and practices (2nd ed.). New York: Springer.

Loyd, B. H., & Hoover, H. D. (1979). A comparison of methods vertical equating. Abstract. Diambil pada tanggal 12 Februari 2004, dari http://SearchERIC.org/ericda/ED177199.htm

Masters, G. N. (1982). A Rasch model for partial credit scoring.

Psychometrica, 47(2), 149–174.

Nonny Swediati. (1997). Test equating under generalized partial credit model.

Unpublished Dissertation, University of Massachusetts Amherst.

Ogasawara, H. (2001). Standard errors of item response theory equating/linking by response function methods. Applied Psychological Measurement, 25(1), 53–67.

Peterson, N. S., Kolen, M. J., & Hoover, H. D. (1989). Scaling, norming, and equating. Dalam R. L. Linn (Ed.), Educational Measurement (3rd ed., pp. 221–262). New York: American Council on Education, Macmillan Publishing Company.

Stocking, M. L.& Lord, F. M. (1980). Developing a common metric in item response theory. Applied Psychological Measurement, 7(2), 201–210.

Syaifuddin, M. (2005). Penyetaraan tes model respons berjenjang. Disertasi doktor, tidak diterbitkan. Yogyakarta: Program Pascasarjana Universitas Negeri Yogyakarta.

Wright, B. D., & Masters, G. N. (1982). Rating scale analysis. Chicago, IL: Mesa Press.

Yen, W. M., & Fitzpatrick, A. R. (2006). Item response theory. Dalam R. L. Brennan (Ed.), Educational Measurement (4th ed. pp. 111–154). Westport, CT: American Council on Education and Praeger Publishers.