Jurnal Penelitian dan Evaluasi Pendidikan

Document Type



Test fairness becomes an aspect that needs to be considered when developing a test instrument. It is highly recommended that the instrument should not be biased for the test takers by ensuring that they do not behave differently among male and female test-takers. This study aims to examine the extent to which the items in an English proficiency test function differently across gender. Fifty reading items were examined and analyzed using a statistical method for detecting DIF. The items were individually tested for gender DIF using Rasch model analysis with the analysis tool of ConQuest. The results showed that six items were detected for DIF, three of which were basic comprehension items, and the other three were vocabulary questions. Some possible ways of dealing with DIF items were also discussed.

First Page


Last Page






Digital Object Identifier (DOI)



Adams, R., & Wu, M. (2010a). ConQuest [Computer software]. ACER.

Adams, R., & Wu, M. (2010b). Differential Item Functioning. ACER.

Bond, T. G., & Fox, C. M. (2015). Applying the Rasch Model: Fundamental measurement in the human sciences (3rd ed.). Routledge.

Boone, W. J., Staver, J. S., & Yale, M. S. (2014). Rasch Analysis in the human sciences. Springer.

Curtis, D. D., & Boman, P. (2007). X-ray your data with Rasch. International Education Journal, 8(2), 249-259.

Dodeen, H. (2003). The use of person-fit statistics to analyze placement tests. In Paper presented at the Annual Meeting of the American Educational Research Association (Chicago, IL, April 21-25, 2003).

Huff, K. L. (2000). Evaluating Differential Item Functioning across selected item formats on a large-scale certification examination. http://www.aicpa.org/BECOMEACPA/CPAEXAM/ PSYCHOMETRICSANDSCORING/TECHNICALREPORTS/Pages/default.aspx

Kan, A., & Bulut, O. (2014). Examining the relationship between gender DIF and language complexity in mathematics assessments. International Journal of Testing, 14(3), 245-264. http://doi.org/10.1080/15305058.2013.877911

Karabatsos, G. (2003). Comparing the aberrant response detection performance of thirty-six person-fit statistics. Applied Measurement in Education, 16(4), 277-298.

Kim, Y., & Jang, E. E. (2009). Differential functioning of reading subskills on the OSSLT for L1 and ELL students: A multidimensionality, model-based DBF/DIF approach. Language Learning, 59(4), 825-865.

Kunnan, A. J. (1990). DIF in native language and gender groups in an ESL placement test. TESOL Quarterly, 24(4), 741-746.

Kunnan, A. J. (2007). Test fairness, test bias, and DIF. Language Assessment Quarterly, 4(2), 109-112. http://doi.org/10.1080/15434300701375865

Le, L. (2006). Analysis of Differential Item Functioning. In The Annual Meetings of the American Educational Research Association in San Francisco, 7-11 April 2006. Australian Council for Educational Research.

Lee-Ellis, S. (2009). The development and validation of a Korean C-Test using Rasch Analysis. Language Testing, 26(2), 245-274.

Lin, J., & Wu, F. (2003). Differential performance by gender in foreign language testing. In Poster for the 2003 annual meeting of NCME in Chicago, IL.

Ong, Y. M., Williams, J., & Lamprianou, I. (2015). Exploring crossing differential item functioning by gender in mathematics assessment. International Journal of Testing, 15(4), 337-355. http://doi.org/10.1080/15305058.2015.1057639

Pae, T. (2012). Causes of gender DIF on an EFL language test: A multiple-data analysis over nine years. Language Testing, 29(4), 533-554. http://doi.org/10.1177/0265532211434027

Reise, S. P. (1990). A comparison of item- and person-fit methods of assessing model-data fit in IRT. Applied Pscyhological Measurement, 14(2), 127-137.

Takala, S., & Kaftandjieva, F. (2000). Test fairness: A DIF analysis of an L2 vocabulary test. Language Testing, 17(3), 323-340.

Wu, M., Tam, H. P., & Jen, T.-H. (2016). Educational measurement for applied researchers: Theory into practice. Springer.

Zenisky, A. L., Hambleton, R. K., & Robin, F. (2003). DIF detection and interpretation in large-scale science assessments: Informing item writing practices. Educational Assessment, 9(1-2), 61-78. https://doi.org/10.1080/10627197.2004.9652959

Zumbo, B. D. (2003). Does item-level DIF manifest itself in scale-level analyses? Implications for translating language tests. Language Testing, 20(2), 136-147.