Keywords
AIG; Bloom’s taxonomy; ChatGPT; Gemini; LLM; Physics items
Document Type
Article
Abstract
High-quality items are essential for producing reliable and valid assessments, offering valuable insights for decision-making processes. As the demand for items with strong psychometric properties increases for both summative and formative assessments, automatic item generation (AIG) has gained prominence. Research highlights the potential of large language models (LLMs) in the AIG process, noting the positive impact of generative AI tools like ChatGPT on educational assessments, recognized for their ability to generate various item types across different languages and subjects. This study fills a research gap by exploring how AI-generated items in secondary/high school physics aligned with educational taxonomy. It utilizes Bloom's taxonomy, a well-known framework for designing and categorizing assessment items across various cognitive levels, from low to high. It focuses on a preliminary assessment of LLMs ability to generate physics items that match the Bloom’s taxonomy application level. Two leading LLMs, ChatGPT (GPT-4) and Gemini, were chosen for their strong performance in creating high-quality educational content. The research utilized various prompts to generate items at different cognitive levels based on Bloom's taxonomy. These items were assessed using multiple criteria: clarity, accuracy, absence of misleading content, appropriate complexity, correct language use, alignment with the intended level of Bloom's taxonomy, solvability, and assurance of a single correct answer. The findings indicated that both ChatGPT and Gemini were skilled at generating physics assessment items, though their effectiveness varied based on the prompting methods used. Instructional prompts, particularly, resulted in excellent outputs from both models, producing items that were clear, precise, and consistently aligned with the Application level of Bloom's taxonomy.
Page Range
168-185
Issue
2
Volume
10
Digital Object Identifier (DOI)
10.21831/reid.v10i2.76864
Recommended Citation
Omopekunola, M. O., & Kardanova, E. Y. (2024). Automatic generation of physics items with Large Language Models (LLMs). REID (Research and Evaluation in Education), 10(2). https://doi.org/10.21831/reid.v10i2.76864
References
Abduljabbar, D. A., & Omar, N. (2015). Exam questions classification based on Bloom’s taxonomy cognitive level using classifiers combination. Journal of Theoretical and Applied Information Technology, 78(3), 447–455.
Adams, N. E. (2015). Bloom’s taxonomy of cognitive learning objectives. Journal of the Medical Library Association: JMLA, 103(3), 152–153. https://doi.org/10.3163/1536-5050.103.3.010
Agarwal, P. K. (2019). Retrieval practice & Bloom’s taxonomy: Do students need fact knowledge before higher order learning? Journal of Educational Psychology, 111(2), 189–209. https://doi.org/10.1037/edu0000282
Agarwal, R., Singh, A., Zhang, L. M., Bohnet, B., Chan, S., Zhang, B., Anand, A., Abbas, Z., Nova, A., Co-Reyes, J. D., Chu, E., Behbahani, F., Faust, A., & Larochelle, H. (2024). Many-shot in-context learning. arXiv. https://doi.org/10.48550/arXiv.2404.11018
Alsubait, T., Parsia, B., & Sattler, U. (2015). Generating multiple choice questions from ontologies: How far can we go? In P. Lambrix, E. Hyvönen, E. Blomqvist, V. Presutti, G. Qi, U. Sattler, Y. Ding, & C. Ghidini (Eds.), Knowledge engineering and knowledge management (EKAW 2014): Lecture notes in computer science (vol. 8982, pp. 66–79). Springer. https://doi.org/10.1007/978-3-319-17966-7_7
Attali, Y. (2018). Automatic item generation unleashed: An evaluation of a large-scale deployment of item models. In In C. P. Rosé, R. Martínez-Maldonado, H. U. Hoppe, R. Luckin, M. Mavrikis, K. Porayska-Pomsta, B. McLaren, & B. du Boulay (Eds.), Artificial Intelligence in Education: The 19th International Conference, AIED 2018 (pp. 17–29). Springer. https://doi.org/10.1007/978-3-319-93843-1_2
Archibald, S., Coggshall, J. G., Croft, A., & Goe, L. (2011). High-Quality professional development for all teachers: Effectively allocating resources [Research & policy brief]. National Comprehensive Center for Teacher Quality. https://files.eric.ed.gov/fulltext/ED520732.pdf
Arendasy, M., & Sommer, M. (2007). Using psychometric technology in educational assessment: The case of a schema-based isomorphic approach to the automatic generation of quantitative reasoning items. Learning and Individual Differences, 17(4), 366–383. https://doi.org/10.1016/j.lindif.2007.03.005
Barnum, C. M. (2020). Usability testing essentials: Ready, set... test! (2nd ed.). Morgan Kaufmann.
Bejar, I. I. (2002). Generative testing: From conception to implementation. In S. H. Irvine & P. C. Kyllonen (Eds.), Item generation for test development (pp. 199–217). Lawrence Erlbaum Associates Publishers.
Bezirhan, U., & von Davier, M. (2023). Automated reading passage generation with OpenAI’s large language model. Computers and Education: Artificial Intelligence, 5, 1–13. https://doi.org/10.1016/j.caeai.2023.100161
Bhandari, S., Liu, Y., Kwak, Y., & Pardos, Z. A. (2024). Evaluating the psychometric properties of ChatGPT-generated questions. Computers and Education: Artificial Intelligence, 7, 1–9. https://doi.org/10.1016/j.caeai.2024.100284
Bloom, B. S., Engelhart, M. D., Furst, E. J., Hill, W. H., & Krathwohl, D. R. (1956). Taxonomy of educational objectives: The classification of educational goals (Handbook I: Cognitive domain). Longmans.
Borji, A. (2023). Stochastic Parrots or Intelligent Systems? A perspective on true depth of understanding in LLMs. SSRN Electronic Journal, 1–10. https://doi.org/10.2139/ssrn.4507038
Bozkurt, A., & Sharma, R. C. (2023). Challenging the status quo and exploring the new boundaries in the age of algorithms: Reimagining the role of generative AI in distance education and online learning. Asian Journal of Distance Education, 18(1), 1–8. https://doi.org/10.5281/zenodo.7755273
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., ... Amodei, D. (2020). Language models are few-shot learners. arXiv. https://doi.org/10.48550/arXiv.2005.14165
Buick, J. M. (2011). Physics assessment and the development of a taxonomy. European Journal of Physics Education, 2(1), 7–15. https://files.eric.ed.gov/fulltext/EJ1053836.pdf
Bulut, O., Beiting-Parrish, M., Casabianca, J. M., Slater, S. C., Jiao, H., Song, D., Ormerod, C. M., Fabiyi, D. G., Ivan, R., Walsh, C., Rios, O., Wilson, J., Yildirim-Erbasli, S. N., Wongvorachan, T., Liu, J. X., Tan, B., & Morilova, P. (2024). The rise of artificial intelligence in educational measurement: Opportunities and ethical challenges. arXiv. https://doi.org/10.48550/arXiv.2406.18900
Burns, M. K., Riley-Tillman, T. C., & Rathvon, N. (2017). Effective school interventions: Evidence-based strategies for improving student outcomes (3rd ed.). Guilford Press.
Chang, W. C., & Chung, M. S. (2009). Automatic applying Bloom’s taxonomy to classify and analysis the cognition level of English question items. Proceedings of the 2009 Joint Conferences on Pervasive Computing (JCPC), 727–734. https://doi.org/10.1109/JCPC.2009.5420087
Collins, A., Brown, J. S., & Newman, S. E. (1989). Cognitive apprenticeship: Teaching the crafts of reading, writing, and mathematics. In L. B. Resnick (Ed.), Knowing, learning, and instruction: Essays in honor of Robert Glaser (pp. 453–494). Routledge.
Crowe, A., Dirks, C., & Wenderoth, M. P. (2008). Biology in Bloom: Implementing Bloom’s taxonomy to enhance student learning in biology. CBE E–Life Sciences Education, 7(4), 368–381. https://doi.org/10.1187/cbe.08-05-0024
Dao, X. Q., & Le, N. B. (2023). LLMs performance on Vietnamese high school biology examination. International Journal of Modern Education and Computer Science, 15(6), 14–30. https://doi.org/10.5815/ijmecs.2023.06.02
Darling-Hammond, L. (2015). Getting teacher evaluation right: What really matters for effectiveness and improvement. Teachers College Press.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv. https://doi.org/10.48550/arXiv.1810.04805
Doughty, J., Wan, Z., Bompelli, A., Qayum, J., Wang, T., Zhang, J., Zheng, Y., Doyle, A., Sridhar, P., Agarwal, A., Bogart, C., Keylor, E., Kultur, C., Savelka, J., & Sakr, M. (2024). A comparative study of AI-generated (GPT-4) and human-crafted MCQs in programming education. Proceedings of the 26th Australasian Computing Education Conference, 114–123. https://doi.org/10.1145/3636243.3636256
Embretson, S. E. (2005). Measuring human intelligence with artificial intelligence: Adaptive item generation. In R. J. Sternberg & J. E. Pretz (Eds.), Cognition and intelligence: Identifying the mechanisms of the mind (pp. 251–267). Cambridge University Press.
Embretson, S., & Yang, X. (2007). Automatic item generation and cognitive psychology. In C. R. Rao, & S. Sinharay (Eds.), Handbook of statistics: Psychometrics (vol. 26, pp. 747–768). Elsevier. https://doi.org/10.1016/S0169-7161(06)26023-1
Feng, S., Park, C. Y., Liu, Y., & Tsvetkov, Y. (2023). From pretraining data to language models to downstream tasks: Tracking the trails of political biases leading to unfair NLP models. arXiv. https://doi.org/10.48550/arXiv.2305.08283
Gao, T., Fisch, A., & Chen, D. (2020). Making pre-trained language models better few-shot learners. arXiv. https://doi.org/10.48550/arXiv.2012.15723
Gierl, M. J., & Haladyna, T. M. (2012). Automatic item generation: An introduction. In M. J. Gierl & T. M. Haladyna (Eds.), Automatic item generation: Theory and practice (pp. 3-12). Routledge.
Glas, C. A., & van der Linden, W. J. (2003). Computerized adaptive testing with item cloning. Applied Psychological Measurement, 27(4), 247–261. https://doi.org/10.1177/0146621603027004001
Gorin, J. S. (2006). Test design with cognition in mind. Educational Measurement: Issues and Practice, 25(4), 21–35. https://doi.org/10.1111/j.1745-3992.2006.00076.x
Gregorcic, B., & Pendrill, A. (2023). ChatGPT and the frustrated Socrates. Physics Education, 58(3), 1–9. https://doi.org/10.1088/1361-6552/acc299
Hou, X., Zhao, Y., Liu, Y., Yang, Z., Wang, K., Li, L., Luo, X., Lo, D., Grundy, J., & Wang, H. (2023). Large language models for software engineering: A systematic literature review. arXiv. https://doi.org/10.48550/arXiv.2308.10620
Ingwersen, P. (1996). Cognitive perspectives of information retrieval interaction: Elements of a cognitive IR theory. Journal of Documentation, 52(1), 3–50. https://doi.org/10.1108/eb026960
Irvine, J. (2021). Taxonomies in education: Overview, comparison, and future directions. Journal of Education and Development, 5(2), 1–25. https://doi.org/10.20849/jed.v5i2.898
Islam, R., & Ahmed, I. (2024). Gemini-the most powerful LLM: Myth or truth. Proceedings of the 2024 5th Information Communication Technologies Conference (ICTC 2024), 303–308. https://doi. org/10.1109/ICTC61510.2024.10602253
Karabenick, S. A., Woolley, M. E., Friedel, J. M., Ammon, B. V., Blazevski, J., Bonney, C. R., de Groot, E., Gilbert, M. C., Musu, L., Kempler, T. M., & Kelly, K. L. (2007). Cognitive processing of self-report items in educational research: Do they think what we mean? Educational Psychologist, 42(3), 139–151. https://doi.org/10.1080/00461520701416231
Kong, S. C. (2014). Developing information literacy and critical thinking skills through domain knowledge learning in digital classrooms: An experience of practicing flipped classroom strategy. Computers & Education, 8, 160–173. https://doi.org/10.1016/j.compedu.2014.05.009
Krathwohl, D. R. (2002). A revision of Bloom’s taxonomy: An overview. Theory into Practice, 41(4), 212–218. https://doi.org/10.1207/s15430421tip4104_2
Krathwohl, D. R., & Anderson, L. W. (2010). Merlin C. Wittrock and the revision of Bloom’s taxonomy. Educational Psychologist, 45(1), 64–65. https://doi.org/10.1080/00461520903433562
Küchemann, S., Steinert, S., Revenga, N., Schweinberger, M., Dinc, Y., Avila, K. E., & Kuhn, J. (2023). Can ChatGPT support prospective teachers in physics task development? Physical Review Physics Education Research, 19(2), 1–14. https://doi.org/10.1103/PhysRevPhysEducRes.19.020128
Kurdi, G., Leo, J., Parsia, B., Sattler, U., & Al-Emari, S. (2020). A systematic review of automatic question generation for educational purposes. International Journal of Artificial Intelligence in Education, 30, 121¬–204. https://doi.org/10.1007/s40593-019-00186-y
Laverghetta Jr, A., & Licato, J. (2023). Generating better items for cognitive assessments using large language models. Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), 414–428. https://doi.org/10.18653/v1/2023.bea-1.34
Li, J., Tang, T., Zhao, W. X., Nie, J. Y., & Wen, J. R. (2024). Pre-trained language models for text generation: A survey. ACM Computing Surveys, 56(9), 1–39. https://doi.org/10.1145/3649449
Li, S. (2021). Measuring cognitive engagement: An overview of measurement instruments and techniques. IJPS: International Journal of Psychology and Educational Studies, 8(3), 63–76. https://doi.org/10.52380/ijpes.2021.8.3.239
Liu, Y., Han, T., Ma, S., Zhang, J., Yang, Y., Tian, J., He, H., Li, A., He, M., Liu, Z., Wu, Z., Zhao, L., Zhu, D., Li, X., Qiang, N., Shen, D., Liu, T., & Ge, B. (2023). Summary of ChatGPT-Related research and perspective towards the future of large language models. Meta-Radiology, 1(2), 1–14. https://doi.org/10.1016/j.metrad.2023.100017
Lorenzo, C. M. (2024). Integrating large language models for real-world problem modelling: A comparative study. Proceedings of the 18th International Technology, Education and Development (INTED 2024) Conference, 3262–3272. https://doi.org/10.21125/inted.2024.0871
Marvin, G., Hellen, N., Jjingo, D., & Nakatumba-Nabende, J. (2024). Prompt engineering in large language models. In I. J. Jacob, S. Piramuthu, & P. Falkowski-Gilski (Eds.), Proceedings of the International Conference on Data Intelligence and Cognitive Informatics (ICDICI 2023) (pp. 387-402). Springer. https://doi.org/10.1007/978-981-99-7962-2_30
Miao, J., Thongprayoon, C., Suppadungsuk, S., Krisanapan, P., Radhakrishnan, Y., & Cheungpasitporn, W. (2024). Chain of thought utilization in large language models and application in nephrology. Medicina, 60(1), 1–19. https://doi.org/10.3390/medicina60010148
Minaee, S., Mikolov, T., Nikzad, N., Chenaghlu, M., Socher, R., Amatriain, X., & Gao, J. (2024). Large language models: A survey. arXiv. https://doi.org/10.48550/arXiv.2402.06196
Miri, B., David, B. C., & Uri, Z. (2007). Purposely teaching for the promotion of higher-order thinking skills: A case of critical thinking. Research in Science Education, 37, 353–369. https://doi.org/10.1007/s11165-006-9029-2
Mishra, S., Khashabi, D., Baral, C., Choi, Y., & Hajishirzi, H. (2021). Reframing instructional prompts to GPTk’s language. arXiv. https://doi.org/10.48550/arXiv.2109.07830
Mohammed, M., & Omar, N. (2020). Question classification based on Bloom’s taxonomy cognitive domain using modified TF-IDF and word2vec. PLoS ONE, 15(3), 1–21. https://doi.org/10.1371/journal.pone.0230442
Motlhabane, A. (2017). Unpacking the South African physics-examination questions according to Blooms’ revised taxonomy. Journal of Baltic Science Education, 16(6), 919–931.
Mystakidis, S., Fragkaki, M., & Filippousis, G. (2021). Ready teacher one: Virtual and augmented reality online professional development for K-12 school teachers. Computers, 10(10), 134. https://doi.org/10.3390/computers10100134
Offerijns, J., Verberne, S., & Verhoef, T. (2020). Better distractions: Transformer-based distractor generation and multiple choice question filtering. arXiv. https://doi.org/10.48550/arXiv.2010.09598
Perikos, I., Kardakis, S., & Hatzilygeroudis, I. (2021). Sentiment analysis using novel and interpretable architectures of Hidden Markov models. Knowledge-Based Systems, 229, 1–18. https://doi.org/10.1016/j.knosys.2021.107332
Polat, F., Tiddi, I., & Groth, P. (2024). Testing prompt engineering methods for knowledge extraction from text. Semantic Web. https://www.semantic-web-journal.net/system/files/swj3719.pdf
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI.
Rangapur, A., & Rangapur, A. (2024). The battle of LLMs: A comparative study in conversational QA tasks. arXiv. https://doi.org/10.48550/arXiv.2405.18344
Roumeliotis, K. I., & Tselikas, N. D. (2023). ChatGPT and Open-AI models: A preliminary review. Future Internet, 15(6), 1–24. https://doi.org/10.3390/fi15060192
Santos, R. P. D. (2023). Enhancing physics learning with ChatGPT, Bing Chat, and Bard as agents-to-think-with: A comparative case study. arXiv. https://doi.org/10.48550/arXiv.2306.00724
Scully, D. (2017). Constructing multiple-choice items to measure higher-order thinking. Practical Assessment, Research, and Evaluation, 22(1), 1–13. https://doi.org/10.7275/swgt-rj52
Shen, Y., Heacock, L., Elias, J., Hentel, K. D., Reig, B., Shih, G., & Moy, L. (2023). Reviews and commentary: ChatGPT and other large language models are double-edged swords. Radiology, 307(2), 1–4. https://doi.org/10.1148/radiol.230163
Song, Y., Dhariwal, P., Chen, M., & Sutskever, I. (2023). Consistency models. arXiv. https://doi.org/10.48550/arXiv.2303.01469
Tabrizi, S., & Rideout, G. (2017). Active learning: Using Bloom’s taxonomy to support critical pedagogy. International Journal for Cross-Disciplinary Subjects in Education, 8(3), 3202–3209.
Tan, B., Armoush, N., Mazzullo, E., Bulut, O., & Gierl, M. (2024). A review of automatic item generation techniques leveraging large language models. EdArXiv Preprints. https://doi.org/10.35542/osf.io/6d8tj
Tomlinson, C. A. (2017). How to differentiate instruction in academically diverse classrooms. ASCD.
Tomlinson, C. A. (2023). The parallel curriculum model: A design to develop potential & challenge high-ability learners. In J. S. Renzulli, E. J. Gubbins, K. S. McMillen, R. D. Eckert, & C. A. Little (Eds.), Systems and models for developing programs for the gifted and talented (pp. 571¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬–598). Routledge. https://doi.org/10.4324/9781003419426
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. arXiv. https://doi.org/10.48550/arXiv.1706.03762
Veal, W. R., & MaKinster, J. G. (1999). Pedagogical content knowledge taxonomies. The Electronic Journal for Research in Science & Mathematics Education, 3(4). https://ejrsme.icrsme.com/article/view/7615
Wang, H., Guo, B., Wu, W., Liu, S., & Yu, Z. (2021). Towards information-rich, logical dialogue systems with knowledge-enhanced neural models. Neurocomputing, 465, 248–264. https://doi.org/10.1016/j.neucom.2021.08.131
Yahya, A. A., Toukal, Z., & Osman, A. (2012). Bloom’s taxonomy–based classification for item bank questions using support vector machines. In W. Ding, H. Jiang, M. Ali, & M. Li (Eds.), Modern advances in intelligent systems and tools: Studies in computational intelligence (vol. 431, pp. 135–140). Springer-Verlag Berlin Heidelberg. https://doi.org/10.1007/978-3-642-30732-4_17
Yu, W., Zhu, C., Li, Z., Hu, Z., Wang, Q., Ji, H., & Jiang, M. (2022). A survey of knowledge-enhanced text generation. ACM Computing Surveys, 54(11s), 1–38. https://doi.org/10.1145/3512467
Zhang, M., & Li, J. (2021). A commentary of GPT-3 in MIT Technology Review 2021. Fundamental Research, 1(6), 831–833. https://doi.org/10.1016/j.fmre.2021.11.011
Zhong, Q., Ding, L., Liu, J., Du, B., & Tao, D. (2023). Can ChatGPT understand too? A comparative study on ChatGPT and fine-tuned Bert. arXiv. https://doi.org/10.48550/arXiv.2302.10198