Automatic generation of physics items with Large Language Models (LLMs)

Moses Oluoke Omopekunola, National Research University Higher School of Economics (HSE University), Russian FederationFollow
Elena Yu Kardanova, National Research University Higher School of Economics (HSE University), Russian FederationFollow

Keywords

AIG; Bloom’s taxonomy; ChatGPT; Gemini; LLM; Physics items

Document Type

Article

Abstract

High-quality items are essential for producing reliable and valid assessments, offering valuable insights for decision-making processes. As the demand for items with strong psychometric properties increases for both summative and formative assessments, automatic item generation (AIG) has gained prominence. Research highlights the potential of large language models (LLMs) in the AIG process, noting the positive impact of generative AI tools like ChatGPT on educational assessments, recognized for their ability to generate various item types across different languages and subjects. This study fills a research gap by exploring how AI-generated items in secondary/high school physics aligned with educational taxonomy. It utilizes Bloom's taxonomy, a well-known framework for designing and categorizing assessment items across various cognitive levels, from low to high. It focuses on a preliminary assessment of LLMs ability to generate physics items that match the Bloom’s taxonomy application level. Two leading LLMs, ChatGPT (GPT-4) and Gemini, were chosen for their strong performance in creating high-quality educational content. The research utilized various prompts to generate items at different cognitive levels based on Bloom's taxonomy. These items were assessed using multiple criteria: clarity, accuracy, absence of misleading content, appropriate complexity, correct language use, alignment with the intended level of Bloom's taxonomy, solvability, and assurance of a single correct answer. The findings indicated that both ChatGPT and Gemini were skilled at generating physics assessment items, though their effectiveness varied based on the prompting methods used. Instructional prompts, particularly, resulted in excellent outputs from both models, producing items that were clear, precise, and consistently aligned with the Application level of Bloom's taxonomy.

Page Range

168-185

Issue

Volume

Digital Object Identifier (DOI)

10.21831/reid.v10i2.76864

Recommended Citation

Omopekunola, M. O., & Kardanova, E. Y. (2024). Automatic generation of physics items with Large Language Models (LLMs). REID (Research and Evaluation in Education), 10(2). https://doi.org/10.21831/reid.v10i2.76864

References

Abduljabbar, D. A., & Omar, N. (2015). Exam questions classification based on Bloom’s taxonomy cognitive level using classifiers combination. Journal of Theoretical and Applied Information Technology, 78(3), 447–455.

Adams, N. E. (2015). Bloom’s taxonomy of cognitive learning objectives. Journal of the Medical Library Association: JMLA, 103(3), 152–153. https://doi.org/10.3163/1536-5050.103.3.010

Agarwal, P. K. (2019). Retrieval practice & Bloom’s taxonomy: Do students need fact knowledge before higher order learning? Journal of Educational Psychology, 111(2), 189–209. https://doi.org/10.1037/edu0000282

Agarwal, R., Singh, A., Zhang, L. M., Bohnet, B., Chan, S., Zhang, B., Anand, A., Abbas, Z., Nova, A., Co-Reyes, J. D., Chu, E., Behbahani, F., Faust, A., & Larochelle, H. (2024). Many-shot in-context learning. arXiv. https://doi.org/10.48550/arXiv.2404.11018

Alsubait, T., Parsia, B., & Sattler, U. (2015). Generating multiple choice questions from ontologies: How far can we go? In P. Lambrix, E. Hyvönen, E. Blomqvist, V. Presutti, G. Qi, U. Sattler, Y. Ding, & C. Ghidini (Eds.), Knowledge engineering and knowledge management (EKAW 2014): Lecture notes in computer science (vol. 8982, pp. 66–79). Springer. https://doi.org/10.1007/978-3-319-17966-7_7

Attali, Y. (2018). Automatic item generation unleashed: An evaluation of a large-scale deployment of item models. In In C. P. Rosé, R. Martínez-Maldonado, H. U. Hoppe, R. Luckin, M. Mavrikis, K. Porayska-Pomsta, B. McLaren, & B. du Boulay (Eds.), Artificial Intelligence in Education: The 19th International Conference, AIED 2018 (pp. 17–29). Springer. https://doi.org/10.1007/978-3-319-93843-1_2

Archibald, S., Coggshall, J. G., Croft, A., & Goe, L. (2011). High-Quality professional development for all teachers: Effectively allocating resources [Research & policy brief]. National Comprehensive Center for Teacher Quality. https://files.eric.ed.gov/fulltext/ED520732.pdf

Arendasy, M., & Sommer, M. (2007). Using psychometric technology in educational assessment: The case of a schema-based isomorphic approach to the automatic generation of quantitative reasoning items. Learning and Individual Differences, 17(4), 366–383. https://doi.org/10.1016/j.lindif.2007.03.005

Barnum, C. M. (2020). Usability testing essentials: Ready, set... test! (2nd ed.). Morgan Kaufmann.

Bejar, I. I. (2002). Generative testing: From conception to implementation. In S. H. Irvine & P. C. Kyllonen (Eds.), Item generation for test development (pp. 199–217). Lawrence Erlbaum Associates Publishers.

Bezirhan, U., & von Davier, M. (2023). Automated reading passage generation with OpenAI’s large language model. Computers and Education: Artificial Intelligence, 5, 1–13. https://doi.org/10.1016/j.caeai.2023.100161

Bhandari, S., Liu, Y., Kwak, Y., & Pardos, Z. A. (2024). Evaluating the psychometric properties of ChatGPT-generated questions. Computers and Education: Artificial Intelligence, 7, 1–9. https://doi.org/10.1016/j.caeai.2024.100284

Bloom, B. S., Engelhart, M. D., Furst, E. J., Hill, W. H., & Krathwohl, D. R. (1956). Taxonomy of educational objectives: The classification of educational goals (Handbook I: Cognitive domain). Longmans.

Borji, A. (2023). Stochastic Parrots or Intelligent Systems? A perspective on true depth of understanding in LLMs. SSRN Electronic Journal, 1–10. https://doi.org/10.2139/ssrn.4507038

Bozkurt, A., & Sharma, R. C. (2023). Challenging the status quo and exploring the new boundaries in the age of algorithms: Reimagining the role of generative AI in distance education and online learning. Asian Journal of Distance Education, 18(1), 1–8. https://doi.org/10.5281/zenodo.7755273

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., ... Amodei, D. (2020). Language models are few-shot learners. arXiv. https://doi.org/10.48550/arXiv.2005.14165

Buick, J. M. (2011). Physics assessment and the development of a taxonomy. European Journal of Physics Education, 2(1), 7–15. https://files.eric.ed.gov/fulltext/EJ1053836.pdf

Bulut, O., Beiting-Parrish, M., Casabianca, J. M., Slater, S. C., Jiao, H., Song, D., Ormerod, C. M., Fabiyi, D. G., Ivan, R., Walsh, C., Rios, O., Wilson, J., Yildirim-Erbasli, S. N., Wongvorachan, T., Liu, J. X., Tan, B., & Morilova, P. (2024). The rise of artificial intelligence in educational measurement: Opportunities and ethical challenges. arXiv. https://doi.org/10.48550/arXiv.2406.18900

Burns, M. K., Riley-Tillman, T. C., & Rathvon, N. (2017). Effective school interventions: Evidence-based strategies for improving student outcomes (3rd ed.). Guilford Press.

Chang, W. C., & Chung, M. S. (2009). Automatic applying Bloom’s taxonomy to classify and analysis the cognition level of English question items. Proceedings of the 2009 Joint Conferences on Pervasive Computing (JCPC), 727–734. https://doi.org/10.1109/JCPC.2009.5420087

Collins, A., Brown, J. S., & Newman, S. E. (1989). Cognitive apprenticeship: Teaching the crafts of reading, writing, and mathematics. In L. B. Resnick (Ed.), Knowing, learning, and instruction: Essays in honor of Robert Glaser (pp. 453–494). Routledge.

Crowe, A., Dirks, C., & Wenderoth, M. P. (2008). Biology in Bloom: Implementing Bloom’s taxonomy to enhance student learning in biology. CBE E–Life Sciences Education, 7(4), 368–381. https://doi.org/10.1187/cbe.08-05-0024

Dao, X. Q., & Le, N. B. (2023). LLMs performance on Vietnamese high school biology examination. International Journal of Modern Education and Computer Science, 15(6), 14–30. https://doi.org/10.5815/ijmecs.2023.06.02

Darling-Hammond, L. (2015). Getting teacher evaluation right: What really matters for effectiveness and improvement. Teachers College Press.

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv. https://doi.org/10.48550/arXiv.1810.04805

Doughty, J., Wan, Z., Bompelli, A., Qayum, J., Wang, T., Zhang, J., Zheng, Y., Doyle, A., Sridhar, P., Agarwal, A., Bogart, C., Keylor, E., Kultur, C., Savelka, J., & Sakr, M. (2024). A comparative study of AI-generated (GPT-4) and human-crafted MCQs in programming education. Proceedings of the 26th Australasian Computing Education Conference, 114–123. https://doi.org/10.1145/3636243.3636256

Embretson, S. E. (2005). Measuring human intelligence with artificial intelligence: Adaptive item generation. In R. J. Sternberg & J. E. Pretz (Eds.), Cognition and intelligence: Identifying the mechanisms of the mind (pp. 251–267). Cambridge University Press.

Embretson, S., & Yang, X. (2007). Automatic item generation and cognitive psychology. In C. R. Rao, & S. Sinharay (Eds.), Handbook of statistics: Psychometrics (vol. 26, pp. 747–768). Elsevier. https://doi.org/10.1016/S0169-7161(06)26023-1

Feng, S., Park, C. Y., Liu, Y., & Tsvetkov, Y. (2023). From pretraining data to language models to downstream tasks: Tracking the trails of political biases leading to unfair NLP models. arXiv. https://doi.org/10.48550/arXiv.2305.08283

Gao, T., Fisch, A., & Chen, D. (2020). Making pre-trained language models better few-shot learners. arXiv. https://doi.org/10.48550/arXiv.2012.15723

Gierl, M. J., & Haladyna, T. M. (2012). Automatic item generation: An introduction. In M. J. Gierl & T. M. Haladyna (Eds.), Automatic item generation: Theory and practice (pp. 3-12). Routledge.

Glas, C. A., & van der Linden, W. J. (2003). Computerized adaptive testing with item cloning. Applied Psychological Measurement, 27(4), 247–261. https://doi.org/10.1177/0146621603027004001

Gorin, J. S. (2006). Test design with cognition in mind. Educational Measurement: Issues and Practice, 25(4), 21–35. https://doi.org/10.1111/j.1745-3992.2006.00076.x

Gregorcic, B., & Pendrill, A. (2023). ChatGPT and the frustrated Socrates. Physics Education, 58(3), 1–9. https://doi.org/10.1088/1361-6552/acc299

Hou, X., Zhao, Y., Liu, Y., Yang, Z., Wang, K., Li, L., Luo, X., Lo, D., Grundy, J., & Wang, H. (2023). Large language models for software engineering: A systematic literature review. arXiv. https://doi.org/10.48550/arXiv.2308.10620

Ingwersen, P. (1996). Cognitive perspectives of information retrieval interaction: Elements of a cognitive IR theory. Journal of Documentation, 52(1), 3–50. https://doi.org/10.1108/eb026960

Irvine, J. (2021). Taxonomies in education: Overview, comparison, and future directions. Journal of Education and Development, 5(2), 1–25. https://doi.org/10.20849/jed.v5i2.898

Islam, R., & Ahmed, I. (2024). Gemini-the most powerful LLM: Myth or truth. Proceedings of the 2024 5th Information Communication Technologies Conference (ICTC 2024), 303–308. https://doi. org/10.1109/ICTC61510.2024.10602253

Karabenick, S. A., Woolley, M. E., Friedel, J. M., Ammon, B. V., Blazevski, J., Bonney, C. R., de Groot, E., Gilbert, M. C., Musu, L., Kempler, T. M., & Kelly, K. L. (2007). Cognitive processing of self-report items in educational research: Do they think what we mean? Educational Psychologist, 42(3), 139–151. https://doi.org/10.1080/00461520701416231

Kong, S. C. (2014). Developing information literacy and critical thinking skills through domain knowledge learning in digital classrooms: An experience of practicing flipped classroom strategy. Computers & Education, 8, 160–173. https://doi.org/10.1016/j.compedu.2014.05.009

Krathwohl, D. R. (2002). A revision of Bloom’s taxonomy: An overview. Theory into Practice, 41(4), 212–218. https://doi.org/10.1207/s15430421tip4104_2

Krathwohl, D. R., & Anderson, L. W. (2010). Merlin C. Wittrock and the revision of Bloom’s taxonomy. Educational Psychologist, 45(1), 64–65. https://doi.org/10.1080/00461520903433562

Küchemann, S., Steinert, S., Revenga, N., Schweinberger, M., Dinc, Y., Avila, K. E., & Kuhn, J. (2023). Can ChatGPT support prospective teachers in physics task development? Physical Review Physics Education Research, 19(2), 1–14. https://doi.org/10.1103/PhysRevPhysEducRes.19.020128

Kurdi, G., Leo, J., Parsia, B., Sattler, U., & Al-Emari, S. (2020). A systematic review of automatic question generation for educational purposes. International Journal of Artificial Intelligence in Education, 30, 121¬–204. https://doi.org/10.1007/s40593-019-00186-y

Laverghetta Jr, A., & Licato, J. (2023). Generating better items for cognitive assessments using large language models. Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), 414–428. https://doi.org/10.18653/v1/2023.bea-1.34

Li, J., Tang, T., Zhao, W. X., Nie, J. Y., & Wen, J. R. (2024). Pre-trained language models for text generation: A survey. ACM Computing Surveys, 56(9), 1–39. https://doi.org/10.1145/3649449

Li, S. (2021). Measuring cognitive engagement: An overview of measurement instruments and techniques. IJPS: International Journal of Psychology and Educational Studies, 8(3), 63–76. https://doi.org/10.52380/ijpes.2021.8.3.239

Liu, Y., Han, T., Ma, S., Zhang, J., Yang, Y., Tian, J., He, H., Li, A., He, M., Liu, Z., Wu, Z., Zhao, L., Zhu, D., Li, X., Qiang, N., Shen, D., Liu, T., & Ge, B. (2023). Summary of ChatGPT-Related research and perspective towards the future of large language models. Meta-Radiology, 1(2), 1–14. https://doi.org/10.1016/j.metrad.2023.100017

Lorenzo, C. M. (2024). Integrating large language models for real-world problem modelling: A comparative study. Proceedings of the 18th International Technology, Education and Development (INTED 2024) Conference, 3262–3272. https://doi.org/10.21125/inted.2024.0871

Marvin, G., Hellen, N., Jjingo, D., & Nakatumba-Nabende, J. (2024). Prompt engineering in large language models. In I. J. Jacob, S. Piramuthu, & P. Falkowski-Gilski (Eds.), Proceedings of the International Conference on Data Intelligence and Cognitive Informatics (ICDICI 2023) (pp. 387-402). Springer. https://doi.org/10.1007/978-981-99-7962-2_30

Miao, J., Thongprayoon, C., Suppadungsuk, S., Krisanapan, P., Radhakrishnan, Y., & Cheungpasitporn, W. (2024). Chain of thought utilization in large language models and application in nephrology. Medicina, 60(1), 1–19. https://doi.org/10.3390/medicina60010148

Minaee, S., Mikolov, T., Nikzad, N., Chenaghlu, M., Socher, R., Amatriain, X., & Gao, J. (2024). Large language models: A survey. arXiv. https://doi.org/10.48550/arXiv.2402.06196

Miri, B., David, B. C., & Uri, Z. (2007). Purposely teaching for the promotion of higher-order thinking skills: A case of critical thinking. Research in Science Education, 37, 353–369. https://doi.org/10.1007/s11165-006-9029-2

Mishra, S., Khashabi, D., Baral, C., Choi, Y., & Hajishirzi, H. (2021). Reframing instructional prompts to GPTk’s language. arXiv. https://doi.org/10.48550/arXiv.2109.07830

Mohammed, M., & Omar, N. (2020). Question classification based on Bloom’s taxonomy cognitive domain using modified TF-IDF and word2vec. PLoS ONE, 15(3), 1–21. https://doi.org/10.1371/journal.pone.0230442

Motlhabane, A. (2017). Unpacking the South African physics-examination questions according to Blooms’ revised taxonomy. Journal of Baltic Science Education, 16(6), 919–931.

Mystakidis, S., Fragkaki, M., & Filippousis, G. (2021). Ready teacher one: Virtual and augmented reality online professional development for K-12 school teachers. Computers, 10(10), 134. https://doi.org/10.3390/computers10100134

Offerijns, J., Verberne, S., & Verhoef, T. (2020). Better distractions: Transformer-based distractor generation and multiple choice question filtering. arXiv. https://doi.org/10.48550/arXiv.2010.09598

Perikos, I., Kardakis, S., & Hatzilygeroudis, I. (2021). Sentiment analysis using novel and interpretable architectures of Hidden Markov models. Knowledge-Based Systems, 229, 1–18. https://doi.org/10.1016/j.knosys.2021.107332

Polat, F., Tiddi, I., & Groth, P. (2024). Testing prompt engineering methods for knowledge extraction from text. Semantic Web. https://www.semantic-web-journal.net/system/files/swj3719.pdf

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI.

Rangapur, A., & Rangapur, A. (2024). The battle of LLMs: A comparative study in conversational QA tasks. arXiv. https://doi.org/10.48550/arXiv.2405.18344

Roumeliotis, K. I., & Tselikas, N. D. (2023). ChatGPT and Open-AI models: A preliminary review. Future Internet, 15(6), 1–24. https://doi.org/10.3390/fi15060192

Santos, R. P. D. (2023). Enhancing physics learning with ChatGPT, Bing Chat, and Bard as agents-to-think-with: A comparative case study. arXiv. https://doi.org/10.48550/arXiv.2306.00724

Scully, D. (2017). Constructing multiple-choice items to measure higher-order thinking. Practical Assessment, Research, and Evaluation, 22(1), 1–13. https://doi.org/10.7275/swgt-rj52

Shen, Y., Heacock, L., Elias, J., Hentel, K. D., Reig, B., Shih, G., & Moy, L. (2023). Reviews and commentary: ChatGPT and other large language models are double-edged swords. Radiology, 307(2), 1–4. https://doi.org/10.1148/radiol.230163

Song, Y., Dhariwal, P., Chen, M., & Sutskever, I. (2023). Consistency models. arXiv. https://doi.org/10.48550/arXiv.2303.01469

Tabrizi, S., & Rideout, G. (2017). Active learning: Using Bloom’s taxonomy to support critical pedagogy. International Journal for Cross-Disciplinary Subjects in Education, 8(3), 3202–3209.

Tan, B., Armoush, N., Mazzullo, E., Bulut, O., & Gierl, M. (2024). A review of automatic item generation techniques leveraging large language models. EdArXiv Preprints. https://doi.org/10.35542/osf.io/6d8tj

Tomlinson, C. A. (2017). How to differentiate instruction in academically diverse classrooms. ASCD.

Tomlinson, C. A. (2023). The parallel curriculum model: A design to develop potential & challenge high-ability learners. In J. S. Renzulli, E. J. Gubbins, K. S. McMillen, R. D. Eckert, & C. A. Little (Eds.), Systems and models for developing programs for the gifted and talented (pp. 571¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬–598). Routledge. https://doi.org/10.4324/9781003419426

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. arXiv. https://doi.org/10.48550/arXiv.1706.03762

Veal, W. R., & MaKinster, J. G. (1999). Pedagogical content knowledge taxonomies. The Electronic Journal for Research in Science & Mathematics Education, 3(4). https://ejrsme.icrsme.com/article/view/7615

Wang, H., Guo, B., Wu, W., Liu, S., & Yu, Z. (2021). Towards information-rich, logical dialogue systems with knowledge-enhanced neural models. Neurocomputing, 465, 248–264. https://doi.org/10.1016/j.neucom.2021.08.131

Yahya, A. A., Toukal, Z., & Osman, A. (2012). Bloom’s taxonomy–based classification for item bank questions using support vector machines. In W. Ding, H. Jiang, M. Ali, & M. Li (Eds.), Modern advances in intelligent systems and tools: Studies in computational intelligence (vol. 431, pp. 135–140). Springer-Verlag Berlin Heidelberg. https://doi.org/10.1007/978-3-642-30732-4_17

Yu, W., Zhu, C., Li, Z., Hu, Z., Wang, Q., Ji, H., & Jiang, M. (2022). A survey of knowledge-enhanced text generation. ACM Computing Surveys, 54(11s), 1–38. https://doi.org/10.1145/3512467

Zhang, M., & Li, J. (2021). A commentary of GPT-3 in MIT Technology Review 2021. Fundamental Research, 1(6), 831–833. https://doi.org/10.1016/j.fmre.2021.11.011

Zhong, Q., Ding, L., Liu, J., Du, B., & Tao, D. (2023). Can ChatGPT understand too? A comparative study on ChatGPT and fine-tuned Bert. arXiv. https://doi.org/10.48550/arXiv.2302.10198

Download