A multidimensional item response theory approach in the item analysis of Arabic language tests in madrasah aliyah

Authors

  • Joko Subando Lembaga Pendidikan Pengembangan Agama Islam (LPPAI) Solo, Indonesia

DOI:

https://doi.org/10.21831/pep.v29i2.90877

Keywords:

multidimensional; item response theory; analysis; arabic language

Abstract

This study evaluates the quality of Arabic test items in madrasah assessments using a quantitative approach based on Multidimensional Item Response Theory (MIRT). The sample comprised 321 twelfth-grade students from MAN 1 Surakarta, purposively selected because the institution implements systematic and independent assessments. Data were obtained from student responses to the final Arabic examination in the 2022/2023 academic year. Exploratory Factor Analysis (EFA) was first conducted to identify the dimensional structure of the test, using the criteria KMO > 0.60 and a significant Bartlett’s Test of Sphericity (p < 0.05). Factor extraction was determined by eigenvalues > 1 and supported by scree plot inspection. Model fit was subsequently examined using a MIRT 2-parameter logistic (2PL) model in R, with evaluation indicators RMSEA < 0.06, CFI > 0.90, and TLI > 0.90. Item parameters included discrimination (d) and difficulty (b), where discrimination was classified as: < 0.00 (unacceptable); 0.00–0.34 (very low); 0.35–0.64 (low); 0.65–1.34 (moderate); ≥ 1.35 (high). The findings reveal substantial variability in item performance. Most items demonstrated acceptable discrimination; however, 16 items had negative discrimination, indicating weaknesses in content representation and item construction. A few items (items 1, 3, 7, 10, and 22) showed high discrimination and are highly informative. Difficulty levels were dominated by easy items, which limited the test’s ability to distinguish between medium- and high-ability examinees. The study recommends revising misfitting items, adding items with moderate difficulty and d > 0.65, and enhancing validity through Confirmatory Factor Analysis and bias detection using DIF analysis.

References

Ackerman, T. A., & Ma, Y. (2024). Examining differential item functioning from a multidimensional IRT perspective. Psychometrika, 89(1), 4-41. https://doi.org/https://doi.org/10.1007/s11336-024-09965-6

Al-Qerem, W., Abdo, S., Jarab, A., Hammad, A., Eberhardt, J., Al-Asmari, F., . . . Zumot, R. (2025). Validation of the Arabic Version of the Long-Term Conditions Questionnaire (LTCQ): A Study of Factor and Rasch Analyses. Healthcare,

Alhamami, M. (2025). Intention over motivation: A holistic analysis of psychological constructs in Arabic as a foreign language learning. Acta Psychologica, 258, 105142. https://doi.org/https://doi.org/10.1016/j.actpsy.2025.105142

Asadizanjani, N., Reddy Kottur, H., & Dalir, H. (2025). Testing and Reliability in Advanced Packaging. In Introduction to Microelectronics Advanced Packaging Assurance (pp. 141-159). Springer. https://doi.org/https://doi.org/10.1007/978-3-031-86102-4_8

Ayanwale, M. A., Chere-Masopha, J., & Morena, M. C. (2022). The classical test or item response measurement theory. https://doi.org/https://doi.org/10.26803/ijlter.21.8.22

Belenguer, L. (2022). AI bias: exploring discriminatory algorithmic decision-making models and the application of possible machine-centric solutions adapted from the pharmaceutical industry. AI and Ethics, 2(4), 771-787. https://doi.org/https://doi.org/10.1007/s43681-022-00138-8

Diki, D., & Yuliastuti, E. (2018). Discrepancy of difficulty level based on item analysis and test developers’ judgment: Department of Biology at Universitas Terbuka, Indonesia Educational technology to improve quality and access on a global scale: Papers from the Educational Technology World Conference (ETWC 2016), Indonesia.

Embretson, S. E., & Reise, S. P. (2013). Item response theory for psychologists. Psychology Press.

Engida, M. A., Iyasu, A. S., & Fentie, Y. M. (2024). Impact of teaching quality on student achievement: student evidence. Frontiers in Education,

Ernawati, E., Habibah, R. Y., Syarifah, N., Firmansyah, F., & Attamimi, H. a. R. (2024). Item analysis test of science, Indonesian language, and mathematics using the rasch model in elementary schools. Jurnal Penelitian dan Evaluasi Pendidikan, 28(2), 195-209. https://doi.org/https://doi.org/10.21831/pep.v28i2.75448

Escudero, E. B., Reyna, N. L., & Morales, M. R. (2000). The level of difficulty and discrimination power of the Basic Knowledge and Skills Examination (EXHCOBA). Revista electrónica de investigación educativa, 2(1), 2. https://doi.org/http://redie.uabc.mx/vol2no1/contents-backhoff.html

Freed, R., McKinnon, D., Fitzgerald, M., & Norris, C. M. (2022). Development and validation of an astronomy self-efficacy instrument for understanding and doing. Physical Review Physics Education Research, 18(1), 010117. https://doi.org/https://doi.org/10.1103/PhysRevPhysEducRes.18.010117

Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory (Vol. 2). Sage.

Istiqlal, M., Putro, N. H. P. S., & Istiyono, E. (2025). Evaluating English Language Test Items Developed by Teachers: An Item Response Theory Approach. Voices of English Language Education Society, 9(1), 218-230. https://doi.org/https://doi.org/10.29408/veles.v9i1.27644

Jewsbury, P. A., & van Rijn, P. W. (2020). IRT and MIRT models for item parameter estimation with multidimensional multistage tests. Journal of Educational and Behavioral Statistics, 45(4), 383-402. https://doi.org/https://doi.org/10.3102/1076998619881790

Jordan, P., & Spiess, M. (2019). Rethinking the interpretation of item discrimination and factor loadings. Educational and psychological measurement, 79(6), 1103-1132. https://doi.org/10.1177/0013164419843164

Jundi, M. (2023). Classical Test Theory in Analyzing Arabic Test Questions: A Descriptive Study on Item Analysis Research in Indonesia/نظرية الاختبار الكلاسيكية في تحليل الأسئلة العربية: الدراسة الوصفية على بحوث تحليل بنود الأسلة في إندونيسيا. ATHLA: Journal of Arabic Teaching, Linguistic and Literature, 4(2), 85-105. https://doi.org/https://doi.org/10.22515/athla.v4i2.7747

Kadir, S., Sarif, S., & Fuadi, A. H. N. (2024). Item Analysis of Arabic Thematic Questions to Determine Thinking Level Ability. ELOQUENCE: Journal of Foreign Language, 3(1), 28-41. https://doi.org/https://doi.org/10.58194/eloquence.v3i1.1498

Karnia, R. (2024). Importance of reliability and validity in research. Psychology and Behavioral Sciences, 13(6), 137-141. https://doi.org/10.13140/RG.2.2.30985.45921

Kasali, J., & Adeyemi, A. A. (2022). Estimation of item parameter indices of NECO Mathematics multiple choice test items among Nigerian students. Journal of Integrated Elementary Education, 2(1), 43-54.

Lee, W. C., Kim, S. Y., Choi, J., & Kang, Y. (2020). IRT approaches to modeling scores on mixed‐format tests. Journal of Educational Measurement, 57(2), 230-254. https://doi.org/ https://doi.org/10.1111/jedm.12248

Madi, D., & Clinton, M. (2015). Rasch analysis of the Arabic language version of the functional disability inventory. Journal of Pediatric Oncology Nursing, 32(4), 230-239. https://doi.org/https://doi.org/10.1177/1043454214554010

Mahmudi, I., Nurwardah, A., Rochma, S. N., & Nurcholis, A. (2023). Item Analysis Of Arabic Language Examination. Ijaz Arabi Journal of Arabic Learning, 6(3). https://doi.org/https://doi.org/10.18860/ijazarabi.v6i3.19821

Mardapi, D. (2020). Assessing Students' Higher Order Thinking Skills Using Multidimensional Item Response Theory. Problems of Education in the 21st Century, 78(2), 196-214. https://doi.org/https://doi.org/10.33225/pec/20.78.196

Ningsih, N. T. R., Rosidin, U., Viyanti, V., Distrik, I. W., & Abdurrahman, A. (2025). Development of an Assessment Instrument for Students Discipline and Responsibility in Physics Practicum-Based Cooperative Learning. Sebatik, 29(1), 67-73. https://doi.org/10.46984/sebatik.v29i1.2595

Ntumi, S. (2025). Advanced Multidimensional Item Response Theory Modeling for High-Stakes, Cross-Disciplinary Competency Assessments in Sub-Saharan Africa: A Psychometric Approach to Equity, Adaptivity, and Policy Integration. https://doi.org/https://doi.org/10.21203/rs.3.rs-6418690/v1

Nury, A. H. A., Hikmah, H., & Masrun, M. (2025). Assessment Instruments for Tarkib and Mufrodat in the Ministry of Religion's Arabic Language Textbook. Al-Lahjah: Jurnal Pendidikan, Bahasa Arab, dan Kajian Linguistik Arab, 8(2), 1021-1031. https://doi.org/https://doi.org/10.32764/lahjah.v8i2.5879

Oladele, J. I., & Ndlovu, M. (2021). A review of standardised assessment development procedure and algorithms for computer adaptive testing: Applications and relevance for fourth industrial revolution. International Journal of Learning, Teaching and Educational Research, 20(5), 1-17.

Perie, M. (2020). Comparability across different assessment systems. Comparability of large-scale educational assessments: Issues and recommendations, 123-148.

Puia, A.-M., Mihalcea, A., & Rotărescu, V. Ș. (2025). Well-being factors. An item-level analysis of the positive cognitive triad role, in the relationship between resilience and well-being. Acta Psychologica, 253, 104692. https://doi.org/https://doi.org/10.1016/j.actpsy.2025.104692

Ramadhan, M. R., & Subando, J. (2025). Analisis kualitas butir soal fiqih dan kemampuan siswa di Madrasah Aliyah Negeri 1 Surakarta. Al Ulum Jurnal Pendidikan Islam, 126-138.

Sadeghi, P., Pourabbas, A., Dehghani, G., & Katebi, K. (2025). Quantitative and qualitative item analysis of exams of basic medical sciences departments of Tabriz University of Medical Sciences in 2023. BMC Medical Education, 25(1), 937. https://doi.org/10.1186/s12909-025-07539-3

Saepudin, S., Pabbajah, M. T. H., & Pabbajah, M. (2024). Unleashing the power of reading: Effective strategies for non-native Arabic language learners. Alsinatuna, 9(2), 109-130. https://doi.org/https://doi.org/10.28918/alsinatuna.v9i2.7826

Scheerens, J., Glas, C. A., & Thomas, S. (2003). Educational evaluation, assessment, and monitoring: A systemic approach (Vol. 13). Taylor & Francis.

Shafie, S., Majid, F. A., Hoon, T. S., & Damio, S. M. (2021). Evaluating Construct Validity and Reliability of Intention to Transfer Training Conduct Instrument Using Rasch Model Analysis. Pertanika Journal of Social Sciences & Humanities, 29(2). https://doi.org/https://doi.org/10.47836/pjssh.29.2.17

Terry, D., & Nguyen, H. (2024). Assessing Measuring Instruments. Nursing and Midwifery Research-E-Book: Methods and Appraisal for Evidence Based Practice, 151.

Wilson, M. (2023). Constructing measures: An item response modeling approach. Routledge.

Zakkiyah, M. Y., Fidyahwati, N. M., Ma'suq, A. T., & Anggraini, N. (2024). Assessment Design and Analysis of Arabic Reading Skills Instructional Materials. IJIE International Journal of Islamic Education, 3(1), 31-46. https://doi.org/https://doi.org/10.35719/ijie.v3i1.2000

Zeinoun, P., Iliescu, D., & El Hakim, R. (2022). Psychological tests in Arabic: A review of methodological practices and recommendations for future use. Neuropsychology Review, 32(1), 1-19. https://doi.org/https://doi.org/10.1007/s11065-021-09476-6

Zondo, N. P., Zewotir, T., & North, D. E. (2021). The level of difficulty and discrimination power of the items of the National Senior Certificate Mathematics Examination. South African Journal of Education, 41(4), 1-13. https://doi.org/10.15700/saje.v41n4a1935

Published

2025-12-31

How to Cite

Subando, J. (2025). A multidimensional item response theory approach in the item analysis of Arabic language tests in madrasah aliyah . Jurnal Penelitian Dan Evaluasi Pendidikan, 29(2). https://doi.org/10.21831/pep.v29i2.90877

Issue

Section

Articles

Citation Check

Similar Articles

1 2 3 4 5 6 7 8 9 10 > >> 

You may also start an advanced similarity search for this article.