A multidimensional item response theory approach in the item analysis of Arabic language tests in madrasah aliyah

Authors

  • Joko Subando Lembaga Pendidikan Pengembangan Agama Islam (LPPAI) Solo, Indonesia

DOI:

https://doi.org/10.21831/pep.v29i2.90877

Keywords:

Arabic language, item analysis, item response theory, multidimensional item response theory

Abstract

This study evaluates the quality of Arabic test items in madrasah assessments using a quantitative approach based on Multidimensional Item Response Theory (MIRT). The sample comprised 321 twelfth-grade students from MAN 1 Surakarta, purposively selected because the institution implements systematic and independent assessments. Data were obtained from student responses to the final Arabic examination in the 2022/2023 academic year. Exploratory Factor Analysis (EFA) was first conducted to identify the dimensional structure of the test, using the criteria KMO > 0.60 and a significant Bartlett’s Test of Sphericity (p < 0.05). Factor extraction was determined by eigenvalues > 1 and supported by scree plot inspection. Model fit was subsequently examined using a MIRT 2-parameter logistic (2PL) model in R, with evaluation indicators RMSEA < 0.06, CFI > 0.90, and TLI > 0.90. Item parameters included discrimination (d) and difficulty (b), where discrimination was classified as: < 0.00 (unacceptable); 0.00–0.34 (very low); 0.35–0.64 (low); 0.65–1.34 (moderate); ≥ 1.35 (high). Findings show substantial variability in item performance. Most items demonstrated acceptable discrimination; however, 16 items had negative discrimination, indicating weaknesses in content representation and item construction. A few items (items 1, 3, 7, 10, and 22) showed high discrimination and are highly informative. Difficulty levels were dominated by easy items, limiting the test’s ability to distinguish medium- to high-ability examinees. The study recommends revising misfitting items, adding items with moderate difficulty and d > 0.65, and enhancing validity through Confirmatory Factor Analysis and bias detection using DIF analysis.

References

Ackerman, T. A., & Ma, Y. (2024). Examining differential item functioning from a multidimensional IRT perspective. Psychometrika, 89(1), 4-41. https://doi.org/10.1007/s11336-024-09965-6

Al-Qerem, W., Abdo, S., Jarab, A., Hammad, A., Eberhardt, J., Al-Asmari, F., . . . Zumot, R. (2025). Validation of the Arabic version of the Long-Term Conditions Questionnaire (LTCQ): A study of factor and Rasch analyses. Healthcare, 13(13), 1485. https://doi.org/10.3390/healthcare13131485

Alhamami, M. (2025). Intention over motivation: A holistic analysis of psychological constructs in Arabic as a foreign language learning. Acta Psychologica, 258, 105142. https://doi.org/10.1016/j.actpsy.2025.105142

Asadizanjani, N., Reddy Kottur, H., & Dalir, H. (2025). Testing and reliability in advanced packaging. In Introduction to microelectronics advanced packaging assurance (pp. 141-159). Springer. https://doi.org/10.1007/978-3-031-86102-4_8

Ayanwale, M. A., Chere-Masopha, J., & Morena, M. C. (2022). The classical test or item response measurement theory. International Journal of Learning, Teaching and Educational Research, 21(8), 384-406. https://doi.org/10.26803/ijlter.21.8.22

Belenguer, L. (2022). AI bias: Exploring discriminatory algorithmic decision-making models and the application of possible machine-centric solutions adapted from the pharmaceutical industry. AI and Ethics, 2(4), 771-787. https://doi.org/10.1007/s43681-022-00138-8

Diki, D., & Yuliastuti, E. (2018). Discrepancy of difficulty level based on item analysis and test developers’ judgment: Department of Biology at Universitas Terbuka, Indonesia. In D. Ifenthaler, A. Yuen, Y. An, & J. M. Spector (Eds.), Educational technology to improve quality and access on a global scale: Papers from the Educational Technology World Conference (ETWC 2016), Indonesia (pp. 215–225). Spriner Nature. https://doi.org/10.1007/978-3-319-66227-5_17

Embretson, S. E., & Reise, S. P. (2013). Item response theory for psychologists. Psychology Press. https://doi.org/10.4324/9781410605269

Engida, M. A., Iyasu, A. S., & Fentie, Y. M. (2024). Impact of teaching quality on student achievement: student evidence. Frontiers in Education, 9, 1367317. https://doi.org/10.3389/feduc.2024.1367317

Ernawati, E., Habibah, R. Y., Syarifah, N., Firmansyah, F., & Attamimi, H. a. R. (2024). Item analysis test of science, Indonesian language, and mathematics using the Rasch model in elementary schools. Jurnal Penelitian dan Evaluasi Pendidikan, 28(2), 195-209. https://doi.org/10.21831/pep.v28i2.75448

Escudero, E. B., Reyna, N. L., & Morales, M. R. (2000). The level of difficulty and discrimination power of the Basic Knowledge and Skills Examination (EXHCOBA). Revista Electrónica de Investigación Educativa, 2(1). http://redie.uabc.mx/vol2no1/contents-backhoff.html

Freed, R., McKinnon, D., Fitzgerald, M., & Norris, C. M. (2022). Development and validation of an astronomy self-efficacy instrument for understanding and doing. Physical Review Physics Education Research, 18(1), 010117. https://doi.org/10.1103/PhysRevPhysEducRes.18.010117

Glas, C., Scheerens, J., & Thomas, S. M. (2003). Educational evaluation, assessment, and monitoring: A systemic approach (1st ed.). Taylor & Francis. https://doi.org/10.4324/9780203971055

Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory (Vol. 2). SAGE.

Istiqlal, M., Putro, N. H. P. S., & Istiyono, E. (2025). Evaluating English language test items developed by teachers: An item response theory approach. Voices of English Language Education Society, 9(1), 218-230. https://doi.org/10.29408/veles.v9i1.27644

Jewsbury, P. A., & van Rijn, P. W. (2020). IRT and MIRT models for item parameter estimation with multidimensional multistage tests. Journal of Educational and Behavioral Statistics, 45(4), 383-402. https://doi.org/10.3102/1076998619881790

Jordan, P., & Spiess, M. (2019). Rethinking the interpretation of item discrimination and factor loadings. Educational and Psychological Measurement, 79(6), 1103-1132. https://doi.org/10.1177/0013164419843164

Jundi, M. (2023). Classical test theory in analyzing Arabic test questions: A descriptive study on item analysis research in Indonesia/نظرية الاختبار الكلاسيكية في تحليل الأسئلة العربية: الدراسة الوصفية على بحوث تحليل بنود الأسلة في إندونيسيا. ATHLA: Journal of Arabic Teaching, Linguistic and Literature, 4(2), 85-105. https://doi.org/10.22515/athla.v4i2.7747

Kadir, S., Sarif, S., & Fuadi, A. H. N. (2024). Item analysis of Arabic thematic questions to determine thinking level ability. ELOQUENCE: Journal of Foreign Language, 3(1), 28-41. https://doi.org/10.58194/eloquence.v3i1.1498

Karnia, R. (2024). Importance of reliability and validity in research. Psychology and Behavioral Sciences, 13(6), 137-141. https://doi.org/10.13140/RG.2.2.30985.45921

Kasali, J., & Adeyemi, A. A. (2022). Estimation of item parameter indices of NECO Mathematics multiple choice test items among Nigerian students. Journal of Integrated Elementary Education, 2(1), 43-54. https://doi.org/10.21580/jieed.v2i1.10187

Lee, W. C., Kim, S. Y., Choi, J., & Kang, Y. (2020). IRT approaches to modeling scores on mixed‐format tests. Journal of Educational Measurement, 57(2), 230-254. https://doi.org/10.1111/jedm.12248

Madi, D., & Clinton, M. (2015). Rasch analysis of the Arabic language version of the functional disability inventory. Journal of Pediatric Oncology Nursing, 32(4), 230-239. https://doi.org/10.1177/1043454214554010

Mahmudi, I., Nurwardah, A., Rochma, S. N., & Nurcholis, A. (2023). Item analysis of Arabic language examination. Ijaz Arabi Journal of Arabic Learning, 6(3). https://doi.org/10.18860/ijazarabi.v6i3.19821

Mardapi, D. (2020). Assessing students' higher order thinking skills using multidimensional item response theory. Problems of Education in the 21st Century, 78(2), 196-214. https://doi.org/10.33225/pec/20.78.196

Ningsih, N. T. R., Rosidin, U., Viyanti, V., Distrik, I. W., & Abdurrahman, A. (2025). Development of an assessment instrument for students discipline and responsibility in physics practicum-based cooperative learning. Sebatik, 29(1), 67-73. https://doi.org/10.46984/sebatik.v29i1.2595

Ntumi, S. (2025). Advanced multidimensional item response theory modeling for high-stakes, cross-disciplinary competency assessments in sub-Saharan Africa: A psychometric approach to equity, adaptivity, and policy integration. Research Square, Preprint(Version 1). https://doi.org/10.21203/rs.3.rs-6418690/v1

Nury, A. H. A., Hikmah, H., & Masrun, M. (2025). Assessment Instruments for Tarkib and Mufrodat in the Ministry of Religion's Arabic language textbook. Al-Lahjah: Jurnal Pendidikan, Bahasa Arab, dan Kajian Linguistik Arab, 8(2), 1021-1031. https://doi.org/10.32764/lahjah.v8i2.5879

Oladele, J. I., & Ndlovu, M. (2021). A review of standardised assessment development procedure and algorithms for computer adaptive testing: Applications and relevance for fourth industrial revolution. International Journal of Learning, Teaching and Educational Research, 20(5), 1-17. https://www.ijlter.org/index.php/ijlter/article/view/3551

Pardede, T., Santoso, A., Diki, D., Retnawati, H., Rafi, I., Apino, E., & Rosyada, M. N. (2023). Gaining a deeper understanding of the meaning of the carelessness parameter in the 4PL IRT model and strategies for estimating it. REID (Research and Evaluation in Education), 9(1), 86–117. https://doi.org/10.21831/reid.v9i1.63230

Perie, M. (2020). Comparability across different assessment systems. In A. I. Berman, E. H. Haertel, & J. W. Pellegrino (Eds.), Comparability of large-scale educational assessments: Issues and recommendations, pp. 123-148. National Academy of Education. https://naeducation.org/wp-content/uploads/2020/06/Comparability-of-Large-Scale-Educational-Assessments.pdf

Puia, A.-M., Mihalcea, A., & Rotărescu, V. Ș. (2025). Well-being factors. An item-level analysis of the positive cognitive triad role, in the relationship between resilience and well-being. Acta Psychologica, 253, 104692. https://doi.org/10.1016/j.actpsy.2025.104692

Ramadhan, M. R., & Subando, J. (2025). Analisis kualitas butir soal fiqih dan kemampuan siswa di Madrasah Aliyah Negeri 1 Surakarta. Al Ulum Jurnal Pendidikan Islam, 5(2), 126-138. https://doi.org/10.54090/alulum.698

Sadeghi, P., Pourabbas, A., Dehghani, G., & Katebi, K. (2025). Quantitative and qualitative item analysis of exams of basic medical sciences departments of Tabriz University of Medical Sciences in 2023. BMC Medical Education, 25(1), 937. https://doi.org/10.1186/s12909-025-07539-3

Saepudin, S., Pabbajah, M. T. H., & Pabbajah, M. (2024). Unleashing the power of reading: Effective strategies for non-native Arabic language learners. Alsinatuna, 9(2), 109-130. https://doi.org/10.28918/alsinatuna.v9i2.7826

Shafie, S., Majid, F. A., Hoon, T. S., & Damio, S. M. (2021). Evaluating construct validity and reliability of intention to transfer training conduct instrument using Rasch model analysis. Pertanika Journal of Social Sciences & Humanities, 29(2), 1055–1070. https://doi.org/10.47836/pjssh.29.2.17

Stalikas, A., Triliva, S., & Roussi, P. (2018). Exploratory factor analysis. In V. Zeigler-Hill & T. K. Shackelford (Eds.), Encyclopedia of personality and individual differences. Springer. https://doi.org/10.1007/978-3-319-28099-8_1385-1

Terry, D., & Nguyen, H. (2024). Assessing measuring instruments. In D. Whitehead & D. Terry (Eds.), Nursing and midwifery research: Methods and appraisal for evidence based practice (7th ed.), pp. 151-167. Elsevier. https://www.elsevierhealth.com.au/nursing-and-midwifery-research-9780729544665.html

Watkins, M. W. (2018). Exploratory factor analysis: A guide to best practice. Journal of Black Psychology, 44(3), 219–246. https://doi.org/10.1177/0095798418771807

Wilson, M. (2023). Constructing measures: An item response modeling approach (2nd ed.). Routledge. https://doi.org/10.4324/9781003286929

Zakkiyah, M. Y., Fidyahwati, N. M., Ma'suq, A. T., & Anggraini, N. (2024). Assessment design and analysis of Arabic reading skills instructional materials. IJIE International Journal of Islamic Education, 3(1), 31-46. https://doi.org/10.35719/ijie.v3i1.2000

Zeinoun, P., Iliescu, D., & El Hakim, R. (2022). Psychological tests in Arabic: A review of methodological practices and recommendations for future use. Neuropsychology Review, 32(1), 1-19. https://doi.org/10.1007/s11065-021-09476-6

Zondo, N. P., Zewotir, T., & North, D. E. (2021). The level of difficulty and discrimination power of the items of the National Senior Certificate Mathematics Examination. South African Journal of Education, 41(4), 1-13. https://doi.org/10.15700/saje.v41n4a1935

Downloads

Published

2025-12-31

How to Cite

Subando, J. (2025). A multidimensional item response theory approach in the item analysis of Arabic language tests in madrasah aliyah . Jurnal Penelitian Dan Evaluasi Pendidikan, 29(2), 271–284. https://doi.org/10.21831/pep.v29i2.90877

Issue

Section

Articles

Citation Check

Similar Articles

1 2 3 4 5 6 7 8 9 10 > >> 

You may also start an advanced similarity search for this article.