A multidimensional item response theory approach in the item analysis of Arabic language tests in madrasah aliyah
DOI:
https://doi.org/10.21831/pep.v29i2.90877Keywords:
Arabic language, item analysis, item response theory, multidimensional item response theoryAbstract
This study evaluates the quality of Arabic test items in madrasah assessments using a quantitative approach based on Multidimensional Item Response Theory (MIRT). The sample comprised 321 twelfth-grade students from MAN 1 Surakarta, purposively selected because the institution implements systematic and independent assessments. Data were obtained from student responses to the final Arabic examination in the 2022/2023 academic year. Exploratory Factor Analysis (EFA) was first conducted to identify the dimensional structure of the test, using the criteria KMO > 0.60 and a significant Bartlett’s Test of Sphericity (p < 0.05). Factor extraction was determined by eigenvalues > 1 and supported by scree plot inspection. Model fit was subsequently examined using a MIRT 2-parameter logistic (2PL) model in R, with evaluation indicators RMSEA < 0.06, CFI > 0.90, and TLI > 0.90. Item parameters included discrimination (d) and difficulty (b), where discrimination was classified as: < 0.00 (unacceptable); 0.00–0.34 (very low); 0.35–0.64 (low); 0.65–1.34 (moderate); ≥ 1.35 (high). Findings show substantial variability in item performance. Most items demonstrated acceptable discrimination; however, 16 items had negative discrimination, indicating weaknesses in content representation and item construction. A few items (items 1, 3, 7, 10, and 22) showed high discrimination and are highly informative. Difficulty levels were dominated by easy items, limiting the test’s ability to distinguish medium- to high-ability examinees. The study recommends revising misfitting items, adding items with moderate difficulty and d > 0.65, and enhancing validity through Confirmatory Factor Analysis and bias detection using DIF analysis.
References
Ackerman, T. A., & Ma, Y. (2024). Examining differential item functioning from a multidimensional IRT perspective. Psychometrika, 89(1), 4-41. https://doi.org/10.1007/s11336-024-09965-6
Al-Qerem, W., Abdo, S., Jarab, A., Hammad, A., Eberhardt, J., Al-Asmari, F., . . . Zumot, R. (2025). Validation of the Arabic version of the Long-Term Conditions Questionnaire (LTCQ): A study of factor and Rasch analyses. Healthcare, 13(13), 1485. https://doi.org/10.3390/healthcare13131485
Alhamami, M. (2025). Intention over motivation: A holistic analysis of psychological constructs in Arabic as a foreign language learning. Acta Psychologica, 258, 105142. https://doi.org/10.1016/j.actpsy.2025.105142
Asadizanjani, N., Reddy Kottur, H., & Dalir, H. (2025). Testing and reliability in advanced packaging. In Introduction to microelectronics advanced packaging assurance (pp. 141-159). Springer. https://doi.org/10.1007/978-3-031-86102-4_8
Ayanwale, M. A., Chere-Masopha, J., & Morena, M. C. (2022). The classical test or item response measurement theory. International Journal of Learning, Teaching and Educational Research, 21(8), 384-406. https://doi.org/10.26803/ijlter.21.8.22
Belenguer, L. (2022). AI bias: Exploring discriminatory algorithmic decision-making models and the application of possible machine-centric solutions adapted from the pharmaceutical industry. AI and Ethics, 2(4), 771-787. https://doi.org/10.1007/s43681-022-00138-8
Diki, D., & Yuliastuti, E. (2018). Discrepancy of difficulty level based on item analysis and test developers’ judgment: Department of Biology at Universitas Terbuka, Indonesia. In D. Ifenthaler, A. Yuen, Y. An, & J. M. Spector (Eds.), Educational technology to improve quality and access on a global scale: Papers from the Educational Technology World Conference (ETWC 2016), Indonesia (pp. 215–225). Spriner Nature. https://doi.org/10.1007/978-3-319-66227-5_17
Embretson, S. E., & Reise, S. P. (2013). Item response theory for psychologists. Psychology Press. https://doi.org/10.4324/9781410605269
Engida, M. A., Iyasu, A. S., & Fentie, Y. M. (2024). Impact of teaching quality on student achievement: student evidence. Frontiers in Education, 9, 1367317. https://doi.org/10.3389/feduc.2024.1367317
Ernawati, E., Habibah, R. Y., Syarifah, N., Firmansyah, F., & Attamimi, H. a. R. (2024). Item analysis test of science, Indonesian language, and mathematics using the Rasch model in elementary schools. Jurnal Penelitian dan Evaluasi Pendidikan, 28(2), 195-209. https://doi.org/10.21831/pep.v28i2.75448
Escudero, E. B., Reyna, N. L., & Morales, M. R. (2000). The level of difficulty and discrimination power of the Basic Knowledge and Skills Examination (EXHCOBA). Revista Electrónica de Investigación Educativa, 2(1). http://redie.uabc.mx/vol2no1/contents-backhoff.html
Freed, R., McKinnon, D., Fitzgerald, M., & Norris, C. M. (2022). Development and validation of an astronomy self-efficacy instrument for understanding and doing. Physical Review Physics Education Research, 18(1), 010117. https://doi.org/10.1103/PhysRevPhysEducRes.18.010117
Glas, C., Scheerens, J., & Thomas, S. M. (2003). Educational evaluation, assessment, and monitoring: A systemic approach (1st ed.). Taylor & Francis. https://doi.org/10.4324/9780203971055
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory (Vol. 2). SAGE.
Istiqlal, M., Putro, N. H. P. S., & Istiyono, E. (2025). Evaluating English language test items developed by teachers: An item response theory approach. Voices of English Language Education Society, 9(1), 218-230. https://doi.org/10.29408/veles.v9i1.27644
Jewsbury, P. A., & van Rijn, P. W. (2020). IRT and MIRT models for item parameter estimation with multidimensional multistage tests. Journal of Educational and Behavioral Statistics, 45(4), 383-402. https://doi.org/10.3102/1076998619881790
Jordan, P., & Spiess, M. (2019). Rethinking the interpretation of item discrimination and factor loadings. Educational and Psychological Measurement, 79(6), 1103-1132. https://doi.org/10.1177/0013164419843164
Jundi, M. (2023). Classical test theory in analyzing Arabic test questions: A descriptive study on item analysis research in Indonesia/نظرية الاختبار الكلاسيكية في تحليل الأسئلة العربية: الدراسة الوصفية على بحوث تحليل بنود الأسلة في إندونيسيا. ATHLA: Journal of Arabic Teaching, Linguistic and Literature, 4(2), 85-105. https://doi.org/10.22515/athla.v4i2.7747
Kadir, S., Sarif, S., & Fuadi, A. H. N. (2024). Item analysis of Arabic thematic questions to determine thinking level ability. ELOQUENCE: Journal of Foreign Language, 3(1), 28-41. https://doi.org/10.58194/eloquence.v3i1.1498
Karnia, R. (2024). Importance of reliability and validity in research. Psychology and Behavioral Sciences, 13(6), 137-141. https://doi.org/10.13140/RG.2.2.30985.45921
Kasali, J., & Adeyemi, A. A. (2022). Estimation of item parameter indices of NECO Mathematics multiple choice test items among Nigerian students. Journal of Integrated Elementary Education, 2(1), 43-54. https://doi.org/10.21580/jieed.v2i1.10187
Lee, W. C., Kim, S. Y., Choi, J., & Kang, Y. (2020). IRT approaches to modeling scores on mixed‐format tests. Journal of Educational Measurement, 57(2), 230-254. https://doi.org/10.1111/jedm.12248
Madi, D., & Clinton, M. (2015). Rasch analysis of the Arabic language version of the functional disability inventory. Journal of Pediatric Oncology Nursing, 32(4), 230-239. https://doi.org/10.1177/1043454214554010
Mahmudi, I., Nurwardah, A., Rochma, S. N., & Nurcholis, A. (2023). Item analysis of Arabic language examination. Ijaz Arabi Journal of Arabic Learning, 6(3). https://doi.org/10.18860/ijazarabi.v6i3.19821
Mardapi, D. (2020). Assessing students' higher order thinking skills using multidimensional item response theory. Problems of Education in the 21st Century, 78(2), 196-214. https://doi.org/10.33225/pec/20.78.196
Ningsih, N. T. R., Rosidin, U., Viyanti, V., Distrik, I. W., & Abdurrahman, A. (2025). Development of an assessment instrument for students discipline and responsibility in physics practicum-based cooperative learning. Sebatik, 29(1), 67-73. https://doi.org/10.46984/sebatik.v29i1.2595
Ntumi, S. (2025). Advanced multidimensional item response theory modeling for high-stakes, cross-disciplinary competency assessments in sub-Saharan Africa: A psychometric approach to equity, adaptivity, and policy integration. Research Square, Preprint(Version 1). https://doi.org/10.21203/rs.3.rs-6418690/v1
Nury, A. H. A., Hikmah, H., & Masrun, M. (2025). Assessment Instruments for Tarkib and Mufrodat in the Ministry of Religion's Arabic language textbook. Al-Lahjah: Jurnal Pendidikan, Bahasa Arab, dan Kajian Linguistik Arab, 8(2), 1021-1031. https://doi.org/10.32764/lahjah.v8i2.5879
Oladele, J. I., & Ndlovu, M. (2021). A review of standardised assessment development procedure and algorithms for computer adaptive testing: Applications and relevance for fourth industrial revolution. International Journal of Learning, Teaching and Educational Research, 20(5), 1-17. https://www.ijlter.org/index.php/ijlter/article/view/3551
Pardede, T., Santoso, A., Diki, D., Retnawati, H., Rafi, I., Apino, E., & Rosyada, M. N. (2023). Gaining a deeper understanding of the meaning of the carelessness parameter in the 4PL IRT model and strategies for estimating it. REID (Research and Evaluation in Education), 9(1), 86–117. https://doi.org/10.21831/reid.v9i1.63230
Perie, M. (2020). Comparability across different assessment systems. In A. I. Berman, E. H. Haertel, & J. W. Pellegrino (Eds.), Comparability of large-scale educational assessments: Issues and recommendations, pp. 123-148. National Academy of Education. https://naeducation.org/wp-content/uploads/2020/06/Comparability-of-Large-Scale-Educational-Assessments.pdf
Puia, A.-M., Mihalcea, A., & Rotărescu, V. Ș. (2025). Well-being factors. An item-level analysis of the positive cognitive triad role, in the relationship between resilience and well-being. Acta Psychologica, 253, 104692. https://doi.org/10.1016/j.actpsy.2025.104692
Ramadhan, M. R., & Subando, J. (2025). Analisis kualitas butir soal fiqih dan kemampuan siswa di Madrasah Aliyah Negeri 1 Surakarta. Al Ulum Jurnal Pendidikan Islam, 5(2), 126-138. https://doi.org/10.54090/alulum.698
Sadeghi, P., Pourabbas, A., Dehghani, G., & Katebi, K. (2025). Quantitative and qualitative item analysis of exams of basic medical sciences departments of Tabriz University of Medical Sciences in 2023. BMC Medical Education, 25(1), 937. https://doi.org/10.1186/s12909-025-07539-3
Saepudin, S., Pabbajah, M. T. H., & Pabbajah, M. (2024). Unleashing the power of reading: Effective strategies for non-native Arabic language learners. Alsinatuna, 9(2), 109-130. https://doi.org/10.28918/alsinatuna.v9i2.7826
Shafie, S., Majid, F. A., Hoon, T. S., & Damio, S. M. (2021). Evaluating construct validity and reliability of intention to transfer training conduct instrument using Rasch model analysis. Pertanika Journal of Social Sciences & Humanities, 29(2), 1055–1070. https://doi.org/10.47836/pjssh.29.2.17
Stalikas, A., Triliva, S., & Roussi, P. (2018). Exploratory factor analysis. In V. Zeigler-Hill & T. K. Shackelford (Eds.), Encyclopedia of personality and individual differences. Springer. https://doi.org/10.1007/978-3-319-28099-8_1385-1
Terry, D., & Nguyen, H. (2024). Assessing measuring instruments. In D. Whitehead & D. Terry (Eds.), Nursing and midwifery research: Methods and appraisal for evidence based practice (7th ed.), pp. 151-167. Elsevier. https://www.elsevierhealth.com.au/nursing-and-midwifery-research-9780729544665.html
Watkins, M. W. (2018). Exploratory factor analysis: A guide to best practice. Journal of Black Psychology, 44(3), 219–246. https://doi.org/10.1177/0095798418771807
Wilson, M. (2023). Constructing measures: An item response modeling approach (2nd ed.). Routledge. https://doi.org/10.4324/9781003286929
Zakkiyah, M. Y., Fidyahwati, N. M., Ma'suq, A. T., & Anggraini, N. (2024). Assessment design and analysis of Arabic reading skills instructional materials. IJIE International Journal of Islamic Education, 3(1), 31-46. https://doi.org/10.35719/ijie.v3i1.2000
Zeinoun, P., Iliescu, D., & El Hakim, R. (2022). Psychological tests in Arabic: A review of methodological practices and recommendations for future use. Neuropsychology Review, 32(1), 1-19. https://doi.org/10.1007/s11065-021-09476-6
Zondo, N. P., Zewotir, T., & North, D. E. (2021). The level of difficulty and discrimination power of the items of the National Senior Certificate Mathematics Examination. South African Journal of Education, 41(4), 1-13. https://doi.org/10.15700/saje.v41n4a1935
Downloads
Published
How to Cite
Issue
Section
Citation Check
License
Copyright (c) 2025 Jurnal Penelitian dan Evaluasi Pendidikan

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
The authors submitting a manuscript to this journal agree that, if accepted for publication, copyright publishing of the submission shall be assigned to Jurnal Penelitian dan Evaluasi Pendidikan. However, even though the journal asks for a copyright transfer, the authors retain (or are granted back) significant scholarly rights.
The copyright transfer agreement form can be downloaded here: [JPEP Copyright Transfer Agreement Form]
The copyright form should be signed originally and sent to the Editorial Office through email to jurnalhepi@uny.ac.id

Jurnal Penelitian dan Evaluasi Pendidikan by http://journal.uny.ac.id/index.php/jpep is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.








