Score conversion methods with modern test theory approach: Ability, difficulty, and guessing justice methods
DOI:
https://doi.org/10.21831/reid.v11i2.67484Keywords:
item response theory, 1-PL, R program, Rasch modelAbstract
The one-parameter logistic (1-PL) model is widely used in Item Response Theory (IRT) to estimate student ability; however, ability-based scoring disregards item difficulty and guessing behavior, which can bias proficiency interpretations. This study evaluates three scoring alternatives derived from IRT: an ability-based conversion, a difficulty-weighted conversion, and a proposed guessing-justice method. Dichotomous responses from 400 students were analyzed using the Rasch (1-PL) model in the R environment with the ltm package. The 1-PL specification was retained to support a parsimonious and interpretable calibration framework consistent with the comparative scoring purpose of the study. Rasch estimation produced item difficulty values ranging from −1.03 to 0.18 and identified 268 unique response patterns. Ability-based scoring yielded only eight score distinctions, demonstrating limited discriminatory capacity. In contrast, the guessing-justice method produced a substantially more differentiated distribution, with approximately 70 percent of patterns consistent with knowledge-based responding and 30 percent indicative of guessing. The findings indicate that scoring models incorporating item difficulty and guessing behaviour provide a more equitable and accurate representation of student proficiency than traditional ability-based conversions. The proposed approach offers a practical and implementable alternative for classroom assessment and can be applied using widely accessible spreadsheet software such as Microsoft Excel.
References
Abedalaziz, N., & Leng, C. H. (2018). The relationship between CTT and IRT approaches in Analyzing Item Characteristics. MOJES: Malaysian Online Journal of Educational Sciences, 1(1), 64–70. http://mojes.um.edu.my/index.php/MOJES/article/view/12857
Abu-Ghazalah, R. M., Dubins, D. N., & Poon, G. M. K. (2023). Dissecting knowledge, guessing, and blunder in multiple choice assessments. Applied Measurement in Education, 36(1), 80–98. https://doi.org/10.1080/08957347.2023.2172017
Anderson, D., Kahn, J. D., & Tindal, G. (2017). Exploring the robustness of a unidimensional item response theory model with empirically multidimensional data. Applied Measurement in Education, 30(3), 163–177. https://doi.org/10.1080/08957347.2017.1316277
Ariyadi, D. J. (2025). Application of Three-Parameter Logistic (3PL) item response theory in Learning Management System (LMS) for post-test analysis. Journal of Informatics Development, 3(2), 33–46. https://doi.org/10.30741/jid.v3i2.1554
Baker, F. B. (2001). The basics of item response theory (2nd edition). ERIC Clearinghouse on Assessment and Evaluation. https://doi.org/10.1007/978-3-319-54205-8_1
Baker, F. B., & Kim, S.-H. (2017). The basics of item response theory using R. Springer. https://doi.org/10.1007/978-3-319-54205-8
Batool, I., Shah, A. A., & Naseer, S. (2023). Construction, analysis and calibration of multiple-choice questions: IRT versus CTT. Archives of Educational Studies (ARES), 3(2), 242–257. https://ares.pk/ojs/index.php/ares/article/view/69
Ben-Simon, A., Budescu, D. V, & Nevo, B. (1997). A comparative study of measures of partial knowledge in multiple-choice tests. Applied Psychological Measurement, 21(1), 65–88. https://doi.org/10.1177/0146621697211006
Brookhart, S. M. (2013). The use of teacher judgement for summative assessment in the USA. Assessment in Education: Principles, Policy & Practice, 20(1), 69–90. https://doi.org/10.1080/0969594X.2012.703170
Burton, R. F. (2002). Misinformation, partial knowledge and guessing in true/false tests. Medical Education, 36(9), 805–811. https://doi.org/10.1046/j.1365-2923.2002.01299.x
Cappelleri, J. C., Jason Lundy, J., & Hays, R. D. (2014). Overview of classical test theory and item response theory for the quantitative assessment of items in developing patient-reported outcomes measures. Clinical Therapeutics, 36(5), 648–662. https://doi.org/10.1016/j.clinthera.2014.04.006
Chalmers, R. P. (2012). Mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1-29. https://doi.org/10.18637/jss.v048.i06
Edelen, M. O., & Reeve, B. B. (2007). Applying item response theory (IRT) modeling to questionnaire development, evaluation, and refinement. Quality of Life Research, 16, 5–18. https://doi.org/10.1007/s11136-007-9198-0
Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists (1st ed.). Psychology Press. https://doi.org/10.4324/9781410605269
Ertoprak, D. G., & Dogan, N. (2016). A research on the classification validity of the decisions made according to norm and criterion-referenced assessment approaches. Anthropologist, 23(3), 612–619. https://doi.org/10.1080/09720073.2014.11891981
Fan, X. (1998). Item response theory and classical test theory: An empirical comparison of their item/person statistics. Educational and Psychological Measurement, 58(3), 357–381. https://doi.org/10.1177/0013164498058003001
Fraley, R. C., Waller, N. G., & Brennan, K. A. (2000). An item response theory analysis of self-report measures of adult attachment. Journal of Personality and Social Psychology, 78(2), 350–365. https://doi.org/10.1037/0022-3514.78.2.350
Gorter, R., Fox, J.-P., Riet, G. T., Heymans, M., & Twisk, J. (2020). Latent growth modeling of IRT versus CTT measured longitudinal latent variables. Statistical Methods in Medical Research, 29(4), 962–986. https://doi.org/10.1177/0962280219856375
Hambleton, R. K., & Jones, R. W. (1993). An NCME instructional module on: Comparison of classical test theory and item response theory and their applications to test development. Educational Measurement: Issues and Practice, 12(3), 38–47. https://doi.org/10.1111/j.1745-3992.1993.tb00543.x
Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: Principles and applications. Springer Science. https://doi.org/10.1007/978-94-017-1988-9
Hambleton, R. K., Swaminathan, H., & Rogers, D. J. (1991). Fundamentals of item response theory. Sage publications. https://doi.org/10.2307/2075521
Hergesell, A. (2022). Using Rasch analysis for scale development and refinement in tourism: Theory and illustration. Journal of Business Research, 142, 551–561. https://doi.org/10.1016/j.jbusres.2021.12.063
Hu, Z., Lin, L., Wang, Y., & Li, J. (2021). The integration of classical testing theory and item response theory. Psychology, 12, 1397–1409. https://doi.org/10.4236/psych.2021.129088
Koloi-Keaikitse, S. (2017). Assessment of teacher perceived skill in classroom assessment practices using IRT Models. Cogent Education, 4(1), 1281202. https://doi.org/10.1080/2331186X.2017.1281202
Lau, P. N. K., Lau, S. H., Hong, K. S., & Usop, H. (2011). Guessing, partial knowledge, and misconceptions in multiple-choice tests. Educational Technology and Society, 14(4), 99–110. https://www.jstor.org/stable/jeductechsoci.14.4.99
Lin, C.-K. (2018). Effects of removing responses with likely random guessing under Rasch measurement on a multiple-choice language proficiency test. Language Assessment Quarterly, 15(4), 406–422. https://doi.org/10.1080/15434303.2018.1534237
Lok, B., McNaught, C., & Young, K. (2016). Criterion-referenced and norm-referenced assessments: Compatibility and complementarity. Assessment and Evaluation in Higher Education, 41(3), 450–465. https://doi.org/10.1080/02602938.2015.1022136
Mahmud, M. N. (2021). Diagnostik kesulitan belajar Matematika siswa SMP kelas VIII di Kota Baubau menggunakan soal-soal model TIMSS (Diagnostics of mathematics learning difficulties for class VIII junior high school students in Baubau City using TIMSS model questions). Thesis, Universitas Negeri Yogyakarta.
Mellenbergh, G. J. (1989). Item bias and item response theory. International Journal of Educational Research, 13(2), 127–143. https://doi.org/10.1016/0883-0355(89)90002-5
Metsämuuronen, J. (2023). Seeking the real item difficulty: Bias-corrected item difficulty and some consequences in Rasch and IRT modeling. Behaviormetrika, 50(1), 121–154. https://doi.org/10.1007/s41237-022-00169-9
Mohammed, M., & Omar, N. (2020). Question classification based on Bloom’s taxonomy cognitive domain using modified TF-IDF and word2vec. PLOS ONE, 15(3), e0230442. https://doi.org/10.1371/journal.pone.0230442
Mullis, I. V. S., & Martin, M. O. (2017). TIMSS 2019 assessment frameworks. In Hacking Connected Cars. Boston College, TIMSS & PIRLS International Study Center. http://timssandpirls.bc.edu/timss2019/frameworks/
Murillo, F. J., & Hidalgo, N. (2018). Fair assessment conceptions of students. A phenomenographic study from teachers’ perspective. Revista Complutense de Educacion, 29(4), 995–1010. https://doi.org/10.5209/RCED.54405
Pardimin. (2018). Analysis of the Indonesia mathematics teachers’ ability in applying authentic assessment. Cakrawala Pendidikan, 37(2), 170–181. https://doi.org/10.21831/cp.v37i2.18885
Polat, M. (2022). Comparison of performance measures obtained from foreign language tests according to item response theory vs classical test theory. International Online Journal of Education and Teaching, 9(1), 471–485. https://eric.ed.gov/?id=EJ1327729
Pollard, G. H. (1989). Scoring to remove guessing in multiple choice examinations. International Journal of Mathematical Education in Science and Technology, 20(3), 429–432. https://doi.org/10.1080/0020739890200313
Retnawati, H. (2014). Teori respons butir dan penerapannya (Item response theory and its application). Nuha Medika.
Rizopoulos, D. (2006). Itm: An R package for latent variable modeling and item response theory analyses. Journal of Statistical Software, 17(5), 1–25. https://doi.org/10.18637/jss.v017.i05
Setiawati, F. A., Amelia, R. N., Sumintono, B., & Purwanta, E. (2023). Study item parameters of classical and modern theory of differential aptitude test: Is it comparable? European Journal of Educational Research, 12(2), 1097–1107. https://doi.org/10.12973/eu-jer.12.2.1097
Stark, S., Chernyshenko, O. S., & Drasgow, F. (2006). Detecting differential item functioning with confirmatory factor analysis and item response theory: Toward a unified strategy. Journal of Applied Psychology, 91(6), 1292–1306. https://doi.org/10.1037/0021-9010.91.6.1292
Subali, B., Kumaidi, & Aminah, N. S. (2020). The comparison of item test characteristics viewed from classic and modern test theory. International Journal of Instruction, 14(1), 647–660. https://doi.org/10.29333/IJI.2021.14139A
Tan, M., & Liu, S. (2023). A way of human capital accumulation: Heterogeneous impact of shadow education on students’ academic performance in China. SAGE Open, 13(4). https://doi.org/10.1177/21582440231207189
Triono, D., Sarno, R., & Sungkono, K. R. (2020). Item analysis for examination test in the postgraduate student’s selection with classical test theory and Rasch measurement model. 2020 International Seminar on Application for Technology of Information and Communication (ISemantic), 523–529. https://ieeexplore.ieee.org/abstract/document/9234204/
van Rijn, P. W., Sinharay, S., Haberman, S. J., & Johnson, M. S. (2016). Assessment of fit of item response theory models used in large-scale educational survey assessments. Large-Scale Assessments in Education, 4(1), 10. https://doi.org/10.1186/s40536-016-0025-3
Wallace, M. P., & Ng, J. S. W. (2023). Fairness of classroom assessment approach: Perceptions from EFL students and teachers. English Teaching and Learning, 47(4), 529–548. https://doi.org/10.1007/s42321-022-00127-4
Xia, J., Tang, Z., Wu, P., Wang, J., & Yu, J. (2019). Use of item response theory to develop a shortened version of the. Scientific Reports, 9, 1764. https://doi.org/10.1038/s41598-018-37965-x
Yeoh, E.-T., & Woods, P. (2006). Formative assessment using norm-referenced fuzzy evaluations. WSEAS Transactions on Information Science and Applications, 3(10), 1846–1850. https://www.wseas.us/e-library/conferences/2006cscc/papers/534-674.pdf
Zhou, Y., Suzuki, K., & Kumano, S. (2023). State-aware deep item response theory using student facial features. Front. Artif. Intell., 6, 1324279. https://doi.org/10.3389/frai.2023.1324279
Zieky, M. J. (2016). Developing fair tests. In Handbook of test development (2nd ed.) (pp. 81–99). Taylor and Francis. https://doi.org/10.4324/9780203102961-6
Downloads
Published
How to Cite
Issue
Section
Citation Check
License
Copyright (c) 2025 REID (Research and Evaluation in Education)

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
The authors submitting a manuscript to this journal agree that, if accepted for publication, copyright publishing of the submission shall be assigned to REID (Research and Evaluation in Education). However, even though the journal asks for a copyright transfer, the authors retain (or are granted back) significant scholarly rights.
The copyright transfer agreement form can be downloaded here: [REID Copyright Transfer Agreement Form]
The copyright form should be signed originally and sent to the Editorial Office through email to reid.ppsuny@uny.ac.id

REID (Research and Evaluation in Education) by http://journal.uny.ac.id/index.php/reid is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.



.png)




