Score conversion methods with modern test theory approach: Ability, difficulty, and guessing justice methods

Authors

  • Siti Nurjanah Universitas Negeri Yogyakarta, Indonesia https://orcid.org/0009-0006-0727-0830
  • Muhammad Iqbal Universitas Negeri Yogyakarta, Indonesia
  • Siti Nurul Sajdah Universitas Negeri Yogyakarta, Indonesia
  • Yohana Veronica Feibe Sinambela Universitas Negeri Yogyakarta, Indonesia
  • Shaufi Ramadhani Universitas Negeri Yogyakarta, Indonesia

DOI:

https://doi.org/10.21831/reid.v11i2.67484

Keywords:

item response theory, 1-PL, R program, Rasch model

Abstract

The one-parameter logistic (1-PL) model is widely used in Item Response Theory (IRT) to estimate student ability; however, ability-based scoring disregards item difficulty and guessing behavior, which can bias proficiency interpretations. This study evaluates three scoring alternatives derived from IRT: an ability-based conversion, a difficulty-weighted conversion, and a proposed guessing-justice method. Dichotomous responses from 400 students were analyzed using the Rasch (1-PL) model in the R environment with the ltm package. The 1-PL specification was retained to support a parsimonious and interpretable calibration framework consistent with the comparative scoring purpose of the study. Rasch estimation produced item difficulty values ranging from −1.03 to 0.18 and identified 268 unique response patterns. Ability-based scoring yielded only eight score distinctions, demonstrating limited discriminatory capacity. In contrast, the guessing-justice method produced a substantially more differentiated distribution, with approximately 70 percent of patterns consistent with knowledge-based responding and 30 percent indicative of guessing. The findings indicate that scoring models incorporating item difficulty and guessing behaviour provide a more equitable and accurate representation of student proficiency than traditional ability-based conversions. The proposed approach offers a practical and implementable alternative for classroom assessment and can be applied using widely accessible spreadsheet software such as Microsoft Excel.

Author Biography

Siti Nurul Sajdah, Universitas Negeri Yogyakarta

Master's student in Educational Research and Evaluation

References

Abedalaziz, N., & Leng, C. H. (2018). The relationship between CTT and IRT approaches in Analyzing Item Characteristics. MOJES: Malaysian Online Journal of Educational Sciences, 1(1), 64–70. http://mojes.um.edu.my/index.php/MOJES/article/view/12857

Abu-Ghazalah, R. M., Dubins, D. N., & Poon, G. M. K. (2023). Dissecting knowledge, guessing, and blunder in multiple choice assessments. Applied Measurement in Education, 36(1), 80–98. https://doi.org/10.1080/08957347.2023.2172017

Anderson, D., Kahn, J. D., & Tindal, G. (2017). Exploring the robustness of a unidimensional item response theory model with empirically multidimensional data. Applied Measurement in Education, 30(3), 163–177. https://doi.org/10.1080/08957347.2017.1316277

Ariyadi, D. J. (2025). Application of Three-Parameter Logistic (3PL) item response theory in Learning Management System (LMS) for post-test analysis. Journal of Informatics Development, 3(2), 33–46. https://doi.org/10.30741/jid.v3i2.1554

Baker, F. B. (2001). The basics of item response theory (2nd edition). ERIC Clearinghouse on Assessment and Evaluation. https://doi.org/10.1007/978-3-319-54205-8_1

Baker, F. B., & Kim, S.-H. (2017). The basics of item response theory using R. Springer. https://doi.org/10.1007/978-3-319-54205-8

Batool, I., Shah, A. A., & Naseer, S. (2023). Construction, analysis and calibration of multiple-choice questions: IRT versus CTT. Archives of Educational Studies (ARES), 3(2), 242–257. https://ares.pk/ojs/index.php/ares/article/view/69

Ben-Simon, A., Budescu, D. V, & Nevo, B. (1997). A comparative study of measures of partial knowledge in multiple-choice tests. Applied Psychological Measurement, 21(1), 65–88. https://doi.org/10.1177/0146621697211006

Brookhart, S. M. (2013). The use of teacher judgement for summative assessment in the USA. Assessment in Education: Principles, Policy & Practice, 20(1), 69–90. https://doi.org/10.1080/0969594X.2012.703170

Burton, R. F. (2002). Misinformation, partial knowledge and guessing in true/false tests. Medical Education, 36(9), 805–811. https://doi.org/10.1046/j.1365-2923.2002.01299.x

Cappelleri, J. C., Jason Lundy, J., & Hays, R. D. (2014). Overview of classical test theory and item response theory for the quantitative assessment of items in developing patient-reported outcomes measures. Clinical Therapeutics, 36(5), 648–662. https://doi.org/10.1016/j.clinthera.2014.04.006

Chalmers, R. P. (2012). Mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1-29. https://doi.org/10.18637/jss.v048.i06

Edelen, M. O., & Reeve, B. B. (2007). Applying item response theory (IRT) modeling to questionnaire development, evaluation, and refinement. Quality of Life Research, 16, 5–18. https://doi.org/10.1007/s11136-007-9198-0

Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists (1st ed.). Psychology Press. https://doi.org/10.4324/9781410605269

Ertoprak, D. G., & Dogan, N. (2016). A research on the classification validity of the decisions made according to norm and criterion-referenced assessment approaches. Anthropologist, 23(3), 612–619. https://doi.org/10.1080/09720073.2014.11891981

Fan, X. (1998). Item response theory and classical test theory: An empirical comparison of their item/person statistics. Educational and Psychological Measurement, 58(3), 357–381. https://doi.org/10.1177/0013164498058003001

Fraley, R. C., Waller, N. G., & Brennan, K. A. (2000). An item response theory analysis of self-report measures of adult attachment. Journal of Personality and Social Psychology, 78(2), 350–365. https://doi.org/10.1037/0022-3514.78.2.350

Gorter, R., Fox, J.-P., Riet, G. T., Heymans, M., & Twisk, J. (2020). Latent growth modeling of IRT versus CTT measured longitudinal latent variables. Statistical Methods in Medical Research, 29(4), 962–986. https://doi.org/10.1177/0962280219856375

Hambleton, R. K., & Jones, R. W. (1993). An NCME instructional module on: Comparison of classical test theory and item response theory and their applications to test development. Educational Measurement: Issues and Practice, 12(3), 38–47. https://doi.org/10.1111/j.1745-3992.1993.tb00543.x

Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: Principles and applications. Springer Science. https://doi.org/10.1007/978-94-017-1988-9

Hambleton, R. K., Swaminathan, H., & Rogers, D. J. (1991). Fundamentals of item response theory. Sage publications. https://doi.org/10.2307/2075521

Hergesell, A. (2022). Using Rasch analysis for scale development and refinement in tourism: Theory and illustration. Journal of Business Research, 142, 551–561. https://doi.org/10.1016/j.jbusres.2021.12.063

Hu, Z., Lin, L., Wang, Y., & Li, J. (2021). The integration of classical testing theory and item response theory. Psychology, 12, 1397–1409. https://doi.org/10.4236/psych.2021.129088

Koloi-Keaikitse, S. (2017). Assessment of teacher perceived skill in classroom assessment practices using IRT Models. Cogent Education, 4(1), 1281202. https://doi.org/10.1080/2331186X.2017.1281202

Lau, P. N. K., Lau, S. H., Hong, K. S., & Usop, H. (2011). Guessing, partial knowledge, and misconceptions in multiple-choice tests. Educational Technology and Society, 14(4), 99–110. https://www.jstor.org/stable/jeductechsoci.14.4.99

Lin, C.-K. (2018). Effects of removing responses with likely random guessing under Rasch measurement on a multiple-choice language proficiency test. Language Assessment Quarterly, 15(4), 406–422. https://doi.org/10.1080/15434303.2018.1534237

Lok, B., McNaught, C., & Young, K. (2016). Criterion-referenced and norm-referenced assessments: Compatibility and complementarity. Assessment and Evaluation in Higher Education, 41(3), 450–465. https://doi.org/10.1080/02602938.2015.1022136

Mahmud, M. N. (2021). Diagnostik kesulitan belajar Matematika siswa SMP kelas VIII di Kota Baubau menggunakan soal-soal model TIMSS (Diagnostics of mathematics learning difficulties for class VIII junior high school students in Baubau City using TIMSS model questions). Thesis, Universitas Negeri Yogyakarta.

Mellenbergh, G. J. (1989). Item bias and item response theory. International Journal of Educational Research, 13(2), 127–143. https://doi.org/10.1016/0883-0355(89)90002-5

Metsämuuronen, J. (2023). Seeking the real item difficulty: Bias-corrected item difficulty and some consequences in Rasch and IRT modeling. Behaviormetrika, 50(1), 121–154. https://doi.org/10.1007/s41237-022-00169-9

Mohammed, M., & Omar, N. (2020). Question classification based on Bloom’s taxonomy cognitive domain using modified TF-IDF and word2vec. PLOS ONE, 15(3), e0230442. https://doi.org/10.1371/journal.pone.0230442

Mullis, I. V. S., & Martin, M. O. (2017). TIMSS 2019 assessment frameworks. In Hacking Connected Cars. Boston College, TIMSS & PIRLS International Study Center. http://timssandpirls.bc.edu/timss2019/frameworks/

Murillo, F. J., & Hidalgo, N. (2018). Fair assessment conceptions of students. A phenomenographic study from teachers’ perspective. Revista Complutense de Educacion, 29(4), 995–1010. https://doi.org/10.5209/RCED.54405

Pardimin. (2018). Analysis of the Indonesia mathematics teachers’ ability in applying authentic assessment. Cakrawala Pendidikan, 37(2), 170–181. https://doi.org/10.21831/cp.v37i2.18885

Polat, M. (2022). Comparison of performance measures obtained from foreign language tests according to item response theory vs classical test theory. International Online Journal of Education and Teaching, 9(1), 471–485. https://eric.ed.gov/?id=EJ1327729

Pollard, G. H. (1989). Scoring to remove guessing in multiple choice examinations. International Journal of Mathematical Education in Science and Technology, 20(3), 429–432. https://doi.org/10.1080/0020739890200313

Retnawati, H. (2014). Teori respons butir dan penerapannya (Item response theory and its application). Nuha Medika.

Rizopoulos, D. (2006). Itm: An R package for latent variable modeling and item response theory analyses. Journal of Statistical Software, 17(5), 1–25. https://doi.org/10.18637/jss.v017.i05

Setiawati, F. A., Amelia, R. N., Sumintono, B., & Purwanta, E. (2023). Study item parameters of classical and modern theory of differential aptitude test: Is it comparable? European Journal of Educational Research, 12(2), 1097–1107. https://doi.org/10.12973/eu-jer.12.2.1097

Stark, S., Chernyshenko, O. S., & Drasgow, F. (2006). Detecting differential item functioning with confirmatory factor analysis and item response theory: Toward a unified strategy. Journal of Applied Psychology, 91(6), 1292–1306. https://doi.org/10.1037/0021-9010.91.6.1292

Subali, B., Kumaidi, & Aminah, N. S. (2020). The comparison of item test characteristics viewed from classic and modern test theory. International Journal of Instruction, 14(1), 647–660. https://doi.org/10.29333/IJI.2021.14139A

Tan, M., & Liu, S. (2023). A way of human capital accumulation: Heterogeneous impact of shadow education on students’ academic performance in China. SAGE Open, 13(4). https://doi.org/10.1177/21582440231207189

Triono, D., Sarno, R., & Sungkono, K. R. (2020). Item analysis for examination test in the postgraduate student’s selection with classical test theory and Rasch measurement model. 2020 International Seminar on Application for Technology of Information and Communication (ISemantic), 523–529. https://ieeexplore.ieee.org/abstract/document/9234204/

van Rijn, P. W., Sinharay, S., Haberman, S. J., & Johnson, M. S. (2016). Assessment of fit of item response theory models used in large-scale educational survey assessments. Large-Scale Assessments in Education, 4(1), 10. https://doi.org/10.1186/s40536-016-0025-3

Wallace, M. P., & Ng, J. S. W. (2023). Fairness of classroom assessment approach: Perceptions from EFL students and teachers. English Teaching and Learning, 47(4), 529–548. https://doi.org/10.1007/s42321-022-00127-4

Xia, J., Tang, Z., Wu, P., Wang, J., & Yu, J. (2019). Use of item response theory to develop a shortened version of the. Scientific Reports, 9, 1764. https://doi.org/10.1038/s41598-018-37965-x

Yeoh, E.-T., & Woods, P. (2006). Formative assessment using norm-referenced fuzzy evaluations. WSEAS Transactions on Information Science and Applications, 3(10), 1846–1850. https://www.wseas.us/e-library/conferences/2006cscc/papers/534-674.pdf

Zhou, Y., Suzuki, K., & Kumano, S. (2023). State-aware deep item response theory using student facial features. Front. Artif. Intell., 6, 1324279. https://doi.org/10.3389/frai.2023.1324279

Zieky, M. J. (2016). Developing fair tests. In Handbook of test development (2nd ed.) (pp. 81–99). Taylor and Francis. https://doi.org/10.4324/9780203102961-6

Downloads

Published

2025-12-18

How to Cite

Nurjanah, S., Iqbal, M., Sajdah, S. N., Sinambela, Y. V. F., & Ramadhani, S. (2025). Score conversion methods with modern test theory approach: Ability, difficulty, and guessing justice methods. REID (Research and Evaluation in Education), 11(2), 183–198. https://doi.org/10.21831/reid.v11i2.67484

Issue

Section

Articles

Citation Check

Similar Articles

1 2 3 4 5 6 7 8 9 10 > >> 

You may also start an advanced similarity search for this article.