Exploring the accuracy of school-based English test items for grade XI students of senior high schools

Martin Iryayo, University of Rwanda - College of Education, Rwanda
Agus Widyantoro, Department of English Education, Universitas Negeri Yogyakarta, Indonesia


This study is set out to (1) explore the accuracy of school-based English test items developed by English teachers and (2) compare the relationship between the content covered by teacher and the students’ success level. This research used the quantitative approach. The source of the data is all grade XI students’ answers to the English test for the second semester of 2016/2017 academic year, and their English teachers’ responses to the questionnaire. During this cross-sectional survey, 241 grade XI students and six English teachers were selected by using the total population sampling technique. To analyze the data, the IRT model was prioritized with BILOG MG 3.0, WINISTEPS 3.7. The findings of the study indicate that (1) the test is valid, (2) it is reliable, (3) majority of the items are moderately difficult, (4) more than a half of all items have power to discriminate the examinees, (5) some items show fully-effective distractors, and (6) the test gives much information at -.40 of theta which means that the test is difficult for the grade XI students. Moreover, there is a wide gap between the content covered and the level of success.


CTT; discrimination power; distractor; information function; IRT; theta; total population sampling

Full Text:



Abadyo, A., & Bastari, B. (2015). Estimation of ability and item parameters in mathematics testing by using the combination of 3PLM/ GRM and MCM/ GPCM scoring model. REiD (Research and Evaluation in Education), 1(1), 55–72. https://doi.org/10.21831/reid.v1i1.4898

Abdulghani, H. M., Ahmad, F., Ponnamperuma, G. G., Khalil, M. S., & Aldrees, A. (2014). The relationship between non-functioning distractors and item difficulty of multiple choice questions: A descriptive analysis. Journal of Health Specialties, 2(4), 148–151. https://doi.org/10.4103/1658-600X.142784

Allen, M. J., & Yen, W. M. (2001). Introduction to measurement theory (1st ed.). Long Grove, IL: Waveland Press.

Boopathiraj, C., & Chellamani, K. (2013). Analysis of test items on difficulty level and discrimination index in the test for research in education. International Journal of Social Science & Interdisciplinary Research (Vol. 2).

Brescia, W., & Fortune, J. C. (1989). Standardized testing of American Indian students. College Student Journal, 23(2), 98–104.

Charismana, D. S., & Aman, A. (2016). Analisis kualitas tes ujian akhir semester PPKN SMP di Kabupaten Kudus. Jurnal Evaluasi Pendidikan, 4(1), 1–9.

DiBattista, D., & Kurzawa, L. (2011). Examination of the quality of multiple-choice items on classroom tests. Canadian Journal for the Scholarship of Teaching and Learning, 2(2), 1–23. https://doi.org/10.5206/cjsotl-rcacea.2011.2.4

Galsworthy, M. J., Paya-Cano, J. L., Liu, L., Monleón, S., Gregoryan, G., Fernandes, C., … Plomin, R. (2005). Assessing reliability, heritability and general cognitive ability in a battery of cognitive tasks for laboratory mice. Behavior Genetics, 35(5), 675–692. https://doi.org/10.1007/s10519-005-3423-9

Gronlund, N. E. (1993). How to make achievement tests and measurements. Needham Heights, MA: Allyn and Bacon.

Guyette, S. (1983). Community-based research: A handbook for native Americans. Los Angeles, CA: American Indian Studies Center, University of California.

Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: Principles and applications. Boston, MA: Kluwer Nijhoff.

Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park, CA: Sage Publications.

Istiyono, E., Mardapi, D., & Suparno, S. (2014). Pengembangan tes kemampuan berpikir tingkat tinggi fisika (PhysTHOTS) peserta didik SMA. Jurnal Penelitian Dan Evaluasi Pendidikan, 18(1), 1–12. https://doi.org/10.21831/pep.v18i1.2120

Joint Committee on Testing Practices of American Psychological Association. (2004). Code of fair testing practices in education. Washington, DC, United States of America.

Kartowagiran, B. (2012). Penulisan butir soal. A paper presented in the Seminar on Question Items Analysis and Writing for Civil Servant Resources of Dik-Rekinpeg, in Kawanua Aerotel Hotel.

Lord, F. M. (2012). Applications of item response theory to practical testing problems. New York, NY: Routledge.

Mardapi, D. (1991). Konsep dasar teori respons butir: Perkembangan dalam bidang pengukuran pendidikan. Cakrawala Pendidikan, 3(X), 1–16.

Mardapi, D. (2012). Pengukuran, penilaian, dan evaluasi pendidikan. Yogyakarta: Nuha Medika.

Mkrtchyan, A. (2011). Distractor Quality Analyze In Multiple Choice Questions Based On Information Retrieval Model. EDULEARN11 Proceedings, 1624–1631.

Osadebe, P. U. (2015). Construction of valid and reliable test for assessment of students. Journal of Education and Practice, 6(1), 51–56.

Polit, D. F., & Beck, C. T. (2006). The content validity index: Are you sure you know what’s being reported? Critique and recommendations. Research in Nursing & Health, 29(5), 489–497. https://doi.org/10.1002/nur.20147

Quaigrain, K., & Arhin, A. K. (2017). Using reliability and item analysis to evaluate a teacher-developed test in educational measurement and evaluation. Cogent Education, 4(1), 1301013. https://doi.org/10.1080/2331186X.2017.1301013

Retnawati, H. (2016). Analisis kuantitatif instrumen penelitian. Yogyakarta: Parama Publishing.

Sabri, S. (2013). Item analysis of student comprehensive test for research in teaching beginner string ensemble using model based teaching among music students in public universities. International Journal of Education and Research, 1(12), 1–14.

Seidel, T., Stürmer, K., Blomberg, G., Kobarg, M., & Schwindt, K. (2011). Teacher learning from analysis of videotaped classroom situations: Does it make a difference whether teachers observe their own teaching or that of others? Teaching and Teacher Education: An International Journal of Research and Studies, 27(2), 259–267. https://doi.org/10.1016/j.tate.2010.08.009

Stone, C. A., Ye, F., Zhu, X., & Lane, S. (2009). Providing subscale scores for diagnostic information: A case study when the test is essentially unidimensional. Applied Measurement in Education, 23(1), 63–86. https://doi.org/10.1080/08957340903423651

Young, M., Cummings, B.-A., & St-Onge, C. (2017). Ensuring the quality of multiple-choice exams administered to small cohorts: A cautionary tale. Perspectives on Medical Education, 6(1), 21–28. https://doi.org/10.1007/s40037-016-0322-0

DOI: https://doi.org/10.21831/reid.v4i1.19971


  • There are currently no refbacks.

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Find REID (Research and Evaluation in Education) on:


ISSN 2460-6995 (Online)

View REiD Visitor Statistics