PERFORMANCE DIFFERENCES BY GENDER IN ENGLISH READING TEST

Test fairness becomes an aspect that needs to be considered when developing a test instrument. It is highly recommended that the instrument should not be biased for the test takers by ensuring that they do not behave differently among male and female test-takers. This study aims to examine the extent to which the items in an English proficiency test function differently across gender. Fifty reading items were examined and analyzed using a statistical method for detecting DIF. The items were individually tested for gender DIF using Rasch model analysis with the analysis tool of ConQuest. The results showed that six items were detected for DIF, three of which were basic comprehension items, and the other three were vocabulary questions. Some possible ways of dealing with DIF items were also discussed.


INTRODUCTION
It has been a commonplace for a single form of a test to have many different types of items to measure skill, knowledge or abilities. When the test items are administered to examinees, there is a potential to have items that function differently in some contexts that might favor one group of examinees. The presence of item DIF or potential bias with regards to different demographic characteristics, such as gender, social class, ethnicity, should be examined further to avoid bias and to promote test fairness (Huff, 2000;Kunnan, 2007;Le, 2006;Wu, Tam, & Jen, 2016). One way to ensure test fairness for the test takers is by understanding possible gender differences in the test items. The primary purpose of the study is to examine the test items that behave differently for different gender group. From the test taker responses, the study investigates the way items function differently for individuals or groups of test takers who have similar abilities (Kunnan, 1990).
According to Wu, Tam, and Jen (2016), the term Differential Item Functioning or DIF relates to an item that functions differently to different groups or contexts. DIF could be a factor that might affect test performance in favor of a particular group, such as gender. The groups might be gender, cultural background, geography, and ethnicity. An item which is exhibits DIF in gender group, for example, functions differently showing one group of male students performs considerably better than the other group on the item. As DIF might affect test performance in favor of one particular group (Takala & Kaftandjieva, 2000), it is essential to treat the items that are detected for DIF to make sure the test validity and fairness for the groups (Lin & Wu, 2003).
A number of studies which investigated DIF have been well documented on some fields of areas, such as mathematics (Kan & Bulut, 2014;Ong, Williams, & Lamprianou, 2015) and second/foreign language testing (Kunnan, 1990;Pae, 2012;Zumbo, 2003). In one study, Kim and Jang (2009) investigated a number of reading items on the Ontario Secondary School Literacy Test (OSSLT) that function differentially for L1 students and ELL students. The findings showed that vocabulary knowledge items favored L1 students, while ELL students were favored by grammatical knowledge or integrating reading and writing skill. Kunnan (1990) examined the ESL placement examination (ESLPE) at the University of California, LA (UCLA) to investigate the identification of DIF among four native language groups and two gender groups. Using the one-parameter Rasch model for Item Response Theory (IRT) to a sample of 884 non-native speaking students at UCLA, the results showed some items displayed DIF in the native language groups (thirteen items) and in the gender group analysis (twenty three items). In gender analysis, 20 items favored male group and the items are found in the test sections of listening (seven items), reading (four items), grammar (three items), vocabulary (four items), writing error detection (two items), and grammar (three items). The source of the DIF in listening and reading was the passages that related to business, culture, and engineering disciplines, which favored the male group. While the potential source of DIF for vocabulary test was the test-taker major field.
Another study related to DIF in gender was conducted by Takala and Kaftandjieva (2000). The study analyzed 40 multiple-choice English vocabulary items to 182 males and 293 females in the intermediate level of the Finnish Foreign Language Certification Examination. The results showed some items advantaged females (five items) and males (six items), and the items were excluded from the item bank as biased estimates of person parameters were produced. In addition, a study by Pae (2012) investigated the potential causes of gender DIF on a high stake national test over a long period of time.
Considerable attention has been paid on examining DIF in various language proficiency tests, many is known about the skill items function differently for gender group. Thus, this study aims to investigate to the extent to which the items in PTESOL function differently for male and female test takers. In the end, this study provides is to evaluate the test for better validity.

Data
The data used in this study were item-level responses which were taken from PTESOL test. The PTESOL is an English proficiency test that is designed to provide information on the examinees' abilities in English subject. The test is developed by the Language Center of Universitas Pendidikan Indonesia (UPI), and its development is guided by a test specification. The test was administered for senior high school students and plays a role as part of English proficiency evidence for university admission or employment. The test consists of three sections, namely listening comprehension, structure and written expression, and reading comprehension. The English reading comprehension section was considered for this study. The items are based on a multiple-choice format with five four options. The items are categorized into three four categories: reading for main idea, reading for basic comprehension, inferencing, and vocabulary knowledge, as presented in Table 1. For this study, the data were in the form of students' responses on reading comprehension subtests. The responses of 1,067 (consisting of 411 males and 656 females) year 3 senior high school students were used, and the data were taken from 2016 test administration from three public schools in different cities (Bandung, Cimahi, and Garut) in Indonesia.  , 4, 6, 7, 9, 10, 11, 12, 16, 19, 21, 23, 25, 26, 29, 31, 32, 35, 39, 41, 46, 48, 49, 50 24 Inferencing (INF) Ability to draw inferences about explicitly stated information by carefully attending to an author's purpose, attitude, tone, etc 15,17,18,28,34,40,43,7 Vocabulary Knowledge (VOC) ability to comprehend meanings of words and phrases used in the context of the test 3,5,8,13,14,20,24,27,30,36,37,38,44,45,47,15 Analysis The statistical method for detecting DIF used in this study was item response theory (IRT). The 50 reading comprehension items were individually tested for gender DIF using Rasch model analysis with the analysis tool of ConQuest (Adams & Wu, 2010a). ConQuest was selected because of its powerful tool for examining DIF, particularly to model interactions between item and gender. It describes the probability of correct responses to generalized items using an item main effect, a gender main effect, and an interaction between item and gender (Adams & Wu, 2010b).
The first procedure for analysis was to conduct the analysis of fit statistics which provided information about how well the pattern of the observed responses match with the modeled expectation (Lee-Ellis, 2009). The fit statistics discusses both person and item fit. Person fit examines how different the patterns of the responses of the examinees, which can create spuriously high or spuriously low test scores, with the specified response model (Karabatsos, 2003;Reise, 1990). The upper limit of the IMS range for a person used the cut-off value proposed by Curtis and Boman (2007) is around 1.60. Meanwhile, the acceptable ranges of item fit statistics are from 0.7 to 1.4 (Curtis & Boman, 2007).
To identify DIF, items flagged for DIF were indicated by chi-square value and the absolute DIF value taken from the logit difference between two groups. Some scholars propose different cut-off value of logit difference to detect the existence of DIF. Le (2006) suggests that items are flagged for DIF if the chi-square DIF test is significant at 0.01 level and its absolute DIF value is greater than 0.25 logit. Meanwhile, Wu, Tam, and Jen (2016) suggests 0.5 logit as a cut-off value, while Bond and Fox (2015) propose a difference of 0.5 logit for highstakes test. For the purpose of this analysis, the cut-off value of 0.5 was used to detect DIF.

FINDINGS AND DISCUSSION
As one of the advantages using IRT is to be able to detect and identify misfitting persons (Dodeen, 2003), the person fit statistics were first calculated. In the test, unusual responses from examinees are sometimes found and such responses create misfitting to the testing model. According to Dodeen (2003), misfitting is the source of inaccuracy in estimating an individual's ability and decreasing the test validity. Using ConQuest (Adams & Wu, 2010a) as a tool of analysis to detect examinees' responses and to identify those who misfit the model as well as using the cut-off value of 1.6, the findings showed that 104 misfitting persons were identified and removed from the analysis.
The second term yielded from ConQuest analysis showed the estimates of gender differences in ability estimates. One of the estimates value is in negative sign to indicate a group that performed poorly. According to Adams and Wu (2010b), the negative sign is used for the gender term in the item response model showing poor performance. The results show that the estimate value of male students is -0.05 and that of female students is 0.05 with the standard error of 0.03. It indicates that male students perform poorer than the female students. It is shown that the male students scored 0.10 lower than female students. The parameter estimate is more than its standard error and the chi-square p-value is 1.67, showing a lower value than 2 which indicates that the difference is not significant.
The third term based on ConQuest analysis gives information about the interaction between the item and gender facets (see Figure 1). Some items were found to be easy for male students than female students, and vice versa. Item 1, for example, had the estimate of 0.014, and indicating that 0.014 must be added to the difficulty of this item for male students, and -0.014 must be added for the females. This item is also relatively easier for female students than male students. Another example is item 5 with the estimate of -0.050 for males and 0.050 for females, indicating that male students found it easier compared to female students. Based on the table of parameter estimates (see Table 2), there are twenty four items (items5, 6, 9, 11, 14, 16, 19, 21-28, 30, 33, 34, 36, 40, 42, 46, 47, and 49) that are relatively easier for male students, and twenty six (items 1-4, 7, 8, 10, 12, 13, 15, 17, 18, 20, 29, 31, 32, 35, 37-39, 41, 43, 44, 45, 48, and 50) that are relatively easier for female students. After misfitting persons were removed from the analysis, the DIF value was calculated. The effect of the existing DIF is determined by the magnitude of the DIF as indicated by the difference between two estimate values. The corresponding chi-square test was also obtained from DIF and standard errors of the estimates. If the chi-square is significant at a 0.01 level (Le, 2006) and the absolute DIF value is greater that 0.50 (Bond & Fox, 2015;Wu et al., 2016), an item is flagged for DIF. Based on the analysis, it is found that the chi-square (142.24, df = 49) is significant indicating the existence of DIF, and six of 50 items are flagged for DIF (items 7, 13, 20, 25, 45, and 46). Using the a magnitude value of more than 0.64 as moderate to large DIF magnitude proposed by Boone, Staver, and Yale (2014) , three items (items 7, 25, and 56) were detected to have moderate to large DIF showing the magnitude of 0.66, 2.16, and 0.97 respectively.

Figure 1. Wright Map of Item-Gender Interaction
Items 7, 25, and 46 are categorized into basic comprehension while items 13, 20, and 45 are vocabulary items. Of these six items, as Figure 1 shows, two items (items 25 and 46) display large magnitudes of 2.16 and 0.97 respectively which favor male students. Both items are in the category of basic comprehension items, particularly detailed and undetailed information. The reading passage of item 25 discusses a topic of entrepreneur and the correct response to the item indicates the information that is explicitly stated in the passage. Meanwhile, item 46 presents item that discusses foreign aid and the students are expected to answer an unstated detailed question. The potential source of this DIF may have been test takers unfamiliarity to the context of the passage. Table 3 presents the results of the gender DIF analysis by reading subskills. In terms of reading subskills, it is generally found that all test items slightly favored female more than male students. The main idea questions, for example, tended to be easier for male students than female students. This finding is not similar to the study of Pae (2012) presenting that the main idea items tended to be easier for female students.  4, 6, 7, 9, 10, 11, 12, 16, 19, 21, 23, 25, 26, 29, 31, 32, 35, 39, 41, 46, 48, 49, 50 24 11 13 Inferencing (INF) 15,17,18,28,34,40,43,7 3 4 Vocabulary Knowledge (VOC) 3,5,8,13,14,20,24,27,30,36,37,38,44,45,47,15 7 8

Total 50 24 26
To deal with the items detected for DIF, some approaches could be taken. Wu, Tam, and Jen (2016) suggest three possible ways of dealing with DIF items: removing the items, splitting DIF items for different groups, or leaving the items in the test. DIF items needed to be removed, as the further suggest, when many test items are available for selection into a final test. However, the consideration should be put on some influences to DIF, such as sample size. The more sample used to detect DIF, the higher the possibility to obtain higher DIF items. In addition to this, the items deemed to have DIF should be further examined to get substantive reasoning and make judgment about which items are actual DIF items. Le (2006) argues that items flagged for DIF are not necessarily deleted from future test, but the items need to be carefully reviewed. The DIF results provide information about the items that function differently in the test, and this information would be necessary for item writers to improve the item writing process (Zenisky, Hambleton, & Robin, 2003). For practical approach, Wu, Tam, and Jen (2016) suggest to use statistical analysis to identify DIF items and to examine the item content to investigate the theoretical explanations. In addition, the magnitude of the items should also be considered. They further argue that items with large DIF having more than 2 logits in item difficulty difference should be deleted.

CONCLUSION
Based on the DIF analysis of individual items in reading subtest in PTESOL, it is found that the subtest did not demonstrate much gender DIF. There were six out of fifty items in the subtest that were detected for gender DIF. Two items favored male students and four others favored female students. Reflecting from the findings, it is necessary to consider how the findings should be addressed in testing. Despite the fact that DIF is not equivalent to bias, the findings shown in the analysis would be indicative of potential sources of DIF, and the results can be utilized as guidelines for item writers to address what may be problematic in the item and what topics might be considered when developing test items.