MODIFIED ROBUST Z METHOD FOR EQUATING AND DETECTING ITEM PARAMETER DRIFT

This study is aimed at: (1) revising the criterion used in Robust Z Method for detecting Item Parameter Drift (IPD), (2) identifying the strengths and weaknesses of the modified Robust Z Method, and (3) investigating the effect of IPD on examinees’ classification consistency using empirical data. This study used two types of data. The simulated data were in the form of responses of 20,000 students on 40 dichotomous items generated by simulating six variables including: (1) ability distribution, (2) differences of groups’ ability between groups, (3) type of drifting, (4) magnitude of drifting, (5) anchor test length, and (6) number of drifting items. The empirical data was 4,187,444 students’ response of UN SD/MI 2011 who administered 41 test forms of Indonesian language, mathematics, and science. Modified Robust Z method was used to detect IPD and the IRT true score equating method was used to analyze the classification consistency. The results of this study show that: (1) the criterion of 0.5 point raw score TCC difference leads to 100% consistency on passing classification, (2) the modified Robust Z is accurate to detect the b and abdrifting when the minimal length of anchor test is 25%, (3) IPD occurring on empirical data affected the passing status of more than 2,000 students.


Introduction
The use of multiple test forms which is considered as parallel is widely implemented recently.Multiple test forms are used due to the test security, and to prevent the examinees from cheating easily to others.The other reason of designing parallel test forms is minimalizing the chance of practicing the test.If the administration of the test can be taken twice or more by a particular examinee, then using similar test form would kame the item get exposed frequently, the examinee may recall and practice the items.
Although the test is designed to be parallel, it is so hard to have the multiple test forms are perfectly parallel.Different item will have different level of difficulty, regardless similar resources of item's specification.The difference level of items' difficulties can raise unfair issues.The less difficulty test form will advantage the examinee who took the form, while examinee who took the more difficult item will get less score not caused by less ability.Thus, comparing the score between groups who took different test forms will lead to a bias result.
Non Equivalent Anchor Test (NEAT) design is a way to design parallel test forms, so that the difference of difficulty levels also the difference of groups' ability can be adjusted.The adjustment of differences is determined bu ancor items.Example of national test that using NEAT design is National Exam (NE) for elementary schools (ES) and Madrasah Ibtidaiyah/MI (Islamicbased elementary school) which is familiarly named as NE ES/MI.UN SD/MI items are constructed by provincial item writing team.All province used the same test specification and items' indicators.Each province then has their own test which differ from one province to others.To maintain the function of the test as a national measurement tool, 25% of the items were removed and replaced by national anchor items.The national ancor items were place in the same order, and preserve exactly the same content, format, even layout.No changes on national anchor items were allowed.All provinces had to make sure similarity of the anchor items.
The anchor items have a very important role.The accuracy of test form's difficulty level and the accuracy of examinee's ability estimation depend on the quality of anchor items.The score on anchor items defines the difference of groups' ability.A group which gets higher score on anchor items is considered as having better ability.Based on the ancor items' property, the difference of test form' level of difficulty can be determined and used for scoring adjustment (Cook & Eignor, 1991).Regarding its importance, the anchor items' parameter should satisfy the measurement invariance assumption.The assumption is that the parameter's value may shift around the bound of sampling error.Instead of being stable, anchor items' parameter are not uncommon shifting accross subsample, test administration, or location.These shifting conditions are known as item parameter drift (IPD) and may cause bias on ability estimation.
Keller and Wells (2009, p. 6) investigated the impact of drifting anchor items(IPD) on the accuracy of examinees' ability estimation.The study found that the difference of groups' ability defines the magnitude of IPD's impact.Even only one moderate drifting anchor item could give a bias ability estimation.
Robust Z method (Hyunh & Meyer, 2010) is a method for detecting drifting items and for fitting linking constants A and B which will be used in scaling process.Robust Z method applies a simple algoritm, yet still presents linking constant that is close to linking constant of the Stocking Lord method.The weaknesses of Robust Z method are its over-sensitivity and the absence of clear cut off criteria (Arce & Lau, 2011).The Robust Z method often detects undrifting anchor items as drifting.The criteria which are used are based on the probability of occurance in a hypothetic distribution; flagging an item as statistically significant IPD does not always mean that the impact of drifting ancors is practically significant.
Regarding the criteria problem, thus, modification of Robust Z method is necessary.The modification which is made is aimed at practically detecting meaningful IPD.Only anchor items which caused significant practical impact will be excluded from scaling process.The modification can give consideration to make decision for either retaining or refining the anchor items.An example of practically meaningful impact is changes on examinees' classification decision; passing to failing or failing to passing.
This study is aimed at: (1) revising the criterion which is employed in Robust Z Method so that the detection of item parameter drift (IPD) can be related to a practically meaningful criterion, (2) identifying the strengths and weaknesses of the modified Robust Z Method in various conditions, and (3) investigating the effect of IPD on the examinees' classification consistency in real life situation by implementing the modified Robust Z method on empirical data.

Type of Research
The research is categorized as a descriptive study.The study described the strengths of modified Ribust Z method, compared to the original version.The study also described the weaknesses of the modified Robust Z method and identified the test's characteristics which were potential for having 'practically meaningful' IPD.The descriptions of IPD's impact on examinees' classification in real life situation were also revealed.The real life situation was illustrated by analyzing empiric data using the modified Robust Z method.

Time and Location
The research took place at Yogyakarta State University, Indonesia, the center of educational assessment, and a province that held item writing workshop for constructing NE ES/MI in the academic year of 2013.The research was conducted in 11 months, starting from March 2013 until February 2014.

Population and Sample
The population of this study was all students who were enrolled as examinee of NE ES/MI in the academic year of 2011 who took the main tests among all provinces in Indonesia.The main tests are defined as the tests which are administered on the main schedule of NE ES/MI.Students who took repeated session or make up session were excluded from the population.According to the population definition, the total number of the students in the research is 4,187,444.
Sample selection in this study was based on the result of cheating validation process.A school is considered as a cheating school if at least one item were identified as being responded identically incorrect by at least 90% students in the school.Identification of cheating school resulted exclusion of all students' responses of the identified school from the database.This cheating validation process eliminated about 40% of responses and the number of responses remained in the database were: 2,509,646 for bahasa Indonesia test, 2,509,517 for mathematics test, and 2,509,751 for science test.

Technical Steps on Modifying Robust Z Method
In order to improve the criteria of Robust Z method, the principle of the Difference that Matter (DTM) which is proposed by Brennan (2008, p. 108) at a topic of 'population invariance' was used.A way of considering an item as a drifting item is not only a statistical significance but an impact which is caused by the drifting items.How significant the impact is is determined by the researchers.The researchers set the practical impact which was considered as meaningful.In this study, the practical impact which was used to determine wether a drifting item was meaningful or not was the changes on classification consistency.If the detected drifting items made the score test equating changes significantly so causes any examinee classify differently, then the items considered as a practically meaningful IPD.It is suggested to exclude the practically meaningful IPD from scaling process, otherwise the decision of examinee classification may disadvantage both the examinee and the user.
The Robust Z method consists of several algoritms which, in the end, give the linking constant of A and B. These constants were then used in the scaling process to transform the scale of anchor and non anchor items' parameter from a focal test form into the same scale as the reference test form.The transformation of the items' parameter were used to plot the Test Characteristic Curve (TCC).The linking of point to point between TCC focal test and TCC transformed focal test became the conversion table for equating test score.The equated test score was then used to decide wheter an examinee passes or fails in the test.
In order to evaluate the IPD impact on modified Robust Z method, Wyse and Reckase ( 2011) formula was adapted.The formula was used to see the significant difference between TCC total and TCC refinement.TCC total is TCC of transformed focal test that used all anchor items for scaling process.TCC refinement is TCC of transformed focal test that using only non drifting anchor items for scaling process.If the difference between the two TCCs is small, then the impact of IPD on classification consistency can be waived.On the other hand, when the difference is big, then the IPD is practically meaningful and suggested to be excluded from the scaling process.In tis study, the cut off value of 0.5 point 'raw score' was used as the maximum difference between the two TCCs.This cut off ensured a hundred percent of classification consistency.
Equation ( 1), ( 2), (3), and (4) are the formulas which were used in modifying Robust Z method's citeria.Equation 4a and 4b are formulas which were used to calculate the linking constant of A and B in two different conditions: without refining IPD items(A tot and B tot ) and by refining IPD items(A cv and B cv ).Both A and B linking constants were used to scale both anchor items and non anchor items' parameter.The two kinds of A and B linking constants also lead to two kinds of TCC plots: TCC without refinement (ΣPi total ) and TCC by refining IPD (ΣPi cv ).The maximum absolute value of the difference between two TCCs was then compared to the DTM cut off value to find out the summary of practically meaningful IPD.

Data, Instrument, and Data Collection
Empirical data which were used in this research were collected by documentation process.The NE ES/MI of the year of 2011 data were copied from Center for Educational Assessment database.This concludes that the type of the data which was used was secondary data.The collected data were raw responses on the 41 test forms of bahasa Indonesia test, 41 test forms of mathematics test, and 41 test forms of science test.The key of each test form was also collected to complement the raw responses data sets.
The instruments which were used in this research was analysis software.There were 5 softwares which were used in this study, namely: The analysis was started by determining the Item Response Theory (IRT) model that would be used.To find out the most suitable model, curves of raw score againts the proportion of students within each group that respond correctly on particular items were manually plotted.Figure 1 is an example of anchor items curve for mathematics test.
After deciding the IRT model which was used, simulation study data were generated using WinGen (Han, 2007) software.Each dataset generated was represented responses of 20,000 examinees on 40 dichotomus items.There are six manipulated variables: (1) The percentage of anchor items relative to total number of items (15%, 25%, and 40%); (2) the percentage of drifting items relative of total number of anchor items (15%, 30%, and 45%); (3) the magnitude of drifting.There are two kinds of drifting: the a-parameter drifting (no drifting, moderate drifting of 0.3, and large drifting of 0.7); the b-parameter drifting (no drifting, moderate drifting of 0.5, and large drifting of 0.8); (4) the direction of IPD (symmetrical two direction, one direction); (5) the ability distribution shape (normal and negatively skewed); and (6) comparison of the ability distribution between groups (similar ability distribution and different ability distribution).
In total, there are 188 conditions.Each manipulated condition was replicated 50 times for both the reference and the focal groups which resulted analysis of 18,800 datasets.The percentage occurance of manipulated drifting items detected as an IPD named as power rate, the percentage occurance of non manipulated drifting items detected as an IPD named as type I error rate, and the percentage occurance of TCCs differences larger than the cut off value named as DTM rate.The expected results from this study are combination of a high power rate, a low type I.
The analysis of empirical data was started with calibration of national anchor items using national responses.The parameter estimated from the national responses was then used as references for calibrating non anchor items in each province.The method which was used to calibrate provincial items is known as fixed item parameter calibration.The similarity of mean and standard deviation between non anchor test and anchor test was used to select the reference test form for equating process.After the reference test form was selected, equating score test of each provincial main test form can be conducted.For each provincial main test form, there are two equating processes: using all anchor items regardless the drifting and using only non drifting anchor items.Based on the two equating processes, each examinee will be classified two times.The classification consistency analysis categories examinees into four groups as follows: (1) passing and keep passing, (2) passing then failing, (3) failing then passing, and (4) failing and keep failing.
For each group, the proportion of examinees relative to total number of examinees was calculated.Classification consistency is the sum of proportion of examinees at groups of 'passing and keep passing' and 'failing and keep failing'.The analysis of empirical data also determined the frequency of each anchor item which was detected as an IPD accross 41 test forms.This frequency was named as IPD rate.The anchor item that has high IPD rate needs

Results of Simulation Study
The result of analysis power rate based on the type of ability distribution is presented in figure 2. The pattern of power rate of normal distribution is similar with the pattern of skewed distribution.Accross different level of drifting magnitude, the type of ability distribution does not present different results.It indicates that the performance of modified Robust Z method is similar with the two types of ability distribution.The modified Robust Z method is accurate when the ability of examinees in one group differs from the other group.Figure 3 and figure 4 are graphs of power rate and type 1 error rate IPD detection on interaction between number of anchor condition and difference of ability among group condition.Figure 3 shows that the modified Robust Z method is accurate when the number of anchor items is 40% and the groups are different in ability.A 100% of power rate means that the modified Robust Z method can detect manipulated drifting items accross all replications.A type 1 error rate close to 0% means that the occurance of detecting IPD incorrectly is almost close to zero.The results presented in figure 5 shows that using 40% anchor items can mimimalize the impact of IPD on the classification consistency.The DTM rate for condition of 40% anchor items is close to 0%, not only for the type a-drift but also tyoe b-drift, for both moderate and large level of drifting magnitude.It concludes that designing multiple test forms using 40% of anchor items anticipates the impact of IPD that may arise.Although the anchor test may have an IPD, at least the impact of the IPD to classification consistency can be minimalized.The IPD detection rate accross different proportion of drifting items shows the weakness of modified Robust Z method as presented in Table 1.Table 1 shows that the power rate of modified Robust Z method is less than 20% in condition number of drifting items is 40% out of total number of anchor items.This finding summarizes that modified Robust Z method is not powerful to detect IPD when the proportion of drifting items in anchor test is big.Large proportion of drifting items makes the anchor items be distributed evenly around the fitting regression line, hiding the facts that many items were drifting.Overall, everything seemed normal and no outlier in the distribution.The modified Robust Z method failed to identify which anchors are drifting and which anchors are not.
Table 1 shows that the modified Robust Z method is still accurate in detecting many drifting items as long as the direction of drfiting is symmetric.A symmetric direction means that some items are drifting more difficult, while some others are drifting less difficult.It is shown that when the drifting items number is 40% of anchor test length, the power rate of one way direction is 9.5%, while the power rate of symmetric direction increases dramatically into 100.
Figure 6, figure 7, and figure 8 illustrate power rate, type 1 error rate, and DTM rate when direction of IPD distributions are one way and symmetrically two opposite direction.The results show that the modified Robust Z method perfoms better in looking the impact of IPD in test level not only in item level particularly.The practical impact of consistency classification is identified by modified Robust Z method as aggregate of items in test level.Even the number of drifting items were great, but when drifting in an opposite direction, the effect will cancel out and the practical impact can be waived.The simulation study results show that the modified Robust Z method improves the parameter calibration) to select the best reference for the test form.IPD detection was implemented using modified Robust Z method in over 41 test forms for each subject.Table 5 presents the IPD rate for each anchor item.The results show that in bahasa Indonesia test, there is one anchor item which was detected as IPD, more than 60% anchor items which was detected in more than 85%, while science test has 2 anchor items detected as IPD in more than 85% provinces.The simulation study prooved that the modified Robust Z method has an accurate IPD detection.Then, the result of 85% IPD rate in empirical data means the item is truely drifting items.
The anchor items which were detected as drifting items were then taken into consideration while performing scaling process.The drifting items impact determined whether it is practically meaningful or not.Empirical data analysis considers the examinee as passing the test if the score of each subject is at least 4.00.Scoring process was conducted twice: in refinement condition and without refinement condition.For each subject, the examinee will have two passing statuses.Table 6, Table 7, and Table 8 present the examinee status proportion based on the scoring processes.Tabel 6 for bahasa Indonesia subject, Table 7 for mathematics subject, and Table 8 for science subject.
Table 6 summarizes analysis results of bahasa Indonesia test's passing status.Eleven out of 41 test forms used show that IPD does not make the difference of TCCs bigger than DTM criteria's cut off value.Careful examination on the eleven test forms proved that when the difference is less than DTM cut off value, the classification consistency is 100%.No examinee changes the passing status over two scaling conditions.It concludes that cut off criteria of 0.5 point raw score guarantee 100% classification consistency.
Table 7 shows that only one drifting item with large magnitude such as mat 40 has a large impact on classification consistency.The DTM rate for mathematics test is very close to 100%.The number of inconsistent classification at the national level is also very huge, about 25.58 %.This number is equal to 621,600 students regarding the numerous students for Indonesia population.This is a very huge number and significant result.These 621,600 students represent student population in East part of Indonesia.
The smallest percentage of inconsistent classification which is persented in Table 8     A deep attention must be put to answer the results of classification consistency.The table shows that inconsistent classification is mostly in categories of passing, while in fact, the status is failing.This means that thousand even hundred thousands students are decided as passing the test, while in fact, their competencies are still below the standard.This inconsistency has a big influence because the NE score is then used as a selection tool for ebtering secondary schools.The starting point of learning process cannot be in the right starting point.The students in Education Journal Modified Robust Z method for equating... -105 Rahmawati & Djemari Mardapi detail analysis on source of drifting.The expected results from the empirical data are a high percentage of classification consistency and a low IPD rate.

Figure 2 .
Figure 2. Power Rate Graph of Type of Ability Distribution accross Different Level of IPD's Magnitude

Figure 3 .
Figure 3. Power Rate Graph of Interaction between Type of Distribution and Ability Differences among Groups, Accross Number of Anchor Items and Type of IPD

Figure 6 .
Figure 6.Power Rate Graph of Interaction Anchor Test length Condition, Number of IPD Condition, Ability Distribution, and IPD Direction.

Table 1 .
Power rate, Type I error rate, and DTM Rate Based on Anchor Test Length, Number of Drifting Items, and IPD Direction

Table 2 .
Anchor Items Parameter for Bahasa Indonesia Test

Table 3 .
Anchor Items Parameter for Matematics Test

Table 4 .
Anchor Items Parameter for Science Test

Table 5 .
IPD Rate of Each Anchor Items over 41 Test Forms

Table 6 .
Percentage of Classification Consistensy of Passing Status Based on Bahasa Indonesia Test

Table 7 .
Percentage of Classification Consistensy of Passing Status Based on Mathematics Test Modified Robust Z method for equating... -111Rahmawati & Djemari Mardapi

Table 8 .
Percentage of Classification Consistensy of Passing Status Based on Science Test