Alternative item selection strategies for improving test security in computerized adaptive testing of the algorithm

One of the ability estimation methods that is widely applied to the Computerized Adaptive Testing (CAT) algorithm is the maximum likelihood estimation (MLE). However, the maximum likelihood method has the disadvantage of being unable to find a solution to the ability estimation of test-takers when the test takers’ scores do not have a pattern. If there are test takers who get either score of 0 or perfect score, then the abilities of test-takers are usually estimated using the step-size model. However, the step-size model often results in item exposure where certain items will appear more often than other items. This surely threatens the security of the test because items that often appear will be easier to recognize. This study tries to provide an alternative strategy by modifying the step-size model and randomizing the calculation results of the information function obtained. Based on the results of the study, it is found that alternative strategies for item selection can make more varied items appear to improve the security of tests on the CAT.


Introduction
The development of item response theory (IRT) and computer technology that is faster and in a large capacity allows the development of computerized adaptive testing (CAT) (Haryanto, 2013, pp. 49-50). It is called "computerized" testing because the testing process no longer uses paper and pencil, but rather uses a computer device. It is called "adaptive" testing because the items that appear are chosen in such a way and adjusted to the ability of the test takers independently. CAT is a test conducted for testtakers where the items are determined based on the answers of the test takers (Winarno, 2013, p. 577). The efficiency of CAT com-pared to conventional testing models has been supported by several studies. The results of research by Eignor concluded that at the same level of measurement precision, adaptive tests only required a test length that was less than half of the computer-based test (CBT) device (Eignor, Stocking, Way, & Steffen, 1993;Grist, 1989, p. 2;Rudner, 1998, p. 2). McBride and Martin concluded that to achieve the same level of reliability, conventional testing required 2.57 times more items than adaptive testing (McBride & Martin, 1983).
The method widely used to estimate the ability of test-takers is the maximum likelihood estimation (MLE). The application of the maximum likelihood method has the disadvantage of being unable to find a solution when there are test takers who get extreme scores where all answers are always incorrect or always correct. To overcome this problem, the step-size method is generally employed. However, the application of the MLE and step-size model often leads to item exposure, which is the frequent appearance of certain items given to test takers. Although CAT is more efficient and reliable, the security of this testing is not guaranteed because certain items appear repeatedly. The items are easily recognized because they appear frequently, especially at the beginning of the item sequence. Therefore, modifications are needed to the conventional CAT algorithm to minimize the appearance of these easily noticeable items. The procedures that are commonly used in developing conventional CAT algorithms are elaborated as follows (Thissen, 1990).

Starting CAT
CAT generally starts with the selection of items with the difficulty level of moderate (Mills, 1999, p. 123;Santoso, 2010, p. 70;Vispoel, 1999). A test taker who answers incorrectly will then be given items with the difficulty level of easy. Conversely, if test taker answers correctly, they will be given items with the difficulty level of hard.

Estimating the Ability of the Test-Takers
The method commonly used to estimate the ability of test-takers is MLE (Baker, 1992;Birnbaum, 1968). The estimation of the ability of test-takers using the maximum likelihood method is calculated using the Newton-Raphson iterative procedure (Hambleton & Swaminathan, 1985, p. 83). The Newton-Raphson iterative procedure is performed first by subtracting the ratio of the first derivative to the second derivative from the initial value so that it results in new . This procedure is repeated by using the new and calculating the value of the new derivative ratio. The estimated value of at (m + 1) iteration can be expressed using the iterative relation as presented in Formula (1). Meanwhile, the error value is a correction factor that is formulated as seen in Formula (2), where u equals 1 if student's answer is correct and u equals 0 if student's answer is incorrect. Besides, P is probability of participants answering the items correctly, which is obtained by Formula (3).
The iteration process is stopped when the error value , with ε as limiting number whose value is very small. In this study, the ε value of 0.0001 was used.
One problem with the application of the MLE method in adaptive testing is the inability of the MLE method to find solutions when there are test takers who get an extreme score, which is either a score of 0 or a perfect score. To overcome the problem of the inability of the MLE method to estimate the ability of test-takers when their responses did not have a pattern, the step size method can be used (Dodd, 1990). Based on the step size method, the test taker's ability level is upgraded or degraded by a certain constant as long as the test taker's responses do not have a pattern, for example, by using a step size of 0.5.

Selection of the Next Item
After the test taker's ability is successfully estimated, the CAT algorithm will then select the next item. Lord recommended the use of the maximum item information procedure to select the next item (Lord, 1977). This method guarantees a highly accurate estimation of the ability of test-takers (Eignor et al., 1993). Items that have the greatest information function value on the ability of certain test takers are selected to be presented to them. The item information function is calculated at each ability level with the equation in Formula (4) (Hambleton, Swaminathan, & Rogers, 1991, p. 107). Formula (4) shows that the information value only depends on the characteristic value of item parameters (for example the values of b, a, and c for the 3PL model) and the level of ability (θ). Thus, for each ability level (θ), the information function contribution for each item in the question bank can be calculated.
The test information function is the sum of the information functions of the test item and is written as in Formula (5). Meanwhile, the test information function illustrates the accuracy of the test set in estimating different levels of ability. The greater the information at the given ability level, the more accurate the ability is estimated from the test kit. The standard error of measurement (SEM) is expressed by the equation in Formula (6) (Hambleton & Swaminathan, 1985, p. 95 Termination of CAT CAT termination uses criteria of equal measurement precision and a fixed number of items. Equal measurement precision criteria aim to produce test scores with the same measurement error level for each test taker. The standard error of measurement is limited to 0.30, which is equivalent to reliability of 91% on conventional tests (Thissen, 1990). By using the criteria, the number of items the test takers must work on can vary (where the number of items is not the same). However, to avoid the test process that may not be converging, the criterion of a fixed number of items is also used in the CAT termination rules by limiting the maximum items that appear, for example, as many as 20 items.

Giving Score to the Ability of the Test-Takers
The score of the ability estimation of the test-taker derives from the conversion of the value θ that is obtained by Formula (7).

………………….. (7)
In this study, the CATs assessment results, which were the conventional CAT model (by taking the information value of items or the largest I (θ)) and the alternative CAT model (by taking some of the largest I (θ) values, then taken randomly to determine the value of I (θ) that would be used), were compared. After that, the alternative CAT model was treated using the step-size method with an additional variable of response time when the test takers' responses did not have a pattern.
The assumption underlying the response time variable is those test-takers who have a high level of ability will be able to answer the items correctly in a shorter time than those who have a lower level of ability. Lidia Martinez compared groups of test-takers who took a test using CBT and found that the groups that spent the shortest average time responding to the initial test item obtained a higher average score (Martinez, 2009). Phil Higgins' research results showed that in CBT, if the item difficulty index was higher, then test-takers would need more time to answer and review the items (Higgins, 2009). This showed that the test taker's response time in working on the items correctly correlated with the estimation of the test taker's ability level.

Method
This study used a Research and Development (R&D) approach. The study began with the development of a question bank to obtain 265 items based on the 1-parameter logistic item response theory (1PL IRT) model. Characteristics of items in the form of parameters of the difficulty level of 265 items were obtained from the validation of processed results using the BILOG-MG software, obtained from the response test using CBT. The total number of items before validation was originally 290 items. A summary of the question bank validation statistics developed and used in this study is presented in Table 1. In the 1PL IRT model, the probability of a person with a certain ability (θ) answering the items correctly depends only on the difficulty level of the items (b). In this study, the estimation methods of the ability of testtakers are the MLE and step-size methods.
Next, two adaptive test designs developed were the conventional and the alternative CAT model. In this study, the development of CAT software referred to the incremental model (Pressman, 2001, pp. 35-36).
In the conventional CAT model, the first item selection method employs a difficulty level of moderate, starting with a range of b values from -0.5 to 0.5 chosen randomly. The ability level estimation is calculated using the MLE method. However, when the testtakers' responses have not had a pattern, their ability is estimated using the step size method with a value of 0.5. The next item that is selected is the item that has the greatest information function value on a particular ability.
The alternative CAT model has the same principles as the conventional CAT model. The difference is in the selection of the second and subsequent items, which uses the principle of randomizing the value of the information function in the 5-4-3-2-1 pattern. The pattern rule of 5-4-3-2-1 used was that the second item was selected from one item randomly from the five items that had the largest information function, the third item was selected from one item randomly from the four items that had the largest information function, the fourth item was selected from one item randomly from three items that had the largest information function, the fourth item was selected from one item ran-domly from three items that had the largest information function, and the fifth item was selected from one item randomly from two items that had the largest information function. Meanwhile, for the sixth and subsequent items, the item selection criteria revert to the maximum information function criteria or revert to the conventional CAT model.
To estimate the ability of test-takers on the alternative CAT model when their responses have not had a pattern, a step-size method is used with the addition of the response time variable. The test-takers' estimated initial ability level is selected at the ability level of θ0. Moreover, the step-size interval changes constantly by k (where in this study, the value of k=0.5). If the test taker responds by answering incorrectly, the testtaker's estimated ability level becomes θ0k or equal to 0-0.5 = -0.5. Meanwhile, if the test taker answers correctly, the estimated ability level becomes θ0+x k or 0.5 . x, where x is a positive constant multiplier and the value depends on the category of students' response time when their answer is correct. Table 2 shows a simulation procedure to estimate the test taker's ability level with a step-size interval added to the response time factor. Test takers were given 300 seconds to respond to each item. If for more than 300 seconds there is no response from test taker, the response is declared incorrect and easierlevel items will be displayed. In this study, the criterion for test termination is that the test is terminated if the SEM value has reached 0.30. An SEM value of 0.30 is equivalent to the reliability of 0.91 in conventional tests such as paper and pencil tests (Thissen, 1990).

Findings and Discussion
Before the answers have a pattern, the conventional CAT model will use the stepsize method with an interval of 0.5. This means that if the test taker always responds with the correct answer, then the second and subsequent items that will appear are items that have the largest information function value at the ability level (θ) of 0.5, 1, 1.5, 2, 2.5, and 3 respectively. Meanwhile, for testtakers who always respond with the incorrect answer, the second and subsequent items that will appear are items that have the largest information function value at θ of -0.5, -1, -1.5, -2, -2.5, and -3 respectively. The results that were obtained in the conventional CAT model are summarized in Table 3.
From the results of the study, it was found that items with list numbers of 209, 164, 113, 044, 237, 275, 081, 002, 091, and 115 were items that appeared more often than other items. The items that often appear will make the security of the test in the conventional CAT model degrade because they may be items that have been recognized by the test takers.
From the results of conventional CAT model testing, it was found that the number of items with difficulty index of moderate, which was indicated by the difficulty index value (b) ranging from -0.5 to +0.5, was 128 items. This meant that the probability of the first item having a chance to appear was 128 items chosen randomly. This was indeed in accordance with the criteria applied to the conventional CAT model design algorithm, that the initially selected items were items with difficulty index of moderate (-0.5 to +0.5).
After the first item displayed and was responded by the test taker, the second item was presented by using the step-size method. This meant that if students responded to the item with the correct answer, then the second item displayed was the item with maximum information for θ = 0.5. However, if students responded to items with incorrect answers, then the second item that was displayed was an item with maximum information for θ = -0.5. Thus, it was certain that in the conventional CAT, the second item only consisted of the possibility of 1 of 2 items only. In this study, the second item presented was question item number 275 (if the answer was correct) and question item number 209 (if the answer was incorrect). The frequent appearance of item number 275 and item number 209 made the security of CAT threatened due to the familiarity with the question.
Another case that also often arises is that there has not been a pattern in students' answers so that the step-size method is used. For example, if students answered questions correctly, the items that would appear were questions that had a maximum information value for θ = 0.5, 1.0, 1.5, 2.0, and 2.5, which were the second item whose item number was 275, the third item whose item number was 081, the fourth item whose item number was 002, the fifth item whose item number was 091, and the sixth item whose item number was 115.
However, if students always answered the question incorrectly, then the item that appeared was questions that had a maximum information value for θ = -0.5, -1.0, -1.5, -2.0, and -2.5, i.e., the second item with item number 209, third item with item number 164, fourth item with item number 113, fifth item with item number 044, and sixth item with item number 237. In the conventional CAT model, if the responses of the test takers have the same pattern, then the items that appear will also be the same. This is what makes the security level of the conventional CAT model suboptimal.
If students' responses already had patterns (where the responses already consisted of correct and incorrect answers), then the items that appeared next had been quite varied because the first item that appeared already had a relatively large variety of items (128 items). However, by using the maximum information function value model to search for items that corresponded to the estimated level of test-takers' abilities, it was very possible that many items could not be presented because they never obtained the maximum function value for each level of ability.
The alternative solution proposed was to use the step-size method based on the student's response time in answering correctly. Student responses were grouped into groups based on the time spent by students in answering the questions correctly. In the stepsize method based on response time, the stepsize value formula was given an additional constant multiplier based on the response time. The faster the students answered correctly, the greater the constant multiplier became.
An additional solution proposed was to randomize the maximum information function value. If the conventional CAT model determined the items that appeared based on the value of the (single) maximum informa-tion function, then the alternative CAT model determined the items that appeared by randomizing the maximum information function values based on groups of 5-4-3-1-1. For example, one of the results of testing the alternative CAT model is presented in Table  4.
From Table 4, the calculation procedure for the alternative CAT model can be observed. From the table, it can be seen that the items that appear in the alternative CAT model are more varied compared to those in the conventional CAT model. The algorithmic procedure in the alternative CAT model can be explained as follows.
The First Item that Appeared was Item Number 239 with b = -0.416 The first item appeared in accordance with the criteria that items were taken randomly with a difficulty index of moderate whose b value ranged from -0.5 to 0.5. Item number 239 fulfilled the criteria. Because students' answers did not have a pattern, the method of estimating the ability level was the step-size of 0.5. Students' answers were declared correct (value 1). The time that was spent to work on the first item was 34 seconds, so it was included in the fast category (between 31 and 60 seconds) with a multiplier factor = 1.3. Thus, the value of θ was 0.5 x 1.3 = 0.64. The second item appeared because it had the five largest information function values at the value of θ = 0.65, according to the use of randomization with the principle of 5-4-3-2-1. From the five alternative values of the largest information function (see Table 5), the item with number 182 was selected randomly. The second item was answered correctly (then the response value was 1). Because students' answers did not have a pattern, the method of determining the estimated ability level was the step-size of 0.5. The item was done in 40 seconds and included in the fast category (between 31 to 60 seconds) with a multiplier factor of 1.3. Thus, the value of θ = 0.65 + (0.5 x 1.3) = 1.3. The third item appeared because it had the four largest information function values at the value θ = 1.3 according to the use of randomization with the principle of 5-4-3 -2-1. Of the four alternative values for the largest information function (see Table 6), the item with number 192 was randomly selected. The third item was responded with an incorrect answer (so the response value was 0). Because students' answers did not have a pattern, the method for estimating the level of ability was MLE. The value of θ obtained was = 1.1809. The Fourth Item that Appeared was Item Number 042 with b = 1.181 The fourth item appeared because it had the three largest information function values at the value θ = 1.1809 according to the use of randomization with the principle of 5-4-3 -2-1. Of the three alternative values for the largest information function (see Table 7), item with number 042 was randomly selected. The fourth item was responded with an incorrect answer (so the response value was 0). Because students' answers did not have a pattern, the method for estimating the level of ability was MLE. The value of θ obtained was = 0.8579. The Fifth Item that Appeared was Item Number 132 with b = 0.861 The fifth item appeared because it had the two largest information function values at the value θ = 0.8579 according to the use of randomization with the principle of 5-4-3 -2-1. Of the two alternative values for the largest information function (see Table 8), the item with number 132 was randomly selected. The fifth item was responded with the correct answer (so the response value was 1). Because students' answers did not have a pattern, the method for estimating the level of ability was MLE. The value of θ obtained was = 1.3204. This sixth item appeared because it had one largest information function value at the value θ = 1.32 according to the use of ran-domization with the principle of 5-4-3 -2-1. Of the one alternative value for the largest information function (see Table 9), the item with number 192 was randomly selected. The sixth item was responded with an incorrect answer (so the response value was 0). Because students' answers were patterned, the method for estimating the level of ability was MLE. The value of θ obtained was = 1.1161. The subsequent items (i.e. the seventh to fifteenth items) used the same method to determine the item that had the largest information function at its value of θ. The fifteenth item became the last item because the criterion for termination rule had been met (SEM = 0.3). It was converted to a numerical value of 85. This alternative CAT model has been proven to be able to overcome a fundamental shortcoming in the conventional CAT model, which was the frequent appearance of certain items. From Table 3, it can be seen that in the conventional CAT model, several similar items would appear, especially in the initial patterns of CAT execution. Meanwhile, in Table 4, there were many variations on the possible items that appeared on the alternative CAT model, even though the patterns of students' answers were the same. The many variations of items that appear in the alternative CAT model can reduce the level of item exposure on CAT so that it will make the CAT more secure. The item variations that appeared in the alternative CAT model actually had item difficulty index that was not much different from those that appeared in the conventional CAT model, so it did not increase the test length or reduce the efficiency of the estimation of the ability of the test takers.

Conclusion
From the results of this study, it can be concluded that the alternative CAT model was able to decrease the level of item expo-sure on the CAT, thereby increasing the security of the CAT without increasing the test length or reducing the efficiency of the CAT. The strategy adopted by the alternative CAT model was to select items using the step-size method based on response time and randomization of the maximum information function value with the criteria of 5-4-3-1-1 by applying the maximum likelihood estimation (MLE) to estimate the ability level of the test takers. The strategy has been proven to be able to present items with more variations, but still with item difficulty index which was not much different in the response patterns of the same test takers.