COMPARING ITEM PARAMETER ESTIMATES AND FIT STATISTICS OF THE RASCH MODEL FROM THREE DIFFERENT TRADITIONS

Rasch model is a method that has a long history in its application in the fields of social and behavioral sciences, including educational measurement. Under certain circumstances, Rasch models are known as a special case of Item response theory (IRT), while IRT is equivalent to the Item Factor Analysis (IFA) models as a special case of Structural Equation Models (SEM), although there are other „tradition‟ that consider Rasch measurement models not part of both. In this study, a simulation study was conducted using simulated data to explain how the inter-relationships between the Rasch model as a constraint version of 2-parameter logistic (2-PL) IRT, Rasch model as an item factor analysis were compared with the Rasch measurement model using Mplus, IRTPRO and WINSTEPS program, each of which came from its own 'tradition'. The results of this study indicate that Rasch models and IFA as a special case of SEM are mathematically equal, as well as the Rasch measurement model, but due to different philosophical perspectives, people might vary in their understanding of this concept. Given the findings of this study, it is expected that confusion and misunderstanding between the three can be overcome.


INTRODUCTION
An item response theory (IRT) model contains entities (observable variables and person-proficiency variables) and relationships (link functions) around which models are structured and through which probability-based inference is carried out (Mislevy, 2018). There are two traditions in IRT modeling: a data-based tradition and a model-based tradition. In the databased tradition, different models within the IRT family are explored to find the best fitting model for available data. By contrast, in a model-based tradition, a model with appealing mathematical properties is selected first and test are designed to fit the model (Alberto Maydeu-Olivares & Montaño, 2013). One of the model-based tradition is the Rasch measurement model (Rasch, 1960;Wright, 1968).
In the field of educational measurement in recent years, there has been an increasing need to use Rasch model as a tool for analyzing assessment data in Indonesia, including largescale assessment at the national level (i.e., Indonesia National Assessment Program). Since the Rasch model was introduced in Indonesia by the late Bruce H. Choppin in 1975(Nasoetion et al., 1976, this method has a long history of its use in educational measurement in Indonesia. Some research conducted by Indonesian scholars by applying the Rasch model has been car-

The Rasch Model and 1-PL IRT
The Rasch model (Rasch, 1960) is a mathematical formulation linking the probability of the outcome when a single person attempts a single item to the characteristics of the person and the item. It is thus one of the families of latent-trait models for the measurement of achievement and is arguably the least complicated member of this family (Choppin, 1983). In its simplest form it can be written as shown in Equation (1): Rasch model attempt to specify the relationship between individuals" underlying trait levels and the probability of endorsing an item using item and person characteristics. The structure of the Rasch model allows algebraic separation of the ability and item parameters, where ( ) is the probability of the response of 1 ), θ is the person location, and δj is item j"s location. Expressed in words, Equation (1) says that the probability of a response of 1 on item j is a function of the distance between a randomly selected person located at θ and the item located at δ (Embretson & Reise, 2000).
The Rasch model is also known as a special case of the logistic model in which the parameter for discriminating power is assumed the same for all items and is absorbed in the unit of scale of the ability estimate. Where the assumption of uniform discriminating power is appropriate, this model has the advantage of greater computational simplicity, chiefly because the test score (number right) is a sufficient statistic for the estimation of latent ability. This contrasts with Birnbaum's logistic model, where a weighted sum of the dichotomous item scores is the sufficient statistic (Bock & Wood, 1971). The Rasch models has three fundamental assumptions: (1) unidimensionality of the latent trait, (2) parallel item characteristic curves (ICCs), and (3) local independence (Mair, 2018).
The purpose of Rasch model analysis from measurement perspective is four-fold: (1) to scale persons and items on a common interval scale in (2) a single measurement dimension, where (3) item calibrations are independent of a distribution of persons, and (4) person measures are independent of the distribution of items. Collectively, these four properties are often encompassed under the term "objective measurement" (Karabatsos, 2000). Besides being considered as a 'measurement model,' Equation (1) is mathematically the same as the 1-PL IRT model as 'statistical model' where philosophical differences limit both. The equation of 1-PL is shown in Equation (2): Both the 1PL and Rasch models require that items have a constant value for α, but allow the items to differ in their locations. For the Rasch model, this constant is 1.0, whereas for the 1PL model the constant α does not have to be equal to 1.0. Mathematically, the 1PL and the Rasch models are equivalent. The values from one model can be transformed into the other by appropriate rescaling. The use of the Rasch model sets α to 1.0, and this constant value is absorbed into the metric used in defining the continuum (de Ayala, 2009;Embretson & Reise, 2000). As a statistical model, the probability of answering correctly or endorsing a particular response category is graphically depicted by an item characteristic curve (ICC). ICC reflects the nonlinear (logit) regression of a response probability on the latent trait. An item difficulty conveys the level of the latent trait (θ) where there is a 50% chance of a positive response on the item, for example, if the value δ = 0.75, there is a probability of 0.50 that a person with a latent trait level of 0.75 will respond positively to the item (Brown, 2015). As a measurement model, to construct inference from observation, the measurement model must: produce linear measure, overcome missing data, give estimates of precision, have devices for detecting misfit and the parameters of the person and the instrument must be separable. Only the Rasch measurement models solve these problems (Wright & Mok, 2004), These are what makes the Rasch measurement model have its characteristics in the application of Equation (1).

Full Information Item Factor Analysis (Categorical CFA) and IRT: Interrelations
Factor analysis contributed to the conceptual synthesis of latent variable and measurement models in SEM. Confirmatory factor analysis (CFA) is used to study the relationships between a set of observed variables and a set of continuous latent variables. When the observed variables are categorical, CFA is also referred to as item response theory (IRT) analysis (Fox, 2010). What it is now known as IRT originated as an effort to overcome the limitations of the factor model when applied to test items. Test items are most often categorical, whereas the factor model was designed for continuous data. Unfortunately, over the years IRT and FA have developed somewhat independently from one another (Maydeu-Olivares, 2005), but there's opinion that IRT is fundamentally a special case of SEM and that both statistical approaches rely on the idea that latent variables are the level of analysis, that is, the critical level. Multiple measured indicators are the means by which a construct can be assessed. IRT and SEM, however, have melded considerably recently (Little, 2018).
From the factor analysis and Mplus users" point of view, popular IRT models, such as one-parameter and two-parameter IRT models, are the measurement modeling part of SEM and are special cases of factor analysis with categorical, ordinal data; thus, those who are mainly the users of Mplus for factor analysis with categorical, ordinal data might wonder the degree of performance of those IRT model estimation by Mplus. In addition, because Mplus provides several different estimation options, users may be curious about the comparative performance of varying estimation options embedded in Mplus for the estimation of item response theory models. The same curiosity regarding the performance of SEM software for IRT model estimation may also exist among the item response theory software users and researchers (Paek et al., 2018).
There are three closely related uses of item response theory (IRT) and factor analysis (FA) models in applied social and behavioral science research. First, the item response theory and factor analysis models are used to understand better the psychometric structure underlying a set of items. Second, IRT or FA procedures are used to construct tests that meet some targeted criterion regarding reliability, validity, or test length. The third and often most ubiquitous goal relies directly on the first two and involves using the final IRT or FA model structure to obtain maximally valid and reliable scale scores to be used in subsequent statistical or graphical analysis. Such scores are sometimes referred to as factor scores (Curran et al., 2016). A critical difference between IRT and factor analytic approaches is how the data are treated. While factor analytic methods ex-amine covariances (or relationships) between the individual items, IRT models examine the overall response patterns across all of the items (Embretson & Reise, 2000).
As a consequence of evaluating item response patterns, the parameter estimates obtained provide insight into how the items function. This type of information can be particularly useful during the process of developing a survey. In addition, factor analytic approaches construct a linear relationship between the factor score and item response. This contrasts the IRT approach, which constructs a nonlinear relationship between latent traits and item responses (Depaoli et al., 2018).
As cited in Cai (2013), the IFA model is based on Thurstone"s common factor model, as a factor analysis of categorical item-level data. For the i-th person"s response to the j-th item, a p-factor model is assumed for the underlying response process variate such that ∑ where the continue to denote the normally distributed latent common factors with mean zero and unit variance, is the factor loading, and is normally distributed with mean zero and unique variance ∑ so that is has unit variance. The common factors and unique factors are uncorrelated. The observed 0 and 1 response is related to via a threshold parameter , such that is observed if and otherwise. In terms of the item parameters, Bock and Aitkin (1981) used the parameterization as seen in Equation (3): is the item intercept, and is called an item slope. The and are also known as the unstandardized parameters, whereas the and are the standardized parameters. In practice, maximum likelihood estimation of the item factor analysis model often involves a logistic substitution. That is, the probability of endorsement or a correct response is shown in Equation (4): where D is a scaling constant (1.7) such that the logistic function becomes identical in shape to the normal ogive function (Cai, 2013). Item discrimination parameters (a) are analogous to factor loadings in CFA because they represent the relationship between the latent trait and the item responses. Similarly, the item thresholds in CFA correspond to the item difficulty parameters (b) estimated in IRT. However, by using CFA parameterization, an IRT difficulty parameter can be directly calculated as shown by Equation (5) (Brown, 2015): where is the CFA item threshold, and is the CFA factor loading. With this background, the parallels of CFA as a special case of SEM and IRT should become clear.

Equating Coefficients from Dichotomous Rasch Model to 1-PL IRT Model
Bastari (2000) shows that we can equate the coefficients of different forms of the scale using linear transformation. We can directly compare the 1-PL estimates from IRTPRO with those of dichotomous Rasch model from WINSTEPS. Short of just making statements about the linear agreement of estimates across the two metrics, we could not directly compare the estimates from the two different metrics because of differences in the origins and units used. One simple approach is based on using the means and standard deviations of the item locations. In this approach, the transformation coefficient ζ is obtained by taking the ratio of the target to initial metric item location standard deviations (de Ayala, 2009), as presented in Equation (6): where sδ* is the standard deviation of the item locations on the target metric and sδ is the standard deviation of the item locations on the initial metric. Once ζ is determined, the other transformation coefficient κ is obtained by Equation (7): where ̅ is the mean of the item locations on the target metric and ̅ is the mean of the item locations on the initial metric. Transforming the location estimate for item-j to the target metric yields Equation (8): ̂ ̂

Design and Data Generation
To demonstrate the equivalence of categorical CFA (Item Factor Analysis, IFA) as a special case of SEM, "traditional" IRT and also Rasch measurement model, item response data were generated using Monte Carlo (MC) simulation. The MC method involved generating a sampling distribution of a compound statistic by using point estimates of its component statistics, along with the asymptotic covariance matrix of these estimates and assumptions about how the component statistics are distributed (Preacher & Selig, 2012). The model used in this simulation study had 20 observed variables with one sample size condition N = 1000 and generated based on 1-PL IRT model using Mplus 8.4. Person and item parameters were generated using informative prior distribution for difficulty parameters using beta~(a=2, b=4) and for person ability using beta~(a=3, b=4). We used 1000 replications and these 1000 datasets were analyzed and compared in a head-to-head comparison of these three "traditions".

Analyses
For the purpose of comparison, three software were used to estimate the parameter across 1000 replications: Mplus 8.4 with weighted least square mean and variance adjusted estimator (WLSMV), IRTPRO 4.2 with marginal maximum likelihood estimator (MML), WINSTEPS 4.2.0 using joint maximum likelihood estimator (JMLE). Four parameter types were studied, namely: (a) factor loadings of IFA, (b) threshold of IFA, (c) IRT (Rasch) difficulty parameter, and (d) IRT (or Rasch) discrimination parameter. Because the primary concern of this study was the similarity and conversion between IFA, IRT and "Rasch measurement". Differences among individual parameters within a type are presented. Therefore, results were averaged across the individual parameters for the four types of parameters. We also show the transformation from IFA parameter to IRT 1-parameter normal ogive model (1-PNO) and also transforming Rasch model parameter to 1-PL (parameter logistic) IRT.

FINDINGS AND DISCUSSION
Using Mplus, the IFA and 1-PNO model were estimated, using IRTPRO 1-PL were estimated and WINSTEPS were used to estimate dichotomous Rasch model. Table 1 shows the results of the item parameters estimate from the three software.  Table 1, it can be seen that the estimation results of the Mplus found that the difficulty parameter in the Rasch model is the conversion of the IFA parameter, with examples such as item 11 with IFA factor loading of 0.542, and the IFA threshold of 0.849, then using Equation (5) it will get results: The same calculation also applies to other items. This indeed has shown that IFA and IRT are the same things. This approach used by various studies that have understood that these two things are the same (Muthén, 1988;Muthen et al., 1991;Takane & de Leeuw, 1987). The results of the conversion of the FA to the IRT can be compared with the parameter estimation of the IRTPRO software which is also almost the same with a minimal difference. However, the main difference between FA and IRT is the absence of the term "calibration" in FA as commonly found in the application of IRT. This findings in line with some opinions that stated how IRT models are similar to factor analytic models in that they both provide information about dimensionality and model fit (i.e. how well a scoring option reflects the data (Kamata & Bauer, 2008)).
The estimates that are produced by WINSTEPS are different. As we can see in Table 1, there is a "discrimination" from WINSTEPS estimation even Rasch models assert that items exhibit the model-specified item discrimination. Empirically, however, item discriminations vary. During the estimation phase of Winsteps, all item discriminations are asserted to be equal, of value 1.0, and to fit the Rasch model. However, empirical item discriminations never are exactly equal, so Winsteps can also report an estimate of those discriminations post-hoc (as a type of fit statistic). The amount of the departure of the discrimination from 1.0 is an indication of the degree to which that item misfits the Rasch model (Linacre, 2018;Masters, 1988).
The differences are twofold, the first difference are the mean of item difficulty is to set equal to 0, the second difference is there are no assumption that the person ability distribution to be a particular parametric form, such as a normal distribution (Paek & Cole, 2020), although the estimation results from WINSTEPS can be transformed into 1-PL models with equating coefficients with Equation (6-8). Given that 1-PL and Rasch model SD are 0.970 and 1.107, we apply Equation (6) and we have: Because the respective initial and target metric means are 0.000 (mean of item difficulty in Rasch model) and 0.270 (1-PL mean of item difficulty), we have that:

̅ ̅
As an example, we used item 11 difficulty from the Rasch model ( ) using the WINSTEPS from Table 1. Transforming the difficulty estimates for item 11 to the IRT 1-PL yields: The results = 1.580 same as IRTPRO difficulty estimate of item 11 using the 1-PL model (see Table 1). We can apply the same procedure for all of the items. We can also transform Rasch model estimate to 1-PNO using the procedure from Linacre (2018). Also, the three approaches have a different fit item index such as the comparison that can be seen in Table 2.
It can be seen that the information about the fit or not of each item has its characteristics in each approach, wherein the Mplus can be seen the z-value along with the p-value as commonly found in factor analysis, while IRTPRO is IRT-based software has a specific itemlevel fit index that is while WINSTEPS also has its own fit item index developed from the perspective of the Rasch measurement model that is Infit and Outfit (Wright & Stone, 1979).
From statistical modeling perspectives, item fit refers to whether an item of a questionnaire belongs with the questionnaire. One method of evaluating the fit of items in IRT models is the generalized statistics (Orlando & Thissen, 2000, 2003. The is similar to a Pearson"s χ2, but instead, it cross-tabulates the response categories for an item against the total score of the subscale for the corresponding item (Depaoli et al., 2018), only specific IRT programs like IRTPRO have these statistics. These three approaches to analyzing the Rasch model show that various differences need to be understood, when a researcher fails to understand these differences, the analysis carried out along with very misleading conclusions. Meanwhile, the same comparison is carried out on the overall model fit index as shown in Table 3.  Table 3 shows that each software has different fit information and different estimation methods. Model fit is typically examined using a variety of measures that convey different aspects of how well the model fits the data. There are several measures that can be used in this context, and some that are specific to IRT-based inquiries. One such measure is (Maydeu-Olivares & Joe, 2006), a limited-information fit measure that outperforms full-information fit statistics (like the Pearson ) when the sample sizes are relatively small (Cai & Hansen, 2013).

The
indicates an adequately fitting model when the p-value is greater than .05. Another measure is the root mean square error of approximation (RMSEA), indicating an adequately fitting model when the confidence interval covers or is below .05 (MacCallum et al., 1996).
With the same data, Mplus produces nonsignificant Pearson with a significance < 0.05 which means that the model is not fit for the data, IRTPRO shows significant which means it is not fit but the value of RMSEA < 0.05 from Mplus and IRTPRO indicates that the model is fit. Meanwhile, WINSTEPS is different from Mplus and IRTPRO. The data must fit the model and if the data and the model disagree, it is the data that must be changed, not the model (Linacre, 2010). However, WINSTEPS also provides global fit information through Log-likelihood where the significant results p < 0.05 indicate that the occurrence of a misfit on the data has significant effect, this statistics are rarely reported. From these differences we can get an overview of the philosophical differences from the Rasch measurement model, IRT and CFA.
From the IRT"s perspective, Rasch models are considered a constraint version of the 2-PL model where item discrimination is equal for all items (the values are not always 1), while the Rasch measurement model is a measurement model that fix item discrimination to 1. The fundamental difference that must be understood is that when we use the Rasch model as a 1-PL model, if the model is not fit, we need to modify the model, while from the perspective of the Rasch measurement model, when the data is not fit to the Rasch model, the data are removed from analysis, for example people or items that misfit (Linacre, 2010). This conceptual difference should not confuse researchers in Indonesia. Before using this method, the researchers suggested to refer to various literature that explains these conceptual differences (e.g., de Ayala, 2009;Embretson & Reise, 2000;Linacre, 2010) so they can minimalize the confusions.
If we try to explain the earliest and latest dissertation of Indonesian scholars in terms of the application of the Rasch models, Under guidance of Bengt O. Muthen, Umar (1987) using Rasch models as a particular case of SEM, this approach can be seen from works on CFA with binary indicators which means the transformation of CFA parameters is carried out into the IRT. Thus, Umar did not use statistics such as Infit and Outfit derived from a different tradition. Under the guidance of Benjamin D. Wright, Hayat (1992) used a Rasch measurement model where he developed an item bank using the Rasch measurement model approach. Under the guidance of Ronald K. Hambleton, Bastari (2000), respectively, using IRT derived from Lord (1952) where the Rasch model is a constraint version of the 3-PL model as a statistical model. The last, under the guidance of Mark Wilson, Wihardini (2016) explained the uses of generalized Rasch models when the data is multidimensional. In sum, Umar (1987), Bastari (2000), and Wihardini (2016) used Rasch model from the data-based tradition, and Hayat (1992) used Rasch measurement models from model-based tradition.

CONCLUSION
The results of this study indicate that IRT as a special case from SEM can be proven, where IRT and categorical data factor analysis (item factor analysis) are mathematically equal. Although the Rasch model can be seen as a measurement model and as a statistical model, it remains a unity in which its position as a special case of IRT is not wrong even though the perspective of the Rasch model as a measurement model or model-based tradition also provides benefits especially regarding criterion referencing. Therefore, it is important for researchers to understand the philosophical differences that separate the three "traditions", which means data-based, model-based tradition or using scale transformation from item factor analysis, including the item fit statistics, the goodness-of-fit index, the fit of data to model or the fit of the model to the data, different estimation methods, and the software produced by these three "traditions".