K-Nearest Neighbor (K-NN) algorithm with Euclidean and Manhattan in classification of student graduation

ABSTRACT


Introduction
Student graduation rate is one of the indicators of the success of higher education. To achieve a proper graduation rate, universities must plan the learning process so that students can timely graduate [1]. The development of information technology can be used by universities to process data rapidly and accurately [2]. One of the benefits of using information technology is its use to predict student graduation [3]- [6]. Prediction of student graduation can be carried out by using student data in the first year. Prediction of student graduation can be further applied to assist universities in evaluating and improving the learning system that universities can produce qualified and timely graduates [7].
Prediction of student graduation can be conducted by classifying student graduation. One of the algorithms that can be utilized to classify is K-Nearest Neighbor (K-NN) using Euclidean and Manhattan Distance. To solve the problem of predicting student study time, the Euclidean Distance method can predict study time with an accuracy of 85.71% at K=10 [8]. In addition, to predict graduation based on Tryout scores, the Manhattan Distance method is proven to perform with an accuracy of 97.30%. The accuracy is obtained when K=3 [9]. For the problem of predicting student graduation time, KNN with Euclidean Distance and Manhattan Distance can predict with an accuracy of 82.26% [10]. The KNN method with Euclidean can make predictions with an accuracy of 83% at K=10 [11]. Moreover, to predict the qualification of the National Examination, the Euclidean Distance method can perform with the accuracy of 88.42% with K=7 [4]. It is in line with the result of a study in SMA Negeri 12 Tangerang, the Euclidean Distance method can perform with an accuracy of 89.126% with K = 7 in predicting the qualification of the National Examination.
Based on the findings, it can be inferred that the accuracy value is highly dependent on the K value and the problem being solved. The research aims to build a student graduation prediction system using Euclidean and Manhattan. The variables used as input are gender, major, number of credits for semester one, number of credits for semester two, number of credits for semester three, grade point on semester one, grade point on semester two, grade point on semester three, age, and graduation status (timely/untimely). This research is expected to be able to provide benefits for universities to formulate policies to secure students can graduate properly [12]. The highest level of accuracy of 98.5 percent was attained when k = 3 according to the results of prediction testing on 60 data for students in 2015-2016. The accuracy of the K-Nearest Neighbor algorithm calculation is also improved when more samples and training data are used. [13]. 240 student scores were used to test the algorithm's performance. These 240 students have graduated, and the cluster is labeled based on their graduation dates. There are 7 clusters with a silhouette value of 0.2416 as an outcome. The range of student graduation times is used to designate each cluster. The variance in each cluster is attributable to the presence of students with similar scores in the majority but varying graduation times. Other factors influencing the range of graduation times in each cluster include academic leave or extending the thesis completion period. The average prediction accuracy of 99.58 [14] is obtained by k-folding 240 data into 5 subsets. Predicting student graduation based on 667 tests completed by the author of the training data. In the first test, with a value of k = 1, records had the maximum accuracy of 88.16 percent. [15].

Data Set
This study used students' data from the Informatics Department at the University of Technology Yogyakarta in the academic year of 2014 and 2015. The attributes and data types are presented in Table 1. The data used was 543 students, consisting of 444 male students and 99 female students. The number of students who timely graduated was 83 students and 460 students whom untimely graduated. 380 students were used as training data, while 163 students were used as testing data.

Research Procedure
To achieve the research objectives, the following steps were carried out: (1) The data were divided into 2 groups The first group was used for training, i.e. 380 data were used as training data, namely 57 students who timely graduated and 323 students who untimely graduated. The second group was used for student testing. 163 students were used as testing data; 31 students who timely graduated and 137 students who untimely graduated. The first group was used as training data for developing the model, and the second group was used for testing data.
(2) Developing a model using group A and testing the model using group B. Rapidminer software was utilized to develop and test the model. There were 2 types of models used, namely K-NN with Euclidean and K-NN with Manhattan. K-NN with Euclidean was designed liberary by setting up the K-NN Numerical Measure menu as shown in Fig.1 and Fig. 2. Model testing that has been formed is carried out by performing the function parameter division of 70% for training data and 30% for testing data, as presented in Fig.3.

Results and Discussion
After carrying out all steps on the research procedure, the results were obtained as shown in Table 2. Based on Table 2, it can be inferred that the use of Euclidean and Manhattan Distance methods for classifying student graduation obtained the highest accuracy of 85.28% at K=7. These results indicated that the use of the K-NN algorithm with Euclidean and Manhattan Distance did not affect the classification accuracy. Moreover, from these results, it can be figured out that the distribution of student data did not affect the distance from the group separated by the Euclidean and Manhattan algorithms. In addition, it can be seen in table 3 that the addition of the value of k was not entirely beneficial for increasing accuracy, K=7 is the maximum value.

Conclusion
In this research, the use of Euclidean and Manhattan methods did not affect prediction accuracy. The highest prediction accuracy for the two methods is at K=7, which is 85.28%. The difference in distance calculation from the Euclidean and Manhattan Distance methods did not affect the results of the data classification. Furthermore, it was also found that the addition of the K value was not fully beneficial in affecting the accuracy value, the value of K=7 was the maximum value.