Clustering and Detecting Network Intrusion Based on Fuzzy Algorithms

Manar Y. Kashmola Bayda I. Khaleel College of Computer Sciences and Mathematics University of Mosul Received on: 29/06/2011 Accepted on: 02/11/2011 ABSTRACT Clustering or (cluster analysis ) has been widely used in data analysis and pattern recognition. There are several algorithms for clustering large data sets or streaming data sets, Their aims to organize a collection of data items into clusters. These such items are more similar to each other within cluster, and difference than they are in the other clusters. Three fuzzy clustering algorithms (Fuzzy C-Means, Possibilistic CMeans and Gustafson-Kessel algorithms) were applied using kdd cup 99 data set to classify this data set into 23 classes according to the subtype of attacks. The same data set were classified into 5 classes according to the type of attacks. In order to evaluate the performance of the system, we compute the classification rate, detection rate and false alarm rate on this data set. Finally, the results obtained from the experiments with classification rate 100% which has not been obtained in any previous work. Keyword: Network intrusion detection, Fuzzy C-Means(FCM), Possibilistic CMeans(PCM) and Gustafson-Kessel (GK) algorithms, kdd cup 99 data set. ةببضملا تايمزراوخلا ىلع دامتعلااب ةكبشلا لفطت فشكو ةدقنع  ةلومشك سنوي رانم .د.م.أ      ليلخ ميهاربا ءاديب  ع ةيلك ل  تايضايرلاو تابساحلا مو ، ملا ةعماج  و  لص  :ثحبلا ملاتسا خيرات 29/06/2011    :ثحبلا لوبق خيرات  02/11/2011  صخلملا  ةدقنعلا  قنعلا ليلحت( ةدع تايمزراوخ كانه .طامنلأا زييمتو تانايبلا ليلحت يف عساو لكشب مدختست )دو ،ديقانع يف تانايبلا عيماجم ميظنت اهفادهأ ،تانايبلا عيماجم نم ليس وأ ةريبكلا تانايبلا عيماجم ةدقنعل  رصانعو ةفلتخم نوكتو رخلآا ضعبلا اهضعب هبشت دحاولا دوقنعلا يف ةدوجوملا هذه تانايبلا  يف تانايبلا رصانع نع ثلاثلا ةببضملا ةدقنعلا تايمزراوخ قيبطت مت دقل .ىرخأ ديقانع FCM, PCM, GK  ـلا تانايب مادختساب KDD


1-Introduction
An intrusion detection system(IDS) is a component of the information security framework. Its main goal is to differentiate between normal activities of the system and behavior that can be classified as suspicious or intrusive. The goal of intrusion detection is to build a system which would automatically scan network activity and detect such intrusion attacks. Once an attack is detected, the system administrator can be informed who can take appropriate action to deal with the intrusion [12]. Intrusion detection techniques can be categorized into misuse detection and anomaly detection .
-misuse detection uses the patterns of well-known attacks or vulnerable spots in the system to identify intrusions [4]. Misuse detection is based on the knowledge of system vulnerabilities and known attack patterns. Misuse detection is concerned with finding intruders who are attempting to break into a system by exploiting some known vulnerability, ideally, a system security administrator should be a were of all the known vulnerabilities and eliminate them [3]. -Anomaly detection attempts to determine whether can be flagged as intrusions.
There are three types of intrusion detection systems: Host-based Intrusion Detection System (HIDS), Network-based Intrusion Detection System (NIDS), and combination of both types (Hybrid Intrusion Detection System ). HIDS usually observes log or system -call on a single host, A host -based intrusion detection system places its reference monitor in the kernel / user layer and watches for anomalies in the system call patterns. While a NIDS typically monitors traffic flows and network packets on network segment, and thus observes multiple hosts simultaneously, NIDS performs traffic analysis on a local area network [12] [4].
This research is organized as follows: in section 2 previous work 3 fuzzy clustering algorithms are discussed, section 4 (Knowledge Discovery and Data Mining) kdd data set used in this research, section 5 Data proeprocessing, section 6 describes about the experiments and results obtained, and section 7 is conclusions .

2-Previous Work
In particular several clustering algorithms based approaches were employed for intrusion detection. Jawhar and Mehrotra [7] used fuzzy c-means clustering to classified dataset into 2 classes, they used (22133)records, the classification result in training stage is 99.9. Siddiqui [14] used parallel backpropagation neural network and pararllel fuzzy ARTMAP, the detection rate result for parallel BP in the training stage is 98.36 and the detection rate in the testing stage is 81.73 and false alarm is 1.28. Detection rate for parallel fuzzy ARTMAP in training stage is 80.14 and in testing state detection rate is 80.52 and false alarm is 19.48.

3-1 Fuzzy C-Means Algorithm
The Fuzzy C-Means algorithm (FCM) is introduced by Bezdek [12]. Fuzzy cmeans is based on Euclidean distance function [9]. It is a data clustering technique where each data point belongs to a cluster to some degree that is specified by membership grade [18]. Let X={x1,…xj,…xn} be the set of n objects and V= { v1,…vi,…vc} be the set of c centroids where xj ϵ Ʀ m , vi ϵ Ʀ m , and vi ϵ X [10]. It partitions X into c clusters by minimizing the objective function : where ki d is given by , c is the number of clusters in X, m is a weighting exponent [6]. The cluster centers are then evaluated by using the following equation: and membership matrix  is update by the following equation: The parameter m is a weighting exponent on each fuzzy membership and determines the amount of fuzziness of resulting classification [13]. And membership value to the data items for the clusters within a range 0 and 1 [16].

3-2 Possibilistic C-Means Algorithm
(PCM) algorithm proposed by krishnapuram and keller [11]. Is based on a modification of the objective function of (FCM). The objective function is: and the membership is updated the following equation : Where i  is the suitable positive number [8].

3-3 Gustafson-Kessel Algorithm
The Gustafson-kessel (GK) is the extension of the fuzzy c-means algorithm. The objective function is: The i A is calculated from the following equation: And i F is calculated as follows: Where i F is the fuzzy covariance matrix and i  the cluster volume which is usually set to 1 [1].

4-Kdd Dataset
(Knowledge Discovery and Data Mining) KDD'99 has been the most widely used data set. The network data is distributed by MIT Lincoln Lab for Defense Advanced Research Projects Agency DARPA The KDD cup 99 dataset includes a set of 41 features derived for each connection and a label which specifies the status of connection records as either normal or specific attack type. These features had all forms of continuous, discrete, and symbolic. The data set encompasses different attack types grouped into one of four categories [17] : -Dos (Denial Of Service ): making some computing or memory resources too busy so that deny legitimate users access to these. -Probe: Host and port scans as precursors to other attacks. An a network to gather information or find known vulnerabilities, e.g., portsweep. -U2R (User to Root ): Unauthorized access to local super user (root) privileges using system's susceptibility, e.g., buffer_overflow. -R2L (Remote to Local ): Unauthorized access from a remote machine according to exploit machine's vulnerabilities, e.g., imap.
Total number of connection records in training data set is 10% data (494020) records . And the total number of connection record in testing data set is corrected file (311029) records. Table (1) shows the data set used in training and testing stages that contain from normal and attack connection records [17][2].

5-Data Preprocessing
Data training and testing was taken from (DARPA). This data consist of symbolic and numeric values, all symbolic values were transformed into numeric values [15]. In this research kdd dataset ( 10_precent kdd) are used in the training stage and (corrected kdd) in the testing stage which contains 41 features (numeric and symbolic), in this research each symbolic of features such as three types of protocols (tcp, udp, icmp) and 68 type of services and 11 types of flag, takes value from [1..n] and then normalized all input data of 10%kdd data set.

6-Experiments And Results
Two indicators were used to measure the accuracy of the methods: detection rate and false alarm rate. The detection rate (DR) shows the percentage of true intrusions that have been successfully detected. While the false alarm rate is defined as the number of normal instances incorrectly labeled as intrusion by the total number of normal instances [4].

6-1 Experiment 1
-First stage, we applied three fuzzy clustering algorithms FCM, PCM, and GK to 10%kdd data set that contains (494020) records. In the first experiment, we apply these three fuzzy clustering algorithms to classify this data set into 23 classes or clusters, One for normal and the rest classes for the types of attacks   Table (3) shows the result of the first experiment that using FCM, PCM and GK clustering about 23 classes. Whereas Figure (1) shows the relationship between these algorithms and iterations number while Figure(2)shows the relationship between algorithms and time for 23 classes.
As shown in table (3) PCM was classified data set faster than other two algorithms, because PCM takes a number of iterations and time less than other algorithms, but FCM takes a number of iteration greater than GK and PCM algorithms. The classification rate [5] about three fuzzy clustering algorithms was calculated by equation (10)    Table (5) shows the results of testing data by using GK algorithm with the detection rate for each attack and normal behavior. and table (6) shows the results testing stage after applying PCM algorithm with the detection rate for each of attack and normal. Finally, table (7) shows the comparisons between three fuzzy clustering algorithms FCM, GK and PCM for 23 classes with over all detection rate that obtained for FCM is equal to (91.659 ) and for GK is equal to (83.021) and detection rate for PCM is equal to (94.284).  Figures (3, 4, and 5) show the relationship between three clustering algorithms with (Detection rate -false alarm ratetime) respectively.

6-2 Experiment 2
The same data set (494020) records were used after preprocessing it in the training stage to classify it into 5 classes, Table (8) shows the results of experiment for FCM, GK and PCM.  While table (9) shows the results after applying these three fuzzy clustering algorithms FCM, PCM and GK to classify data set into 5 classes. As shown in table (9) PCM was classified data set faster than other two algorithms, because PCM takes a number of iterations and time less than other algorithms, but FCM take a number of iteration greater than GK and PCM algorithms. -The second stage of this experiment "corrected KDD file" data set that consists of (311029) records also were used in the testing stage on three fuzzy clustering algorithms FCM, GK, and PCM. Table ( 10) shows the results of the testing "corrected KDD " file in FCM with detection rate for each attack type and for normal. In which the normal behavior got the higher detection rate is equal (97.813). After testing data in GK algorithm, the higher detection rate obtained is (98.498) for normal behavior. Which are shown in table (11). Using Possibilistic c-meams (PCM), normal behavior got higher detection rate equals to (99.972) after the testing data set shown in table (12). Finally, table (13) shows the comparisons between three fuzzy clustering algorithms FCM, GK and PCM for 5 classes. with over all detection rate that obtained for FCM is equal to (98.543%) and false alarm equal to (2.236%), while the detection rate that obtained for GK is equal to (80.836%) and false alarm equal to (1.502%), and the detection rate that obtained for PCM is equal to (99.955%) and false alarm equal to (0.116%).  Finally, table (14) shows the comparisons results of the three fuzzy clustering algorithms FCM , GK and PCM with the previous work .

7-Conclusions
In this research, three fuzzy clustering algorithms were applied to classify intrusion into 23 classes and also classify the same data set into 5 classes, In the first experiment, these three fuzzy clustering algorithms classify 10%kdd data set into 23 classes, one for normal and others for subtypes of attacks and detect these attacks in the first stage of the first experiment by using the fuzzy clustering algorithms FCM, GK and PCM; the classification rate obtained is 100% for these three algorithms. And in the second stage of the first experiment we have got higher detection rate for (PCM) algorithm is equal to (94.284) and less false_alarm rate (1.310). While the (FCM) algorithm got detection rate is equal to (91.659) and false_alarm rate (42.792), and finally, (GK) algorithm got the smaller detection rate (83.021) and higher false_alarm rate(44.848).
In the second experiment, these three fuzzy clustering algorithms classify 10%kdd data set into 5 classes, one for normal and others for types of attacks and detect these attacks. In the first stage of the second experiment, we obtained 100% classification rate for the three fuzzy clustering algorithms FCM, GK and PCM, and in the second stage (PCM) algorithm got higher detection rate (99.955) and less false_alarm rate (0.116). While (FCM) algorithm got on (98.543), but it is got higher false_alarm rate (2.236) and finally, (GK) algorithm has got the smaller detection rate (80.836) and false_alarm rate(1.502). So the results PCM are best performance and next FCM and last GK.
After applying three clustering algorithms (FCM, GK, PCM) on the kdd 99 dataset obtained the following: • When implement PCM algorithm on this dataset to classify it into 23,5 class, this algorithm got high classification rate (100%) in a few number of iteration and less time compared with the other algorithms. • Also the PCM algorithm got high detection rate and low false alarm compared with the other algorithms.