Practical Comparison between Genetic Algorithm and Clonal Selection Theory on KDD data set

This paper compares between two models: Common Genetic algorithm and the new Clonal selection theory in the field of Intrusion Detection. Genetic algorithms (GA) which is a model of genetic evolution, while Clonal selection theory (CST) is from models of the natural immune system NIS, the two models are from two different fields of Artificial Intelligence AI but they have portion of shared operations and objectives. The comparison to be done by applying the two models on some records of Knowledge Discovery and Data mining tools which is known by the name KDD data sets (its records the data of the interring packets to the computer system from the internet), to produce population ( in case of GA) or antibodies (in case of CST) can recognize these abnormal records. ةعومجم ىلع ةللاسلا ءاقتنإ ةيرظنو ةينيجلا ةيمزراوخلا نيب ةيلمع ةنراقم


Introduction
Internet has given users a need for security components to protect themselves.Certain techniques are used to secure important data, such as firewall and encryption etc. .Firewall acts as a defense to protect sensitive data, but it merely reduces exposure rather than monitors or eliminates vulnerabilities in computer systems.Any encrypted message can be decrypted in theory, and encryption adds extra burden on hosts or application.Moreover, any new security techniques themselves might have design flaws.Obviously, it is important to have a detecting and monitoring system to protect important data.For this reason the detection methods of intruders in the computer networks have drawn attention to many researchers in recent years.[1] An Intrusion Detection System (IDS) is an important component of the computer and information security framework.Its main goal is to differentiate between normal activities of the system and behaviors that can be classified as suspicious or intrusive.
There are two main approaches to design of IDSs: misuse and anomaly detection techniques.In a misuse detection based IDS, intrusions are detected by looking for activities that correspond to known signatures of intrusion and vulnerabilities.On the other hand, the anomaly detection based IDSs detect attacks by observing deviations from behavior of the system.Its works by comparing network traffic, system call sequences, or other features of known attack patterns.

Input data (the KDD Cup 99 Data)
Four Samples of Connection Records Corresponding to the Attack Types.

Evolutionary Computation
Evolution is an optimization process where the aim is to improve the ability of an organism (or system) to survive in dynamically changing and competitive environments [2] [3].
Evolutionary computation (EC) refers to computer-based problem solving systems that use computational models of evolutionary processes, such as natural selection, survival of the fittest and reproduction, as the fundamental components of such computational systems.
Evolution via natural selection of a randomly chosen population of individuals can be thought of as a search through the space of possible chromosome values.In that sense, an evolutionary algorithm (EA) is a stochastic search for an optimal solution to a given problem.The evolutionary search process is influenced by the following main components of an EA: Algorithm (1) shows how these components are combined to form a generic EA [2].
The steps of an EA are applied iteratively until some stopping condition is satisfied.Each iteration of an EA is referred to as a generation.The different ways in which the EA components are implemented result in different EC paradigms: [2] Algorithm 1. Generic Evolutionary Algorithm Let t = 0 be the generation counter; Create and initialize an n x -dimensional population, C(0), to consist of n s individuals; while stopping condition(s) not true do Evaluate the fitness, f(x i (t)), of each individual, x i (t); Perform reproduction to create offspring; Select the new population, C(t + 1); Advance to the new generation, i.e. t = t + 1; end • Genetic algorithms (GAs), which model genetic evolution.
• Genetic programming (GP), which is based on genetic algorithms, but individuals are programs (represented as trees).• Evolutionary programming (EP), which is derived from the simulation of adaptive behavior in evolution (i.e.phenotypic evolution).• Evolution strategies (ESs), which are geared toward modeling the strategic parameters that control variation in evolution, i.e. the evolution of evolution.• Differential evolution (DE), which is similar to genetic algorithms, differing in the reproduction mechanism used.• Cultural evolution (CE), which models the evolution of culture of a population and how the culture influences the genetic and phenotypic evolution of individuals.• Co-evolution (CoE), where initially "dumb" individuals evolve through cooperation, or in competition with one another, acquiring the necessary characteristics to survive.

Genetic Algorithms
Genetic algorithms (GA) are possibly the first algorithmic models developed to simulate genetic systems.GAs model genetic evolution, where the characteristics of individuals are expressed using genotypes.The main driving operators of a GA is selection (to model survival of the fittest) and recombination through application of a crossover operator (to model reproduction).This section discusses in detail GA used in this research and their evolution operators, which is follows the general algorithm as given in Algorithm (1), but with different components are combined to form GA particularity to solve intrusion detection problem in KDD data set [2].
• A real value representation was used.
• Stochastic Universal sampling selection was used to select parents for recombination.we added them here because they are necessary for intrusion detection applications.

Real value representation
Since our data consist of fields have different types characters and numbers.To unite them we convert characters to numbers, and then applied normalization process on them to obtain values in range [0 -1].
The benefit of data transformation such as normalization may improve the accuracy and efficiency of artificial algorithms.Such methods provide better results if data to be analyzed has been normalized, that is, scaled to specific range as [0 -1].[2] Min-Max Normalization: The min-Max normalization performs a linear transformation on the original data values.Suppose that minX and maxX are the minimum and maximum of feature X.In order to map interval [minX -maxX] into new interval [new minX -new maxX].Consequently, every value v from the original interval will be mapped into value new v using the following formula [1]: maxX -minX

Proportional Selection (Stochastic Universal sampling)
Selection operators are characterized by their selective pressure, also referred to as the takeover time, which relates to the time it requires to produce a uniform population.It is defined as the speed at which the best solution will occupy the entire population by repeated application of the selection operator alone.An operator with a high selective pressure decreases diversity in the population more rapidly than operators with a low selective pressure, which may lead to premature convergence to suboptimal solutions.A high selective pressure limits the exploration abilities of the population [2].
Two popular sampling methods used in proportional selection is roulette wheel sampling and stochastic universal sampling.In roulette wheel selection it may happen that the best individual is not selected to produce offspring during a given generation.To prevent this problem, stochastic universal sampling (refer to Algorithm 2), used to determine for each individual the number of offspring, λ i , to be produced by the individual with only one call to the algorithm.
Because selection is directly proportional to fitness, it is possible that strong individuals may dominate in producing offspring, thereby limiting the diversity of the new population.In other words, proportional selection has a high selective pressure [2][4].

Crossover (Uniform crossover)
Crossover operators can be divided into three main categories based on the amity (i.e. the number of parents used) of the operator.This results in three main classes of crossover operators: • asexual, where an offspring is generated from one parent.
• sexual, where two parents are used to produce one or two offspring.
• multi-recombination, where more than two parents are used to produce one or more offspring.Crossover operators are further categorized based on the representation scheme used.For example, binary-specific operators have been developed for binary string representations, and operators specific to floating-point representations.
Parents are selected using the selection scheme discussed in previous section.But here binary crossover applied on parent's features instead of 0 and 1. Recombination is applied probabilistically, Each pair (or group) of parents have a probability, pc, of producing offspring.Usually, a high crossover probability (also referred to as the crossover rate) is used.Most of the crossover operators for binary representations are sexual, being applied to two selected parents.If x 1 (t) and x 2 (t) denote the two selected parents, then the recombination process is summarized in Algorithm (3).In this algorithm, m(t) is a mask that specifies which bits of the parents should be swapped to generate the offspring, x 1 (t) and x 2 (t).Several crossover operators have been developed to compute the mask: One-point crossover, Two-point crossover, Uniform crossover [2].Uniform crossover: The n x -dimensional mask is created randomly as summarized in Algorithm (3).Here, p x is the bit-swapping probability.If p x = 0.5, then each bit has an equal chance to be swapped.Uniform crossover is illustrated in Figure (1).

Algorithm 3. Uniform Crossover Mask Calculation
Initialize the mask: m j (t) = 0, for all j = 1, . . ., n x ; for j = 1 to n x do if U(0, 1) ≤ p x then m j (t) = 1; end end Mutation real-value attribute strings (vectors) has the same essence as mutating the other types of strings, i.e., a change is made in one or more of the attributes, but it has to respect the upper and lower limits of each attribute (vector coordinate).
In inductive mutation, a random number to be added to a given attribute is generated.A common mutation operator for real-valued vectors in evolutionary algorithms is Gaussian mutation.The Gaussian mutation alters all The attributes of a string according to the following expression: m` = m + α(D) N(0,σ) …..….. ( 1) where m = (m 1 , m 2 , …, m L ) is attribute string , m` its mutated version, α(D) is a function that accounts for affinity (AIS) proportional mutation therefore is canceled here in evolutionary computation, and N(0,σ) is a vector of independent Gaussian random variables of zero mean and standard deviation σ [5].

Replacement Strategy
A replacement strategy that decides if offspring will replace parents, and which parents to replace.
Two main classes of GAs are identified based on the replacement strategy used, namely generational genetic algorithms (GGA) and steady state genetic algorithms (SSGA), also referred to as incremental GAs.For GGAs the replacement strategy replaces all parents with their offspring after all offpsring have been created and mutated.This results in no overlap between the current population and the new population (assuming that elitism is not used).For SSGAs, a decision is made immediately after an offspring is created and mutated as to whether the parent or the offspring survives to the next generation.Thus, there exists an overlap between the current and new populations.
The amount of overlap between the current and new populations is referred to as the generation gap.GGAs have a zero generation gap, while SSGAs generally have large generation gaps [2] [4].
A number of replacement strategies have been developed for SSGAs: Replace worst, Replace random, Kill tournament, Replace oldest, Conservative selection, Elitist, Parent-offspring.
• Replace worst, was used here where the offspring replaces the worst individual of the current population.
The following flowchart (see figure 2) display in summary way the preceding steps in our applying GA to solve intrusion detection problem in KDD data set.

AIS -Learning the Antigen Structure
Learning in the immune system is based on increasing the population size of those lymphocytes that frequently recognize antigens.Learning by the immune system is done by a process known as affinity maturation.Affinity maturation can be broken down into two smaller processes namely, a cloning process and a somatic hyper-mutation process.The cloning process is more generally known as clonal selection, which is the proliferation of the lymphocytes that recognize the antigens.
The interaction of the lymphocyte with an antigen leads to an activation of the lymphocyte where upon the cell is proliferated and grown into a clone.When an antigen stimulates a lymphocyte, the lymphocyte not only secretes antibodies to bind to the antigen but also generates mutated clones of itself in an attempt to have a higher binding affinity with the detected antigen.The latter process is known as somatic hyper-mutation.Thus, through repetitive exposure to the antigen, the immune system learns and adapts to the shape of the frequently encountered antigen and moves from a random receptor creation to a repertoire that represents the antigens more precisely.Lymphocytes in a clone produce antibodies if it is a B-Cell and secrete growth factors (lymphokines) in the case of an HTC [2].
Since antigens determine or select the lymphocytes that need to be cloned, the process is called clonal selection.The fittest clones are those which produce antibodies that bind to antigen best (with highest affinity).Since the total number of lymphocytes in the immune system is regulated, the increase in size of some clones decreases the size of other clones.This leads to the immune system forgetting previously learned antigens.When a familiar antigen is detected, the immune system responds with larger cloning sizes.This response is referred to as the secondary immune response.Learning is also based on decreasing the population size of those lymphocytes that seldom or never detect any antigens.These lymphocytes are removed from the immune system.For the affinity maturation process to be successful, the receptor molecule repository needs to be as complete and diverse as possible to recognize any foreign shape [2][3].

Clonal Selection Theory Models
The process of clonal selection in the natural immune system was discussed in the previous Section.Clonal selection in AIS is the selection of a set of Artificial LymphoCytes (ALCs) with the highest calculated affinity with a non-self pattern.The selected ALCs are then cloned and mutated in an attempt to have a higher binding affinity with the presented non-self pattern.The mutated clones compete with the existing set of ALCs, based on the calculated affinity between the mutated clones and the non-self pattern, for survival to be exposed to the next non-self pattern [2].
In Clonal selection algorithms, each antibody and antigen is represented by a set of attributes {x 1 , x 2 , …, x n }.Thus, antibodies and antigens may be represented as either n-dimensional points in a metric space such as Euclidean space or use binary encoding of the attributes; however, other representations are also used.The antigenic affinity of each antibody is typically defined based on a metric, usually, the Euclidean distance.Also, some operators are defined to introduce genetic variation to the antibodies based on their antigenic affinities.First, a cloning operator is defined to make exact copies (clones) of those antibodies having higher antigenic affinities; the higher the antigenic affinity, the higher the number of clones an antibody can generate.Then some genetic variation is introduced to these antibodies (through a mutation operator) to allow them for better matching with the antigens [3].
Clonal selection algorithms are developed based on the Clonal selection theory proposed nearly 50 years ago.The main immunological elements used are: • Maintenance of a specific memory set.
• Selection and cloning of most stimulated antibodies.
• Removal of poorly stimulated or nonstimulated antibodies.
• Generation and maintenance of a diverse set of antibodies.

CLONALG
The selection of a lymphocyte by a detected antigen for Clonal proliferation, inspired the modeling of CLONALG.CLONALG is an algorithm that performs machine-learning and pattern recognition tasks.All patterns are presented as binary strings [2].
The affinity between an ALC and a non-self pattern is measured as the Hamming distance between the ALC and the non-self pattern.The Hamming distance gives an indication of the similarity between two patterns, i.e. a lower Hamming distance between an ALC and a non-self pattern implies a stronger affinity.
All patterns in the training set are seen as non-self patterns.Algorithm (4) summarizes CLONALG for pattern recognition tasks.The different parts of the algorithm are explained next [2].
The set of ALCs, C, is initialized with n a randomly generated ALCs.The ALC set is split into a memory set of ALCs, M, and the remaining set of ALCs, R, which are not in M. Thus, C = MUR and |C| = |M| + |R| (i.e.n a = n m + n r ).The assumption in CLONALG is that there is one memory ALC for each of the patterns that needs to be recognized in DT .
Each training pattern, z p , at random position, p, in DT , is presented to C. The affinity between z p and each ALC in C is calculated.A subset of the n h highest affinity ALCs is selected from C as subset H.The n h selected ALCs are then sorted in ascending order of affinity with z p .Each ALC in the sorted H are cloned proportional to the calculated affinity with z p and added to set W. The number of clones, n ci , generated for an ALC, x i , at position i in the sorted set H, is defined in as  The ALCs in the cloned set, W, are mutated with a mutation rate that is inversely proportional to the calculated affinity, i.e. a higher affinity implies a lower rate of mutation.The mutated clones in W are added to a set of mutated clones, W`.The affinity between the mutated clones in W` and the selected training pattern, z p , is calculated.
The ALC with the highest calculated affinity in W` , x`, replaces the ALC at position, p, in set M, if the affinity of x` is higher than the affinity of the ALC in set M. Randomly generated ALCs replace n l of the lowest affinity ALCs in R. The learning process repeats, until the maximum number of generations, t max , has been reached.A modified version of CLONALG has been applied to multi-modal function optimization .

Affinity Proportional Mutation rates
Here in CLONALG we also applied Somatic Mutation for real-value discussed in evolutionary sections, but from the viewpoint of evolution, a remarkable characteristic of the affinity maturation process is its controlled nature.That is to say the hypermutation rate to be applied to every immune cell receptor is proportional to its antigenic affinity.By computationally simulating this process, one can produce powerful algorithms that perform a search akin to local search around each candidate solution.In equation ( 1) mutations borrowed from evolutionary algorithms do not account for this important aspect of the mutation in the immune system: it is inversely proportional to the antigenic affinity [5].
In this case, one can evaluate the relative affinity of each candidate solution by scaling (normalizing) their affinities.The inverse of an exponential function can be used to establish a relationship between the hypermutation rate σ(.) and normalized affinity D*, as described in equation (2).In some cases it might be interesting to re-scale α to an interval such as [0 -1].
α(D*) = exp(-ρD*) ………………(2) where ρ is a parameter that controls the smoothness of the inverse exponential, and D* is the normalized affinity, that can be determined by D* = D/D max .

Shape-Space and Affinity
A shape-space (or representation space) concept to represent antibody or antigen binding (see Figure 2).Accordingly, antigens and antibodies are characterized by their physicochemical binding properties, which are represented as coordinate points in such space, typically, a Euclidean space (Figure 4).Binding properties include geometric shape, hydrophobicity, charge, etc.In computational models, the notion of affinity between antibodies and antigens is defined based on a distance measure between points in the shape-space.Specifically, a small distance between an antibody and an antigen represents high affinity between them.It should be noticed that in some cases, coordinates are not given explicitly but the distance between antibodies and antigens is provided.[3] In Figure 5, the big outer circle V, crosses (X), and small inner circles V ε represent the shape-space, antigens, and affinity (coverage) of antibodies, respectively.[3] Thus, ε specifies a recognition threshold; if the affinity between an antibody and an antigen (X) is less than ε (i.e., the antigen lies inside the affinity region of an antibody), then the antigen is said to match (bind) the antibody.

Real-Valued Vector Matching Rules (GA & CLON ALG)
Some distance measures that have been used to define matching rules in real-valued vector representation are explained as the amount of difference between two objects [3].Euclidean Distance A Euclidean distance is defined as d(x, y) = Σ i (x i -y i ) 2 = || x -y|| ……….(3) Euclidean distance can be modified when all the dimensions do not have equal weights by multiplying each component of the vectors by specific weights.Other distance measures can be used to define real-valued matching rule in a similar way to Euclidean distance.The choice of distance measures mainly relies on the type of data and domain knowledge of the specific application [3].

Stopping Conditions (GA & CLON ALG)
The evolutionary operators are iteratively applied in an EA until a stopping condition is satisfied.The simplest stopping condition is to limit the number of generations that the EA is allowed to execute, or alternatively, a limit is placed on the number of fitness function evaluations.This limit should not be too small, otherwise the EA will not have sufficient time to explore the search space [2].
In addition to a limit on execution time, a convergence criterion is usually used to detect if the population has converged.Convergence is loosely • an encoding of solutions to the problem as a chromosome; • a function to evaluate the fitness, or survival strength of individuals; • initialization of the initial population; • selection operators; and • reproduction operators.

•
Uniform crossover was used as the primary method to produce offspring.• Somatic Mutation for real-value.• Fitness evaluation, see section 8 .• Positive Selection, see section 10.2.• Replace worst.• Stopping Condition, see section 9. • Negative Selection, see section 10.1, this step performed on detectors one time after the generation cycles complete.The two steps Positive Selection and Negative Selection are from AIS and
where β is a multiplying factor and round returns the closest integer.
n ci = round ( β × n h ) i

Algorithm 4 .
CLONALG Algorithm for Pattern Recognition t = t max ; Determine the antigen patterns as training set D T ; Initialize a set of n a randomly generated ALCs as population C; Select a subset of n m = |DT | memory ALCs, as population M C C; Select a subset of n a − n m ALCs, as population R C C; while t

Figure 4 :
Figure 4: Antigens and antibodies are represented as points in an N-dimensional (Euclidean) space.

‫ﻭﺍﻟﺮﻳﺎﺿﻴﺎﺕ‬ ‫ﺍﳊﺎﺳﻮﺏ‬ ‫ﻟﻌﻠﻮﻡ‬ ‫ﺍﻟﺮﺍﻓﺪﻳﻦ‬ ‫ﳎﻠﺔ‬ ٢٠١٠ ‫ﺍﳌﻌﻠﻮﻣﺎﺕ‬ ‫ﺗﻘﺎﻧـﺔ‬ ‫ﰲ‬ ‫ﺍﻟﺜﺎﻟﺚ‬ ‫ﺍﻟﻌﻠﻤﻲ‬ ‫ﺍﳌﺆﲤﺮ‬ ‫ﻭﻗﺎﺋﻊ‬ 29-30/Nov./2010 ‫ﻭﺍﻟﺮﻳﺎﺿﻴﺎﺕ‬ ‫ﺍﳊﺎﺳﻮﺏ‬ ‫ﻋﻠﻮﻡ‬ ‫ﻛﻠﻴﺔ‬ - ‫ﺍﳌﻮﺻﻞ‬ ‫ﺟﺎﻣﻌﺔ‬ ٤٤٢
[1]s is the data set of The Third International Knowledge Discovery and Data mining tools competition, which was held in conjunction with KDD cup 99 the Fifth International Conference on Knowledge Discovery and Data mining.The KDD cup 1999 is dataset used for benchmarking intrusion detection problems.The dataset was a collection over a period of nine weeks on local area network.The types are grouped into five categories (Normal, Probing, Denial of Service (DoS), User to Root (U2R), and Remote to Local (R2L)).KDD Cup 99 dataset is divided into training and testing record sets.Total number of connection records in the training dataset is about 5 million records.This too large for our purpose, only concise training dataset of KDD Cup 99, known as 10% KDD Cup 99, and test dataset which called (correct) data set was employed here[1]. ٤٤١‫ﻟﺴﻨﺔ‬

‫ﺍﳊﺎﺳﻮﺏ‬ ‫ﻟﻌﻠﻮﻡ‬ ‫ﺍﻟﺮﺍﻓﺪﻳﻦ‬ ‫ﳎﻠﺔ‬ ٢٠١٠ ‫ﺍﳌﻌﻠﻮﻣﺎﺕ‬ ‫ﺗﻘﺎﻧـﺔ‬ ‫ﰲ‬ ‫ﺍﻟﺜﺎﻟﺚ‬ ‫ﺍﻟﻌﻠﻤﻲ‬ ‫ﺍﳌﺆﲤﺮ‬ ‫ﻭﻗﺎﺋﻊ‬ 29-30/Nov./2010 ‫ﻭﺍﻟﺮﻳﺎﺿﻴﺎﺕ‬ ‫ﺍﳊﺎﺳﻮﺏ‬ ‫ﻋﻠﻮﻡ‬ ‫ﻛﻠﻴﺔ‬ - ‫ﺍﳌﻮﺻﻞ‬ ‫ﺟﺎﻣﻌﺔ‬ ٤٤٩
Calculate the affinity between z p and each of the ALCs in W` ; Select the ALC with the highest affinity in W` as x`; Insert x` in M at position p; Replace n l of the lowest affinity ALCs in R with randomly generated ALCs; > 0 do for each antigen pattern z p Є DT do Calculate the affinity between z p and each of the ALCs in C; Select n h of the highest affinity ALCs with z p from C as subset H; Sort the ALCs of H in ascending order, according to the ALCs affinity; Generate W as the set of clones for each ALC in H; Generate W` as the set of mutated clones for each ALC in W;