Improving Gene Expression Programming Method

In this work the algorithm of Gene Expression Programming (GEP) is investigated thoroughly and the major deficiencies are pointed out. Multiple suggestions for enhancements are introduced in this research aiming at solving the major deficiencies that were investigated. These improvements produced higher success rates and avoid the malfunctioning situations found in GEP. These deficiencies or weak points include: choosing the best parameter settings, using only one linking function, gene flattening problem, illegal operations in genes and lack of function biasing. Improvements suggested the following enhancement features: the Multi-Population feature, the Emergency Mutation feature, and the feature of Component Biasing. Tests are carried out using two different symbolic regression problems.


Introduction
Gene Expression Programming (GEP) was introduced by Ferreira in 2001 [5]. The great insight of GEP consisted in the invention of chromosomes capable of representing any expression tree. For that a new language (Karva) was created so that the information of GEP chromosomes could be read and expressed. The structural and functional organization of genes always guarantees the production of valid programs, no matter how much or how profoundly the chromosomes are modified.
Gene expression programming (GEP) is, like genetic algorithms (GAs) and genetic programming (GP), a genetic algorithm as it uses populations of individuals, selects them according to fitness, and introduces genetic variation using one or more genetic operators. GAs, with their simple genome and limited structural and functional diversity, resemble a primitive RNA World, whereas GP, with its structural and functional diversity, resembles a hypothetical Protein World. Only when molecules capable of replication joined molecules with catalytic activity, forming an indivisible whole, was it possible to create more complex systems and, ultimately, the first cell. Since then, the genome and phenome mutually presume one another and neither can function without the other. Similarly, the chromosomes and expression trees of GEP mutually presume one another and neither exists without the other. [5] The advantages of a system like GEP are clear from nature, but the most important should be emphasized: First, the chromosomes are simple entities: linear, compact, relatively small, easy to genetically manipulate (replicate, mutate, recombine, transpose, etc.). Second, expression trees are exclusively the expression of the respective chromosomes; they are the entities upon which selection acts and, according to fitness, they are selected to reproduce with modification. During reproduction it is their chromosomes, not the ETs, which are reproduced with modification and transmitted to the next generation.
GEP is a vastly growing field and it has recently been applied in many research areas such as Hydraulic Data Mining [4] and Classifier Conditions [16].

GEP Method 2.1 The Structure of the Chromosome
The phenotype of GEP individuals consists of the same kind of diagram representations used by GP. However, these complex entities are encoded in simpler, linear structures of fixed length (chromosomes). Thus, the main parts in GEP are two entities: the chromosomes and the expression trees (ETs), being the latter the expression of the genetic information encoded in the former. The process of translating the chromosomes to ETs implies a kind of code and a set of rules. The genetic code is very simple: a one-to-one relationship between symbols and functions or the terminals they represent. The rules are also simple: they determine the spatial organization of the functions and terminals in the ETs and the type of interaction between sub-ETs in multigenic systems [6]. Given a GEP individual (genotype) in Karva language, the phenotype can easily be represented by an ET as in Figure (1).

An Expression
The Corresponding The Expression Tree GEP chromosome ((a-b)*(c+d)) 2 01234567 S*-+abcd Genes are composed of a head and a tail. The head contains both function (nonterminal) and terminals symbols. The tail contains only terminal symbols. For each problem the head length (h) is chosen by the user. Given the maximum arity n, or the number of arguments for the function with the most arguments, the tail length (t) is evaluated by: In this way if n=2 and h= 4, then t=5 and the total length of the gene is 9. So despite their fixed length, GEP genes have the potential to code for ETs of different sizes and shapes, being the simplest composed of only one node (the first element is a terminal) and the biggest composed of as many nodes as the gene length (all head elements are functions of maximum arity).

Figure (2) Multigenic Chromosomal Structure in GEP Method
GEP chromosomes are usually composed of more than one gene of equal length; as in Figure (2) [12]. For each problem or run, the number of genes, as well as head length, is a priori chosen. Each gene codes for a sub-ET that interact with one another through a linking function forming a more complex multi-subunit ET.
Multigenic chromosome was introduced because it can happen that the first symbol in a gene to be a terminal, and thus a single gene chromosome cannot represent a complex expression. As an indirect consequence, if the first symbol of a gene is a terminal then the rest of the gene is unused.
Breadth-first parsing is used in the translation of tree programs into genes, where usually the gene is not entirely used for phenotypic transcription. If the first symbol in the gene is a terminal, the expression tree consists of a single node. If all symbols in the head are non-terminals the expression tree uses all the symbols of the gene.
Genes may be linked by a function symbol in order to obtain a fully functional chromosome. The linking functions for algebraic expressions are addition and multiplication. A single type of function is used for linking multiple genes. If the functions {+ , -,* ,/ } are used as linking operators then the complexity of the problem grows substantially (since the problem of determining how to mix these operators with the genes is as hard as the initial problem). [13]

GEP Algorithm
The flowchart of the Gene Expression Algorithm is shown in Figure (3). The process begins with the random generation of the chromosomes of each individual in the initial population. Then chromosomes are expressed and the fitness of each individual is evaluated.
Individuals are then selected according to fitness to reproduce with modification, leaving progeny with new traits. The individuals of this new generation are, in their turn, subjected to the same developmental process. The process is repeated for a certain number of generations or until a solution has been found. Reproduction here includes not only replication but also the action of genetic operators capable of creating genetic diversity. During replication, the genome is rigorously copied and transmitted to the next generation. The operators randomly select the chromosomes to be modified. Thus, in GEP, a chromosome might be modified by one or several operators at a time or not be modified at all. [5]

Reproduction in GEP
According to fitness and the luck of the roulette, individuals are selected to reproduce with modification, creating the necessary genetic diversity that allows adaptation in the long run. Except for replication, where the genomes of all the selected individuals are rigorously copied, all the remaining operators randomly pick chromosomes to be subjected to a certain modification. However, except for mutation, each operator is not allowed to modify a chromosome more than once. Furthermore, in GEP, a chromosome might be chosen by one or several genetic operators. Thus, the modifications of several genetic operators accumulate during reproduction, producing offspring very different from the parents.

Replication
Although vital, replication is the most uninteresting operator: alone it contributes nothing to genetic diversification. According to fitness and the luck of the roulette, chromosomes are faithfully copied into the next generation. The fitter the individual the higher the probability of leaving more offspring. Thus, during replication the genomes of the selected individuals are copied as many times as the outcome of the roulette. The roulette is spun as many times as there are individuals in the population, maintaining always the same population size.

Mutation
Mutations can occur anywhere in the chromosome. However, the structural organization of chromosomes must remain intact. In the heads any symbol can change into another (function or terminal); in the tails terminals can only change into terminals. This way, the structural organization of chromosomes is maintained, and all the new individuals produced by mutation are structurally correct programs. Typically, a mutation rate (pm) equivalent to 2 point mutations per chromosome is used. Consider the If a mutation would occur in the following 3-genic chromosome, it might change the element in position 0 in gene 1 to 'Q'; the element in position 3 in gene 2 to 'Q'; and the element in position 1 in gene 3 to 'b'.

Transposition and Insertion Sequence Elements
The transposable elements of GEP are fragments of the genome that can be activated and jump to another place in the chromosome. In GEP there are three kinds of transposable elements: 1) Short fragments with a function or terminal in the first position that transpose to the head of genes except to the root (insertion sequence elements or IS elements); 2) Short fragments with a function in the first position that transpose to the root of genes (root IS elements or RIS elements); 3) Entire genes that transpose to the beginning of chromosomes.

Transposition of IS elements
Any random sequence in the genome might become an IS element. A copy of the transposon is made and inserted at any position in the head of a gene, except at the start position. Typically, a transposition rate (pis) of 0.1 and a set of three IS elements of different length are used. The chromosome, IS element, target site, and length of the transposon are all randomly chosen. Suppose that the sequence 'bba' in gene 2 (positions 12-14) was chosen to be an IS element in the chromosome bellow: 012345678901234567890012345678901234567890 *-+*a-+a*bbabbaabababQ**+abQbb*aabbaaaabba If the target site was bond 6 in gene 1 (between positions 5 and 6). Then, a cut is made in bond 6 and the block 'bba' is copied into the site of insertion, obtaining: 012345678901234567890012345678901234567890 *-+*a-bba+babbaabababQ**+abQbb*aabbaaaabba

Root transposition
All RIS elements start with a function, and thus are chosen from the heads. For that, a point is randomly chosen in the head and the gene is scanned downstream until a function is found. This function becomes the start position of the RIS element. If no functions are found, it does nothing. Typically a root transposition rate (pris) of 0.1 and a set of three RIS elements of different sizes are used. This operator randomly chooses the chromosomes, the gene to be modified, the RIS element, and its length. If the sequence '+bb' in gene 2 was chosen as an RIS element in the next chromosome: Then, a copy of the transposon is made into the root of the gene, obtaining:

Gene Transposition
Here an entire gene functions as a transposon and transposes itself to the beginning of the chromosome. In contrast to the other forms of transposition, in gene transposition the transposon (the gene) is deleted in the place of origin. This way, the chromosome's length is maintained. The chromosome to undergo gene transposition is randomly chosen, and one of its genes (except the first) is randomly chosen to transpose. Considering the following chromosome, if gene 2 was chosen to undergo gene transposition, then the following chromosome is obtained: → 012345678012345678012345678 *a-*abbab-QQ/aaabbQ+abababb -QQ/aaabb*a-*abbabQ+abababb

Recombination
In GEP there are three kinds of recombination: 1-point, 2-point, and gene recombination. In all cases, two parent chromosomes are randomly chosen and paired to exchange some material between them.

One-point recombination
In 1-point recombination, the chromosomes cross over a randomly chosen point to form two children chromosomes. Having the following parent chromosomes, if bond 3 in gene 1 (between positions 2 and 3) was randomly chosen as the crossover point, then the paired chromosomes are cut at this bond, and exchange between them the material downstream the crossover point, forming the offspring:

Two-point recombination
The chromosomes are paired and the two points of recombination are randomly chosen. The material between the recombination points is afterwards exchanged between the two chromosomes, forming two new children chromosomes. Consider the following parent chromosomes, if bond 7 in gene 1 (between positions 6 and 7) and bond 3 in gene 2 (between positions 2 and 3) were chosen as the crossover points. Then, the paired chromosomes are cut at these bonds, and exchange the material between the crossover points, forming the offspring:

Gene recombination
In gene recombination an entire gene is exchanged during crossover. The exchanged genes are randomly chosen and occupy the same position in the parent chromosomes. Consider the following parent chromosomes, if gene 2 was chosen to be exchanged. In this case the following offspring is formed:

GEP Malfunctioning Conditions
GEP method was thoroughly investigated in this work, due to the fact that it is considered to be the most appropriate approach among the various methods introduced so far in this field. Carrying out such an investigation has led to the discovery of five main issues that reduce the performance of GEP [1]. These are described in the following sections.

The Choice of the Best Environmental Parameter Settings
This is a problem shared among all EAs; it is the decision of the right parameter setting for an algorithm, which produces the best results possible. When defining an EA there is a great need to choose its components, such as genetic operators, selection mechanisms for selecting parents, and initial populations. Each of these may have parameters, like: mutation probability, or population size. Values of these parameters greatly determine whether the algorithm will find a near-optimum solution and whether it will find one efficiently. Choosing the right parameters, however, is time-consuming and considerable effort has gone into developing good heuristics for it. [3] Early attempts put considerable efforts into finding parameter values, which were good for a number of numeric test problems (experimentally determined). Later, meta-algorithms were used to optimize values of these parameters. Eiben, et. al [3], globally distinguished two major forms of setting parameter values: parameter tuning (the common approach that amounts to find good values for parameters before the run and then run the algorithm using them) and parameter control (remains fixed during the run). They also give arguments that any static set of parameters, having the values fixed during a run, seems to be inappropriate. Whereas Parameter control forms an alternative, it amounts to starting a run with initial parameter values that are changed during the run.

The Use of Different Linking Functions
Given a set of functions to be used in evolution, one function should be used to link existing genes. This choice varies depending on the function set, the types of functions included in the sets, and the rules to be evolved.
Using one of the linking functions through the entire evolution process is not appropriate nor of any advantage to the system. Attempting to use varied linking functions in one population will only cause the complexity of the problem to grow substantially, while the problem of determining how to mix these operators with the genes is as hard as the initial problem as mentioned earlier in section (2.1).

The Problem of Gene Flattening
Another fact noticed about GEP, is gene flattening in chromosomes. Flat genes are genes with heads containing only terminal symbols; they may appear as a product of applying the (IS insertion) of the transposition operator coupled with mutations changing functions to terminals. This problem appears when there is no guarantee for forbidding operators from destructing the functionality of the gene by eliminating functions from the head.
Restricting the operator from inserting the chosen sequence at the beginning of the head is not enough. In the worst case, the first symbol in the existing head might just be a terminal leading to the destruction of any hope in saving the gene though other operators. Repeated occurrence of this event can increase the rate of flat genes in the chromosome. Even when the first symbol in the head is not a terminal, such a process can reduce the efficiency of the gene by increasing terminals in heads, thus producing poorly functioning genes that weaken chromosomes in the population.

The Problem of Illegal Operations in Genes
Through the process of evaluating a gene, it is very likely to encounter terminals or operands to functions that, when evaluated, gives illegal results like division by zero or square root of negative values. This usually leads to the termination of the evaluation process, and thus excluding the contribution presented by the gene, and the whole chromosome is assigned the worst fitness measure agreed upon. This will certainly cause the loss of significant chances for introducing fit individuals in the population. Chromosomes are assigned poor fitness values due to the existence of illegal operands to functions in only one of its genes; other genes may have valuable fitness measures to offer.

Improving GEP using Biased Components
Some EC algorithms try to increase efficiency and performance of the evolutionary process by giving a higher rate of occurrence to some elements from the function or terminal set that makes up the contents of genes in the chromosome. This feature was employed in Multi Expression Programming. In such a procedure, certain components, like addition or multiplication operators, are usually assigned a higher chance of being introduced in the genes of the chromosome than other operators. The idea is about focusing on the terms that are more vital in the construction of a rule, and thus allowing evolution to adapt more rapidly towards forming desired rules or programs.

Suggested Solutions
In an attempt to improve the performance of GEP, new characteristics are introduced, the Multi-population feature, which is used to ensure better exploitation of the properties possessed by the method. This feature is completely inspired by nature, as many natural environments are found to adopt multi populations as ecosystems that evolve simultaneously and concurrently under some certain resources or environmental circumstances. Some of these situations are shared and are common between such evolving ecosystems, while others are locally exclusive or restricted as they vary from one population to another. This decisiveness usually depends on environmental needs demanded by each individual population, another important issue to rely on when choosing to localize or globalize an aspect relevant to a population, is the overall performance of the resulting system.
Introducing this new feature involves decomposing existing large population into a number of smaller distinct entities each having its own set of parameters, thus forming several diverse environments that evolve independently and simultaneously. In GEP there are some certain settings that must be globally maintained to all populations, while others need to be locally differentiated to overcome certain malfunctioning phenomena. Useful issues that can be viewed using this feature are: 1-Introducing various environments to enhance evolution: this is done by dividing the impact of large populations with the same evolutionary features. Thus using small multiple ones with various environmental features. 2-Finding parameter sets: helps to find the appropriate set of parameters applied to a system, instead of trying to find them by hand tuning. 3-Evaluating Genetic Dynamics: varying operator's probabilities in a multipopulation collection while fixing others and making them global to the whole environment. This is very useful in the study of dynamics. 4-Evaluating Environmental Settings: population size, number of generations, chromosomal length and number of genes can each be evaluated using multipopulation collections. This enables the study of the impact that these settings have on the behavior of the system. This feature is used in the following section to improve first and second problems. As for the third and fourth, a monitoring process is added to detect the occurrence of flat genes or illegal operations in the population and are avoided using emergency mutations. Considering the idea of component biased assigning, GEP can be improved by giving more weight to one or more solution components. The choice of biasing a certain component among the set is done depending on the type of rule or program to be evolved.

Symbolic Regression 5.1 Problem Description
The symbolic regression problem can be stated as finding a function in a symbolic form that fits a given finite sample of data [9]. The advantage of symbolic regression over standard regression methods is that in symbolic regression, the search process works simultaneously on both the model specification problem and the problem of fitting coefficients. Symbolic regression would thus appear to be a particularly valuable tool for the analysis of experimental data where the specification of the strategic function used is often difficult, and may even vary over time. [2] The system is given a set of input and output pairs, and must determine the function that maps one onto the other. Symbolic regression tries to reconstruct a mathematical function just using a set of data samples. This data can be pairs of independent and dependent variables that are samples of a possibly unknown function. As an aspect of Data Mining, symbolic regression is inherently computationally extensive because of the lack of a model solution in general. [14] The problem, in its essence, is an optimization problem; a search is conducted for the most fitting individual to the data, in the space of all possible expressions. In his work, Freitas [8] showed how the requirements of data mining and knowledge discovery influence the design of EAs. In particular, how individual representation, operators and fitness functions have to be adapted for extracting high-level knowledge from data. Data mining is more or less the same as symbolic regression but the emphasis is not on complete description of the data but on extracting salient nuggets of information from potentially large data sources (e.g. databases). [11] GP possesses certain advantages that make it suitable for application in data mining, such as convenient structure for rule generation. Furthermore, it is convenient for process parallelism to improve computational efficiency. [10] The object of the search is a symbolic description of a model, not just a set of coefficients in a pre-specified model. This sharply contrast with other methods of regression, including feed-forward ANN, where a specific model is assumed and often only the complexity of this model can be varied. [15] Genetic programming and its variants are in principle capable of expressing functional forms, given a sufficiently expressive function set; they are capable of expressing a linear relationship or a non-linear relationship. With Genetic programming and variants, the object of search is a composition of the input variables, coefficients and primitive functions such that the error of the function with respect to the desired output is minimized.

Fitness Measure
One important application of GEP is symbolic regression, where the goal is to find an expression that performs well for all fitness cases within a certain error of the correct value. Mathematically, this can be expressed by the equation: where M is the range of selection, and E is the absolute error between the number generated by the ET and the target value, as follows: E= |C(i,j) -Tj|, …(3) where C(i,j) is the value returned by the individual chromosome i for fitness case j and Tj is the target value for fitness case j (for all j of the fitness cases). The precision for the absolute error is usually very small, for instance 0.01. For example, for a set of 10 fitness cases and an M = 100, f max = 1000 if all the values are within 0.01 of the correct value, as follows: fi = fmax =Ct * M, …(4) where Ct is the number of total fitness cases. If, for all j, |C(i,j) -Tj|, (the precision) less or equal to 0.01, then the precision is equal to zero. So, the fitness measure fi of an individual program i is given by: The advantage of this kind of fitness function is that the system can find the optimal solution for itself. [7]

Tests and Results
Experiments carried out in this section are implemented using the Symbolic Regression problem. Due to its simplicity and common use in most of the applications, it has almost become a benchmark problem in assessing such systems that employ learning and training in evolution. As a standard benchmark problem it is very useful in making comparisons more practicable. Each test applies 100 run of randomly generated populations to evaluate success rates of the approach. In the following tests two equations are used to determine the efficiency of the improvements carried out, they are as indicated in the tables of comparisons: Y = a 4 + a 3 + a 2 + a …(6) Y=3a 2 +2a+1 …(7) Fitness cases (Training set) are chosen as those used by all methods proposed so far, this is done to facilitate comparisons. Training cases are given in Tables (1) and (2), parameter settings are given in Table (3).

Improvements Related to Parameter Setting
Applying Multi-population feature enables the system to use different settings for each population and can therefore reduce the parameter-setting problem discussed in the first subsection. Having P Populations each of size S with G as the number of Generations, the test is done using 3 populations, with settings in Table (4). Results are shown in Table (5).

Improvements Related to Linking Function
This is another case that can make use of the multi-population feature in investigating the affect that linking functions have on fitness calculations.
First, different populations were introduced each having its own local linking function, results showed that the '*', '-', and '/' function were not able to enhance the rate of successful runs, the rate went down for all functions except the '+' function.
Second, different linking functions were applied to link genes. Having 3 genes, the proposal suggests linking first and second genes with one linking function, while linking the result with the third gene by another one. Results showed that this was also not helpful in increasing success rates.
Gained results point out a very normal consequence, as the type of rules evolved in the tests relies heavily on addition; any other linking function will not be appropriate in this case. The function to be evolved is a summation process of multiple terms. It is very clear that the use of the Multi-population feature enabled the study of applying various linking functions to the system, and was able to distinguish the best population that gave best results.

Improvement Related to Flat Genes
Flat genes are avoided by imposing some monitoring process on the application of the IS operator, so that, when the number of functions in the head is zero, an emergency mutation is forced after that IS operation to ensure the existence of a function in the head of that modified gene. Results are shown in Table (6).

Improvement Related to Illegal operations in genes
The problem of illegal operations in genes is treated by adding a very simple mechanism in fitness calculation called emergency mutation, when an invalid operation is about to cause the termination of fitness calculation, it is simply mutated in its place to any of the other remaining functions in the function set. Using this mechanism, the gene is saved from complete loss and can be presented again in the population with an appropriate fitness value. The result of applying this idea to GEP is shown in Table (7).

Improvement Related to Biased Components
Biased GEP was tested through biasing different components and monitoring the effect of that biasing on the process of evolution and the rate of success; results are shown in Table (8).  (8), biasing the multiplication operator influenced the rate of success considerably. This is mainly because the rule depends heavily on this function. While in the second case the biasing of the addition operator was more successful than the others, as the evolved rule depends more on addition than multiplication, subtraction or division.

Conclusions
Many linear variants of Genetic programming are presented in the literature, of these; GEP was investigated thoroughly as it possesses the least limitations among other methods. Like any other method, GEP has some points of weakness that reduces its efficiency. These points were investigated and reinforced with five solutions that managed weak points in an efficient manner; weak points included the choice of the best parameter settings for evolution, the use of different linking functions, the problem of gene flattening, and the illegal operations that occur in the genes of the chromosome.
The five enhancement procedures suggested were able to eliminate these problems and increase the efficiency of the method. Enhancement procedures included introducing the Multi-population feature, the Emergency Mutation feature, and the Component Biasing feature. Tests and results showed that success rates improved clearly towards higher values in all cases.