A Binary Integer Programming model for computing DNA Sequence Alignment

DNA Sequence Alignment is an important problem in computational biology and is useful for comparing genomes and finding genes, for determining evolutionary linkage of different biological sequences. Dynamic Programming Problems is discussed and applied to solve this problem. This paper is concerned with computing DNA Sequence Alignment firstly by formulating a Binary Integer Programming model to compute the string sequence in Edit Distance Problem then re-formulating this model to be suitable to compute this alignment. By this model we gave a good role for Operations Researches field to prove it's efficient to solve problems of molecule of life. The suggested model is applied to solve an example in Edit Distance Problem then used again after re-formulating it for an example in DNA Sequence Alignment Problem.


Edit Distance Problem:
Edit Distance (or Levenshtein distance [10]) between two strings is defined as the minimal number of edit operations which must be performed between them, character by character to transform one string into another, the edit operations are: Replacement (R), Matching (M), Deletion (D) and Insertion (I) and the string over the (R, M, D and I) is called an edit transcript string of the two strings. [6] Simply, to transform the string abcde to the string bcfeg, we can delete a then replace d by f and finally insert g, yielding to the string bcfeg. Also by the following description: The Edit Distance Problem (EDP) is to compute the edit distance between two given strings along with an optimal edit transcript that describes the transformation.
The symbol "-" indicated to a gap which occur of the deletion and insertion operations only, the minimal edit distance operations is the number of columns in which the characters differ. The aim of finding the minimal edit distance between two strings is for searching for strings of strings with most similarity.

Edit Distance Formula:
For finding the value of edit distance between two strings we have to use a general description of the problem which as follows: Let S and T be a two strings with length m and n , respectively, we can define: ) See ( [2], [7], [9] & [11]).
If we remove the last 3 columns and the last 2 rows from the table (2.1) of the optimal edit distance then the remaining columns and rows represent an optimal edit distance for the remaining substrings and this property called Prefix, see below: Here, the minimal edit distance of the two substrings VINTN and WRIT will be 4 ) 2.5. Recovering the string itself: DPP described how to compute the edit distance between two strings and now we explain how the optimal string recover itself. The key idea is to retrace the optimal paths of the Dynamic Programming backwards, rediscovering the path of ( opt means optimal edit distance) which depends on the following three possibilities: Now, by the (eqs. (3), (4), (5) and (6)) the recovering of the optimal paths of the strings in above Therefore, the final transformations of the similarity of both strings VINTNER and WRITERS according to the recovering of the optimal paths will be: The first path of transformation string VINTNER to WRITERS is the best one than the others because it has the least number of gaps (two gaps) and minimal length (8) than the two other paths. [13] 3. Binary Integer Programming model for Edit Distance:

Binary Integer Programming Problem:
A Binary Integer Programming Problem (BIPP) is kind of Integer Programming Problem (IPP) and is given by vector . The goal of the problem is to find a vector  

. Formulating BIP model for computing Edit Distance:
Let S and T be a two strings with length m and n , respectively, we can formulate a binary integer programming model (BIPM) to find the similarity of S and T on finding a minimum distances between them which depends on the number of similar characters, such that:

Proposed algorithm to solve BIPM:
Step 1: (Starting) We choose the first node (N0) to be the variable By this algorithm we get a tree of nodes with the optimal solution of the Edit Distance Problem which represent the minimum number of edit operations to transform the string S into the string T with finding at least one of the optimal transformations paths.

Solving Example 2.4 according to BIPM:
We have two strings S and T, such that: n T By the algorithm with step 1, the first branched node (N0) is carry the variable 7 7 T S x , and by steps 2 and 3 we continue by branching the nodes, such that:

Figure (3.1)
Step 4 of the algorithm occurs in node (N11), although I is not similar to R but is branched only to one node (N13).
By following the path of the nodes: N15-N14-N13-N12-N9-N8-N5-N4-N1-N0, we have the value of the objective function: and we have two transformation of VINTNER to WRITERS (by fixing first the similar characters in the nodes: N1, N4, N8 and N12) such that: And by following the path of the nodes: N13-N11-N9-N8-N5-N4-N1-N0 we have the value of the objective function: and we have two transformation of VINTNER to WRITERS (by fixing first the similar characters in the nodes: N1, N4 and N8) such that: in T so as to line up each character in one sequence with either a character or a gap in the other sequence and this will result in two sequences of equal length. An alignment between two input sequences, S and T over the alphabet Σ = {A,C,G,T}, expresses an equivalence relationship between the pair of sequences by generating two sequences S' and T' of equal length by inserting the gaps into S and T.
An optimal alignment is one that minimizes the number of the gaps are inserted while simultaneously minimizing the number of replacements operation occur (i.e. a C aligned with a T).
Gaps are an important concept in biological applications because a stream of gaps in a DNA sequence my present a significant biological characteristic, gaps usually incur a penalty to the potential alignment score between two sequences. If we remove the gaps from S' and T' then we restored S and T.
To illustrate the idea of an alignment, consider the sequences given by S=TAAGAAC and T= TGAC. They could have an optimal alignment consisting of S' and T' below: Given the scoring functions of one for a matching cost and zero for gaps and replacement cost, the above alignment would have a score of four. Note that other alignments can share the maximal score: And other sub-optimal alignments can have lower scores (3 in the following instance):

Dot Plot Problem:
Dot Plot Problem is one of the earliest methods of comparing two DNA Sequences Alignment which plots the regions of the similarity between them by hand.
Dot Plot Problem is to create a table by setting one DNA Sequence on a vertical axis and the other on a horizontal axis and the dots mark a match between nucleotides in the sequences. [5]

A G C T A G A G A
In the table (3.1), the sequence AGCATAGGA is matched against the sequence AGCTAGAGA, the regions of similarity occur where it is clear that there is a string of diagonal dots in the dot plot, so we can easily compute the similarity by setting the value (2) in each dot in the three diagonal strings and the value (-1) in the two gaps (i.e.

DNA Sequence Alignment Formula:
DNA Sequence Alignment Problem is an alignment present in optimal path between the point ) 0 , 0 ( E and the point ) , ( n m E and Dynamic Programming used to solve it in the following formula: Given two string S and T with length m and n respectively, our goal is to compute the optimal sequence alignment of S and T.
defined as:

Example:
The table (4.2) gives the transformation of the DNA sequence alignment AACTGGTACC to TTCACGGCA using Dynamic Programming Problem: We chose an arbitrary edit distance calculations from table (2.1) such that:   (9)) with depending on the following three possibilities: And by the (eqs. (10), (11), (12) and (13)    Thus, according to the recovering of the optimal paths, the similarity of the two sequence will be in form: All the paths of transformation the DNA sequences alignment AACTGGTACC to TTCACGGCA are optimal because they have the same number of gaps (five gaps) and the same length (12). See ( [1] & [11]).

Solving Example 4.4 according to BIPM:
Before going to compute the optimal sequence alignment of the DNA sequences alignment AACTGGTACC and TTCACGGCA, we need to re-formulate the binary integer programming model in equations (7.1) and (7.4) to be suitable to find that sequence, as follows:

Conclusions:
With the binary integer programming model proposed for computing DNA Sequence Alignment some important conclusions were reached: I. The model presented allows to obtain good and effective results for the Edit Distance and DNA Sequence Alignment Problems as the results of applying the Dynamic Programming Problem. II. The binary model solved by exact algorithm for computing the problems and obtaining most of the optimal paths with least number of gaps and shortest length and within a very reasonable computational time in handly solution than the Dynamic Programming which needs III. The binary model proved it's efficient to solve a wide of real-life problems and gives a very good solutions and one of these problems is computing the string sequence and it's applications in molecular biology.