Parallel Newtonian Optimization without Hessian Approximation

ABSTRACT The purpose of this paper is to introduce parallel algorithms based on the Newton method for solving non-linear unconstrained optimization problem in (MIMD) parallel computers by solving linear system in parallel using Gaussian Elimination method rather than finding inverse Hessian matrix to avoid the errors caused by evaluating the inverse matrix and also to increase computing power and reduce run time.


Introduction:
This paper is concerned with finding a local minimum x * of the unconstrained optimization problem min ) (x f (1) where n R x  and the objective f:R n →R is twice continuously differentiable.
Methods for unconstrained optimization are generally iterative methods in which the user typically provides an initial estimate of x * , and possibly some additional information. A sequence of iterates {xi} is then generated according to some algorithm. Usually the algorithm is such that the sequence of function values fi is monotonically decreasing (fi denotes f(xi)). Due to practical applications of the unconstrained optimization problem, a considerable amount of effort has been expanded on the development of efficient sequential algorithms for the solution of this problems. Recent advances in parallel computing technology have made it possible to solve the optimization problem more effectively and increasing computing power for this reason it is natural that there is an increased interest in designing parallel algorithms for various types of applications. In this paper, we will discuss parallel numerical algorithm namely Newton algorithm for the general unconstrained optimization problem using multiprocessors or (MIMD) parallel machines which consist of a number of fully programmable processors capable of simultaneously performing entirely different operations on different data, where each processor has its own local memory.
Concurrent computation can be done at different levels. Our focus is on using (MIMD) computers where a number of processors communicate and cooperate to solve a common computational problem. Multiprocessor parallel computation involves three key ingredients hardware, software (programming language, operating system and compiler) and parallel algorithms (See [8] or [11]).
A popular class of methods for solving problem (1) is the Quasi-Newton (QN) methods.
Assume that at the ith iteration, an approximation point xi and n×n matrix Hi are available, then the methods proceed by generating a sequence of approximation points via the equation is the step-size which is calculated to satisfy certain line search conditions and di is an n-dimensional real vector representing the search direction. For QN methods, di is defined by : is the gradient vector of f(x) evaluated at point x=xi and Hi is an approximation of which is corrected or updated from iteration to iteration in general Hi is symmetric and positive definite , there are different choices of Hi , we list here some must popular forms (See [5], [6]or [4]).
is called rank-one formula, the other forms are BFGS and DFP where QN methods mentioned above have two disadvantages: one of which QN algorithms require line search which is expensive in practice and The second is the accumulation of error in evaluating the approximate Hussein matrix Hi at each iteration.
To overcome these disadvantages we use parallel Newton method without approximating Hussein matrix and neglecting line search

2-Newton Method
Newton method uses first and second derivatives, the idea behind this method is as follows: Given a starting point x0 , we construct a quadratic approximation to the objective function that matches the first and the second derivative values at the point. We then minimize the approximate (quadratic) function instead of the original objective function. We use the minimizer of the objective function as the starting point in the next step and we repeat the procedure iteratively. If the objective function is quadratic, then the approximation is exact, and the method yields the true minimizer in one step. If on the other hand, the objective function is not quadratic then the approximation will provide only an estimate for the position of the true minimizer.
We can obtain a quadratic approximation to the given twice continuously differentiable function f(x) using the Taylor series expansion of f about the current point xi, neglecting terms of order three and higher in i x  for simplicity we let We then prove that the sequence {xi} convergence and the order of convergence is two (See [13]). These local convergence properties represent the ideal local behavior which other algorithms aim be evaluated as far as possible (See [6]). In fact, super linear convergence of any algorithm is obtained if and only if the step i x  is asymptotically equal to the step given by solving (8). This fundamental results due to Dennis and More (1974) emphases the importance of the Newton step for local convergence.
Two quite different classes of methods for solving the linear system (8) are at interest: direct methods and iterative methods. In direct method, the system is transformed to a system of simpler form e.g. triangular or diagonal form, which can be solved in an elementary way. The most important direct method is the Gaussian elimination, we will use it to solve the linear system (8).

Gaussian Elimination Method
A fundamental observation in Gauss elimination is the elementary row operations, which can be performed on the system without changing the set of solution. These operations are: 1-adding a multiple of the ith equation to the jth equation 2-Interchange two equations.
The idea behind Gaussian elimination is to use such elementary operations to eliminate the unknowns in the system (8) in a symmetric way, so that at the end an equivalent upper triangular system is produced, which is then solved by backsubstitution. If , we can eliminate 2  from the last (n-2) equations. After m-1 steps, n m  of Gaussian elimination the matrix G has been reduced to the form: proceed. But this case does not occur in our problem, which is stated in (8). Since G is square matrix and positive definite, this means that all diagonal elements of G are non-zero. Second a zero pivot in exact arithmetic will almost invariably be polluted by rounding errors in such a way that it equals some small non-zero number, unfortunately, there is no general rule which can be used to decide when a pivot should be taken to be zero. What tolerance to use in such a test should depend on the context. This question and the stability (without these two difficulties) of Gaussian method treated in [Laxxus, 2000]

Gaussian Elimination Algorithm
Given a matrix

4-Parallel algorithms for finding Newton direction:
In order for the algorithm to be as efficient as possible, we must be able to exploit the memory hierarchy of the architecture we are running on. The memory hierarchy refers to different kinds of memories inside a computer. These range from very fast, expensive and therefore small memory at the top of the hierarchy, down to slow. Cheap and very large memory at the bottom (for more detail see [13]). Useful floating point operations can only be done at the top of hierarchy, in the registers. Since an entire large matrix cannot fit inside the registers, it must be moved up and down through the hierarchy. This takes time and for maximum performance should be optimized for each architecture.
The Basic linear Algebra Subprograms BLAS are high performance routines for performing basic vector and matrix operations. The BLAS are specifically machine optimized to exploit the memory hierarchy. Level 1 BLAS does vector-vector operation, Level 2 BLAS does matrix-vector operations and Level 3 BLAS does matrix-matrix operations (see [7]).
In order to speed up Gaussian elimination, we wish to use as high a level of BLAS as possible.
There are different ways to parallelize the Gaussian elimination method. In this paper, we consider two types of parallel Gaussian elimination, the first is based on the rows of the matrix G and the second uses columns of G.

4-1: Parallel Gaussian Elimination Based on Rows (PGER)
The most popular form of Gaussian elimination is defined by subtracting multiple of row of the matrix G from other rows in order to reduce (8) to an upper triangular system, which is then solved directly by back substitution. We assume that the computing system consists of n (where n is the dimension of the problem) identical independent processors also assume that the ith row of G is assigned to processors i. At the first stage, the first row of G is sent to all processors and then the elimination of the first elements of rows 2 to n can be done in parallel in the processors

Figure (1) tasks of processors in PGER
This approach has two major drawbacks (See [9]) (i) there is a considerable communication of data between the processors at each stage (ii) the number of active processors decreases by one at each stage. are given and assume that the computing system consists of n+1 identical independent processors and fully communicated processors, also assume that the jth column (Cj). of the matrix ) 1 ( G is assigned to processor j and column and the process is repeated until the matrix G is transformed to upper triangular matrix, we denote it G (k) latter k steps and then backsubstitution is used to find the vector  .

4-2: Parallel Gaussian Elimination Based on Columns (PGEC)
The new point is found from equation (9). Again in this algorithm the number of active processors decreases by one at each stage. The main advantage of this algorithm is that we optimized data transfer between processors since we need only transfer elements instead of vectors, this will reduce the time required to complete the computations.  Step (1) 1.save (1) (1) repeat from step (1)

Figure (2) tasks of processors in PGEBC Numerical Example:
We give an example from [1] to illustrate the tasks of processors in the algorithms PGER and PGEC  We need 4 processors to solve the problem given (10) as shown in figure (4)

5-Numerical experiments:
Both the amount of time required to complete the calculations and the subsequent round-off error depend on the number of floating arithmetic operations needed to solve a routine problem. The a mount of time required to perform multiplication or division on a computer is approximately the same and is considerably greater than that required to perform an addition or subtraction (see [3]). The total number of arithmetic operation depends on the size n as follows: Multiplications / divisions: Hence if the all operations done on sequential computer then the accumulation error will be large. On the other hand, dividing the operation on different processors will reduce the amount of errors.
As for the amount of time, If we assume t1 is time required to complete the calculations in sequential computer, we expect the time required to perform the same calculations in the parallel algorithms proposed is In the algorithms PGER and PGEC, there is no need to line search and matrix inversion, all operations are done using ordinary arithmetic operations. Also, the number of function evaluations are only (2), since the Gaussian method gives exact solutions shown in the given example.

Conclusion:
In this paper we proposed two parallel algorithms for solving unconstrained optimization problem. These methods can be extended to any non-homogenous linear system with positive definite matrix. These methods do not require to vector multiplication and matrix inversion as shown in the numerical example.