fbpx
Wikipedia

Gauss–Newton algorithm

The Gauss–Newton algorithm is used to solve non-linear least squares problems, which is equivalent to minimizing a sum of squared function values. It is an extension of Newton's method for finding a minimum of a non-linear function. Since a sum of squares must be nonnegative, the algorithm can be viewed as using Newton's method to iteratively approximate zeroes of the components of the sum, and thus minimizing the sum. In this sense, the algorithm is also an effective method for solving overdetermined systems of equations. It has the advantage that second derivatives, which can be challenging to compute, are not required.[1]

Fitting of a noisy curve by an asymmetrical peak model, using the Gauss–Newton algorithm with variable damping factor α.
Top: Raw data and model.
Bottom: Evolution of the normalised sum of the squares of the errors.

Non-linear least squares problems arise, for instance, in non-linear regression, where parameters in a model are sought such that the model is in good agreement with available observations.

The method is named after the mathematicians Carl Friedrich Gauss and Isaac Newton, and first appeared in Gauss' 1809 work Theoria motus corporum coelestium in sectionibus conicis solem ambientum.[2]

Description

Given   functions   (often called residuals) of   variables   with   the Gauss–Newton algorithm iteratively finds the value of the variables that minimize the sum of squares[3]

 

Starting with an initial guess   for the minimum, the method proceeds by the iterations

 

where, if r and β are column vectors, the entries of the Jacobian matrix are

 

and the symbol   denotes the matrix transpose.

At each iteration, the update   can be found by rearranging the previous equation in the following two steps:

  •  
  •  

With substitutions  ,  , and  , this turns into the conventional matrix equation of form  , which can then be solved in a variety of methods (see Notes).

If m = n, the iteration simplifies to

 

which is a direct generalization of Newton's method in one dimension.

In data fitting, where the goal is to find the parameters   such that a given model function   best fits some data points  , the functions  are the residuals:

 

Then, the Gauss–Newton method can be expressed in terms of the Jacobian   of the function   as

 

Note that   is the left pseudoinverse of  .

Notes

The assumption mn in the algorithm statement is necessary, as otherwise the matrix   is not invertible and the normal equations cannot be solved (at least uniquely).

The Gauss–Newton algorithm can be derived by linearly approximating the vector of functions ri. Using Taylor's theorem, we can write at every iteration:

 

with  . The task of finding   minimizing the sum of squares of the right-hand side; i.e.,

 

is a linear least-squares problem, which can be solved explicitly, yielding the normal equations in the algorithm.

The normal equations are n simultaneous linear equations in the unknown increments  . They may be solved in one step, using Cholesky decomposition, or, better, the QR factorization of  . For large systems, an iterative method, such as the conjugate gradient method, may be more efficient. If there is a linear dependence between columns of Jr, the iterations will fail, as   becomes singular.

When   is complex   the conjugate form should be used:  .

Example

 
Calculated curve obtained with   and   (in blue) versus the observed data (in red)

In this example, the Gauss–Newton algorithm will be used to fit a model to some data by minimizing the sum of squares of errors between the data and model's predictions.

In a biology experiment studying the relation between substrate concentration [S] and reaction rate in an enzyme-mediated reaction, the data in the following table were obtained.

i 1 2 3 4 5 6 7
[S] 0.038 0.194 0.425 0.626 1.253 2.500 3.740
Rate 0.050 0.127 0.094 0.2122 0.2729 0.2665 0.3317

It is desired to find a curve (model function) of the form

 

that fits best the data in the least-squares sense, with the parameters   and   to be determined.

Denote by   and   the values of [S] and rate respectively, with  . Let   and  . We will find   and   such that the sum of squares of the residuals

 

is minimized.

The Jacobian   of the vector of residuals   with respect to the unknowns   is a   matrix with the  -th row having the entries

 

Starting with the initial estimates of   and  , after five iterations of the Gauss–Newton algorithm, the optimal values   and   are obtained. The sum of squares of residuals decreased from the initial value of 1.445 to 0.00784 after the fifth iteration. The plot in the figure on the right shows the curve determined by the model for the optimal parameters with the observed data.

Convergence properties

The Gauss-Newton iteration is guaranteed to converge toward a local minimum point   under 4 conditions:[4] The functions   are twice continuously differentiable in an open convex set  , the Jacobian   is of full column rank, the initial iterate   is near  , and the local minimum value   is small. The convergence is quadratic if  .

It can be shown[5] that the increment Δ is a descent direction for S, and, if the algorithm converges, then the limit is a stationary point of S. For large minimum value  , however, convergence is not guaranteed, not even local convergence as in Newton's method, or convergence under the usual Wolfe conditions.[6]

The rate of convergence of the Gauss–Newton algorithm can approach quadratic.[7] The algorithm may converge slowly or not at all if the initial guess is far from the minimum or the matrix   is ill-conditioned. For example, consider the problem with   equations and   variable, given by

 

The optimum is at  . (Actually the optimum is at   for  , because  , but  .) If  , then the problem is in fact linear and the method finds the optimum in one iteration. If |λ| < 1, then the method converges linearly and the error decreases asymptotically with a factor |λ| at every iteration. However, if |λ| > 1, then the method does not even converge locally.[8]

Solving overdetermined systems of equations

The Gauss-Newton iteration

 
is an effective method for solving overdetermined systems of equations in the form of   with
 
and   where   is the Moore-Penrose inverse (also known as pseudoinverse) of the Jacobian matrix   of  . It can be considered an extension of Newton's method and enjoys the same local quadratic convergence [4] toward isolated regular solutions.

If the solution doesn't exist but the initial iterate   is near a point   at which the sum of squares   reaches a small local minimum, the Gauss-Newton iteration linearly converges to  . The point   is often called a least squares solution of the overdetermined system.

Derivation from Newton's method

In what follows, the Gauss–Newton algorithm will be derived from Newton's method for function optimization via an approximation. As a consequence, the rate of convergence of the Gauss–Newton algorithm can be quadratic under certain regularity conditions. In general (under weaker conditions), the convergence rate is linear.[9]

The recurrence relation for Newton's method for minimizing a function S of parameters   is

 

where g denotes the gradient vector of S, and H denotes the Hessian matrix of S.

Since  , the gradient is given by

 

Elements of the Hessian are calculated by differentiating the gradient elements,  , with respect to  :

 

The Gauss–Newton method is obtained by ignoring the second-order derivative terms (the second term in this expression). That is, the Hessian is approximated by

 

where   are entries of the Jacobian Jr. Note that when the exact hessian is evaluated near an exact fit we have near-zero  , so the second term becomes near-zero as well, which justifies the approximation. The gradient and the approximate Hessian can be written in matrix notation as

 

These expressions are substituted into the recurrence relation above to obtain the operational equations

 

Convergence of the Gauss–Newton method is not guaranteed in all instances. The approximation

 

that needs to hold to be able to ignore the second-order derivative terms may be valid in two cases, for which convergence is to be expected:[10]

  1. The function values   are small in magnitude, at least around the minimum.
  2. The functions are only "mildly" nonlinear, so that   is relatively small in magnitude.

Improved versions

With the Gauss–Newton method the sum of squares of the residuals S may not decrease at every iteration. However, since Δ is a descent direction, unless   is a stationary point, it holds that   for all sufficiently small  . Thus, if divergence occurs, one solution is to employ a fraction   of the increment vector Δ in the updating formula:

 

In other words, the increment vector is too long, but it still points "downhill", so going just a part of the way will decrease the objective function S. An optimal value for   can be found by using a line search algorithm, that is, the magnitude of   is determined by finding the value that minimizes S, usually using a direct search method in the interval   or a backtracking line search such as Armijo-line search. Typically,   should be chosen such that it satisfies the Wolfe conditions or the Goldstein conditions.[11]

In cases where the direction of the shift vector is such that the optimal fraction α is close to zero, an alternative method for handling divergence is the use of the Levenberg–Marquardt algorithm, a trust region method.[3] The normal equations are modified in such a way that the increment vector is rotated towards the direction of steepest descent,

 

where D is a positive diagonal matrix. Note that when D is the identity matrix I and  , then  , therefore the direction of Δ approaches the direction of the negative gradient  .

The so-called Marquardt parameter   may also be optimized by a line search, but this is inefficient, as the shift vector must be recalculated every time   is changed. A more efficient strategy is this: When divergence occurs, increase the Marquardt parameter until there is a decrease in S. Then retain the value from one iteration to the next, but decrease it if possible until a cut-off value is reached, when the Marquardt parameter can be set to zero; the minimization of S then becomes a standard Gauss–Newton minimization.

Large-scale optimization

For large-scale optimization, the Gauss–Newton method is of special interest because it is often (though certainly not always) true that the matrix   is more sparse than the approximate Hessian  . In such cases, the step calculation itself will typically need to be done with an approximate iterative method appropriate for large and sparse problems, such as the conjugate gradient method.

In order to make this kind of approach work, one needs at least an efficient method for computing the product

 

for some vector p. With sparse matrix storage, it is in general practical to store the rows of   in a compressed form (e.g., without zero entries), making a direct computation of the above product tricky due to the transposition. However, if one defines ci as row i of the matrix  , the following simple relation holds:

 

so that every row contributes additively and independently to the product. In addition to respecting a practical sparse storage structure, this expression is well suited for parallel computations. Note that every row ci is the gradient of the corresponding residual ri; with this in mind, the formula above emphasizes the fact that residuals contribute to the problem independently of each other.

Related algorithms

In a quasi-Newton method, such as that due to Davidon, Fletcher and Powell or Broyden–Fletcher–Goldfarb–Shanno (BFGS method) an estimate of the full Hessian   is built up numerically using first derivatives   only so that after n refinement cycles the method closely approximates to Newton's method in performance. Note that quasi-Newton methods can minimize general real-valued functions, whereas Gauss–Newton, Levenberg–Marquardt, etc. fits only to nonlinear least-squares problems.

Another method for solving minimization problems using only first derivatives is gradient descent. However, this method does not take into account the second derivatives even approximately. Consequently, it is highly inefficient for many functions, especially if the parameters have strong interactions.

Notes

  1. ^ Mittelhammer, Ron C.; Miller, Douglas J.; Judge, George G. (2000). Econometric Foundations. Cambridge: Cambridge University Press. pp. 197–198. ISBN 0-521-62394-4.
  2. ^ Floudas, Christodoulos A.; Pardalos, Panos M. (2008). Encyclopedia of Optimization. Springer. p. 1130. ISBN 9780387747583.
  3. ^ a b Björck (1996)
  4. ^ a b J.E. Dennis, Jr. and R.B. Schnabel (1983). Numerical Methods for Unconstrained Optimization and Nonlinear Equations. SIAM 1996 reproduction of Prentice-Hall 1983 edition. p. 222.
  5. ^ Björck (1996), p. 260.
  6. ^ Mascarenhas (2013), "The divergence of the BFGS and Gauss Newton Methods", Mathematical Programming, 147 (1): 253–276, arXiv:1309.7922, doi:10.1007/s10107-013-0720-6, S2CID 14700106
  7. ^ Björck (1996), p. 341, 342.
  8. ^ Fletcher (1987), p. 113.
  9. ^ (PDF). Archived from the original (PDF) on 2016-08-04. Retrieved 2014-04-25.{{cite web}}: CS1 maint: archived copy as title (link)
  10. ^ Nocedal (1999), p. 259.
  11. ^ Nocedal, Jorge. (1999). Numerical optimization. Wright, Stephen J., 1960-. New York: Springer. ISBN 0387227423. OCLC 54849297.

References

External links

  • Probability, Statistics and Estimation The algorithm is detailed and applied to the biology experiment discussed as an example in this article (page 84 with the uncertainties on the estimated values).

Implementations

  • Artelys Knitro is a non-linear solver with an implementation of the Gauss–Newton method. It is written in C and has interfaces to C++/C#/Java/Python/MATLAB/R.

gauss, newton, algorithm, used, solve, linear, least, squares, problems, which, equivalent, minimizing, squared, function, values, extension, newton, method, finding, minimum, linear, function, since, squares, must, nonnegative, algorithm, viewed, using, newto. The Gauss Newton algorithm is used to solve non linear least squares problems which is equivalent to minimizing a sum of squared function values It is an extension of Newton s method for finding a minimum of a non linear function Since a sum of squares must be nonnegative the algorithm can be viewed as using Newton s method to iteratively approximate zeroes of the components of the sum and thus minimizing the sum In this sense the algorithm is also an effective method for solving overdetermined systems of equations It has the advantage that second derivatives which can be challenging to compute are not required 1 Fitting of a noisy curve by an asymmetrical peak model using the Gauss Newton algorithm with variable damping factor a Top Raw data and model Bottom Evolution of the normalised sum of the squares of the errors Non linear least squares problems arise for instance in non linear regression where parameters in a model are sought such that the model is in good agreement with available observations The method is named after the mathematicians Carl Friedrich Gauss and Isaac Newton and first appeared in Gauss 1809 work Theoria motus corporum coelestium in sectionibus conicis solem ambientum 2 Contents 1 Description 2 Notes 3 Example 4 Convergence properties 5 Solving overdetermined systems of equations 6 Derivation from Newton s method 7 Improved versions 8 Large scale optimization 9 Related algorithms 10 Notes 11 References 12 External links 12 1 ImplementationsDescription EditGiven m displaystyle m functions r r 1 r m displaystyle textbf r r 1 ldots r m often called residuals of n displaystyle n variables b b 1 b n displaystyle boldsymbol beta beta 1 ldots beta n with m n displaystyle m geq n the Gauss Newton algorithm iteratively finds the value of the variables that minimize the sum of squares 3 S b i 1 m r i b 2 displaystyle S boldsymbol beta sum i 1 m r i boldsymbol beta 2 Starting with an initial guess b 0 displaystyle boldsymbol beta 0 for the minimum the method proceeds by the iterationsb s 1 b s J r T J r 1 J r T r b s displaystyle boldsymbol beta s 1 boldsymbol beta s left mathbf J r mathsf T mathbf J r right 1 mathbf J r mathsf T mathbf r left boldsymbol beta s right where if r and b are column vectors the entries of the Jacobian matrix are J r i j r i b s b j displaystyle left mathbf J r right ij frac partial r i left boldsymbol beta s right partial beta j and the symbol T displaystyle mathsf T denotes the matrix transpose At each iteration the update D b s 1 b s displaystyle Delta boldsymbol beta s 1 boldsymbol beta s can be found by rearranging the previous equation in the following two steps D J r T J r 1 J r T r b s displaystyle Delta left mathbf J r mathsf T mathbf J r right 1 mathbf J r mathsf T mathbf r left boldsymbol beta s right J r T J r D J r T r b s displaystyle mathbf J r mathsf T mathbf J r Delta mathbf J r mathsf T mathbf r left boldsymbol beta s right With substitutions A J r T J r textstyle A mathbf J r mathsf T mathbf J r b J r T r b s displaystyle mathbf b mathbf J r mathsf T mathbf r left boldsymbol beta s right and x D displaystyle mathbf x Delta this turns into the conventional matrix equation of form A x b displaystyle A mathbf x mathbf b which can then be solved in a variety of methods see Notes If m n the iteration simplifies tob s 1 b s J r 1 r b s displaystyle boldsymbol beta s 1 boldsymbol beta s left mathbf J r right 1 mathbf r left boldsymbol beta s right which is a direct generalization of Newton s method in one dimension In data fitting where the goal is to find the parameters b displaystyle boldsymbol beta such that a given model function f x b displaystyle mathbf f mathbf x boldsymbol beta best fits some data points x i y i displaystyle x i y i the functions r i displaystyle r i are the residuals r i b y i f x i b displaystyle r i boldsymbol beta y i f left x i boldsymbol beta right Then the Gauss Newton method can be expressed in terms of the Jacobian J f J r displaystyle mathbf J f mathbf J r of the function f displaystyle mathbf f asb s 1 b s J f T J f 1 J f T r b s displaystyle boldsymbol beta s 1 boldsymbol beta s left mathbf J f mathsf T mathbf J f right 1 mathbf J f mathsf T mathbf r left boldsymbol beta s right Note that J f T J f 1 J f T displaystyle left mathbf J f mathsf T mathbf J f right 1 mathbf J f mathsf T is the left pseudoinverse of J f displaystyle mathbf J f Notes EditThe assumption m n in the algorithm statement is necessary as otherwise the matrix J r T J r displaystyle mathbf J r T mathbf J r is not invertible and the normal equations cannot be solved at least uniquely The Gauss Newton algorithm can be derived by linearly approximating the vector of functions ri Using Taylor s theorem we can write at every iteration r b r b s J r b s D displaystyle mathbf r boldsymbol beta approx mathbf r left boldsymbol beta s right mathbf J r left boldsymbol beta s right Delta with D b b s displaystyle Delta boldsymbol beta boldsymbol beta s The task of finding D displaystyle Delta minimizing the sum of squares of the right hand side i e min r b s J r b s D 2 2 displaystyle min left mathbf r left boldsymbol beta s right mathbf J r left boldsymbol beta s right Delta right 2 2 is a linear least squares problem which can be solved explicitly yielding the normal equations in the algorithm The normal equations are n simultaneous linear equations in the unknown increments D displaystyle Delta They may be solved in one step using Cholesky decomposition or better the QR factorization of J r displaystyle mathbf J r For large systems an iterative method such as the conjugate gradient method may be more efficient If there is a linear dependence between columns of Jr the iterations will fail as J r T J r displaystyle mathbf J r T mathbf J r becomes singular When r displaystyle mathbf r is complex r C n C displaystyle mathbf r mathbb C n to mathbb C the conjugate form should be used J r T J r 1 J r T displaystyle left overline mathbf J r mathsf T mathbf J r right 1 overline mathbf J r mathsf T Example Edit Calculated curve obtained with b 1 0 362 displaystyle hat beta 1 0 362 and b 2 0 556 displaystyle hat beta 2 0 556 in blue versus the observed data in red In this example the Gauss Newton algorithm will be used to fit a model to some data by minimizing the sum of squares of errors between the data and model s predictions In a biology experiment studying the relation between substrate concentration S and reaction rate in an enzyme mediated reaction the data in the following table were obtained i 1 2 3 4 5 6 7 S 0 038 0 194 0 425 0 626 1 253 2 500 3 740Rate 0 050 0 127 0 094 0 2122 0 2729 0 2665 0 3317It is desired to find a curve model function of the formrate V max S K M S displaystyle text rate frac V text max cdot S K M S that fits best the data in the least squares sense with the parameters V max displaystyle V text max and K M displaystyle K M to be determined Denote by x i displaystyle x i and y i displaystyle y i the values of S and rate respectively with i 1 7 displaystyle i 1 dots 7 Let b 1 V max displaystyle beta 1 V text max and b 2 K M displaystyle beta 2 K M We will find b 1 displaystyle beta 1 and b 2 displaystyle beta 2 such that the sum of squares of the residualsr i y i b 1 x i b 2 x i i 1 7 displaystyle r i y i frac beta 1 x i beta 2 x i quad i 1 dots 7 is minimized The Jacobian J r displaystyle mathbf J r of the vector of residuals r i displaystyle r i with respect to the unknowns b j displaystyle beta j is a 7 2 displaystyle 7 times 2 matrix with the i displaystyle i th row having the entries r i b 1 x i b 2 x i r i b 2 b 1 x i b 2 x i 2 displaystyle frac partial r i partial beta 1 frac x i beta 2 x i quad frac partial r i partial beta 2 frac beta 1 cdot x i left beta 2 x i right 2 Starting with the initial estimates of b 1 0 9 displaystyle beta 1 0 9 and b 2 0 2 displaystyle beta 2 0 2 after five iterations of the Gauss Newton algorithm the optimal values b 1 0 362 displaystyle hat beta 1 0 362 and b 2 0 556 displaystyle hat beta 2 0 556 are obtained The sum of squares of residuals decreased from the initial value of 1 445 to 0 00784 after the fifth iteration The plot in the figure on the right shows the curve determined by the model for the optimal parameters with the observed data Convergence properties EditThe Gauss Newton iteration is guaranteed to converge toward a local minimum point b displaystyle hat beta under 4 conditions 4 The functions r 1 r m displaystyle r 1 ldots r m are twice continuously differentiable in an open convex set D b displaystyle D ni hat beta the Jacobian J r b displaystyle mathbf J mathbf r hat beta is of full column rank the initial iterate b 0 displaystyle beta 0 is near b displaystyle hat beta and the local minimum value S b displaystyle S hat beta is small The convergence is quadratic if S b 0 displaystyle S hat beta 0 It can be shown 5 that the increment D is a descent direction for S and if the algorithm converges then the limit is a stationary point of S For large minimum value S b displaystyle S hat beta however convergence is not guaranteed not even local convergence as in Newton s method or convergence under the usual Wolfe conditions 6 The rate of convergence of the Gauss Newton algorithm can approach quadratic 7 The algorithm may converge slowly or not at all if the initial guess is far from the minimum or the matrix J r T J r displaystyle mathbf J r mathsf T J r is ill conditioned For example consider the problem with m 2 displaystyle m 2 equations and n 1 displaystyle n 1 variable given byr 1 b b 1 r 2 b l b 2 b 1 displaystyle begin aligned r 1 beta amp beta 1 r 2 beta amp lambda beta 2 beta 1 end aligned The optimum is at b 0 displaystyle beta 0 Actually the optimum is at b 1 displaystyle beta 1 for l 2 displaystyle lambda 2 because S 0 1 2 1 2 2 displaystyle S 0 1 2 1 2 2 but S 1 0 displaystyle S 1 0 If l 0 displaystyle lambda 0 then the problem is in fact linear and the method finds the optimum in one iteration If l lt 1 then the method converges linearly and the error decreases asymptotically with a factor l at every iteration However if l gt 1 then the method does not even converge locally 8 Solving overdetermined systems of equations EditThe Gauss Newton iterationx k 1 x k J x k f x k k 0 1 displaystyle mathbf x k 1 mathbf x k J mathbf x k dagger mathbf f mathbf x k quad k 0 1 ldots is an effective method for solving overdetermined systems of equations in the form of f x 0 displaystyle mathbf f mathbf x mathbf 0 with f x f 1 x 1 x n f m x 1 x n displaystyle mathbf f mathbf x begin bmatrix f 1 x 1 ldots x n vdots f m x 1 ldots x n end bmatrix and m gt n displaystyle m gt n where J x displaystyle J mathbf x dagger is the Moore Penrose inverse also known as pseudoinverse of the Jacobian matrix J x displaystyle J mathbf x of f x displaystyle mathbf f mathbf x It can be considered an extension of Newton s method and enjoys the same local quadratic convergence 4 toward isolated regular solutions If the solution doesn t exist but the initial iterate x 0 displaystyle mathbf x 0 is near a point x x 1 x n displaystyle hat mathbf x hat x 1 ldots hat x n at which the sum of squares i 1 m f i x 1 x n 2 f x 2 2 textstyle sum i 1 m f i x 1 ldots x n 2 equiv mathbf f mathbf x 2 2 reaches a small local minimum the Gauss Newton iteration linearly converges to x displaystyle hat mathbf x The point x displaystyle hat mathbf x is often called a least squares solution of the overdetermined system Derivation from Newton s method EditIn what follows the Gauss Newton algorithm will be derived from Newton s method for function optimization via an approximation As a consequence the rate of convergence of the Gauss Newton algorithm can be quadratic under certain regularity conditions In general under weaker conditions the convergence rate is linear 9 The recurrence relation for Newton s method for minimizing a function S of parameters b displaystyle boldsymbol beta isb s 1 b s H 1 g displaystyle boldsymbol beta s 1 boldsymbol beta s mathbf H 1 mathbf g where g denotes the gradient vector of S and H denotes the Hessian matrix of S Since S i 1 m r i 2 textstyle S sum i 1 m r i 2 the gradient is given byg j 2 i 1 m r i r i b j displaystyle g j 2 sum i 1 m r i frac partial r i partial beta j Elements of the Hessian are calculated by differentiating the gradient elements g j displaystyle g j with respect to b k displaystyle beta k H j k 2 i 1 m r i b j r i b k r i 2 r i b j b k displaystyle H jk 2 sum i 1 m left frac partial r i partial beta j frac partial r i partial beta k r i frac partial 2 r i partial beta j partial beta k right The Gauss Newton method is obtained by ignoring the second order derivative terms the second term in this expression That is the Hessian is approximated byH j k 2 i 1 m J i j J i k displaystyle H jk approx 2 sum i 1 m J ij J ik where J i j r i b j textstyle J ij partial r i partial beta j are entries of the Jacobian Jr Note that when the exact hessian is evaluated near an exact fit we have near zero r i displaystyle r i so the second term becomes near zero as well which justifies the approximation The gradient and the approximate Hessian can be written in matrix notation asg 2 J r T r H 2 J r T J r displaystyle mathbf g 2 mathbf J mathbf r mathsf T mathbf r quad mathbf H approx 2 mathbf J mathbf r mathsf T mathbf J r These expressions are substituted into the recurrence relation above to obtain the operational equationsb s 1 b s D D J r T J r 1 J r T r displaystyle boldsymbol beta s 1 boldsymbol beta s Delta quad Delta left mathbf J r mathsf T mathbf J r right 1 mathbf J r mathsf T mathbf r Convergence of the Gauss Newton method is not guaranteed in all instances The approximation r i 2 r i b j b k r i b j r i b k displaystyle left r i frac partial 2 r i partial beta j partial beta k right ll left frac partial r i partial beta j frac partial r i partial beta k right that needs to hold to be able to ignore the second order derivative terms may be valid in two cases for which convergence is to be expected 10 The function values r i displaystyle r i are small in magnitude at least around the minimum The functions are only mildly nonlinear so that 2 r i b j b k textstyle frac partial 2 r i partial beta j partial beta k is relatively small in magnitude Improved versions EditWith the Gauss Newton method the sum of squares of the residuals S may not decrease at every iteration However since D is a descent direction unless S b s displaystyle S left boldsymbol beta s right is a stationary point it holds that S b s a D lt S b s displaystyle S left boldsymbol beta s alpha Delta right lt S left boldsymbol beta s right for all sufficiently small a gt 0 displaystyle alpha gt 0 Thus if divergence occurs one solution is to employ a fraction a displaystyle alpha of the increment vector D in the updating formula b s 1 b s a D displaystyle boldsymbol beta s 1 boldsymbol beta s alpha Delta In other words the increment vector is too long but it still points downhill so going just a part of the way will decrease the objective function S An optimal value for a displaystyle alpha can be found by using a line search algorithm that is the magnitude of a displaystyle alpha is determined by finding the value that minimizes S usually using a direct search method in the interval 0 lt a lt 1 displaystyle 0 lt alpha lt 1 or a backtracking line search such as Armijo line search Typically a displaystyle alpha should be chosen such that it satisfies the Wolfe conditions or the Goldstein conditions 11 In cases where the direction of the shift vector is such that the optimal fraction a is close to zero an alternative method for handling divergence is the use of the Levenberg Marquardt algorithm a trust region method 3 The normal equations are modified in such a way that the increment vector is rotated towards the direction of steepest descent J T J l D D J T r displaystyle left mathbf J mathsf T J lambda D right Delta mathbf J mathsf T mathbf r where D is a positive diagonal matrix Note that when D is the identity matrix I and l displaystyle lambda to infty then l D l J T J l I 1 J T r I J T J l J T r J T r displaystyle lambda Delta lambda left mathbf J mathsf T J lambda mathbf I right 1 left mathbf J mathsf T mathbf r right left mathbf I mathbf J mathsf T J lambda cdots right left mathbf J mathsf T mathbf r right to mathbf J mathsf T mathbf r therefore the direction of D approaches the direction of the negative gradient J T r displaystyle mathbf J mathsf T mathbf r The so called Marquardt parameter l displaystyle lambda may also be optimized by a line search but this is inefficient as the shift vector must be recalculated every time l displaystyle lambda is changed A more efficient strategy is this When divergence occurs increase the Marquardt parameter until there is a decrease in S Then retain the value from one iteration to the next but decrease it if possible until a cut off value is reached when the Marquardt parameter can be set to zero the minimization of S then becomes a standard Gauss Newton minimization Large scale optimization EditFor large scale optimization the Gauss Newton method is of special interest because it is often though certainly not always true that the matrix J r displaystyle mathbf J mathbf r is more sparse than the approximate Hessian J r T J r displaystyle mathbf J mathbf r mathsf T mathbf J r In such cases the step calculation itself will typically need to be done with an approximate iterative method appropriate for large and sparse problems such as the conjugate gradient method In order to make this kind of approach work one needs at least an efficient method for computing the productJ r T J r p displaystyle mathbf J mathbf r mathsf T mathbf J r mathbf p for some vector p With sparse matrix storage it is in general practical to store the rows of J r displaystyle mathbf J mathbf r in a compressed form e g without zero entries making a direct computation of the above product tricky due to the transposition However if one defines ci as row i of the matrix J r displaystyle mathbf J mathbf r the following simple relation holds J r T J r p i c i c i p displaystyle mathbf J mathbf r mathsf T mathbf J r mathbf p sum i mathbf c i left mathbf c i cdot mathbf p right so that every row contributes additively and independently to the product In addition to respecting a practical sparse storage structure this expression is well suited for parallel computations Note that every row ci is the gradient of the corresponding residual ri with this in mind the formula above emphasizes the fact that residuals contribute to the problem independently of each other Related algorithms EditIn a quasi Newton method such as that due to Davidon Fletcher and Powell or Broyden Fletcher Goldfarb Shanno BFGS method an estimate of the full Hessian 2 S b j b k textstyle frac partial 2 S partial beta j partial beta k is built up numerically using first derivatives r i b j textstyle frac partial r i partial beta j only so that after n refinement cycles the method closely approximates to Newton s method in performance Note that quasi Newton methods can minimize general real valued functions whereas Gauss Newton Levenberg Marquardt etc fits only to nonlinear least squares problems Another method for solving minimization problems using only first derivatives is gradient descent However this method does not take into account the second derivatives even approximately Consequently it is highly inefficient for many functions especially if the parameters have strong interactions Notes Edit Mittelhammer Ron C Miller Douglas J Judge George G 2000 Econometric Foundations Cambridge Cambridge University Press pp 197 198 ISBN 0 521 62394 4 Floudas Christodoulos A Pardalos Panos M 2008 Encyclopedia of Optimization Springer p 1130 ISBN 9780387747583 a b Bjorck 1996 a b J E Dennis Jr and R B Schnabel 1983 Numerical Methods for Unconstrained Optimization and Nonlinear Equations SIAM 1996 reproduction of Prentice Hall 1983 edition p 222 Bjorck 1996 p 260 Mascarenhas 2013 The divergence of the BFGS and Gauss Newton Methods Mathematical Programming 147 1 253 276 arXiv 1309 7922 doi 10 1007 s10107 013 0720 6 S2CID 14700106 Bjorck 1996 p 341 342 Fletcher 1987 p 113 Archived copy PDF Archived from the original PDF on 2016 08 04 Retrieved 2014 04 25 a href Template Cite web html title Template Cite web cite web a CS1 maint archived copy as title link Nocedal 1999 p 259 Nocedal Jorge 1999 Numerical optimization Wright Stephen J 1960 New York Springer ISBN 0387227423 OCLC 54849297 References EditBjorck A 1996 Numerical methods for least squares problems SIAM Philadelphia ISBN 0 89871 360 9 Fletcher Roger 1987 Practical methods of optimization 2nd ed New York John Wiley amp Sons ISBN 978 0 471 91547 8 Nocedal Jorge Wright Stephen 1999 Numerical optimization New York Springer ISBN 0 387 98793 2 External links EditProbability Statistics and Estimation The algorithm is detailed and applied to the biology experiment discussed as an example in this article page 84 with the uncertainties on the estimated values Implementations Edit Artelys Knitro is a non linear solver with an implementation of the Gauss Newton method It is written in C and has interfaces to C C Java Python MATLAB R Retrieved from https en wikipedia org w index php title Gauss Newton algorithm amp oldid 1152655089, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.