Methods for Local MinimizationThe essence of most methods, as described above, is in the local model employed in equation (1) that is used to determine the next step. The FindMinimum function in Mathematica has five essentially different ways of choosing this model, controlled by the method option. These methods are similarly used by FindMaximum and FindFit. Basic method choices for FindMinimum.With Method->Automatic, Mathematica uses the quasi-Newton method unless the problem is structurally a sum of squares, in which case the Levenberg-Marquardt variant of the Gauss-Newton method is used. When given two starting conditions in each variable, the principal axis method is used. Newton's MethodOne significant advantage Mathematica provides is that it can symbolically compute derivatives. This means that when you specify Method->"Newton" and the function is explicitly differentiable, the symbolic derivative will be computed automatically. On the other hand, if the function is not in a form that can be explicitly differentiated, Mathematica will use finite difference approximations to compute the Hessian, using structural information to minimize the number of evaluations required, or you can specify a Mathematica expression, which will give the Hessian with numerical values of the variables. In this example, FindMinimum computes the Hessian symbolically and substitutes numerical values for x and y when needed. In[1]:=  |
Out[1]=
|
This defines a function that is only intended to evaluate for numerical values of the variables. In[2]:=  |
The derivative of this function cannot be found symbolically since the function has been defined only to evaluate with numerical values of the variables. This shows the steps taken by FindMinimum when it has to use finite differences to compute the gradient and Hessian. In[3]:=  |
Out[3]=
|
When the gradient and Hessian are both computed using finite differences, the error in the Hessian may be quite large and it may be better to use a different method. In this case, FindMinimum does find the minimum quite accurately, but cannot be sure because of inadequate derivative information. Also, the number of function and gradient evaluations is much greater than in the example with the symbolic derivatives computed automatically because extra evaluations are required to approximate the gradient and Hessian, respectively. If it is possible to supply the gradient (or the function is such that it can be computed automatically), the method will typically work much better. You can give the gradient using the Gradient option. A later section describes in some detail the options you can use for specifying derivatives. This defines a function that returns the gradient for numerical values of x and y. In[4]:=  |
Out[4]=
|
This tells FindMinimum to use the supplied gradient. The Hessian is computed using finite differences of the gradient. In[5]:=  |
Out[5]=
|
If you can provide a program which gives the Hessian, you can provide this also. Because the Hessian is only used by Newton's method, it is given as a method option of Newton. This defines a function which returns the Hessian for numerical values of x and y. In[6]:=  |
Out[6]=
|
This tells FindMinimum to use the supplied gradient and Hessian. In[7]:=  |
Out[7]=
|
In principle, Newton's method uses the Hessian computed either by evaluating the symbolic derivative or by using finite differences. However, the convergence for the method computed this way depends on the function being convex, in which case the Hessian is always positive definite. However, it is common that a search will start at a location where this condition is violated. Here is an example where the search starts near a local maximum. In[8]:=  |
Out[8]=
|
When sufficiently near a local maximum, the Hessian is actually negative definite. This computes the eigenvalues of the Hessian near the local maximum. In[9]:=  |
Out[9]=
|
If you were to only apply the Newton step formula in cases where the Hessian is not positive definite, it is possible to get a step direction that does not lead to a decrease in the function value. This computes the directional derivative for the direction found by solving . Since it is positive, moving in this direction will locally increase the function value. In[10]:=  |
Out[10]=
|
It is crucial for the convergence of line-search methods that the direction be computed using a positive definite quadratic model since the search process and convergence results derived from it depend on a direction with sufficient descent. Mathematica modifies the Hessian by a diagonal matrix with entries large enough so that is positive definite. Such methods are sometimes referred to as modified Newton methods. The modification to is done during the process of computing a Cholesky decomposition somewhat along the lines described in [GMW81], both for dense and sparse Hessians. The modification is only done if is not positive definite. This decomposition method is accessible through LinearSolve if you want to use it independently. This computes the step using where is determined as the Cholesky factors of the Hessian are being computed. In[11]:=  |
Out[11]=
|
The computed step is in a descent direction. In[12]:=  |
Out[12]=
|
Besides the robustness of the (modified) Newton method, another key aspect is its convergence rate. Once a search is close enough to a quadratic minimum, the convergence is said to be q-quadratic, which means that if is the local minimum point, then for some constant > 0. At machine precision, this does not always make a substantial difference since commonly, that most of the steps are spent getting near to the local minimum. However, if you want a root to extremely high precision, Newton's method is usually the best choice because of the rapid convergence. This computes a very high-precision solution using Newton's method. The precision is adaptively increased from machine precision (the precision of the starting point) to the maximal working precision of 100000 digits. Reap is used with Sow to save the steps taken. Counters are used to keep track of and print the number of function evaluations and steps used. In[13]:=  |

Out[13]=
|
When the option WorkingPrecision->prec is used, the default for the AccuracyGoal and PrecisionGoal is prec/2. Thus, this example should find the minimum to at least 50000 digits. This computes a symbolic solution for the position of the minimum which the search approaches. In[14]:=  |
Out[14]=
|
This computes the norm of the distance from the search points at the ends of each step to the exact minimum. In[15]:=  |
Out[15]=
|
The reason that more function evaluations were required than the number of steps is that Mathematica adaptively increases the precision from the precision of the initial value to the requested maximum WorkingPrecision. The sequence of precisions used is chosen so that as few computations are done at the most expensive final precision as possible under the assumption that the points are converging to the minimum. Sometimes when Mathematica changes precision, it is necessary to reevaluate the function at the higher precision. This shows a table with the precision of each of the points with the norm of their errors. In[16]:=  |
Out[16]//TableForm=
|
Note that typically the precision is roughly double the scale of the error. For Newton's method this is appropriate since when the step is computed, the scale of the error will effectively double according to the quadratic convergence. FindMinimum always starts with the precision of the starting values you gave it. Thus, if you do not want it to use adaptive precision control, you can start with values, which are exact or have at least the maximum WorkingPrecision. This computes the solution using only precision 100000 throughout the computation. (Warning: this takes a very long time to complete.) In[17]:=  |

Out[17]=
|
However, even though this may use fewer function evaluations, they are all done at the highest precision, so typically adaptive precision saves a lot of time. For example, the previous command without adaptive precision takes more than 50 times as long as when starting from machine precision. With Newton's method, both line search and trust region step control are implemented. The default, which is used in the examples above is the line search. However, any of them may be done with the trust region approach. The approach typically requires more numerical linear algebra per step, but, because steps are better controlled, may converge in fewer iterations. This uses the unconstrained problems package to set up the classic Rosenbrock function, which has a narrow curved valley. In[18]:=  |
Out[18]=
|
This shows the steps taken by FindMinimum with a trust region Newton method for a Rosenbrock's function. In[19]:=  |
Out[19]=
|
This shows the steps taken by FindMinimum with a line search Newton method for the same function. In[20]:=  |
Out[20]=
|
You can see from the comparison of the two plots that the trust region method has kept the steps within better control as the search follows the valley and consequently converges with fewer function evaluations. The following table below summarizes the options you can use with Newton's method. Method options for Method->"Newton". Quasi-Newton MethodsThere are many variants of quasi-Newton methods. In all of them, the idea is to base the quadratic model in equation (1) on an approximation of the Hessian matrix built up from the function and gradient values from some or all of the steps previously taken. This shows a plot of the steps taken by the quasi-Newton method. The path is much less direct than for Newton's method. The quasi-Newton method is used by default by FindMinimum for problems that are not sums of squares. In[21]:=  |
Out[21]=
|
The first thing to notice about the path taken above is that it starts out in the wrong direction. This direction is chosen because at the first step all the method has to go by is the gradient, and so it takes the direction of steepest descent. However, in subsequent steps, it incorporates information from the values of the function and gradient at the steps taken to build up an approximate model of the Hessian. The methods used by Mathematica are the Broyden Fletcher Goldfarb and Shanno (BFGS) updates and, for large systems, the limited memory BFGS (L-BFGS) methods, in which the model is not stored explicitly, but rather is calculated by gradients and step directions stored from past steps. The BFGS method is implemented such that instead of forming the model Hessian at each step, Cholesky factors such that = are computed so that only operations are needed to solve the system [DS96] for a problem with variables. For large scale sparse problems, the BFGS method can be problematic because, in general, the Cholesky factors (or the Hessian approximation or its inverse) are dense, so the memory and operations requirements become prohibitive compared to algorithms that take advantage of sparseness. The L-BFGS algorithm [NW99] forms an approximation to the inverse Hessian based on a the last past steps, which are stored. The Hessian approximation may not be as complete, but the memory and order of operations are limited to for a problem with variables. In Mathematica 5, for problems over 250 variables, the algorithm is switched automatically to L-BFGS. You can control this with the method option "StepMemory"-> . With = , the full BFGS method will always be used. Choosing an appropriate value of is a trade-off between speed of convergence and the work done per step. With <3, you are most likely better off using a conjugate gradient algorithm. This shows the same aforementioned example function with the minimum computed using L-BFGS with = 5. In[22]:=  |
Out[22]=
|
Quasi-Newton methods are chosen as the default in Mathematica because they are typically quite fast and do not require computation of the Hessian matrix, which can be quite expensive both in terms of the symbolic computation and numerical evaluation. With an adequate line search, they can be shown to converge superlinearly [NW99] to a local minimum where the Hessian is positive definite. This means that or, in other words, that the steps keep getting smaller. However, as mentioned in the section on Newton methods, for very high precision, this does not compare to the q-quadratic convergence rate of Newton's method. This shows the number of steps and function evaluations required to find the minimum to high precision for the same problem shown above with Newton's method. In[23]:=  |

Out[23]=
|
Even though the WorkingPrecision was a tenth that used for the example with Newton's method, many more steps were taken than with Newton's method. However, one can see that the convergence is still superlinear since the ratio of the errors are clearly going to zero. This makes a plot showing the ratios of the errors in the computation. The ratios of the errors are shown on a logarithmic scale so that the trend can clearly be seen over a large range of magnitudes. In[24]:=  |
The following table summarizes the options you can use with quasi-Newton methods. Method options for Method->"QuasiNewton". Gauss-Newton MethodsFor minimization problems for which the objective function is a sum of squares, it is often advantageous to use the special structure of the problem. Time and effort can be saved by computing the residual function , and its derivative, the Jacobian . The Gauss-Newton method is an elegant way to do this. Rather than using the complete second order Hessian matrix for the quadratic model, the Gauss-Newton method uses in (1) such that the step is computed from the formula where Note that this is an approximation to the full Hessian, which is . In the zero residual case, where is the minimum, or when r varies nearly as a linear function near the minimum point, the approximation to the Hessian is quite good and the quadratic convergence of Newton's method is commonly observed. Objective functions, which are sums of squares, are quite common, and, in fact, this is the form of the objective function when FindFit is used with the default value of the NormFunction option. One way to view the Gauss-Newton method is in terms of least squares problems. Solving the Gauss-Newton step is the same as solving a linear least squares problems, so applying a Gauss-Newton method is in effect applying a sequence of linear least squares fits to a nonlinear function. With this view, it makes sense that this method is particularly appropriate for the sort of nonlinear fitting that FindFit does. This uses the unconstrained problems package to set up the classic Rosenbrock function, which has a narrow curved valley. In[26]:=  |
Out[26]=
|
When Mathematica encounters a problem that is expressly a sum of squares, such as the Rosenbrock example, or a function that is the dot product of a vector with itself, the Gauss-Newton method will be used automatically. This shows the steps taken by FindMinimum with a the Gauss-Newton method for trust region Newton Rosenbrock's function. In[27]:=  |
Out[27]=
|
If you compare this to the same example done with Newton's method, you can see that it was done with fewer steps and evaluations because the Gauss-Newton method is taking advantage of the special structure of the problem. The convergence rate near the minimum is just as good as for Newton's method because the residual is zero at the minimum. The Levenberg-Marquardt method is a Gauss-Newton method with trust region step control (though it was originally proposed before the general notion of trust regions had been developed). You can request this method specifically by using the FindMinimum option Method->"LevenbergMarquardt" or equivalently Method->"GaussNewton". Sometimes, it is awkward to express a function so that it will explicitly be a sum of squares or a dot product of a vector with itself. In these cases, it is possible to use the "Residual" method option to specify the residual directly. Similarly, you can specify the derivative of the residual with the "Jacobian" method option. Note that when the residual is specified through the "Residual" method option, it is not checked for consistency with the first argument of FindMinimum. The values returned will depend on the value given through the option. This finds the minimum of Rosenbrock's function using the specification of the residual. In[28]:=  |
Out[28]=
| Method options for Method->"LevenbergMarquardt".Another natural way of setting up sums of squares problems in Mathematica is with FindFit, which computes nonlinear fits to data. A simple example follows. Here is a model function. In[29]:=  |
Here is some data generated by the function with some random perturbations added. In[30]:=  |
This finds a nonlinear least squares fit to the model function. In[31]:=  |
Out[31]=
|
This shows the fit model with the data. In[32]:=  |
In the example, FindFit internally constructs a residual function and Jacobian, which are in turn used by the Gauss-Newton method to find the minimum of the sum of squares, or the nonlinear least squares fit. Of course, FindFit can be used with other methods, but because a residual function that evaluates rapidly can be constructed, it is often faster than the other methods. Nonlinear Conjugate Gradient MethodsThe basis for a nonlinear conjugate gradient method is to effectively apply the linear conjugate gradient method, where the residual is replaced by the gradient. A model quadratic function is never explicitly formed, so it is always combined with a line search method. The first nonlinear conjugate gradient method was proposed by Fletcher and Reeves as follows. Given a step direction , use the line search to find such that . Then compute It is essential that the line search for choosing satisfy the strong Wolfe conditions; this is necessary to ensure that the directions are descent directions [NW99]. An alternate method, which generally (but not always) works better in practice, is that of Polak and Ribiere, where equation (2) is replaced with In formula (3), it is possible that can become negative, in which case Mathematica uses the algorithm modified by using . In Mathematica, the default conjugate gradient method is Polak-Ribiere, but the Fletcher-Reeves method can be chosen by using the method option
Method->{"ConjugateGradient", Method->"FletcherReeves"} The advantage of conjugate gradient methods are that they use relatively little memory for large scale problems and require no numerical linear algebra, so each step is quite fast. The disadvantage is that they typically converge much more slowly than Newton or quasi-Newton methods. Also, steps are typically poorly scaled for length, so the line search algorithm may require more iterations each time to find an acceptable step. This shows a plot of the steps taken by the nonlinear conjugate gradient method. The path is much less direct than for Newton's method. In[33]:=  |
Out[33]=
|
One issue that arises with nonlinear conjugate gradient methods is when to restart them. As the search moves, the nature of the local quadratic approximation to the function may change substantially. The local convergence of the method depends on that of the linear conjugate gradient method, where the quadratic function is constant. With a constant quadratic function for n variables and an exact line search, the linear algorithm will converge in n or fewer iterations. By restarting (taking a steepest descent step with ) every so often, it is possible to eliminate information from previous points, which may not be relevant to the local quadratic model at the current search point. If you look carefully at the example, you can see where the method was restarted and a steepest descent step was taken. One option is to simply restart after every k iterations, where k n. You can specify this using the method option "RestartIterations"->k. An alternative is to restart when consecutive gradients are not sufficiently orthogonal according to the test
with a threshold between 0 and 1. You can specify this using the method option "RestartThreshold"-> . The table summarizes the options you can use with the conjugate gradient methods. Method options for Method->"ConjugateGradient".It should be noted that the default method for FindMinimum in Mathematica Version 4 was a conjugate gradient method with a near exact line search. This has been maintained for legacy reasons and can be accessed by using the FindMinimum option Method->Gradient. Typically, this will use more function and gradient evaluations than the newer Method->"ConjugateGradient", which itself often uses far more than the methods that Mathematica currently uses as defaults. Principal Axis MethodAll of the methods previously described use derivatives. When Mathematica cannot compute symbolic derivatives, finite differences will be used. Computing derivatives with finite differences can impose a significant cost in some cases and certainly affects the reliability of derivatives, ultimately having an effect on how good an approximation to the minimum is achievable. For functions where symbolic derivatives are not available, an alternative is to use a derivative-free algorithm, where an approximate model is built up using only values from function evaluations. Mathematica uses the principal axis method of Brent [Br02] as a derivative-free algorithm. For an n-variable problem, take a set of search directions and a point . Take to be the point that minimizes f along the direction from (i.e., do a line search from ), then replace with . At the end, replace with . Ideally, the new should be linearly independent, so that a new iteration could be undertaken, but in practice, they are not. Brent's algorithm involves using the singular value decomposition (SVD) on the matrix U = to realign them to the principal directions for the local quadratic model. (An eigendecomposition could be used, but Brent shows that the SVD is more efficient.) With the new set of obtained, another iteration can be done. Two distinct starting conditions in each are required for this method because these are used to define the magnitudes of the vectors . In fact, whenever you specify two starting conditions in each variable, FindMinimum, FindMaximum, and FindFit will use the principal axis algorithm by default. This shows the search path and function evaluations for FindMinimum to find a local minimum of the function . In[34]:=  |
Out[34]=
|
The basics of the search algorithm can be seen quite well from the plot since the derivative-free line search algorithm requires a substantial number of function evaluations. First a line search is done in the x-direction, then from that point, a line search is done in the y-direction, determining the step direction. Once the step is taken, the vectors are realigned appropriately to the principal directions of the local quadratic approximation and the next step is similarly computed. The algorithm is efficient in terms of convergence rate; it has quadratic convergence in terms of steps. However, in terms of function evaluations, it is quite expensive because of the derivative free line search required. Note that since the directions given to the line search (especially at the beginning) are not necessarily descent directions, the line search has to be able to search in both directions. For problems with many variables, the individual linear searches in all of the directions become very expensive, so this method is typically better suited to problems without too many variables.
|