Unconstrained Optimization and Nonlinear Systems
Goals of this chapter
• Introduce and apply effective methods for solving uncon- strained optimization problems.
• Introducealgorithmsthatarepopularinapplicationsandsoft- ware libraries.
Copyright By PowCoder代写 加微信 powcoder
• Evaluate different algorithms and understand tradeoffs ac- cording to user convenience, speed of convergence, robust- ness and sparsity.
• Introduce gradient descent, Newton’s method, BFGS, Gauss- Newton and derivative-free methods.
• Study methods for solving large problems.
Optimization seems to be almost a primal urge among scientists and engineers, and there is a vast number of applications giving rise to the mathematical problems and the numerical methods described in this chapter and the next one.
There are several types of optimization problems. Our prototype here is the minimization of a scalar function f in n variables x = (x1, x2, . . . , xn)T . We write this as
min f (x), x
and require that x be in Rn. Assuming that the objective function f is at least continuous, we call this an unconstrained minimization problem.
A necessary condition for a differentiable function f (x) to have an unconstrained minimum at a certain point in Rn is that all its first derivatives, i.e., the gradient vector, vanish there. This generalizes the setting and results discussed in Section 3.5 and is further justified in Section 9.2 below. We must require
∂f =0, ∂f =0,…, ∂f =0. ∂x1 ∂x2 ∂xn
In general, these are n nonlinear equations in n unknowns. 313
314 Chapter 9: Optimization and Nonlinear Systems
Thus, we first consider in Section 9.1 the problem of solving systems of nonlinear equations. In fact, we do it in a more general context, without a minimization problem necessarily lurking in the background. This provides a generalization of Chapter 3 to problems in several variables, and it gives access to solving nonlinear differential equations as well as other nonlinear problems.
Returning to the main topic, the problem of minimizing a scalar, sufficiently smooth function in several variables without constraints is considered in the rest of the chapter. We start with the important background Section 9.2, where conditions for optimality are derived and the concept of convexity is defined and discussed. Over regions where the objective function is convex there is precisly one minimum point and no other extremal points. More generally, there may be minima, maxima and saddle points, and we quickly discuss the potentially difficult quesion of finding a global minimum before turning our attention to the task of finding a local one (in this context, recall Section 3.5 again).
Basic numerical methods are considered in Section 9.3. We attempt to find a sequence of iterates such that the objective function decreases meaningfully (i.e., without “crawling”) from each iterate, xk, to the next, xk+1. It is possible to show that, under reasonable assumptions that include a good starting guess x0, such a method would converge to a local minimum. The basic gradient descent and Newton’s methods are introduced, and line search algorithms that guarantee meaningful decrease in the objective function f under suitable conditions are discussed.
Quasi-Newton methods, which give rise to the most popular algorithm in unconstrained op- timization software, are considered in Section 9.4.
A special method for nonlinear least squares problems, highly popular in applications, is derived in Section 9.5.
The methods in Sections 9.3–9.5 are considered under the assumption that the norm of the matrix of 2nd partial derivatives of f(x), called the Hessian matrix, is bounded in the region of interest. Consequently, the faster methods attempt to exploit 2nd order information, approximat- ing the action of the Hessian in some way. However, there is a class of interesting and popular applications where the gradient of the objective function exists and is bounded in norm, but not more: the gradient may not even be differentiable. Section 9.6 considers methods for such con- vex problems. These methods essentially attempt to improve on the gradient descent method in various ways.
All of the methods in Sections 9.3–9.6 are considered under the assumption that the gradient of the objective function is available. Methods for the case where this does not hold are briefly discussed in Section 9.7. The gradient may not be available either becuase f is not differentiable or because it is just too complicated or messy for a modest user to evaluate without error. There is a significant difference between these sources of difficulty.
If the optimization problem is large, i.e., n is large in the sense discussed in previous chapters (especially Chapter 7), then additional considerations often dominate and new methods arise as a result. Some popular methods that are built to handle such problems are discussed in Section 9.8.
In Chapter 10 we consider the additional challenges brought about when the array of variables of the function to be minimized is subjected to constraints, i.e., x is restricted to be in some set Ω that is strictly contained in Rn. This usually makes the problem harder to solve.
9.1 Newton’s method for nonlinear systems
Note: If you are focussed just on optimization, then it is possible to skip this section on a first pass and start your exploration in Sec- tion 9.2.
9.1. Newton’smethodfornonlinearsystems 315
Let us consider a system of n nonlinear equations in n unknowns, written as g(x) = 0,
In component form the system reads
x= . , xn
g1 (x) g2(x)
g(x)= . . gn (x)
gi(x1,x2,…,xn) = 0, i = 1,2,…,n.
This is a generalization of the case n = 1, considered in Chapter 3. We will not assume in this section that the problem necessarily arises from the minimization of any scalar function. Our purpose is to present the basic Newton method in a general context.
Example 9.1. (E)
Let g1(x1, x2) = x21−2×1−x2+1 and g2(x1, x2) = x21+x2−1. The equation g1(x1, x2) = 0
describes a parabola, while g2(x1, x2) = 0 describes a circle; see Figure 9.1.
−1.5 −1 −0.5 0 0.5 1 1.5
Figure 9.1. A parabola meets a circle.
There are two roots to the system of equations
x21 −2×1 −x2 +1=0, x21 +x2 −1=0.
These are the two intersection points, x∗(1) = (1, 0)T and x∗(2) = (0, 1)T .
316 Chapter 9: Optimization and Nonlinear Systems
The more general picture
We will assume throughout that g has bounded derivatives at least up to order two.
Similarly to the case of one nonlinear equation considered in Chapter 3 (see in particular Fig- ure 2.1), a system of nonlinear equations may have any number of solutions in general; indeed the number of solutions is not directly related to n. Also as before, we will seek an iterative method for solving nonlinear equations: starting from an initial guess x0, such a method gener- atesasequenceofiteratesx1,x2,…,xk,…,hopefullyconvergingtoasolutionx∗ satisfying
g(x∗) = 0.
Systems of nonlinear equations are generally much more complicated to solve than one non-
linear equation in one unknown, although some of the techniques developed in Chapter 3 do extend to the present family of problems. Unfortunately, no simple method such as bisection is available here. We will soon see an extension of Newton’s method and variants, which provide a powerful tool for local convergence. That is, if we start with x0 already close in some sense to x∗, then under reasonable conditions the method will converge rapidly. However, obtaining convergence for cases where the starting point x0 is remote can be really problematic, due to the lack of a tool like the bisection method that would allow getting closer to a solution safely.
Theorem: for Vector Functions.
Let x = (x1,x2,…,xn)T , g = (g1,g2,…,gm)T , and assume that g(x) has bounded derivatives up to order at least two. Then for a direction vector p = (p1, p2, . . . , pn)T , the Taylor expansion for each function gi in each coordinate xj yields
gi(x+p) = gi(x)+ ∂x pj +O(∥p∥ ), i = 1,…,m.
Thus we have
g(x + p) = g(x) + J(x)p + O(∥p∥2),
where J(x) is the Jacobian matrix of first derivatives of g at x, given by
∂g2 J(x)=∂x1
… ∂g2 ∂xn .
… ∂g1 ∂xn
. . . . . .
∂gm ∂gm … ∂gm ∂x1 ∂x2 ∂xn
To start, we need an extension of the scalar Theorem to the multivariate case. The corresponding theorem is given on this page. Note that for the present purposes, n = m, so the Jacobian matrix is square. More generally, it is a rectangular matrix function of x. Observe that there is no evaluation of the next, remainder term, at some intermediate ξ for a system. Moreover, the precise form of the term involving the second derivatives of g is often a monster that we luckily need not look at too often in our applications.
It is helpful to think of x as a point in Rn and of p = (p1,p2,…,pn)T as a direction vector. We then move from a point x, in the direction p, to the point x + p. See Figure 9.2.
9.1. Newton’smethodfornonlinearsystems
Figure 9.2. The point x, the direction p, and the point x + p. Deriving Newton’s method
Let us return to the system of nonlinear equations
Newton’s method for solving such a system is derived in a fashion similar to the scalar case. Starting from an initial guess x0 , a sequence of iterates x1 , x2 , . . . , xk , . . . is generated, where the next itrate xk+1 is obtained upon approximating the Taylor expansion of g about xk by its linear terms.
Thus,atx=xk,ifweknewp=p∗ =x∗ −xk,thenforsmallp∗ 0 = g(xk + p∗) ≈ g(xk) + J(xk)p∗.
Of course, we do not know p∗, because we do not know x∗, but we use the above to define p = pk by requiring
g(xk) + J(xk)pk = 0. This yields the algorithm defined on the current page.
Algorithm: Newton’s Method for Systems.
for k = 0,1,…, until convergence
solve J(xk)pk = −g(xk) for pk setxk+1 =xk +pk.
The following MATLAB function implements this algorithm.
function [x,k] = newtons(g,x, tol ,nmax) %
318 Chapter 9: Optimization and Nonlinear Systems
% function [x,k] = newtons(g,x, tol ,nmax)
% This function returns in x a column vector x_k such that
% || x_k − x_{k−1} || < tol (1 + ||x_k||)
% and in k the number of iterations (Jacobian evaluations) required.
% On entry , x contains an initial guess .
% If k equals nmax then no convergence has been reached .
% The iterates ||g(x_k)|| are recorded. This option
% can be easily turned off
%I n i t i a l i z e
fprintf (’k ||g(x_k)|| \n’) format long g
%Newton iteration
for k=1:nmax
[gx,Jx] = feval(g,x);
fprintf (’%d %e \n’,k−1,norm(gx) ) p = −Jx \ gx;
if norm(p) < tol*(1+norm(x))
gx = feval(g,x);
fprintf (’%d %e \n’,k,norm(gx) )
return end
The stopping criterion in newtons can be easily modified to include a test on ∥g(xk)∥ as well.
Example 9.2. (I)
For the problem defined in Example 9.1 there are two solutions (roots), x∗(1) = (1, 0)T and
x∗(2) = (0, 1)T . The Jacobian matrix for this problem is
J(x)= 2x1−2 −1 . 2x1 2x2
Note that the Jacobian matrix is not necessarily nonsingular for just any choice of x1 and x2. For example, starting the iteration from x0 = 0 is troublesome because J(0) is singular. But the Jacobian is nonsingular at the two roots and at a neighborhood of each of these roots.
Starting Newton’s method from the two initial guesses x(1) = (1, 1)T and x(2) = (−1, 1)T 00
yields quick convergence to the two roots, respectively. The progress of ∥g(xk)∥ is recorded in Table 9.1. Note how rapidly the iteration converges once the residual ∥g(xk)∥ becomes small enough.
We show next that if at a neighborhood of an isolated root x∗ the Jacobian matrix J(x) has a bounded inverse and continuous derivatives, then Newton’s method converges locally quadrati- cally, i.e., there is a constant M such that
∥x∗ − xk+1∥ ≤ M∥x∗ − xk∥2,
provided ∥x∗ − xk ∥ is already small enough. This quadratic convergence order is nicely demon-
strated in Table 9.1.
Assumptions Let g be Lipschitz continuously differentiable in an open convex set D ⊂ Rn,
i.e., there is a constant γ such that
∥J(x)−J(y)∥≤γ∥x−y∥, ∀x,y∈D.
9.1. Newton’smethodfornonlinearsystems 319
Table 9.1. Convergence of Newton’s method to the two roots of Example 9.1. The first guess, x0 = (1, 1)T , leads to the root x∗ = (1, 0)T , while the second guess (or starting point), x0 = (−1, 1)T , leadstotherootx∗ =(0,1)T.
k ∥g(xk)∥ first guess
∥g(xk)∥ second guess 3.162278e+00 7.218033e-01 1.072159e-01 4.561589e-03 9.556657e-06 4.157885e-11
0 1 2 3 4 5 6
1.414214e+00 1.274755e+00 2.658915e-01 3.129973e-02 3.402956e-04 7.094460e-08 1.884111e-15
Note: Below we prove that Newton’s method converges locally quadratically under appropriate assumptions. If you are not inter- ested in proofs then you can skip straight to Example 9.3. But if you can stomach that, then it is worthwhile to understand this one, because it is typical and fundamental.
Suppose further that there is a
nondegenerate root x∗ ∈ D, i.e., g(x∗) = 0, J(x∗) nonsingular.
Quadratic convergence proof By Taylor series for vector functions we have 1
g(x + p) = g(x) +
(Note again that, unlike for the scalar case n = 1, we cannot simply write g(x + p) = g(x) + J(ξ)p at some nearby point ξ.)
Herewesetx=xk,p=x∗ −xk,writeg(xk)=gk, Jk =J(xk),andobtain 1
0=g(x∗)=gk + Adding and subtracting Jk(x∗ − xk) we get
Taking norms, using Lipschitz continuity, and integrating with respect to t, we obtain ∥gk + Jk(x∗ − xk)∥ ≤ γ2∥x∗ − xk∥2.
Now, by our continuity assumptions there is a ball B ⊂ D around x∗ and a constant β such that
∥J−1(x)∥ ≤ β, ∀x ∈ B.
For xk+1 = xk − J−1gk we have k
x∗ −xk+1 =x∗ −xk +J−1gk =J−1[gk +Jk(x∗ −xk)], kk
gk +Jk(x∗ −xk)=
J(x + tp) p dt.
J(xk +t(x∗ −xk))(x∗ −xk)dt.
[Jk −J(xk +t(x∗ −xk))](x∗ −xk)dt.
320 Chapter 9: Optimization and Nonlinear Systems
∥x∗ − xk+1∥ ≤ ∥J−1∥∥gk + Jk(x∗ − xk)∥ ≤ βγ ∥x∗ − xk∥2. k2
This establishes quadratic convergence with M = βγ , because we can make the ball B small 2
enough that, starting with x0 ∈ B, the iterates all stay in that ball.
Example 9.3. (A)
Nonlinear systems of equations are of course not restricted to two components. In fact, n can
easily become large in applications. Let us show a simple example that gives rise to a larger set of nonlinear equations.
Extending Example 4.12 (see page 127), suppose we are now required to find a function v(t) that satisfies the nonlinear boundary value ODE
v′′(t)+ev(t) = 0, 0 < t < 1, v(0) = v(1) = 0
(where v′′(t) = d2v ), and we do this approximately by applying a finite difference discretization. dt2
Such methods and problems are in general the subject of the more advanced Section 17.8, but all we do here is extend the method in Example 4.12 directly, and concentrate on the resulting nonlinear system of algebraic equations.
Thus we subdivide the interval [0, 1] into n + 1 equal subintervals and set ti = ih, i = 0,1,...,n + 1, where (n + 1)h = 1. We then look for an approximate solution vi ≈ v(ti), i = 1, . . . , n, using the boundary conditions to set v0 = vn+1 = 0.
The discretization of the differential equation is given by
vi+1 −2vi +vi−1 +evi = 0, i = 1,2,...,n. h2
This is a system of nonlinear equations, g(x) = 0. Here x ← v and g (v) = vi+1−2vi+vi−1 +
The Jacobian matrix has in its (i, j)th element ∂gi . We obtain
−2 + h2ev1 1
−2 + h2evn
All we need to fire up Newton’s method is an initial guess v0. Let us choose the correspond-
ing mesh values of αt(1 − t), so that
v0 = α(t1(1 − t1),...,tn(1 − tn))T ,
and try different values for the parameter α.
Setting tol = 1.e-8 and h = .04 (i.e., n = 24), the results are as follows. For a zero initial
guess α = 0 convergence is reached within 4 iterations to a solution denoted v∗1 in Figure 9.3. For α = 10 the same solution is reached in 6 iterations. For α = 20 another solution, denoted v∗2 in Figure 9.3, is reached within 6 iterations. Figure 9.3 depicts this solution, too.
For α = 50 the algorithm diverges. Unfortunately, divergence is not a rare occasion when using Newton’s method without due care.
1 −2 + h2ev2 1 ... ...
1 −2+h2evn−1 1
9.2. Unconstrained optimization: optimality and convexity 321
4.5 4 3.5 3 2.5 2 1.5 1 0.5 0
0 0.2 0.4 0.6 0.8 1 t
Figure 9.3. Two solutions for the boundary value ODE of Example 9.3.
Modifying Newton’s method
In Section 3.4 we have discussed ways to deal with various deficiencies of Newton’s method, including the fact that it may diverge when the initial guess is not sufficiently close to the solution and the requirement that the user specify what corresponds to the Jacobian matrix. Here, in addition, there is the need to solve a potentially large system of equations at each iteration k, so techniques from Chapters 5 and 7 are directly relevant.
We discuss such issues and more in this chapter, but let’s do it in the context of unconstrained optimization, considered next.
Before moving on, note that the general problem considered in this section can be artificially cast as that of minimizing some objective function f(x) that reaches its minimum at x∗. For instance, define f(x) = ∥g(x)∥, or let f(x) = ∥Jˆ−1g(x)∥ for some constant and nonsingu- lar matrix Jˆ that somehow represents the Jacobian matrix. We do not necessarily recommend solving nonlinear equations through such a minimization setting, but it is possible to use f as a merit function, to judge or measure improvement in a sequence of iterates. For instance, the next iterate xk+1 of the same Newton method described on page 317 may be considered “better” than the current xk if ∥g(xk+1)∥ < ∥g(xk)∥.
Specific exercises for this section: Exercises 9.1–9.9.
9.2 Unconstrained optimization: optimality and convexity
In this section and the following ones we consider numerical methods for the unconstrained minimization of a function in n variables. Thus, the prototype problem reads
min f(x). x∈Rn
322 Chapter 9: Optimization and Nonlinear Systems
Here f : Rn → R. That is to say, the argument x = (x1,x2,...,xn)T is a vector as in Section 9.1, but f takes on scalar values.
Example 9.4. (E)
Here is a very simple example. For n = 2, x = (x1 , x2 )T , we specify the function
f ( x ) = x 21 + x 42 + 1 .
This function obviously has the minimum value of 1 at x∗ = (0, 0)T = 0; see Figure 9.4.
Figure 9.4. The function x21 + x42 + 1 has a unique minimum at the origin (0, 0) and no maximum. Upon flipping it, the resulting function −(x21 + x42 + 1) would have a unique maximum at the origin (0, 0) and no minimum.
In general, finding a maximum for f(x) is the same as finding a minimum for −f(x). Note that there is no finite argument at which the function in this example attains a maximum. There- fore, the function defined by
f ( x ) = − x 21 − x 42 − 1
does not attain a minimum value anywhere, even though it has a maximum at x = 0.
Local vs global minimum
A point x∗ is said to be a local minimizer, or a local minimum point, if there is an open set containing x∗ such that f(x∗) ≤ f(x) for any point x in that ball.
For realistic, nontrivial problems we typically cannot find minimum points by inspection. Indeed, it is often unclear in practice how many local minima a function f(x) has and, if it has more than one local minimum, how to efficiently find the global minimum, which has the overall smallest value for f . In Section 2.6 we already saw examples of scalar functions f (x) that have several local minima. In Exercise 3.28 we even boldly asked you to write a program to find the global minimizer(s) by first finding all the local minimizers and then finding among them those
9.2. Unconstrained optimization: optimality and convexity 323
with minimal objective value. However, for nontrivial problems in several variables writing such an efficient and general program can be extremely difficult if not practically impossible. And yet, such problems oft
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com