程序代写代做 C go graph Excel Lecture Notes to Accompany

Lecture Notes to Accompany
Scientific Computing
An Introductory Survey
Second Edition
by Michael T. Heath
Chapter 6
Optimization
Copyright c 2001. Reproduction permitted only for noncommercial, educational use in conjunction with the book.
1

Optimization
Given function f: Rn ! R, and set S ✓ Rn, find x⇤ 2S such that f(x⇤)f(x) for all x2S
x⇤ called minimizer or minimum of f
Suces to consider only minimization, since
maximum of f is minimum of f
Objective function f usually di↵erentiable, may
be linear or nonlinear
Constraint set S defined by system of equa- tions and inequalities that may be linear or nonlinear
Points x 2 S called feasible points If S = Rn, problem is unconstrained
2

Optimization Problems
General continuous optimization problem: min f(x) subject to g(x) = o and h(x)  o,
where f:Rn ! R, g:Rn ! Rm, h:Rn ! Rp Linear programming: f, g, and h all linear
Nonlinear programming: nonlinear objective or nonlinear constraints, or both
3

Examples: Optimization Problems
Minimize weight of structure subject to con- straint on its strength, or maximize its strength subject to constraint on its weight
Minimize cost of diet subject to nutritional constraints
Minimize surface area of cylinder subject to constraint on its volume:
min f (x1, x2) = 2⇡x1(x1 + x2) x1,x2
subject to g(x1, x2) = ⇡x21x2 V = 0,
where x1 and x2 are radius and height of cylin- der, and V is required volume
4

Local vs Global Optimization
x⇤ 2 S is global minimum if f(x⇤)  f(x) for all x 2 S
x⇤ 2 S is local minimum if f(x⇤)  f(x) for all feasible x in some neighborhood of x⇤
. .
” . . local minimum . .
.
global m”inimum
5

Global Optimization
Finding, or even verifying, global minimum is dicult, in general
Most optimization methods designed to find local minimum, which may or may not be global minimum
If global minimum desired, can try several widely separated starting points and see if all produce same result
For some problems, such as linear program- ming, global optimization tractable
6

Existence of Minimum
If f is continuous on closed and bounded set S ✓ Rn, then f has global minimum on S
If S is not closed or is unbounded, then f may have no local or global minimum on S
Continuous function f on unbounded set S ✓ Rn is coercive if
lim f(x) = +1, kxk!1
i.e., f(x) must be large whenever kxk is large
If f coercive on closed, unbounded set S ✓ Rn, then f has global minimum on S
7

Level Sets
Level set for function f:S ✓ Rn ! R is set of all points in S for which f has some given constant value (also called contour line)
For given 2 R, sublevel set is
L ={x2S: f(x)}
If continuous function f on S ✓ Rn has nonempty sublevel set that is closed and bounded, then f has global minimum on S
If S is unbounded, then f is coercive on S if, and only if, all its sublevel sets are bounded
8

Uniqueness of Minimum
Set S ✓ Rn is convex if it contains line segment between any two of its points
Function f:S ✓ Rn ! R is convex on convex set S if its graph along any line segment in S lies on or below chord connecting function values at endpoints of segment
Any local minimum of convex function f on convex set S ✓ Rn is global minimum of f on S
Any local minimum of strictly convex function f on convex set S ✓ Rn is unique global mini- mumoff onS
9

First-Order Optimality Condition
For function of one variable, find extremum by di↵erentiating function and setting derivative to zero
Generalization to function of n variables is to find critical point, i.e., solution of nonlinear system
rf(x) = o,
where rf(x) is gradient vector of f, whose ith
component is @f(x)/@xi
For continuously di↵erentiable f:S ✓ Rn ! R, any interior point x⇤ of S at which f has local minimum must be critical point of f
But not all critical points are minima: can also be maximum or saddle point
10

Second-Order Optimality Condition
For twice continuously di↵erentiable f : S ✓ Rn ! R, can distinguish among critical points by con- sidering Hessian matrix Hf(x) defined by
{Hf(x)}ij = @2f(x), @xi@xj
which is symmetric
At critical point x⇤, if Hf(x⇤) is
• positive definite, then x⇤ is minimum of f
• negative definite, then x⇤ is maximum of f
• indefinite, then x⇤ is saddle point of f
• singular, then various pathological situa- tions possible
11

z = x2 +y2 hasaminimum

z=x2 –y2 hasasaddlepoint

Constrained Optimality
If problem is constrained, need be concerned only with feasible directions
For equality-constrained problem
min f (x) subject to g(x) = o,
where f:Rn !R and g:Rn !Rm, with mn, necessary condition for feasible point x⇤ to be solution is that negative gradient of f lie in space spanned by constraint normals, i.e.,
rf (x⇤) = JgT (x⇤),
where Jg is Jacobian matrix of g, and is vec-
tor of Lagrange multipliers
This condition says we cannot reduce objective
function without violating constraints
12

Constrained Optimality, continued
Lagrangian function L: Rn+m ! R, defined by
L(x, ) = f(x) + T g(x), and its gradient and Hessian given by
rL(x,) = “rf(x)+JgT(x)# g(x)
and
HL(x,) = “B(x,) JgT(x)#,
Jg(x) O
where
B(x, ) = Hf (x) +
Together, necessary condition and feasibility
iHgi(x) imply critical point of Lagrangian function,
rL(x,) = “rf(x)+JgT(x)# = o g(x)
13
Xm i=1

Constrained Optimality, continued
Hessian of Lagrangian is symmetric, but not positive definite, so critical point of L is saddle point rather than minimum or maximum
Critical point (x⇤,⇤) of L is constrained min- imum of f if B(x⇤,⇤) is positive definite on null space of Jg(x⇤)
If columns of Z form basis for null space, then test projected Hessian ZTBZ for positive def- initeness
If inequalities present, then KKT optimality conditions also require nonnegativity of Lagrange multipliers corresponding to inequalities, and complementarity condition
14

Sensitivity and Conditioning
Although problems of minimizing function and solving equation are closely related, their sen- sitivities di↵er
In one dimension, absolute condition number of root x⇤ of equation f(x) = 0 is 1/|f0(x⇤)|, so if |f(xˆ)|  ✏, then |xˆ x⇤| may be as large as ✏/|f0(x⇤)|
For minimizing f, Taylor series expansion f(xˆ) = f(x⇤+h)
= f(x⇤) + f0(x⇤)h + 1 f00(x⇤)h2 + O(h3) 2
0⇤q⇤ shows that, since f (x ) = 0, if |f(xˆ)f(x )| 
✏, then |xˆx⇤| may be as large as 2✏/|f00(x⇤)| Thus, based on function values alone, minima
can be computed to only about half precision 15

Unimodality
For minimizing function of one variable, need “bracket” for solution analogous to sign change for nonlinear equation
Real-valued function f is unimodal on inter- val [a, b] if there is unique x⇤ 2 [a, b] such that f(x⇤) is minimum of f on [a,b], and f is strictly decreasing for x  x⇤, strictly increasing for x⇤  x
This property enables discarding portions of in- terval based on sample function values, analo- gous to interval bisection
16

Golden Section Search
Suppose f is unimodal on [a, b], and let x1 and x2 be two points within interval, with x1 < x2 Evaluating and comparing f(x1) and f(x2), can discard either (x2,b] or [a,x1), with mini- mum known to lie in remaining subinterval To repeat process, need to compute only one new function evaluation To reduce length of interval by fixed fraction at each iteration, each new pair of points must have same relationship with respect to new in- terval that previous pair had with respect to previous interval 17 Golden Section Search, continued To accomplish this, choose relative positions of two points as ⌧ and 1⌧, where ⌧2 = 1⌧, so ⌧ = (p51)/2 ⇡ 0.618 and 1⌧ ⇡ 0.382 Whichever subinterval is retained, its length will be ⌧ relative to previous interval, and inte- rior point retained will be at position either ⌧ or 1 ⌧ relative to new interval Need to compute only one new function value, at complementary point, to continue iteration This choice of sample points called golden sec- tion search Golden section search is safe but convergence rate only linear, with constant C ⇡ 0.618 18 Golden Section Search, continued p ⌧=( 51)/2 x =a+(1⌧)(ba) 1 f =f(x) 11 x =a+⌧(ba) 2 . . . . . . . . . . . 22 ax1x2b f = f (x ) while ((b a) > tol) do
if (f1 > f2) then a = x1
x1 = x2
f1 = f2
x =a+⌧(ba) 2
f2 = f(x2)
else ||||
b=x ax1x2
2
x = x 21
f2 = f1
x1 =a+(1⌧)(ba) f1 = f(x1)
.. . .. . .. .. ..
..
.. ..
. . .
. ..
.
.
. .
. .
. . .
. . . . .
.
.
. .
. . .
.
. .
. .
• . .
. . . .
. ..
. . . .
. ..
.
. •. .
1⌧# # .
b
||||
a x1x2 b
.
||||
a x1x2 b “”
.

.
19

Example: Golden Section Search
Use golden section search to minimize f (x) = 0.5 x exp(x2)
x1 f1 x2 f2
0.764 0.472 0.764 0.652 0.584 0.652 0.695 0.679 0.695
. 0.705 .
.
.
0.074 1.236 0.122 0.764 0.074 0.944 0.074 0.764 0.085 0.652 0.074 0.695 0.071 0.721 0.072 0.695 0.071 0.705 0.071 0.711
0.232 0.074 0.113 0.074 0.074 0.071 0.071 0.071 0.071 0.071
.
..
. .
. . . .
. .
. .
. . .
. .
.•. . .
. . .
. .
. .
. .
. . .
. • . . . . . . . .
. . .
. 0 x1 x2 2
20

Successive Parabolic Interpolation
Fit quadratic polynomial to three function val- ues
Take minimum of quadratic to be new approx- imation to minimum of function
New point replaces oldest of three previous points and process repeated until convergence
Convergence rate of successive parabolic inter- polation is superlinear, with r ⇡ 1.324
.
. . •. . . . . . ..
. •. . . .
. . .. .
.. . ..
. … … … …
… … … .
. . .. . ..
. .. . ..
. ..
. .. . . .
. .
.. ..
. . .
. .
. .. . …
.. .
. .
.
. ..
.
.
. . .
..
.
. . .
. .
. . . .
. .
. .
. .
. .
. .
.
.
..
.. .
.
. .
. .
. . . . •. . .
.
xk2 xk xk+1 xk1
. . ..
. .
. ….. .
. . .
21
… ..
.
. . .
.

Example: Successive Parabolic Interpolation
Use successive parabolic interpolation to min- imize
.
f (x) = 0.5 x exp(x2)
xk f(xk)
0.000 0.500
0.600 0.081
1.200 0.216
0.754 0.073
0.721 0.071
0.692 0.071
0.707 0.071
. •.
.
.. .. . .
. . .
. .
. . .
. .
.
. .
. .
. . .
. .
. .
. . .
.
. .
. . . .
. .
. . . .
. . . .
. . .
.. ..
. . . .
. . .
. .
. . .
. . .
. . . . . .
. . .
. . .
.. . . . .. . .. .
. . .
. .
. . . . . .
. . . .
. . . . . . .. . . •. … .
. . .
. . .
. . .
. .
. . .
. .
. . .
. .
. . .
. .•.
. . .
. .. ..
. .
. . .
. .
.
. . . x0 x1 x3 x2
. .
. .
. . .
22

Newton’s Method
Another local quadratic approximation is trun- cated Taylor series
f(x + h) ⇡ f(x) + f0(x)h + f00(x)h2 2
By di↵erentiation, minimum of this quadratic function of h is given by h = f0(x)/f00(x)
Suggests iteration scheme
xk+1 = xk f0(xk)/f00(xk),
which is Newton’s method for solving nonlinear
equation f0(x) = 0
Newton’s method for finding minimum nor- mally has quadratic convergence rate, but must be started close enough to solution
23

Example: Newton’s Method
Use Newton’s method to minimize f (x) = 0.5 x exp(x2)
First and second derivatives of f are given by f 0(x) = (2×2 1) exp(x2)
and
so Newton iteration for zero of f0 given by
f 00(x) = 2x(3 2×2) exp(x2), xk+1 = xk (2x2k 1)/(2xk(3 2x2k))
Using starting guess x0 = 1, obtain
xk f(xk) 1.000 0.132 0.500 0.111 0.700 0.071 0.707 0.071
24

Safeguarded Methods
As with nonlinear equations in one dimension, slow-but-sure and fast-but-risky optimization methods can be combined to provide both safety and eciency
Most library routines for one-dimensional opti- mization based on hybrid approach
Popular combination is golden section search and successive parabolic interpolation, for which no derivatives required
25

Direct Search Methods
Direct search methods for multidimensional op- timization make no use of function values other than comparing them
For minimizing function f of n variables, Nelder- Mead method begins with n+1 starting points, forming simplex in Rn
Then move to new point along straight line from current point having highest function value through centroid of points
New point replaces worst point, and process repeated
Direct search methods useful for nonsmooth functions or for small n, but expensive for larger n
26

Contour’Maps’

Contour’Maps’

Contour’Map’

Contour’Map’

Steepest Descent Method
Let f : Rn ! R be real-valued function of n real variables
At any point x where gradient vector is nonzero, negative gradient, rf(x), points downhill to- ward lower values of function f
In fact, rf(x) is locally direction of steepest descent: function decreases more rapidly along direction of negative gradient than along any other
Steepest descent method: starting from ini- tial guess x0, successive approximate solutions given by
xk+1 = xk ↵krf(xk),
where ↵k is line search parameter that deter- mines how far to go in given direction
27

S Steepest Descent

Steepest Descent, continued
Given descent direction, such as negative gra- dient, determining appropriate value for ↵k at each iteration is one-dimensional minimization problem
minf(xk ↵krf(xk)) ↵k
that can be solved by methods already dis- cussed
Steepest descent method is very reliable: can always make progress provided gradient is nonzero
But method is myopic in its view of function’s behavior, and resulting iterates can zigzag back and forth, making very slow progress toward solution
In general, convergence rate of steepest de- scent is only linear, with constant factor that can be arbitrarily close to 1
28

Example: Steepest Descent
Use steepest descent method to minimize
f (x) = 0.5×21 + 2.5×2 Gradient is given by
rf(x)=” x1 # 5×2
Taking x0 = “51#, we have rf(x0) = “5# Performing line search along negative gradient
direction, i.e.,
min f (x0 ↵0rf (x0)),
↵0
exact minimum along line is given by ↵0 = 1/3,
so next approximation is
x1=” 3.333# 0.667
29

Example Continued
xk f(xk) rf(xk) 5.000 1.000 15.000 5.000 5.000 3.333 -0.667 6.667 3.333 -3.333 2.222 0.444 2.963 2.222 2.222 1.481 -0.296 1.317 1.481 -1.481 0.988 0.198 0.585 0.988 0.988 0.658 -0.132 0.260 0.658 -0.658 0.439 0.088 0.116 0.439 0.439 0.293 -0.059 0.051 0.293 -0.293 0.195 0.039 0.023 0.195 0.195 0.130 -0.026 0.010 0.130 -0.130
3 .•.
. . . . . . . . . .
. . . .. . . . . . . . . . . . . . . . . . . . .. . . .. . . . .. . . .. . . . . . . .. . . . . . . . . . . . . . . .
. .
. .
. . . . . . . . .
. . . .
. . . . .
. . . . . .
. .
. .
. .. . . .. .
6
6 . . . . . .
. .
. .
. . . . . .
. . . . . .
. .
.
. .
. . . . . .
. . . . . .
. .
. .
. . . . . .
. . . .
. .
. . . . . .
. . . .
.
. . . . . .
. . . .
. . . .
. . . .
. .
.
.
. .
. .
. .
. .
.
.
.
. .
. .
3
.
30

Newton’s Method
Broader view can be obtained by local quadratic approximation, which is equivalent to Newton’s method
In multidimensional optimization, we seek zero of gradient, so Newton iteration has form
xk+1 = xk H1(xk)rf(xk), f
where Hf(x) is Hessian matrix of second par- tial derivatives of f,
{Hf(x)}ij = @2f(x) @xi@xj
31

Newton’s Method, continued
Do not explicitly invert Hessian matrix, but in- stead solve linear system
Hf(xk)sk = rf(xk) for sk, then take as next iterate
xk+1 = xk + sk
Convergence rate of Newton’s method for min-
imization is normally quadratic
As usual, Newton’s method is unreliable unless started close enough to solution
32

Example: Newton’s Method
Use Newton’s method to minimize
f (x) = 0.5×21 + 2.5×2 Gradient and Hessian are given by
rf(x)=” x1 # and Hf(x)=”1 0# 5×2 05
Taking x0 = “51#, we have rf(x0) = “5# Linear system for Newton step is
“1 0#s0=”5#, 05 5
so next approximate solution is
x1 = x0 + s0 = ” 5 # + ” 5 # = ” 0 # ,
1 1 0
which is exact solution for this problem, as ex- pected for quadratic function
33

Newton’s Method, continued
Line search parameter unnecessary with New- ton’s method, since quadratic model deter- mines length, as well as direction, of step to next approximate solution
When started far from solution, however, it may still be advisable to perform line search along direction of Newton step sk to make method more robust (damped Newton)
Once iterations near solution, then ↵k = 1 should suce for subsequent iterations
34

Newton’s Method, continued
If objective function f has continuous second partial derivatives, then Hessian matrix Hf is symmetric, and near minimum it is positive definite
Thus, linear system for step to next iterate can be solved in only about half of work required for LU factorization
Far from minimum, Hf(xk) may not be posi- tive definite, so Newton step sk may not even be descent direction for function, i.e., we may not have
rf(xk)Tsk < 0 In this case, alternative descent direction can be computed, such as negative gradient or di- rection of negative curvature, and line search performed 35 Trust Region Methods Alternative to line search is trust region method, in which approximate solution is constrained to lie within region where quadratic model is suf- ficiently accurate If current trust radius is binding, minimizing quadratic model function subject to this con- straint may modify direction as well as length of Newton step Accuracy of quadratic model is assessed by comparing actual decrease in objective func- tion with that predicted by quadratic model, and trust radius is increased or decreased ac- cordingly 36 Trust Region Methods, continued .N.e.w.t.o.n.•.x.k.+.1.•.trx.a.r.k.u.d.s.i.u.t.s. . . . . . . . . . . . . . . . . . . . . step . . . . . . .. . . . . . .. . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . contours of . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .quad. model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . neg. grad. dir... . . . 37 Quasi-Newton Methods Newton’s method costs O(n3) arithmetic and O(n2) scalar function evaluations per iteration for dense problem Many variants of Newton’s method improve re- liability and reduce overhead Quasi-Newton methods have form xk+1 = xk ↵kB1rf(xk), k where ↵k is line search parameter and Bk is approximation to Hessian matrix Many quasi-Newton methods are more robust than Newton’s method, superlinearly conver- gent, and have lower overhead per iteration, which often more than o↵sets their slower con- vergence rate 38 Secant Updating Methods Could use Broyden’s method to seek zero of gradient, but this would not preserve symmetry of Hessian matrix Several secant updating formulas have been developed for minimization that not only pre- serve symmetry in approximate Hessian matrix, but also preserve positive definiteness Symmetry reduces amount of work required by about half, while positive definiteness guaran- tees that quasi-Newton step will be descent direction 39 BFGS Method One of most e↵ective secant updating meth- ods for minimization is BFGS: x0 = initial guess B0 = initial Hessian approximation for k = 0,1,2,... Solve Bk sk = rf(xk) for sk xk+1 = xk + sk yk = rf(xk+1) rf(xk) Bk+1 = Bk + (ykykT )/(ykT sk) end (BksksTk Bk)/(sTk Bksk) 40 BFGS Method, continued In practice, factorization of Bk is updated rather than Bk itself, so linear system for sk can be solved at cost of O(n2) rather than O(n3) work Unlike Newton’s method for minimization, no second derivatives are required Can start with B0 = I, so initial step is along negative gradient, and then second derivative information is gradually built up in approximate Hessian matrix over successive iterations BFGS normally has superlinear convergence rate, even though approximate Hessian does not nec- essarily converge to true Hessian Line search can be used to enhance e↵ective- ness of method 41 Example: BFGS Method Use BFGS to minimize f (x) = 0.5x21 + 2.5x2 Gradient is given by rf(x)=" x1 # 5x2 Taking x0 = [5 1]T and B0 = I, initial step is negative gradient, so x1=x0+s0="5#+"5#=" 0# 1 5 4 Updating approximate Hessian using BFGS for- mula, obtain B1 = " 0.667 0.333 # 0.333 0.667 Then new step is computed and process is re- peated 42 Example: BFGS Method xk f(xk) rf(xk) 5.000 0.000 -2.222 0.816 -0.009 -0.001 1.000 15.000 5.000 -4.000 40.000 0.000 5.000 -20.000 2.222 0.408 -0.077 0.005 0.444 2.963 -2.222 0.082 -0.015 0.001 -0.009 0.350 0.000 -0.001 0.816 0.001 Increase in function value can be avoided by us- ing line search, which generally enhances con- vergence For quadratic objective function, BFGS with exact line search finds exact solution in at most n iterations, where n is dimension of problem 43 Conjugate Gradient Method Another method that does not require explicit second derivatives, and does not even store approximation to Hessian matrix, is conjugate gradient (CG) method CG generates sequence of conjugate search directions, implicitly accumulating information about Hessian matrix For quadratic objective function, CG is theo- retically exact after at most n iterations, where n is dimension of problem CG is e↵ective for general unconstrained min- imization as well 44 Conjugate Gradient Method, continued x0 = initial guess g0 = rf(x0) s0 = g0 for k = 0,1,2,... Choose ↵k to minimize f(xk + ↵ksk) xk+1 = xk + ↵ksk gk+1 = rf(xk+1) k+1 = (gT gk+1)/(gT gk) k+1 k sk+1 = gk+1 + k+1sk end Alternative formula for k+1 is k+1 = ((gk+1 gk)T gk+1)/(gkT gk) 45 Example: Conjugate Gradient Method Use CG method to minimize f (x) = 0.5x21 + 2.5x2 Gradient is given by rf(x)=" x1 # 5x2 Taking x0 = [5 1]T, initial search direction is negative gradient, s0 = g0 = rf(x0) = "5# 5 Exact minimum along line is given by ↵0 = 1/3, so next approximation is x1 = [ 3.333 0.667 ]T , and we compute new gradient, g1 = rf(x1) = " 3.333# 3.333 46 Example Continued So far there is no di↵erence from steepest de- scent method At this point, however, rather than search along new negative gradient, we compute instead 1 = (g1T g1)/(g0T g0) = 0.444, which gives as next search direction s1 =g1+1s0 ="3.333#+0.444"5# 3.333 5 = " 5.556 # 1.111 Minimum along this direction is given by ↵1 = 0.6, which gives exact solution at origin, as expected for quadratic function 47 Truncated Newton Methods Another way to reduce work in Newton-like methods is to solve linear system for Newton step by iterative method Small number of iterations may suce to pro- duce step as useful as true Newton step, es- pecially far from overall solution, where true Newton step may be unreliable anyway Good choice for linear iterative solver is CG method, which gives step intermediate between steepest descent and Newton-like step Since only matrix-vector products are required, explicit formation of Hessian matrix can be avoided by using finite di↵erence of gradient along given vector 48 Nonlinear Least Squares Given data (ti, yi), find vector x of parameters that gives “best fit” in least squares sense to model function f(t,x) If we define components of residual function ri(x) = yi f(ti,x), i = 1,...,m, then we want to minimize (x) = 1 rT (x)r(x) 2 Gradient vector of is r(x) = J T (x)r(x), and Hessian matrix is T Xm H(x) = J (x)J(x) + ri(x)Hi(x), i=1 where J(x) is Jacobian of r(x), and Hi(x) is Hessian of ri(x) 49 Nonlinear Least Squares, continued Linear system for Newton step is 0@ T J (xk)J(xk) + 1A Xm i=1 ri(xk)Hi(xk) sk = JT(xk)r(xk) m Hessian matrices Hi are usually inconvenient and expensive to compute Moreover, in H each Hi is multiplied by resid- ual component ri, which is small at solution if fit of model function to data is good 50 Gauss-Newton Method This motivates Gauss-Newton method for non- linear least squares, in which second-order term is dropped and linear system JT (xk)J(xk)sk = JT (xk)r(xk) is solved for approximate Newton step sk at each iteration This is system of normal equations for linear least squares problem J(xk)sk =⇠ r(xk), which can be solved more reliably by orthogo- nal factorization Next approximate solution is then given by xk+1 = xk + sk, and process is repeated until convergence 51 Example: Gauss-Newton Method Use Gauss-Newton method to fit nonlinear model function to data f (t, x) = x1 exp(x2t) t 0.0 1.0 2.0 3.0 y 2.0 0.7 0.3 0.1 For this model function, entries of Jacobian matrix of residual function r are given by {J(x)}i,1 = @ri(x) = exp(x2ti), @x1 {J(x)}i,2 = @ri(x) = x1ti exp(x2ti) @x2 52 Example Continued If we take x0 = [ 1 0 ]T , then Gauss-Newton step s0 is given by linear least squares problem 26 1 0 37 26 1 37 61 17s0 =⇠ 60.37, 64 1 2 75 64 0 . 7 75 1 3 0.9 whose solution is s0 = " 0.69 # Then next approximate solution is given by x1 = x0 + s0, and process is repeated until convergence 0.61 1.000 1.690 1.975 1.994 1.995 1.995 xk kr(xk)k2 0.000 2.390 -0.610 0.212 -0.930 0.007 -1.004 0.002 -1.009 0.002 -1.010 0.002 53 Gauss-Newton Method, continued Gauss-Newton method replaces nonlinear least squares problem by sequence of linear least squares problems whose solutions converge to solution of original nonlinear problem If residual at solution is large, then second- order term omitted from Hessian is not negligi- ble, and Gauss-Newton method may converge slowly or fail to converge In such “large-residual” cases, it may be best to use general nonlinear minimization method that takes into account true full Hessian matrix 54 Levenberg-Marquardt Method Levenberg-Marquardt method is another use- ful alternative when Gauss-Newton approxima- tion is inadequate or yields rank deficient linear least squares subproblem In this method, linear system at each iteration is of form (JT (xk)J(xk) + μkI)sk = JT (xk)r(xk), where μk is scalar parameter chosen by some strategy Corresponding linear least squares problem is "J(xk)#s =⇠ "r(xk)# pμkIk o With suitable strategy for choosing μk, this method can be very robust in practice, and it forms basis for several e↵ective software pack- ages 55 Equality-Constrained Optimization For equality-constrained minimization problem min f (x) subject to g(x) = o, where f:Rn !R and g:Rn !Rm, with mn, we seek critical point of Lagrangian Applying Newton’s method to nonlinear sys- tem rL(x,) = "rf(x)+JgT(x)# = o, g(x) obtain linear system "B(x,) JgT(x)#"s# = "rf(x)+JgT(x)# Jg(x) O g(x) for Newton step (s,) in (x,) at each itera- tion 56 Sequential Quadratic Programming Foregoing block 2 ⇥ 2 linear system equiva- lent to quadratic programming problem, so this approach known as sequential quadratic pro- gramming Types of solution methods include: • Direct solution methods, in which entire block 2 ⇥ 2 system is solved directly • Range space methods, based on block elim- ination in block 2 ⇥ 2 linear system • Null space methods, based on orthogonal factorization of matrix of constraint nor- mals, JgT(x) 57 Merit Function Once Newton step (s, ) determined, need merit function to measure progress toward overall solution for use in line search or trust region Popular choices include penalty function ⇢(x) = f(x) + 1 ⇢ g(x)T g(x) 2 and augmented Lagrangian function L⇢(x,) = f(x) + Tg(x) + 1 ⇢g(x)Tg(x), 2 where parameter ⇢ > 0 determines weighting of optimality vs feasibility
Given starting guess x0, good starting guess for 0 can be obtained from least squares prob- lem
JgT(x0)0 =⇠ rf(x0)
58

Inequality-Constrained Optimization
Methods just outlined for equality constraints can be extended to handle inequality constraints by using active set strategy
Inequality constraints are provisionally divided into those that are satisfied already (and can therefore be temporarily disregarded) and those that are violated (and are therefore temporarily treated as equality constraints)
This division of constraints is revised as it- erations proceed until eventually correct con- straints are identified that are binding at solu- tion
59

Penalty Methods
Merit function can also be used to convert equality-constrained problem into sequence of unconstrained problems
If x⇤⇢ is solution to
min ⇢(x) = f(x) + 1 ⇢ g(x)T g(x),
x2
then under appropriate conditions
lim x⇤⇢ = x⇤ ⇢!1
This enables use of unconstrained optimization methods, but problem becomes increasingly ill- conditioned for large ⇢, so solve sequence of problems with increasing values of ⇢
60

Barrier Methods
For inequality-constrained problems, another alternative is barrier function, such as
or
μ ( x ) = f ( x ) μ Xp 1 i=1 hi(x)
Xp
log(hi(x)), they approach boundary of feasible region
Again, solutions of unconstrained problem ap- proach x⇤ as μ ! 0, but problems increasingly ill-conditioned, so solve sequence of problems with decreasing values of μ
Barrier functions are basis for interior point methods for linear programming
61
μ(x) = f (x) μ
which increasingly penalize feasible points as
i=1

Example: Constrained Optimization
Consider quadratic programming problem min f (x) = 0.5×21 + 2.5×2,
x
subject to
with Lagrangian function given by
g(x) = x1 x2 1 = 0, L(x, ) = f (x) + g(x)
= 0.5×21 +2.5×2 +(x1 x2 1) Since
rf(x)=” x1 # and Jg(x)=[1 1], 5×2
we have
rxL(x,) = rf(x)+JgT(x) = ” x1 #+” 1#
5×2
1 62

Example Continued
So system to be solved for critical point of Lagrangian is
x1+ = 0, 5×2 = 0, x1x2 = 1,
which in this case is linear system
26 1 0 1 37 26 x 1 37 26 0 37 4 0 5 1 5 4 x2 5 = 4 0 5 110 1
Solving this system, we obtain solution
x1 = 0.833, x2 = 0.167, = 0.833
63

. . .
. . .
. . . . . . . . . .
. .
. .
. . .
. . . . . . ..•. . . .. . . . . . . . . . . . . . . . . . . …. . .
. . .
. . .
. . .
. .
. .
. .
.. . . . .. . .
. .
. .
.. . . . . . . . . . .
. . .
. .
. .
. .
. . . . . .. . .
. .
. .
. .
. .
. . . . . . . . .
. .
. . .
. . . . . .
.
..
Example Continued
1.0
contours of 0.5×21 + 2.5×2 constraint x1 x2 = 1
. . . .
. . . . . . . . . . . . . . . .. . . .
.
. . . . . .
. . . . . . . . . . . .. . . . . . . .. . . . . . .. . . . . . . .. . . . . . . .. .. . . . . . . … . .
. . .
. . .
. . . . . . . . . .. . .
. 1.5 . .
. . .
. . . . .. . . . . . . . . . . . . . . . .
.
. . . .
. . . .. . . 1.5
.
. . . . . . . . . .
. .
. . . .
. .
. . ..
.. .
1.0
. .
64

Linear Programming
One of most important and common constrained optimization problems is linear programming
One standard form for such problems is minf(x)=cTx subject to Ax=b and xo,
where m