Lecture Notes to Accompany
Scientific Computing
An Introductory Survey
Second Edition
by Michael T. Heath
Chapter 6
Optimization
Copyright c 2001. Reproduction permitted only for noncommercial, educational use in conjunction with the book.
1
Optimization
Given function f: Rn ! R, and set S ✓ Rn, find x⇤ 2S such that f(x⇤)f(x) for all x2S
x⇤ called minimizer or minimum of f
Su ces to consider only minimization, since
maximum of f is minimum of f
Objective function f usually di↵erentiable, may
be linear or nonlinear
Constraint set S defined by system of equa- tions and inequalities that may be linear or nonlinear
Points x 2 S called feasible points If S = Rn, problem is unconstrained
2
Optimization Problems
General continuous optimization problem: min f(x) subject to g(x) = o and h(x) o,
where f:Rn ! R, g:Rn ! Rm, h:Rn ! Rp Linear programming: f, g, and h all linear
Nonlinear programming: nonlinear objective or nonlinear constraints, or both
3
Examples: Optimization Problems
Minimize weight of structure subject to con- straint on its strength, or maximize its strength subject to constraint on its weight
Minimize cost of diet subject to nutritional constraints
Minimize surface area of cylinder subject to constraint on its volume:
min f (x1, x2) = 2⇡x1(x1 + x2) x1,x2
subject to g(x1, x2) = ⇡x21x2 V = 0,
where x1 and x2 are radius and height of cylin- der, and V is required volume
4
Local vs Global Optimization
x⇤ 2 S is global minimum if f(x⇤) f(x) for all x 2 S
x⇤ 2 S is local minimum if f(x⇤) f(x) for all feasible x in some neighborhood of x⇤
. .
” . . local minimum . .
.
global m”inimum
5
Global Optimization
Finding, or even verifying, global minimum is di cult, in general
Most optimization methods designed to find local minimum, which may or may not be global minimum
If global minimum desired, can try several widely separated starting points and see if all produce same result
For some problems, such as linear program- ming, global optimization tractable
6
Existence of Minimum
If f is continuous on closed and bounded set S ✓ Rn, then f has global minimum on S
If S is not closed or is unbounded, then f may have no local or global minimum on S
Continuous function f on unbounded set S ✓ Rn is coercive if
lim f(x) = +1, kxk!1
i.e., f(x) must be large whenever kxk is large
If f coercive on closed, unbounded set S ✓ Rn, then f has global minimum on S
7
Level Sets
Level set for function f:S ✓ Rn ! R is set of all points in S for which f has some given constant value (also called contour line)
For given 2 R, sublevel set is
L ={x2S: f(x) }
If continuous function f on S ✓ Rn has nonempty sublevel set that is closed and bounded, then f has global minimum on S
If S is unbounded, then f is coercive on S if, and only if, all its sublevel sets are bounded
8
Uniqueness of Minimum
Set S ✓ Rn is convex if it contains line segment between any two of its points
Function f:S ✓ Rn ! R is convex on convex set S if its graph along any line segment in S lies on or below chord connecting function values at endpoints of segment
Any local minimum of convex function f on convex set S ✓ Rn is global minimum of f on S
Any local minimum of strictly convex function f on convex set S ✓ Rn is unique global mini- mumoff onS
9
First-Order Optimality Condition
For function of one variable, find extremum by di↵erentiating function and setting derivative to zero
Generalization to function of n variables is to find critical point, i.e., solution of nonlinear system
rf(x) = o,
where rf(x) is gradient vector of f, whose ith
component is @f(x)/@xi
For continuously di↵erentiable f:S ✓ Rn ! R, any interior point x⇤ of S at which f has local minimum must be critical point of f
But not all critical points are minima: can also be maximum or saddle point
10
Second-Order Optimality Condition
For twice continuously di↵erentiable f : S ✓ Rn ! R, can distinguish among critical points by con- sidering Hessian matrix Hf(x) defined by
{Hf(x)}ij = @2f(x), @xi@xj
which is symmetric
At critical point x⇤, if Hf(x⇤) is
• positive definite, then x⇤ is minimum of f
• negative definite, then x⇤ is maximum of f
• indefinite, then x⇤ is saddle point of f
• singular, then various pathological situa- tions possible
11
z = x2 +y2 hasaminimum
z=x2 –y2 hasasaddlepoint
Constrained Optimality
If problem is constrained, need be concerned only with feasible directions
For equality-constrained problem
min f (x) subject to g(x) = o,
where f:Rn !R and g:Rn !Rm, with mn, necessary condition for feasible point x⇤ to be solution is that negative gradient of f lie in space spanned by constraint normals, i.e.,
rf (x⇤) = JgT (x⇤) ,
where Jg is Jacobian matrix of g, and is vec-
tor of Lagrange multipliers
This condition says we cannot reduce objective
function without violating constraints
12
Constrained Optimality, continued
Lagrangian function L: Rn+m ! R, defined by
L(x, ) = f(x) + T g(x), and its gradient and Hessian given by
rL(x, ) = “rf(x)+JgT(x) # g(x)
and
HL(x, ) = “B(x, ) JgT(x)#,
Jg(x) O
where
B(x, ) = Hf (x) +
Together, necessary condition and feasibility
iHgi(x) imply critical point of Lagrangian function,
rL(x, ) = “rf(x)+JgT(x) # = o g(x)
13
Xm i=1
Constrained Optimality, continued
Hessian of Lagrangian is symmetric, but not positive definite, so critical point of L is saddle point rather than minimum or maximum
Critical point (x⇤, ⇤) of L is constrained min- imum of f if B(x⇤, ⇤) is positive definite on null space of Jg(x⇤)
If columns of Z form basis for null space, then test projected Hessian ZTBZ for positive def- initeness
If inequalities present, then KKT optimality conditions also require nonnegativity of Lagrange multipliers corresponding to inequalities, and complementarity condition
14
Sensitivity and Conditioning
Although problems of minimizing function and solving equation are closely related, their sen- sitivities di↵er
In one dimension, absolute condition number of root x⇤ of equation f(x) = 0 is 1/|f0(x⇤)|, so if |f(xˆ)| ✏, then |xˆ x⇤| may be as large as ✏/|f0(x⇤)|
For minimizing f, Taylor series expansion f(xˆ) = f(x⇤+h)
= f(x⇤) + f0(x⇤)h + 1 f00(x⇤)h2 + O(h3) 2
0⇤q⇤ shows that, since f (x ) = 0, if |f(xˆ) f(x )|
✏, then |xˆ x⇤| may be as large as 2✏/|f00(x⇤)| Thus, based on function values alone, minima
can be computed to only about half precision 15
Unimodality
For minimizing function of one variable, need “bracket” for solution analogous to sign change for nonlinear equation
Real-valued function f is unimodal on inter- val [a, b] if there is unique x⇤ 2 [a, b] such that f(x⇤) is minimum of f on [a,b], and f is strictly decreasing for x x⇤, strictly increasing for x⇤ x
This property enables discarding portions of in- terval based on sample function values, analo- gous to interval bisection
16
Golden Section Search
Suppose f is unimodal on [a, b], and let x1 and x2 be two points within interval, with x1 < x2
Evaluating and comparing f(x1) and f(x2), can discard either (x2,b] or [a,x1), with mini- mum known to lie in remaining subinterval
To repeat process, need to compute only one new function evaluation
To reduce length of interval by fixed fraction at each iteration, each new pair of points must have same relationship with respect to new in- terval that previous pair had with respect to previous interval
17
Golden Section Search, continued
To accomplish this, choose relative positions of two points as ⌧ and 1 ⌧, where ⌧2 = 1 ⌧, so ⌧ = (p5 1)/2 ⇡ 0.618 and 1 ⌧ ⇡ 0.382
Whichever subinterval is retained, its length will be ⌧ relative to previous interval, and inte- rior point retained will be at position either ⌧ or 1 ⌧ relative to new interval
Need to compute only one new function value, at complementary point, to continue iteration
This choice of sample points called golden sec- tion search
Golden section search is safe but convergence rate only linear, with constant C ⇡ 0.618
18
Golden Section Search, continued
p
⌧=( 5 1)/2
x =a+(1 ⌧)(b a) 1
f =f(x) 11
x =a+⌧(b a) 2
. . . . . . . . . .
.
22 ax1x2b
f = f (x )
while ((b a) > tol) do
if (f1 > f2) then a = x1
x1 = x2
f1 = f2
x =a+⌧(b a) 2
f2 = f(x2)
else ||||
b=x ax1x2
2
x = x 21
f2 = f1
x1 =a+(1 ⌧)(b a) f1 = f(x1)
.. . .. . .. .. ..
..
.. ..
. . .
. ..
.
.
. .
. .
. . .
. . . . .
.
.
. .
. . .
.
. .
. .
• . .
. . . .
. ..
. . . .
. ..
.
. •. .
1 ⌧# # .
b
||||
a x1x2 b
.
||||
a x1x2 b “”
.
⌧
.
19
Example: Golden Section Search
Use golden section search to minimize f (x) = 0.5 x exp( x2)
x1 f1 x2 f2
0.764 0.472 0.764 0.652 0.584 0.652 0.695 0.679 0.695
. 0.705 .
.
.
0.074 1.236 0.122 0.764 0.074 0.944 0.074 0.764 0.085 0.652 0.074 0.695 0.071 0.721 0.072 0.695 0.071 0.705 0.071 0.711
0.232 0.074 0.113 0.074 0.074 0.071 0.071 0.071 0.071 0.071
.
..
. .
. . . .
. .
. .
. . .
. .
.•. . .
. . .
. .
. .
. .
. . .
. • . . . . . . . .
. . .
. 0 x1 x2 2
20
Successive Parabolic Interpolation
Fit quadratic polynomial to three function val- ues
Take minimum of quadratic to be new approx- imation to minimum of function
New point replaces oldest of three previous points and process repeated until convergence
Convergence rate of successive parabolic inter- polation is superlinear, with r ⇡ 1.324
.
. . •. . . . . . ..
. •. . . .
. . .. .
.. . ..
. … … … …
… … … .
. . .. . ..
. .. . ..
. ..
. .. . . .
. .
.. ..
. . .
. .
. .. . …
.. .
. .
.
. ..
.
.
. . .
..
.
. . .
. .
. . . .
. .
. .
. .
. .
. .
.
.
..
.. .
.
. .
. .
. . . . •. . .
.
xk 2 xk xk+1 xk 1
. . ..
. .
. ….. .
. . .
21
… ..
.
. . .
.
Example: Successive Parabolic Interpolation
Use successive parabolic interpolation to min- imize
.
f (x) = 0.5 x exp( x2)
xk f(xk)
0.000 0.500
0.600 0.081
1.200 0.216
0.754 0.073
0.721 0.071
0.692 0.071
0.707 0.071
. •.
.
.. .. . .
. . .
. .
. . .
. .
.
. .
. .
. . .
. .
. .
. . .
.
. .
. . . .
. .
. . . .
. . . .
. . .
.. ..
. . . .
. . .
. .
. . .
. . .
. . . . . .
. . .
. . .
.. . . . .. . .. .
. . .
. .
. . . . . .
. . . .
. . . . . . .. . . •. … .
. . .
. . .
. . .
. .
. . .
. .
. . .
. .
. . .
. .•.
. . .
. .. ..
. .
. . .
. .
.
. . . x0 x1 x3 x2
. .
. .
. . .
22
Newton’s Method
Another local quadratic approximation is trun- cated Taylor series
f(x + h) ⇡ f(x) + f0(x)h + f00(x)h2 2
By di↵erentiation, minimum of this quadratic function of h is given by h = f0(x)/f00(x)
Suggests iteration scheme
xk+1 = xk f0(xk)/f00(xk),
which is Newton’s method for solving nonlinear
equation f0(x) = 0
Newton’s method for finding minimum nor- mally has quadratic convergence rate, but must be started close enough to solution
23
Example: Newton’s Method
Use Newton’s method to minimize f (x) = 0.5 x exp( x2)
First and second derivatives of f are given by f 0(x) = (2×2 1) exp( x2)
and
so Newton iteration for zero of f0 given by
f 00(x) = 2x(3 2×2) exp( x2), xk+1 = xk (2x2k 1)/(2xk(3 2x2k))
Using starting guess x0 = 1, obtain
xk f(xk) 1.000 0.132 0.500 0.111 0.700 0.071 0.707 0.071
24
Safeguarded Methods
As with nonlinear equations in one dimension, slow-but-sure and fast-but-risky optimization methods can be combined to provide both safety and e ciency
Most library routines for one-dimensional opti- mization based on hybrid approach
Popular combination is golden section search and successive parabolic interpolation, for which no derivatives required
25
Direct Search Methods
Direct search methods for multidimensional op- timization make no use of function values other than comparing them
For minimizing function f of n variables, Nelder- Mead method begins with n+1 starting points, forming simplex in Rn
Then move to new point along straight line from current point having highest function value through centroid of points
New point replaces worst point, and process repeated
Direct search methods useful for nonsmooth functions or for small n, but expensive for larger n
26
Contour’Maps’
Contour’Maps’
Contour’Map’
Contour’Map’
Steepest Descent Method
Let f : Rn ! R be real-valued function of n real variables
At any point x where gradient vector is nonzero, negative gradient, rf(x), points downhill to- ward lower values of function f
In fact, rf(x) is locally direction of steepest descent: function decreases more rapidly along direction of negative gradient than along any other
Steepest descent method: starting from ini- tial guess x0, successive approximate solutions given by
xk+1 = xk ↵krf(xk),
where ↵k is line search parameter that deter- mines how far to go in given direction
27
S Steepest Descent
Steepest Descent, continued
Given descent direction, such as negative gra- dient, determining appropriate value for ↵k at each iteration is one-dimensional minimization problem
minf(xk ↵krf(xk)) ↵k
that can be solved by methods already dis- cussed
Steepest descent method is very reliable: can always make progress provided gradient is nonzero
But method is myopic in its view of function’s behavior, and resulting iterates can zigzag back and forth, making very slow progress toward solution
In general, convergence rate of steepest de- scent is only linear, with constant factor that can be arbitrarily close to 1
28
Example: Steepest Descent
Use steepest descent method to minimize
f (x) = 0.5×21 + 2.5×2 Gradient is given by
rf(x)=” x1 # 5×2
Taking x0 = “51#, we have rf(x0) = “5# Performing line search along negative gradient
direction, i.e.,
min f (x0 ↵0rf (x0)),
↵0
exact minimum along line is given by ↵0 = 1/3,
so next approximation is
x1=” 3.333# 0.667
29
Example Continued
xk f(xk) rf(xk) 5.000 1.000 15.000 5.000 5.000 3.333 -0.667 6.667 3.333 -3.333 2.222 0.444 2.963 2.222 2.222 1.481 -0.296 1.317 1.481 -1.481 0.988 0.198 0.585 0.988 0.988 0.658 -0.132 0.260 0.658 -0.658 0.439 0.088 0.116 0.439 0.439 0.293 -0.059 0.051 0.293 -0.293 0.195 0.039 0.023 0.195 0.195 0.130 -0.026 0.010 0.130 -0.130
3 .•.
. . . . . . . . . .
. . . .. . . . . . . . . . . . . . . . . . . . .. . . .. . . . .. . . .. . . . . . . .. . . . . . . . . . . . . . . .
. .
. .
. . . . . . . . .
. . . .
. . . . .
. . . . . .
. .
. .
. .. . . .. .
6
6 . . . . . .
. .
. .
. . . . . .
. . . . . .
. .
.
. .
. . . . . .
. . . . . .
. .
. .
. . . . . .
. . . .
. .
. . . . . .
. . . .
.
. . . . . .
. . . .
. . . .
. . . .
. .
.
.
. .
. .
. .
. .
.
.
.
. .
. .
3
.
30
Newton’s Method
Broader view can be obtained by local quadratic approximation, which is equivalent to Newton’s method
In multidimensional optimization, we seek zero of gradient, so Newton iteration has form
xk+1 = xk H 1(xk)rf(xk), f
where Hf(x) is Hessian matrix of second par- tial derivatives of f,
{Hf(x)}ij = @2f(x) @xi@xj
31
Newton’s Method, continued
Do not explicitly invert Hessian matrix, but in- stead solve linear system
Hf(xk)sk = rf(xk) for sk, then take as next iterate
xk+1 = xk + sk
Convergence rate of Newton’s method for min-
imization is normally quadratic
As usual, Newton’s method is unreliable unless started close enough to solution
32
Example: Newton’s Method
Use Newton’s method to minimize
f (x) = 0.5×21 + 2.5×2 Gradient and Hessian are given by
rf(x)=” x1 # and Hf(x)=”1 0# 5×2 05
Taking x0 = “51#, we have rf(x0) = “5# Linear system for Newton step is
“1 0#s0=”