Design optimization algorithms and tools
Intro to gradient-based
optimization
ME 564/SYS 564
Wed Sep 26, 2018
Steven Hoffenson
Goal of Week 5: To learn the optimality conditions for unconstrained problems,
be able to solve problems with them, and know two derivative-based algorithms
1
Recap: How to optimize
1. Formulate the problem
a) Define system boundaries
b) Develop analytical models
c) Explore/reduce the problem space
d) Formalize optimization problem
2. Solve the problem
a) Choose the right approach/algorithm
b) Solve (by hand, code, or software)
c) Interpret the results
d) Iterate if needed
2
𝐱𝑘+1 = 𝐱𝑘 − 𝐇(𝐱𝑘)
−1𝛁𝑓 𝐱0
(Weeks 1-2, 4, 9-12)
(Weeks 3, 5-8, 12)
TODAY
Recap: Week 3
• Linear programs are special cases
• All functions monotonic
• Solutions must lie on boundary of design space
• Simplex algorithm is efficient
• Derivative-free algorithms for nonlinear problems
are straightforward and robust, but may take
longer and often converge on local optima
• Coordinate search
• Nelder-Meade
• Space-filling DIRECT
• Simulated Annealing
3
Recap: DOEs and surrogate modeling
• Surrogate modeling is fitting a mathematical
function to your data to speed up evaluations and
optimization
• This involves four general steps:
• Gather data (e.g., using a DOE)
• Choose a function structure (e.g., linear, polynomial,
kriging, ANN)
• Fit a function to the data
• Assess fitness
• Watch out for outliers, underfitting, and overfitting
4
Unconstrained gradient-
based methods
“Unconstrained” means we are talking about
problems that have interior optima, not optima that
lie on a constraint (like what monotonicity analysis
can help with)
5
Basic nonlinear problem
How do we solve this?
6
How do we know
it’s a minimizer?
1-variable optimality conditions
1. First-order necessary condition
If 𝑓(𝑥) is differentiable and
𝑥∗ is a local minimum, then
𝜕𝑓
𝜕𝑥
(𝑥∗) = 0.
2. Second-order sufficient condition
If
𝜕𝑓
𝜕𝑥
(𝑥∗) = 0 and
𝜕2𝑓
𝜕𝑥2
𝑥∗ > 0,
then 𝑥∗ is a local minimum.
7
Note: this is not sufficient – 𝑥∗ could be a local
maximum or saddle point (rather than a minimum)
zero slope
increasing slope
Example
8
min 𝑓 𝑥 = 𝑥 + 𝑒−𝑥
𝑑𝑓
𝑑𝑥
= 1 − 𝑒−𝑥
𝑑𝑓
𝑑𝑥
𝑥∗ = 1 − 𝑒−𝑥
∗
= 0
1 = 𝑒−𝑥
∗
ln(1) = −𝑥∗
𝑥∗ = 0
𝑑2𝑓
𝑑𝑥2
= 𝑒−𝑥
𝑑2𝑓
𝑑𝑥2
𝑥∗ = 𝑒−𝑥
∗
𝑑2𝑓
𝑑𝑥2
0 = 𝑒0 = 1
First-order necessary condition Second-order sufficient condition
This is a stationary point
Since this is >0,
it is a minimum
Confirm
with plot:
Global vs. local optima
Stationary points
𝜕𝑓
𝜕𝑥
(𝑥∗) = 0 can be minima, maxima, or
saddle points
9
global maximum
on interval [0,3.25]
global minimum
on interval [0,3.25]
local minimum
saddle point
Multiple variables
Gradient: 𝛻𝑓(𝐱) ≜
𝜕𝑓
𝜕𝑥1
,
𝜕𝑓
𝜕𝑥2
, … ,
𝜕𝑓
𝜕𝑥𝑛
Hessian: 𝐇(𝐱) ≜
𝜕2𝑓
𝜕𝑥1
2 ⋯
𝜕2𝑓
𝜕𝑥1𝑥𝑛
⋮ ⋮
𝜕2𝑓
𝜕𝑥𝑛𝑥1
⋯
𝜕2𝑓
𝜕𝑥𝑛
2
10
Multi-variable optimality conditions
1. First-order necessary condition
If 𝑓(𝐱) is differentiable and 𝐱∗ is
a local minimum, then 𝛁𝑓(𝐱∗) = 𝟎.
2. Second-order sufficient condition
If 𝛁𝑓(𝐱∗) = 𝟎 and 𝐇 𝐱∗ is positive-definite,
then 𝐱∗ is a local minimum
11
Recall: Points that satisfy the
necessary condition are called
“stationary points,” and they
are not all minima!
Hessian properties
A Hessian is positive-definite if 𝜕𝐱𝑇𝐇 𝐱∗ 𝜕𝐱 > 0 for all 𝜕𝐱 ≠ 𝟎.
A matrix is positive-definite if and only if any of these hold:
1. All of its eigenvalues are positive.
2. All determinants of its leading principal minors are positive.
3. All the pivots are positive when the matrix is reduced to
row-echelon form.
12
Replace positive-
definite with:
Replace > with: Nature of x*
negative-definite < local maximum positive-semidefinite ≥ probable valley negative-semidefinite ≤ probable ridge indefinite have both + & - saddle point Other matrix classification definitions Note: 𝜕𝐱 = 𝐱 − 𝐱∗ Note: The 3 “tricks” above only apply for positive-definite, and cannot all be used to prove negative- or semi-definiteness Sometimes we can solve problems using the optimality conditions 1. Apply First Order Necessary Condition (FONC) Find stationary points 𝐱∗ where 𝛻𝑓 𝐱∗ ≜ 𝜕𝑓 𝜕𝑥1 , 𝜕𝑓 𝜕𝑥2 , … , 𝜕𝑓 𝜕𝑥𝑛 = (0,0,… , 0) 2. Apply Second Order Sufficient Condition (SOSC) Test each stationary point 𝐱∗ for a positive-definite Hessian 𝜕𝐱𝑇𝐇 𝐱∗ 𝜕𝐱 > 0 for all 𝜕𝐱 ≠ 𝟎
If a point 𝐱∗ meets both conditions, it is a local minimum!
13
Example (4.6 in book)
Find the minimum of the following function:
𝑓 𝐱 = 2𝑥1 + 𝑥1
−2 + 2𝑥2 + 𝑥2
−2
𝛻𝑓 𝐱 = 2 − 2𝑥1
−3 2 − 2𝑥2
−3 𝐱∗ =
𝑥1
𝑥2
=
1
1
Stationary Point
FONC
𝐇 𝐱 =
6𝑥1
−4 0
0 6𝑥2
−4
𝜕𝐱 =
𝜕𝑥1
𝜕𝑥2
= 𝐱 − 𝐱∗ =
𝑥1 − 1
𝑥2 − 1
𝐇 𝐱∗ =
6 0
0 6
𝜕𝐱T𝐇 𝐱∗ 𝜕𝐱 = 6 𝜕𝑥1
2 + 6 𝜕𝑥2
2 > 0
SOSC
Positive Definite
Therefore, [1;1]
is a minimum!
14
Eigenvalues to test SOSC
Rather than checking the quadratic term
𝜕𝐱𝑇𝐇 𝐱∗ 𝜕𝐱 > 0, we can check eigenvalues of 𝐇 𝐱∗
Eigenvalues of H(x*) Hessian Matrix Nature of x*
All Positive ( >0 ) Positive definite Local min
All Negative ( <0 ) Negative definite Local max All Nonnegative ( ≥ 0 ) Positive semidefinite Probable valley All Nonpositive ( ≤ 0 ) Negative semidefinite Probable ridge Any sign Indefinite Saddle point 15 Example Recall our example 𝐇 𝐱 = 6𝑥1 −4 0 0 6𝑥2 −4 𝐇 𝐱∗ = 6 0 0 6 Eigenvalues at 𝐱∗: 𝜆1 = 6 and 𝜆2 = 6 Eigenvalues of the general Hessian: 𝜆1 = 6𝑥1 −4 and 𝜆2 = 6𝑥2 −4 Positive definite at 𝐱∗ Local Minimum Positive definite everywhere 16 Determinants to test SOSC If your matrix is 2x2, you can simply check the determinant of H(x*) Determinant of H(x*) Hessian Matrix Nature of x* Positive and h11 > 0 Positive definite Local min
Positive and h11 < 0 Negative definite Local max Zero and h11 > 0 Positive semidefinite Probable valley
Zero and h11 < 0 Negative semidefinite Probable ridge Negative Indefinite Saddle point Note: h11 is the first element (upper-left value) of H Another note: h11 is the first leading principal minor of H, and the full matrix is the second leading principal minor. This follows option 2 from Slide 12. 17 Example Recall our example 𝐇 𝐱 = 6𝑥1 −4 0 0 6𝑥2 −4 𝐇 𝐱∗ = 6 0 0 6 det 𝐇 (𝐱∗) = 36 > 0
ℎ11 = 6 > 0
Positive definite at 𝐱∗
Local Minimum
18
Interior vs. boundary optima
Interior optimum
19
Boundary optimum
Note that the necessary and sufficient conditions
usually do not apply to boundary optima
Local approximation
A review of Taylor series expansion for approximating
function behavior (in a local neighborhood)
20
Taylor Series Approximation
𝑓 𝑥 = 𝑓 𝑥0 +
𝑓′ 𝑥0
1!
𝑥 − 𝑥0 +
𝑓′′ 𝑥0
2!
𝑥 − 𝑥0
2+
𝑓(3) 𝑥0
3!
𝑥 − 𝑥0
3+⋯
Taylor series approximation for a single variable function
Approximation is valid only in the neighborhood of 𝑥0
Higher-order
terms
21
Taylor Series Approximation
𝑓 𝑥 = 𝑒𝑥
𝑓 𝑥 ≈ 𝑓 𝑥0 + 𝑓′(𝑥0)(𝑥 − 𝑥0)
for 𝑥0 = 0
𝑓 𝑥 ≈ 1 + 𝑥
Linear Approximation
22
Taylor Series Approximation
𝑓 𝑥 = 𝑒𝑥
𝑓 𝑥 ≈ 𝑓 𝑥0 + 𝑓
′ 𝑥0 𝑥 − 𝑥0
+
1
2
𝑓” 𝑥0 𝑥 − 𝑥0
2
for 𝑥0 = 0
𝑓 𝑥 ≈ 1 + 𝑥 +
1
2
𝑥2
Quadratic Approximation
23
Taylor Series Approximation
In higher dimensions
𝑓 𝐱 ≈ 𝑓 𝐱0 + 𝛻𝑓 𝐱0 𝐱 − 𝐱0 +
1
2
𝐱 − 𝐱0
T𝐇(𝐱0) 𝐱 − 𝐱0 +⋯
Gradient vector
𝛻𝑓 𝐱 =
𝜕𝑓
𝜕𝑥1
,
𝜕𝑓
𝜕𝑥2
, … ,
𝜕𝑓
𝜕𝑥𝑛
Hessian matrix
𝐇(𝐱) =
𝜕2𝑓
𝜕𝑥1
2 …
𝜕2𝑓
𝜕𝑥1𝑥𝑛
⋮ ⋱ ⋮
𝜕2𝑓
𝜕𝑥𝑛𝑥1
…
𝜕2𝑓
𝜕𝑥𝑛
2
Recall:
24
Taylor Series Approximation
𝜕𝑓 𝜕𝐱
𝜕𝑓 ≈ 𝛻𝑓 𝐱0 𝜕𝐱 +
1
2
𝜕𝐱T𝐇(𝐱0)𝜕𝐱
𝑓 𝐱 ≈ 𝑓 𝐱0 + 𝛻𝑓 𝐱0 𝐱 − 𝐱0 +
1
2
𝐱 − 𝐱0
T𝐇(𝐱0) 𝐱 − 𝐱0
𝑓 𝐱 − 𝑓 𝐱0 ≈ 𝛻𝑓 𝐱0 𝐱 − 𝐱0 +
1
2
𝐱 − 𝐱0
T𝐇(𝐱0) 𝐱 − 𝐱0
( Perturbation in 𝑓 ) ( Perturbation in 𝐱 )
We will use this to
develop our first
two algorithms!
25
Basic gradient-based
algorithms
1st-order: Gradient descent
2nd-order: Newton’s method
26
1st-order algorithm: Gradient descent
Starting at a point 𝐱0, we want to find a direction
that will lower the objective value
1. Using 1st-order terms only from Taylor approx.,
We want 𝜕𝑓 < 0. If we choose 𝜕𝐱 = −𝛻𝑓 𝐱0 , then we know 𝜕𝑓 ≈ − 𝛻𝑓 𝐱0 2 < 0 This is our direction of descent! 27 𝜕𝑓 ≈ 𝛻𝑓 𝐱0 𝜕𝐱 Gradient Method 𝜕𝐱𝑘 = − 𝛻𝑓 𝐱𝑘 𝑇 = −𝐠𝑘 𝜕𝐱𝑘 = 𝐱𝑘+1 − 𝐱𝑘 𝐱𝑘+1 = 𝐱𝑘 − 𝐠𝑘 𝑓(𝑥) gradient (not constraint) 28 Gradient descent algorithm Local optimization algorithm for interior optima 1. Begin with a feasible point 𝐱0 2. Find the gradient at that point 𝛁𝑓 𝐱0 3. Move in the direction of the negative gradient to find an improved 𝐱1 𝐱1 = 𝐱0 − 𝛁𝑓 𝐱0 4. Continue to iterate until you stop improving 𝐱𝑘 = 𝐱𝑘−1 − 𝛁𝑓 𝐱𝑘−1 29 Gradient descent Slight modification: add a scale factor to avoid jumping past the minimum: 𝐱𝑘 = 𝐱𝑘−1 − 𝛼𝛁𝑓 𝐱𝑘−1 Optimizing the step size 𝛼 gives us: 𝐱𝑘 = 𝐱𝑘−1 − 𝛁𝑓𝑇 𝐱𝑘−1 𝛁𝑓 𝐱𝑘−1 𝛁𝑓𝑇 𝐱𝑘−1 𝐇(𝐱𝑘−1)𝛁𝑓 𝐱𝑘−1 𝛁𝑓 𝐱𝑘−1 provided the Hessian is positive-definite 30 p. 155 of 2nd edition or p. 189 of 3rd edition explains step size optimization Gradient descent example 31 Problem: Gradient & Hessian: Algorithm: Summary • The optimality conditions can be used to prove an interior optimum • The First-Order Necessary Condition identifies stationary points • The Second-Order Sufficiency Condition identifies the nature (minima, maxima, saddle) of stationary points • Taylor series approximation is used to generate derivative-based local optimization directions • The gradient descent algorithm uses 1st-order info • Newton’s method (algorithm) uses 2nd-order info, which we didn’t get to today… 32 Acknowledgements • Much of this material came from Chapter 4 of the textbook, Principles of Optimal Design • Some of the slides and examples came from Drs. Emrah Bayrak, Alex Burnap, and Namwoo Kang at the University of Michigan 33 Question Who would be interesting in a HW1 review session with Amineh sometime on Friday (9/28) or Monday (10/1)? ANSWER: There is interest! Amineh will host a session to review the HW1 solutions on Monday at 11am. Location is TBD, and an announcement will go out on Canvas soon. 34 Another problem set-up and monotonicity example Designing an air tank 35 r h l s Solution Air tank problem set-up minimize the volume of metal: 36 r h l s Example: Air tank add constraints: 37 r h l s minimum inner volume ASME code ratio limit ASME code ratio limit allow space to attach nozzles space limitation Okay, what’s next? Example: Air tank monotonicity analysis: 38 r h l s MP1: Every increasing variable (in the objective) is bounded below by at least one non-increasing active constraint Okay, what’s next? Example: Air tank Remove critical constraints: 39 Solve each for variable Plug into objective and constraints Infimum at 𝑉∗ = 130.1(103)𝜋, where 𝑙∗ = ∞: not well bounded