Numerical Optimisation: Line search methods
Numerical Optimisation:
Line search methods
Marta M. Betcke
m.betcke@ucl.ac.uk,
Kiko Rullan
f.rullan@cs.ucl.ac.uk
Department of Computer Science,
Centre for Medical Image Computing,
Centre for Inverse Problems
University College London
Lecture 2 & 3
M.M. Betcke Numerical Optimisation
Descent direction
Descent direction is a vector p ∈ Rn for which the function
decreases.
From Taylor’s theorem
f (xk + αp) = f (xk) + αp
T∇f (xk) +
1
2
α2pT∇2f (xk + tp)p, t ∈ (0, α)
= f (xk) + α p
T∇f (xk)︸ ︷︷ ︸
<0 +O(α2) Thus for α > 0 small enough, f (xk + αp) < f (xk) implies pT∇f (xk) = ‖p‖‖∇f (xk)‖ cos θ < 0⇔ |θ| > π/2,
where θ is the angle between p and ∇f (xk).
M.M. Betcke Numerical Optimisation
Steepest descent direction
Steepest descent direction p
min
p
pT∇f (xk), subject to ‖p‖ = 1.
min
p
pT∇f (xk) = min
p
‖p‖‖∇f (xk)‖ cos(θ) = −‖∇f (xk)‖,
attained for cos(θ) = −1 and p = −∇f (xk)/‖∇f (xk)‖, where θ is
the angle between p and ∇f (xk).
-1 -0.5 0 0.5 1
-1
-0.5
0
0.5
1
M.M. Betcke Numerical Optimisation
Newton direction
Consider the second order Taylor polynomial
f (xk + p) ≈ f (xk) + pT∇f (xk) +
1
2
pT∇2f (xk)p =: m2(p)
and assume ∇2f (xk) is positive definite.
Newton direction minimises the second order Taylor polynomial
m2. Setting
m′2(p) = ∇
2f (xk)p +∇f (xk) = 0,
yields
p = −
(
∇2f (xk)
)−1∇f (xk).
The Newton direction is reliable when m2(p) is a close
approximation to f (xk + p) i.e. ∇2f (xk + tp), t ∈ (0, 1) and
∇2f (xk) are close. This is the case if ∇2f is sufficiently smooth
and the difference is of order O(‖p‖3).
M.M. Betcke Numerical Optimisation
pT∇f (xk) = −pT∇2f (xk)p ≤ −σ‖p‖2
for some σ > 0. Thus unless ∇f (xk) = 0 (and hence p = 0),
pT∇f (xk) < 0 and p is a descend direction.
The step length 1 is optimal for f (xk + p) = m2(p), thus 1 is used
unless it does not produce a satisfactory reduction of f .
If ∇2f (xk) is not positive definite, the Newton direction may not
be defined: if ∇2f (xk) is singular, ∇2f (xk)−1 does not exist.
Otherwise, p may not be a descent direction which can be
remedied.
Fast local convergence (quadratic) close to the solution.
Computing the Hessian is expensive.
M.M. Betcke Numerical Optimisation
Quasi-Newton direction
Use symmetric positive definite (s.p.d.) approximation Bk to the
Hessian ∇2f (xk) in the Newton step
p = −B−1k ∇f (xk), ∇f (xk)
TB−1k ∇f (xk) > 0,
such that superlinear convergence is retained.
Bk is updated in each step taking into account the additional
information gathered during that step. The updates make use of
the fact that changes in gradient provide information about the
second derivative of f along the search direction.
Secant equation
∇f (xk + p) = ∇f (xk) + Bkp
This equation is underdetermined, different methods quasi-Newton
methods differ in the way they solve it.
M.M. Betcke Numerical Optimisation
Line search
Given the search direction p the optimal reduction of the f
amounts to minimising the function of one variable
φ(α) := f (xk + αp), α > 0.
This is in general to expensive (even the local minimiser), hence
inexact line search is of interest.
Choice of step size is important. To small steps mean slow
convergence, to large steps may not lead to reduction of the
objective function f .
M.M. Betcke Numerical Optimisation
Conditions for decrease
Simple condition: require f (xk + αp) < f (xk). Consider a sequence f (xk) = 5/k, k = 1, 2, . . . . This sequence is decreasing but its limiting value is 0, while the minimum of a convex function can be smaller than 0. Figure: Nocedal Wright Fig 3.2 The decrease is insufficient to converge to the minimum of a convex function. Hence we need conditions for sufficient decrease. M.M. Betcke Numerical Optimisation Sufficient decrease condition Armijo condition f (xk + αp) ≤ f (xk) + c1αpT∇f (xk) =: `(α), for some c1 ∈ (0, 1). [Typically small, c1 = 10−4] `(α) is a linear function with negative slope c1p T∇f (xk) < 0, `(α) = f (xk)+c1αp T∇f (xk) > f (xk)+αpT∇f (xk) = φ(0)+αφ′(0).
From Taylor Thm φ(α) = φ(0) + αφ′(0) + α2φ′′(ξ), ξ ∈ (0, α),
thus for sufficiently small α > 0, `(α) > φ(α).
Figure: Nocedal Wright Fig 3.3
M.M. Betcke Numerical Optimisation
Curvature condition
Armijo condition is satisfied for all sufficiently small α, so we need
another condition to avoid very small steps.
Curvature condition
pT∇f (xk + αp)︸ ︷︷ ︸
φ′(α)
≥ c2 pT∇f (xk)︸ ︷︷ ︸
φ′(0)
, c2 ∈ (c1, 1).
Ensures, that we progress far enough along a good direction p.
If φ′(α) is strongly negative, there is a good prospect of
significant decrease along p.
If φ′(α) is slightly negative (or even positive) we have a
prospect of little decrease and hence we terminate the line
search.
Typically c2 = 0.9 for a Newton or quasi Newton direction,
c2 = 0.1 for nonlinear conjugate gradient.
M.M. Betcke Numerical Optimisation
Curvature condition
Curvature condition
pT∇f (xk + αp)︸ ︷︷ ︸
φ′(α)
≥ c2 pT∇f (xk)︸ ︷︷ ︸
φ′(0)
, c2 ∈ (c1, 1).
Figure: Nocedal Wright Fig 3.4
M.M. Betcke Numerical Optimisation
Wolfe conditions
The sufficient decrease (Armijo rule) and curvature conditions
together are called Wolfe conditions
f (xk + αp) ≤ f (xk) + c1αpT∇f (xk),
pT∇f (xk + αp) ≥ c2pT∇f (xk),
for 0 < c1 < c2 < 1.
Possibly includes points far away from stationary points, hence
strong Wolfe conditions to disallow “too positive” values of
φ′(α)
f (xk + αp) ≤ f (xk) + c1αpT∇f (xk),
|pT∇f (xk + αp)| ≤ c2|pT∇f (xk)|,
for 0 < c1 < c2 < 1.
M.M. Betcke Numerical Optimisation
Wolfe conditions: existence
Let f : Rn → R continuously differentiable and p be a descent
direction at xk . If f is bounded below along the ray
{xk + αp | α > 0}, then there exists an interval of step lengths
satisfying both the Wolfe conditions and strong Wolfe conditions.
Proof:
For α > 0, φ(α) = f (xk + αp) is bounded below while
`(α) = f (xk) + αc1p
T∇f (xk) is unbounded below as
c1p
T∇f (xk) < 0 and for small α, `(α) > φ(α) as c1 < 1. Thus `(α) has to intersect φ(α) at least once. Let α′ be the smallest value for which φ(α′) = f (xk + α ′p) = f (xk) + α ′c1p T∇f (xk) = `(α′). Then the sufficient decrease condition holds for all α ≤ α′. M.M. Betcke Numerical Optimisation Furthermore, by the mean value theorem ∃α′′ ∈ (0, α′) : f (xk + α′p)− f (xk) = α′pT∇f (xk + α′′p) and we obtain pT∇f (xk + α′′p) = c1pT∇f (xk) > c2pT∇f (xk)
since c1 < c2 and p
T∇f (xk) < 0 and therewith α′′ satisfies Wolfe
conditions. As the inequality in curvature condition for α′′ holds
strictly, by continuity of ∇f , the inequality (and hence Wolfe
conditions) also holds in an interval containing α′′. Furthermore,
as all terms in the last equation are negative strong Wolfe
conditions hold for the same interval.
Wolfe conditions are scale-invariant in the sense that are
unaffected by scaling the function or affine change of variables.
They can be used in most line search methods and are particularly
important for quasi-Newton methods.
M.M. Betcke Numerical Optimisation
Goldstein conditions
f (xk) + (1− c)αpT∇f (xk) ≤ f (xk + αp) ≤ f (xk) + cαpT∇f (xk)
with 0 < c < 1/2.
The second inequality is the sufficient decrease condition. The first
inequality controls the step length from below. Disadvantage
w.r.t. Wolfe conditions is that it can exclude all minimisers of φ.
Figure: Nocedal Wright Fig 3.6
Goldstein conditions are ofter used in Newton-type methods, but
not well suited for quasi-Newton methods with positive definite
Hessian approximation.
M.M. Betcke Numerical Optimisation
Backtracking: sufficient decrease avoding too small steps
Backtracking line search
1: Choose ᾱ > 0, ρ ∈ (0, 1), c ∈ (0, 1)
2: Set α = ᾱ
3: repeat
4: α = ρα
5: until f (xk + αp) ≤ f (xk) + cαpT∇f (xk)
Terminates in finite number of steps: α will eventually
become small enough to satisfy sufficient decrease condition.
Prevents too short step lengths: the accepted α is within
factor ρ of the previous value, α/ρ, which was rejected for
violating the sufficient decrease condition i.e. being too long.
ρ can vary in [ρmin, ρmax] ⊂ (0, 1) between iterations.
In Newton and quasi-Newton methods ᾱ = 1, but different
values can be appropriate for other algorithms.
Well suited for Newton methods, less appropriate for
quasi-Newton and conjugate gradient methods.
M.M. Betcke Numerical Optimisation
Convergence of line search methods [Zoutendijk]
Consider an iteration
xk+1 = xk + αkpk , k = 0, 1, . . . ,
where pk is a descent direction and αk satisfied the Wolfe
conditions.
Let f be bounded below in Rn and continuously differentiable in an
open set M containing the level set {x : f (x) ≤ f (x0)}.
If ∇f is Lipschitz continuous on M i.e.
∃L > 0 : ‖∇f (x)−∇f (x̄)‖ ≤ L‖x − x̄‖, ∀x , x̄ ∈M
then ∑
k≥0
cos2 θk‖∇f (xk)‖2 <∞,
where θk = ∠(pk ,−∇f (xk)).
M.M. Betcke Numerical Optimisation
Convergence of line search methods [Zoutendijk]
Subtracting pTk ∇f (xk) from both sides of curvature condition
pTk ∇f (xk + αpk︸ ︷︷ ︸
=xk+1
) ≥ c2pTk ∇f (xk)
we obtain
pTk (∇f (xk+1)−∇f (xk)) ≥ (c2 − 1)p
T∇f (xk).
On the other hand the Lipschitz condition implies
pTk (∇f (xk+1)−∇f (xk)) ≤ ‖∇f (xk+1)−∇f (xk)‖‖pk‖ ≤ αkL‖pk‖
2.
Combining the two inequalities we obtain a lower bound on the
step size
αk ≥
c2 − 1
L
pTk ∇f (xk)
‖pk‖2
.
M.M. Betcke Numerical Optimisation
Substituting this inequality into the sufficient decrease condition
f (xk+1) ≤ f (xk) + c1
c2 − 1
L
(pTk ∇f (xk))
2
‖pk‖2
with cos θk = −
pT
k
∇f (xk )
‖∇f (xk )‖‖pk‖
yields
f (xk+1) ≤ f (xk)− c cos2 θk‖∇f (xk)‖2,
where c = c1(1− c2)/L.
Summing over all indices up to k we obtain
f (xk+1) ≤ f (x0)− c
k∑
j=0
cos2 θj‖∇f (xj)‖2
and since f is bounded from below
k∑
j=0
cos2 θj‖∇f (xj)‖2 ≤ (f (x0)− f (xk+1))/c < C
where C > 0 is some positive constant. Taking limits∑∞
k=0 cos
2 θk‖∇f (xk)‖2 <∞.
M.M. Betcke Numerical Optimisation
Global convergence
Goldstein or strong Wolfe conditions also imply the Zoutendijk
condition ∑
k≥0
cos2 θk‖∇f (xk)‖2 <∞.
The Zoutendijk condition implies
cos2 θk‖∇f (xk)‖2 → 0,
which can be used to derive global convergence results for line
search algorithms.
If the method ensures that cos θk ≥ δ > 0, ∀k i.e. θk is bounded
away from π/2, it follows that
lim
k→∞
‖∇f (xk)‖ = 0.
This is the strongest global convergence result that can be
obtained for such iteration (convergence to a stationary point)
without additional assumptions.
M.M. Betcke Numerical Optimisation
In particular, the steepest descent (pk = −∇f (xk)) produces a
gradient sequence which converges to 0 if it uses a line search
satisfying Wolfe or Goldstein conditions.
For some algorithms e.g. nonlinear conjugate gradient methods,
only a weaker result can be obtained
lim inf
k→∞
‖∇f (xk)‖ = 0
i.e. only subsequence of gradient norms ‖∇f (xkj )‖ converges to 0
rather than the whole sequence.
M.M. Betcke Numerical Optimisation
Those limits can be proved by contradiction:
Suppose that ‖∇f (xk)‖ ≥ γ for some γ > 0 for all k sufficiently
large. Then from cos2 θk‖∇f (xk)‖2 → 0 we conclude that
cos θk → 0 i.e. the entire sequence {cos θk} converges to 0.
Thus to show the weak convergence result it is enough to show
that a subsequence {cos θkj} is bounded away from 0.
Consider any algorithm which
(i) decreases the objective function in each iteration,
(ii) every mth iteration is a steepest descent step with step length
satisfying the Wolfe or Goldstein conditions.
Since cos θk = 1 for steepest descent steps, this provides the
subsequence bounded away from 0. The algorithm can do
something better in remaining m − 1 iterates, while the occasional
steepest descent step will guarantee the overall (weak) global
convergence.
M.M. Betcke Numerical Optimisation
Rate of convergence
Unfortunately, rapid convergence sometimes conflicts with global
convergence.
Example: Steepest descent is globally convergent (with appropriate
step sizes) but it can be very slow in practice. On the other hand,
while Newton iteration converges rapidly when we are close to the
solution, the Newton step may not even be a descent direction far
away from the solution.
The challenge: design algorithms with good global convergence
properties and rapid convergence rate.
M.M. Betcke Numerical Optimisation
Steepest descent
Steepest descent with exact line search for strictly convex
quadratic function
f (x) =
1
2
xTQx − bTx ,
where Q is symmetric positive definite.
Figure: Nocedal Wright Fig 3.7
M.M. Betcke Numerical Optimisation
Steepest descent
Characteristic zig-zag due to elongated shape of the ellipse. If the
level sets were circles instead, the steepest descent would need one
step only.
Convergence rate of steepest descent with exact line search:
‖xk+1 − x?‖2Q︸ ︷︷ ︸
=f (xk+1)−f (x?)
≤
(
λn − λ1
λn + λ1
)2
‖xk − x?‖2Q︸ ︷︷ ︸
=f (xk+1)−f (x?)
,
where 0 < λ1 ≤ λ2 ≤ · · · ≤ λn are the eigenvalues of Q, and
‖x‖2Q = x
TQx . Note: for quadratic strictly convex function we
obtain objective function and (for free) iterate convergence rates!
The objective function convergence rate is essentially the same for
steepest descent with exact line search when applied to a twice
continuously differentiable nonlinear function satisfying sufficient
conditions at x?.
M.M. Betcke Numerical Optimisation
Local convergence rate: Newton methods
Let f : Rn → R be twice continuously differentiable with Lipschitz
continuous Hessian in a neighbourhood of the solution x?
satisfying the sufficient conditions. Note that the Hessian ∇2f is
positive definite also in the vicinity of the solution x?.
The iterates xk computed by the Newton method (note step length
1)
xk+1 = xk − (∇2f (xk))−1∇f (xk)
converge locally quadratically i.e. for starting point x0 sufficiently
close to x?.
The sequence of gradient norms ‖∇f (xk)‖ also converges
quadratically to 0.
Local convergence: note that away from the solution ∇fk may not
be positive definite and hence pk may not be a descent direction.
Global convergence with Hessian modification is discussed later.
M.M. Betcke Numerical Optimisation
Convergence rates: Newton-type methods
Let f : Rn → R be twice continuously differentiable.
Let {xk} be a sequence generated by a descent method
xk+1 = xk + αpk
for step sizes satisfying Wolfe conditions with c1 ≤ 1/2.
If the sequence {xk} converges to a point x? satisfying the
sufficient conditions and the search direction satisfies
lim
k→∞
‖∇f (xk) +∇2f (xk)pk‖
‖pk‖
= 0,
then for all k > k0, the step length αk = 1 is admissible, and for
that choice of αk = 1, k > k0, the sequence {xk} converges to x?
superlinearly.
Note: once close enough to the solution so that ∇2f (xk) became
s.p.d., the limit is trivially satisfied and for αk = 1 we recover local
quadratic convergence.
M.M. Betcke Numerical Optimisation
Convergence rates: quasi-Newton methods
Let f : Rn → R be twice continuously differentiable.
Let {xk} be a sequence generated by a quasi-Newton method
(note step length 1, Bk s.p.d.)
xk+1 = xk −B−1k ∇f (xk)︸ ︷︷ ︸
pk
.
Assume the sequence {xk} converges to a point x? satisfying the
sufficient conditions. Then {xk} converges superlinearly if an only
if
lim
k→∞
‖(Bk −∇2f (x?))pk‖
‖pk‖
= 0.
Note: the superlinear convergence rate can be attained even if the
sequence {Bk} does not converge to ∇2f (x?). It suffices that Bk
becomes increasingly accurate approximation to ∇2f (x?) along the
search direction pk . Quasi-Newton methods use it to construct Bk .
M.M. Betcke Numerical Optimisation
Hessian modifications
Away from the solution, the Hessian may not be positive definite,
and the Newton direction may not be a descent direction. The
general solution is to consider positive definite approximations.
Bk = ∇2f (xk) + Ek , where Ek is chosen to ensure that Bk is
sufficiently positive definite.
Global convergence results can be established for Newton method
with Hessian modification and step satisfying Wolfe or Goldstein or
Armijo backtracking conditions provided that:
κ(Bk) = ‖Bk‖‖B−1k ‖ ≤ C for some C > 0 and all k whenever the
sequence of the Hessians {∇2f (xk)} is bounded.
M.M. Betcke Numerical Optimisation
Hessian modifications
Eigenvalue decomposition
∇2f (xk) = QΛQT =
n∑
i=1
λiqiq
T
i .
Example:
∇2f (xk) = diag(10, 3,−1), ∇f (xk) = (1,−3,−2)T
Q = I , Λ = diag(λ1, λ2, λ3)
The Newton step: pk = (−0.1, 1, 2)T
As pTk ∇f (xk) > 0, it is not a descent direction.
M.M. Betcke Numerical Optimisation
Eigenvalue modifications (not practical):
Replace all negative eigenvalues with δ =
√
u = 10−8, where
u = 10−16 is the machine precision.
Bk =
2∑
i=1
λiqiq
T
i + δq3q
T
3 = diag(10, 3, 10
−8)
Bk is s.p.d. and curvature along q1, q2 is preserved, however the
direction is dominated by q3:
pk = −B−1k ∇fk = −
2∑
i=1
1
λi
qiq
T
i ∇fk −
1
δ
q3q
T
3 ∇fk ≈ −(2× 10
8)q3.
pk is a descent direction but the length is very large, not in line
with local validity of the Newton approximation. Thus pk may be
ineffective.
Adapt choice of δ to avoid excessive lengths. Even δ = 0 which
eliminates direction q3.
M.M. Betcke Numerical Optimisation
Let A is symmetric A = QΛQT.
The correction matrix ∆A of minimum Frobenius norm that
ensures λmin(A + ∆A) ≥ δ is given by
∆A = Q diag(τi )Q
T, with τi =
{
0, λi ≥ δ
δ − λi , λi < δ.
and the modified matrix is
A + ∆A = Q(Λ + diag(τi ))Q
T.
Frobenius norm is defined ‖A‖2F =
∑n
i ,j=1 a
2
ij =
∑n
i=1 λ
2
i .
The correction matrix ∆A of minimum Euclidean norm that
satisfies λmin(A + ∆A) ≥ δ is given by
∆A = τ I , with τ = max(0, δ − λmin(A)).
and the modified matrix has the form A + τ I .
M.M. Betcke Numerical Optimisation
Cholesky factorisation of A + τ I
Simple idea:
If mini aii ≤ 0 set τ0 = −mini aii + β for some small β > 0
(e.g. 10−3), otherwise τ0 = 0.
Attempt the Cholesky algorithm to obtain LLT = A + τk I
If not successful, increase τk+1 = max(2τk , β) and reattempt.
Drawback: possibly multiple failed attempts to factorise.
M.M. Betcke Numerical Optimisation
Cholesky decomposition
Figure: Nocedal Wright Ex. 3.1
M.M. Betcke Numerical Optimisation
Cholesky decomposition of indefinite matrix
For A indefinite:
The factorisation A = LDLT may not exist.
Even if it does exist, the algorithm can be unstable
i.e. elements of L and D can become arbitrarily large.
Posterior modification of D to force the elements to be
positive may break down or result in a matrix very different to
A.
Instead, modify A during the factorisation to achieve that the
elements of D are sufficiently positive and the elements of L
and D are not to large.
M.M. Betcke Numerical Optimisation
Modified Cholesky decomposition
Choose δ, β > 0. While computing jth column of L,D ensure
dj ≥ δ, |mij | ≤ β, i = j + 1, j + 2, . . . , n,
where mij = lij
√
dj .
To satisfy these bounds we only need to change how dj is
computed, from dj = cjj to
dj = max
(
|cjj |,
θ2j
β2
, δ
)
, with θj = max
j j .
M.M. Betcke Numerical Optimisation
Modified Cholesky decomposition
Properties:
Modifies the Hessian during factorization where necessary.
The modified Cholesky factors exist and are bounded relative
to the norm of the actual Hessian.
It does not modify Hessian if it is sufficiently positive definite.
This is the basis for the modified Cholesky factorisation which also
introduces symmetric row and column permutations to reduce the
size of the modification.
PAPT + E = LDLT = MMT,
where E is a nonnegative diagonal matrix that is zero if A is
sufficiently positive definite.
It has been shown, that the matrices obtained by this modified
Cholesky algorithm to the exact Hessian ∇2f (xk) have bounded
condition numbers, hence some global convergence results can be
obtained.
M.M. Betcke Numerical Optimisation
Step length selection
How to find a step length satisfying one of the termination
conditions e.g. Wolfe etc. for
φ(α) = f (xk + αpk),
where pk is a descent direction i.e. φ
′(0) = pTk f (xk) < 0.
If f is a convex quadratic function f (x) = 1
2
xTQx − bTx , it has a
global minimiser along the ray xk + αpk which can be calculated
analytically
αk = −
pTk ∇f (xk)
pTk Qpk
.
For general nonlinear functions iterative approach is necessary.
M.M. Betcke Numerical Optimisation
Line search algorithms can be classified according to the
information they use:
Methods using only function evaluations can be very inefficient as
they need to continue iterating until a very small interval has been
found.
Methods using gradient information can determine wether the
current step length satisfies e.g. Wolfe or Goldstein conditions
which require gradients to evaluate.
Typically, they consist of two phases: bracketing phase which finds
an interval containing acceptable step lengths and the selection
phase which locates the final step in the interval.
M.M. Betcke Numerical Optimisation
Line search via interpolation
Aim: find a step length α that satisfies sufficient decrease
condition without being to small. [similarity to backtracking]
Sufficient decrease condition
φ(αk) ≤ φ(0) + c1αkφ′(0)
.
We want to compute as few derivatives ∇f (x) as possible.
Initial guess α0: check sufficient decrease condition
φ(α0) ≤ φ(0) + c1α0φ′(0).
If satisfied terminate the search. Otherwise, [0, α0] contains
acceptable step lengths.
M.M. Betcke Numerical Optimisation
Quadratic approximation φq(α) to φ by interpolating the available
information: φq(0) = φ(0), φ
′
q(0) = φ
′(0) and φq(α0) = φ(α0)
yields
φq(α) =
φ(α0)− φ(0)− α0φ′(0)
α20
α2 + φ′(0)α + φ(0).
The new trial value α1 is defined as the minimiser of φq i.e.
α1 = −
φ′(0)α20
2(φ(α0)− φ(0)− φ′(0)α0)
.
If sufficient decrease condition is satisfied, terminate search.
Otherwise, construct cubic interpolating the four available values:
φc(0) = φ(0), φ
′
c(0) = φ
′(0), φc(α0) = φ(α0) and φc(α1) = φ(α1)
yields
φc(α) = aα
3 + bα2 + αφ′(0) + φ(0),[
a
b
]
=
1
α20α
2
1(α1 − α0)
[
α20 −α
2
1
−α30 α
3
1
] [
φ(α1)− φ(0)− φ′(0)α1
φ(α0)− φ(0)− φ′(0)α0
]
M.M. Betcke Numerical Optimisation
By differentiating φc we find the minimiser α2 ∈ [0, α1]
α2 =
−b +
√
b2 − 3aφ′(0)
3a
.
If necessary repeat the cubic interpolation with φc(0) = φ(0),
φ′c(0) = φ
′(0) and the two most recent values
φc(αk−1) = φ(αk−1) and φc(αk) = φ(αk) until αk+1 satisfies the
sufficient decrease condition.
Safeguard: If any αi is either too close to αi−1 or much smaller
than αi−1, we reset αi = αi−1/2.
If derivatives can be computed along the function values at little
additional cost, we can also devise variant interpolating φ, φ′ at
two most recent values.
M.M. Betcke Numerical Optimisation
Initial step length
For Newton and quasi Newton α0 = 1. This ensures that unit step
length will be taken whenever they satisfy the termination
conditions and allow for quick convergence.
For methods which do not produce well scaled search direction like
steepest descent or conjugate gradient it is important to use
available information to make the initial guess e.g.:
M.M. Betcke Numerical Optimisation
First order change in function at iterate xk will be the same as
that obtained at previous step
i.e. α0p
T
k ∇f (xk) = αk−1p
T
k−1∇f (xk−1)
α0 = αk−1
pTk−1∇f (xk−1)
pTk ∇f (xk)
.
Interpolate quadratic to data f (xk−1), f (xk) and
pTk−1∇f (xk−1) and define α0 to be its minimiser
α0 =
2(f (xk)− f (xk−1))
φ′(0)
.
It can be shown that if xk → x? superlinearly, then the ratio
converges to 1. If we adjust by setting α0 = min(1, 1.01α0)
we find that the unit step length will eventually always be
tried and accepted and the superlinear convergence will be
observed.
M.M. Betcke Numerical Optimisation