程序代写代做代考 algorithm Numerical Optimisation: Line search methods

Numerical Optimisation: Line search methods

Numerical Optimisation:
Line search methods

Marta M. Betcke
m.betcke@ucl.ac.uk,

Kiko Rullan
f.rullan@cs.ucl.ac.uk

Department of Computer Science,
Centre for Medical Image Computing,

Centre for Inverse Problems
University College London

Lecture 2 & 3

M.M. Betcke Numerical Optimisation

Descent direction

Descent direction is a vector p ∈ Rn for which the function
decreases.

From Taylor’s theorem

f (xk + αp) = f (xk) + αp
T∇f (xk) +

1

2
α2pT∇2f (xk + tp)p, t ∈ (0, α)

= f (xk) + α p
T∇f (xk)︸ ︷︷ ︸

<0 +O(α2) Thus for α > 0 small enough, f (xk + αp) < f (xk) implies pT∇f (xk) = ‖p‖‖∇f (xk)‖ cos θ < 0⇔ |θ| > π/2,

where θ is the angle between p and ∇f (xk).

M.M. Betcke Numerical Optimisation

Steepest descent direction

Steepest descent direction p

min
p

pT∇f (xk), subject to ‖p‖ = 1.

min
p

pT∇f (xk) = min
p
‖p‖‖∇f (xk)‖ cos(θ) = −‖∇f (xk)‖,

attained for cos(θ) = −1 and p = −∇f (xk)/‖∇f (xk)‖, where θ is
the angle between p and ∇f (xk).

-1 -0.5 0 0.5 1
-1

-0.5

0

0.5

1

M.M. Betcke Numerical Optimisation

Newton direction

Consider the second order Taylor polynomial

f (xk + p) ≈ f (xk) + pT∇f (xk) +
1

2
pT∇2f (xk)p =: m2(p)

and assume ∇2f (xk) is positive definite.

Newton direction minimises the second order Taylor polynomial
m2. Setting

m′2(p) = ∇
2f (xk)p +∇f (xk) = 0,

yields

p = −
(
∇2f (xk)

)−1∇f (xk).
The Newton direction is reliable when m2(p) is a close
approximation to f (xk + p) i.e. ∇2f (xk + tp), t ∈ (0, 1) and
∇2f (xk) are close. This is the case if ∇2f is sufficiently smooth
and the difference is of order O(‖p‖3).

M.M. Betcke Numerical Optimisation

pT∇f (xk) = −pT∇2f (xk)p ≤ −σ‖p‖2

for some σ > 0. Thus unless ∇f (xk) = 0 (and hence p = 0),
pT∇f (xk) < 0 and p is a descend direction. The step length 1 is optimal for f (xk + p) = m2(p), thus 1 is used unless it does not produce a satisfactory reduction of f . If ∇2f (xk) is not positive definite, the Newton direction may not be defined: if ∇2f (xk) is singular, ∇2f (xk)−1 does not exist. Otherwise, p may not be a descent direction which can be remedied. Fast local convergence (quadratic) close to the solution. Computing the Hessian is expensive. M.M. Betcke Numerical Optimisation Quasi-Newton direction Use symmetric positive definite (s.p.d.) approximation Bk to the Hessian ∇2f (xk) in the Newton step p = −B−1k ∇f (xk), ∇f (xk) TB−1k ∇f (xk) > 0,

such that superlinear convergence is retained.

Bk is updated in each step taking into account the additional
information gathered during that step. The updates make use of
the fact that changes in gradient provide information about the
second derivative of f along the search direction.

Secant equation

∇f (xk + p) = ∇f (xk) + Bkp

This equation is underdetermined, different methods quasi-Newton
methods differ in the way they solve it.

M.M. Betcke Numerical Optimisation

Line search

Given the search direction p the optimal reduction of the f
amounts to minimising the function of one variable

φ(α) := f (xk + αp), α > 0.

This is in general to expensive (even the local minimiser), hence
inexact line search is of interest.

Choice of step size is important. To small steps mean slow
convergence, to large steps may not lead to reduction of the
objective function f .

M.M. Betcke Numerical Optimisation

Conditions for decrease

Simple condition: require f (xk + αp) < f (xk). Consider a sequence f (xk) = 5/k, k = 1, 2, . . . . This sequence is decreasing but its limiting value is 0, while the minimum of a convex function can be smaller than 0. Figure: Nocedal Wright Fig 3.2 The decrease is insufficient to converge to the minimum of a convex function. Hence we need conditions for sufficient decrease. M.M. Betcke Numerical Optimisation Sufficient decrease condition Armijo condition f (xk + αp) ≤ f (xk) + c1αpT∇f (xk) =: `(α), for some c1 ∈ (0, 1). [Typically small, c1 = 10−4] `(α) is a linear function with negative slope c1p T∇f (xk) < 0, `(α) = f (xk)+c1αp T∇f (xk) > f (xk)+αpT∇f (xk) = φ(0)+αφ′(0).

From Taylor Thm φ(α) = φ(0) + αφ′(0) + α2φ′′(ξ), ξ ∈ (0, α),
thus for sufficiently small α > 0, `(α) > φ(α).

Figure: Nocedal Wright Fig 3.3
M.M. Betcke Numerical Optimisation

Curvature condition

Armijo condition is satisfied for all sufficiently small α, so we need
another condition to avoid very small steps.

Curvature condition

pT∇f (xk + αp)︸ ︷︷ ︸
φ′(α)

≥ c2 pT∇f (xk)︸ ︷︷ ︸
φ′(0)

, c2 ∈ (c1, 1).

Ensures, that we progress far enough along a good direction p.

If φ′(α) is strongly negative, there is a good prospect of
significant decrease along p.

If φ′(α) is slightly negative (or even positive) we have a
prospect of little decrease and hence we terminate the line
search.

Typically c2 = 0.9 for a Newton or quasi Newton direction,
c2 = 0.1 for nonlinear conjugate gradient.

M.M. Betcke Numerical Optimisation

Curvature condition

Curvature condition

pT∇f (xk + αp)︸ ︷︷ ︸
φ′(α)

≥ c2 pT∇f (xk)︸ ︷︷ ︸
φ′(0)

, c2 ∈ (c1, 1).

Figure: Nocedal Wright Fig 3.4

M.M. Betcke Numerical Optimisation

Wolfe conditions

The sufficient decrease (Armijo rule) and curvature conditions
together are called Wolfe conditions

f (xk + αp) ≤ f (xk) + c1αpT∇f (xk),
pT∇f (xk + αp) ≥ c2pT∇f (xk),

for 0 < c1 < c2 < 1. Possibly includes points far away from stationary points, hence strong Wolfe conditions to disallow “too positive” values of φ′(α) f (xk + αp) ≤ f (xk) + c1αpT∇f (xk), |pT∇f (xk + αp)| ≤ c2|pT∇f (xk)|, for 0 < c1 < c2 < 1. M.M. Betcke Numerical Optimisation Wolfe conditions: existence Let f : Rn → R continuously differentiable and p be a descent direction at xk . If f is bounded below along the ray {xk + αp | α > 0}, then there exists an interval of step lengths
satisfying both the Wolfe conditions and strong Wolfe conditions.

Proof:
For α > 0, φ(α) = f (xk + αp) is bounded below while
`(α) = f (xk) + αc1p

T∇f (xk) is unbounded below as
c1p

T∇f (xk) < 0 and for small α, `(α) > φ(α) as c1 < 1. Thus `(α) has to intersect φ(α) at least once. Let α′ be the smallest value for which φ(α′) = f (xk + α ′p) = f (xk) + α ′c1p T∇f (xk) = `(α′). Then the sufficient decrease condition holds for all α ≤ α′. M.M. Betcke Numerical Optimisation Furthermore, by the mean value theorem ∃α′′ ∈ (0, α′) : f (xk + α′p)− f (xk) = α′pT∇f (xk + α′′p) and we obtain pT∇f (xk + α′′p) = c1pT∇f (xk) > c2pT∇f (xk)

since c1 < c2 and p T∇f (xk) < 0 and therewith α′′ satisfies Wolfe conditions. As the inequality in curvature condition for α′′ holds strictly, by continuity of ∇f , the inequality (and hence Wolfe conditions) also holds in an interval containing α′′. Furthermore, as all terms in the last equation are negative strong Wolfe conditions hold for the same interval. Wolfe conditions are scale-invariant in the sense that are unaffected by scaling the function or affine change of variables. They can be used in most line search methods and are particularly important for quasi-Newton methods. M.M. Betcke Numerical Optimisation Goldstein conditions f (xk) + (1− c)αpT∇f (xk) ≤ f (xk + αp) ≤ f (xk) + cαpT∇f (xk) with 0 < c < 1/2. The second inequality is the sufficient decrease condition. The first inequality controls the step length from below. Disadvantage w.r.t. Wolfe conditions is that it can exclude all minimisers of φ. Figure: Nocedal Wright Fig 3.6 Goldstein conditions are ofter used in Newton-type methods, but not well suited for quasi-Newton methods with positive definite Hessian approximation. M.M. Betcke Numerical Optimisation Backtracking: sufficient decrease avoding too small steps Backtracking line search 1: Choose ᾱ > 0, ρ ∈ (0, 1), c ∈ (0, 1)
2: Set α = ᾱ
3: repeat
4: α = ρα
5: until f (xk + αp) ≤ f (xk) + cαpT∇f (xk)

Terminates in finite number of steps: α will eventually
become small enough to satisfy sufficient decrease condition.

Prevents too short step lengths: the accepted α is within
factor ρ of the previous value, α/ρ, which was rejected for
violating the sufficient decrease condition i.e. being too long.

ρ can vary in [ρmin, ρmax] ⊂ (0, 1) between iterations.
In Newton and quasi-Newton methods ᾱ = 1, but different
values can be appropriate for other algorithms.

Well suited for Newton methods, less appropriate for
quasi-Newton and conjugate gradient methods.

M.M. Betcke Numerical Optimisation

Convergence of line search methods [Zoutendijk]

Consider an iteration

xk+1 = xk + αkpk , k = 0, 1, . . . ,

where pk is a descent direction and αk satisfied the Wolfe
conditions.
Let f be bounded below in Rn and continuously differentiable in an
open set M containing the level set {x : f (x) ≤ f (x0)}.
If ∇f is Lipschitz continuous on M i.e.

∃L > 0 : ‖∇f (x)−∇f (x̄)‖ ≤ L‖x − x̄‖, ∀x , x̄ ∈M

then ∑
k≥0

cos2 θk‖∇f (xk)‖2 <∞, where θk = ∠(pk ,−∇f (xk)). M.M. Betcke Numerical Optimisation Convergence of line search methods [Zoutendijk] Subtracting pTk ∇f (xk) from both sides of curvature condition pTk ∇f (xk + αpk︸ ︷︷ ︸ =xk+1 ) ≥ c2pTk ∇f (xk) we obtain pTk (∇f (xk+1)−∇f (xk)) ≥ (c2 − 1)p T∇f (xk). On the other hand the Lipschitz condition implies pTk (∇f (xk+1)−∇f (xk)) ≤ ‖∇f (xk+1)−∇f (xk)‖‖pk‖ ≤ αkL‖pk‖ 2. Combining the two inequalities we obtain a lower bound on the step size αk ≥ c2 − 1 L pTk ∇f (xk) ‖pk‖2 . M.M. Betcke Numerical Optimisation Substituting this inequality into the sufficient decrease condition f (xk+1) ≤ f (xk) + c1 c2 − 1 L (pTk ∇f (xk)) 2 ‖pk‖2 with cos θk = − pT k ∇f (xk ) ‖∇f (xk )‖‖pk‖ yields f (xk+1) ≤ f (xk)− c cos2 θk‖∇f (xk)‖2, where c = c1(1− c2)/L. Summing over all indices up to k we obtain f (xk+1) ≤ f (x0)− c k∑ j=0 cos2 θj‖∇f (xj)‖2 and since f is bounded from below k∑ j=0 cos2 θj‖∇f (xj)‖2 ≤ (f (x0)− f (xk+1))/c < C where C > 0 is some positive constant. Taking limits∑∞
k=0 cos

2 θk‖∇f (xk)‖2 <∞. M.M. Betcke Numerical Optimisation Global convergence Goldstein or strong Wolfe conditions also imply the Zoutendijk condition ∑ k≥0 cos2 θk‖∇f (xk)‖2 <∞. The Zoutendijk condition implies cos2 θk‖∇f (xk)‖2 → 0, which can be used to derive global convergence results for line search algorithms. If the method ensures that cos θk ≥ δ > 0, ∀k i.e. θk is bounded
away from π/2, it follows that

lim
k→∞

‖∇f (xk)‖ = 0.

This is the strongest global convergence result that can be
obtained for such iteration (convergence to a stationary point)
without additional assumptions.

M.M. Betcke Numerical Optimisation

In particular, the steepest descent (pk = −∇f (xk)) produces a
gradient sequence which converges to 0 if it uses a line search
satisfying Wolfe or Goldstein conditions.

For some algorithms e.g. nonlinear conjugate gradient methods,
only a weaker result can be obtained

lim inf
k→∞

‖∇f (xk)‖ = 0

i.e. only subsequence of gradient norms ‖∇f (xkj )‖ converges to 0
rather than the whole sequence.

M.M. Betcke Numerical Optimisation

Those limits can be proved by contradiction:
Suppose that ‖∇f (xk)‖ ≥ γ for some γ > 0 for all k sufficiently
large. Then from cos2 θk‖∇f (xk)‖2 → 0 we conclude that
cos θk → 0 i.e. the entire sequence {cos θk} converges to 0.
Thus to show the weak convergence result it is enough to show
that a subsequence {cos θkj} is bounded away from 0.

Consider any algorithm which
(i) decreases the objective function in each iteration,
(ii) every mth iteration is a steepest descent step with step length
satisfying the Wolfe or Goldstein conditions.
Since cos θk = 1 for steepest descent steps, this provides the
subsequence bounded away from 0. The algorithm can do
something better in remaining m − 1 iterates, while the occasional
steepest descent step will guarantee the overall (weak) global
convergence.

M.M. Betcke Numerical Optimisation

Rate of convergence

Unfortunately, rapid convergence sometimes conflicts with global
convergence.

Example: Steepest descent is globally convergent (with appropriate
step sizes) but it can be very slow in practice. On the other hand,
while Newton iteration converges rapidly when we are close to the
solution, the Newton step may not even be a descent direction far
away from the solution.

The challenge: design algorithms with good global convergence
properties and rapid convergence rate.

M.M. Betcke Numerical Optimisation

Steepest descent

Steepest descent with exact line search for strictly convex
quadratic function

f (x) =
1

2
xTQx − bTx ,

where Q is symmetric positive definite.

Figure: Nocedal Wright Fig 3.7

M.M. Betcke Numerical Optimisation

Steepest descent

Characteristic zig-zag due to elongated shape of the ellipse. If the
level sets were circles instead, the steepest descent would need one
step only.

Convergence rate of steepest descent with exact line search:

‖xk+1 − x?‖2Q︸ ︷︷ ︸
=f (xk+1)−f (x?)


(
λn − λ1
λn + λ1

)2
‖xk − x?‖2Q︸ ︷︷ ︸
=f (xk+1)−f (x?)

,

where 0 < λ1 ≤ λ2 ≤ · · · ≤ λn are the eigenvalues of Q, and ‖x‖2Q = x TQx . Note: for quadratic strictly convex function we obtain objective function and (for free) iterate convergence rates! The objective function convergence rate is essentially the same for steepest descent with exact line search when applied to a twice continuously differentiable nonlinear function satisfying sufficient conditions at x?. M.M. Betcke Numerical Optimisation Local convergence rate: Newton methods Let f : Rn → R be twice continuously differentiable with Lipschitz continuous Hessian in a neighbourhood of the solution x? satisfying the sufficient conditions. Note that the Hessian ∇2f is positive definite also in the vicinity of the solution x?. The iterates xk computed by the Newton method (note step length 1) xk+1 = xk − (∇2f (xk))−1∇f (xk) converge locally quadratically i.e. for starting point x0 sufficiently close to x?. The sequence of gradient norms ‖∇f (xk)‖ also converges quadratically to 0. Local convergence: note that away from the solution ∇fk may not be positive definite and hence pk may not be a descent direction. Global convergence with Hessian modification is discussed later. M.M. Betcke Numerical Optimisation Convergence rates: Newton-type methods Let f : Rn → R be twice continuously differentiable. Let {xk} be a sequence generated by a descent method xk+1 = xk + αpk for step sizes satisfying Wolfe conditions with c1 ≤ 1/2. If the sequence {xk} converges to a point x? satisfying the sufficient conditions and the search direction satisfies lim k→∞ ‖∇f (xk) +∇2f (xk)pk‖ ‖pk‖ = 0, then for all k > k0, the step length αk = 1 is admissible, and for
that choice of αk = 1, k > k0, the sequence {xk} converges to x?
superlinearly.

Note: once close enough to the solution so that ∇2f (xk) became
s.p.d., the limit is trivially satisfied and for αk = 1 we recover local
quadratic convergence.

M.M. Betcke Numerical Optimisation

Convergence rates: quasi-Newton methods

Let f : Rn → R be twice continuously differentiable.
Let {xk} be a sequence generated by a quasi-Newton method
(note step length 1, Bk s.p.d.)

xk+1 = xk −B−1k ∇f (xk)︸ ︷︷ ︸
pk

.

Assume the sequence {xk} converges to a point x? satisfying the
sufficient conditions. Then {xk} converges superlinearly if an only
if

lim
k→∞

‖(Bk −∇2f (x?))pk‖
‖pk‖

= 0.

Note: the superlinear convergence rate can be attained even if the
sequence {Bk} does not converge to ∇2f (x?). It suffices that Bk
becomes increasingly accurate approximation to ∇2f (x?) along the
search direction pk . Quasi-Newton methods use it to construct Bk .

M.M. Betcke Numerical Optimisation

Hessian modifications

Away from the solution, the Hessian may not be positive definite,
and the Newton direction may not be a descent direction. The
general solution is to consider positive definite approximations.

Bk = ∇2f (xk) + Ek , where Ek is chosen to ensure that Bk is
sufficiently positive definite.

Global convergence results can be established for Newton method
with Hessian modification and step satisfying Wolfe or Goldstein or
Armijo backtracking conditions provided that:

κ(Bk) = ‖Bk‖‖B−1k ‖ ≤ C for some C > 0 and all k whenever the
sequence of the Hessians {∇2f (xk)} is bounded.

M.M. Betcke Numerical Optimisation

Hessian modifications

Eigenvalue decomposition

∇2f (xk) = QΛQT =
n∑

i=1

λiqiq
T
i .

Example:

∇2f (xk) = diag(10, 3,−1), ∇f (xk) = (1,−3,−2)T

Q = I , Λ = diag(λ1, λ2, λ3)

The Newton step: pk = (−0.1, 1, 2)T

As pTk ∇f (xk) > 0, it is not a descent direction.

M.M. Betcke Numerical Optimisation

Eigenvalue modifications (not practical):
Replace all negative eigenvalues with δ =


u = 10−8, where

u = 10−16 is the machine precision.

Bk =
2∑

i=1

λiqiq
T
i + δq3q

T
3 = diag(10, 3, 10

−8)

Bk is s.p.d. and curvature along q1, q2 is preserved, however the
direction is dominated by q3:

pk = −B−1k ∇fk = −
2∑

i=1

1

λi
qiq

T
i ∇fk −

1

δ
q3q

T
3 ∇fk ≈ −(2× 10

8)q3.

pk is a descent direction but the length is very large, not in line
with local validity of the Newton approximation. Thus pk may be
ineffective.
Adapt choice of δ to avoid excessive lengths. Even δ = 0 which
eliminates direction q3.

M.M. Betcke Numerical Optimisation

Let A is symmetric A = QΛQT.
The correction matrix ∆A of minimum Frobenius norm that
ensures λmin(A + ∆A) ≥ δ is given by

∆A = Q diag(τi )Q
T, with τi =

{
0, λi ≥ δ

δ − λi , λi < δ. and the modified matrix is A + ∆A = Q(Λ + diag(τi ))Q T. Frobenius norm is defined ‖A‖2F = ∑n i ,j=1 a 2 ij = ∑n i=1 λ 2 i . The correction matrix ∆A of minimum Euclidean norm that satisfies λmin(A + ∆A) ≥ δ is given by ∆A = τ I , with τ = max(0, δ − λmin(A)). and the modified matrix has the form A + τ I . M.M. Betcke Numerical Optimisation Cholesky factorisation of A + τ I Simple idea: If mini aii ≤ 0 set τ0 = −mini aii + β for some small β > 0
(e.g. 10−3), otherwise τ0 = 0.

Attempt the Cholesky algorithm to obtain LLT = A + τk I

If not successful, increase τk+1 = max(2τk , β) and reattempt.

Drawback: possibly multiple failed attempts to factorise.

M.M. Betcke Numerical Optimisation

Cholesky decomposition

Figure: Nocedal Wright Ex. 3.1

M.M. Betcke Numerical Optimisation

Cholesky decomposition of indefinite matrix

For A indefinite:

The factorisation A = LDLT may not exist.

Even if it does exist, the algorithm can be unstable
i.e. elements of L and D can become arbitrarily large.

Posterior modification of D to force the elements to be
positive may break down or result in a matrix very different to
A.

Instead, modify A during the factorisation to achieve that the
elements of D are sufficiently positive and the elements of L
and D are not to large.

M.M. Betcke Numerical Optimisation

Modified Cholesky decomposition

Choose δ, β > 0. While computing jth column of L,D ensure

dj ≥ δ, |mij | ≤ β, i = j + 1, j + 2, . . . , n,
where mij = lij


dj .

To satisfy these bounds we only need to change how dj is
computed, from dj = cjj to

dj = max

(
|cjj |,

θ2j
β2
, δ

)
, with θj = max

j j .

M.M. Betcke Numerical Optimisation

Modified Cholesky decomposition

Properties:

Modifies the Hessian during factorization where necessary.

The modified Cholesky factors exist and are bounded relative
to the norm of the actual Hessian.

It does not modify Hessian if it is sufficiently positive definite.

This is the basis for the modified Cholesky factorisation which also
introduces symmetric row and column permutations to reduce the
size of the modification.

PAPT + E = LDLT = MMT,

where E is a nonnegative diagonal matrix that is zero if A is
sufficiently positive definite.
It has been shown, that the matrices obtained by this modified
Cholesky algorithm to the exact Hessian ∇2f (xk) have bounded
condition numbers, hence some global convergence results can be
obtained.

M.M. Betcke Numerical Optimisation

Step length selection

How to find a step length satisfying one of the termination
conditions e.g. Wolfe etc. for

φ(α) = f (xk + αpk),

where pk is a descent direction i.e. φ
′(0) = pTk f (xk) < 0. If f is a convex quadratic function f (x) = 1 2 xTQx − bTx , it has a global minimiser along the ray xk + αpk which can be calculated analytically αk = − pTk ∇f (xk) pTk Qpk . For general nonlinear functions iterative approach is necessary. M.M. Betcke Numerical Optimisation Line search algorithms can be classified according to the information they use: Methods using only function evaluations can be very inefficient as they need to continue iterating until a very small interval has been found. Methods using gradient information can determine wether the current step length satisfies e.g. Wolfe or Goldstein conditions which require gradients to evaluate. Typically, they consist of two phases: bracketing phase which finds an interval containing acceptable step lengths and the selection phase which locates the final step in the interval. M.M. Betcke Numerical Optimisation Line search via interpolation Aim: find a step length α that satisfies sufficient decrease condition without being to small. [similarity to backtracking] Sufficient decrease condition φ(αk) ≤ φ(0) + c1αkφ′(0) . We want to compute as few derivatives ∇f (x) as possible. Initial guess α0: check sufficient decrease condition φ(α0) ≤ φ(0) + c1α0φ′(0). If satisfied terminate the search. Otherwise, [0, α0] contains acceptable step lengths. M.M. Betcke Numerical Optimisation Quadratic approximation φq(α) to φ by interpolating the available information: φq(0) = φ(0), φ ′ q(0) = φ ′(0) and φq(α0) = φ(α0) yields φq(α) = φ(α0)− φ(0)− α0φ′(0) α20 α2 + φ′(0)α + φ(0). The new trial value α1 is defined as the minimiser of φq i.e. α1 = − φ′(0)α20 2(φ(α0)− φ(0)− φ′(0)α0) . If sufficient decrease condition is satisfied, terminate search. Otherwise, construct cubic interpolating the four available values: φc(0) = φ(0), φ ′ c(0) = φ ′(0), φc(α0) = φ(α0) and φc(α1) = φ(α1) yields φc(α) = aα 3 + bα2 + αφ′(0) + φ(0),[ a b ] = 1 α20α 2 1(α1 − α0) [ α20 −α 2 1 −α30 α 3 1 ] [ φ(α1)− φ(0)− φ′(0)α1 φ(α0)− φ(0)− φ′(0)α0 ] M.M. Betcke Numerical Optimisation By differentiating φc we find the minimiser α2 ∈ [0, α1] α2 = −b + √ b2 − 3aφ′(0) 3a . If necessary repeat the cubic interpolation with φc(0) = φ(0), φ′c(0) = φ ′(0) and the two most recent values φc(αk−1) = φ(αk−1) and φc(αk) = φ(αk) until αk+1 satisfies the sufficient decrease condition. Safeguard: If any αi is either too close to αi−1 or much smaller than αi−1, we reset αi = αi−1/2. If derivatives can be computed along the function values at little additional cost, we can also devise variant interpolating φ, φ′ at two most recent values. M.M. Betcke Numerical Optimisation Initial step length For Newton and quasi Newton α0 = 1. This ensures that unit step length will be taken whenever they satisfy the termination conditions and allow for quick convergence. For methods which do not produce well scaled search direction like steepest descent or conjugate gradient it is important to use available information to make the initial guess e.g.: M.M. Betcke Numerical Optimisation First order change in function at iterate xk will be the same as that obtained at previous step i.e. α0p T k ∇f (xk) = αk−1p T k−1∇f (xk−1) α0 = αk−1 pTk−1∇f (xk−1) pTk ∇f (xk) . Interpolate quadratic to data f (xk−1), f (xk) and pTk−1∇f (xk−1) and define α0 to be its minimiser α0 = 2(f (xk)− f (xk−1)) φ′(0) . It can be shown that if xk → x? superlinearly, then the ratio converges to 1. If we adjust by setting α0 = min(1, 1.01α0) we find that the unit step length will eventually always be tried and accepted and the superlinear convergence will be observed. M.M. Betcke Numerical Optimisation