Numerical Optimisation: Quasi-Newton methods
Numerical Optimisation:
Quasi-Newton methods
Marta M. Betcke
m.betcke@ucl.ac.uk,
Kiko Rullan
f.rullan@cs.ucl.ac.uk
Department of Computer Science,
Centre for Medical Image Computing,
Centre for Inverse Problems
University College London
Lecture 7 & 8
M.M. Betcke Numerical Optimisation
Quasi-Newton
First idea by William C. Davidon in mid 1950, who was
frustrated by performance of coordinate descent.
Quickly picked up by Fletcher and Powell who demonstrated
that the new algorithm was much faster and more reliable
then existing methods.
Davidon’s original paper was not accepted for publication.
More than 30 years later it appeared in the first issue of the
SIAM Journal on Optimization in 1991.
Like steepest gradient, Quasi Newton methods only require
the gradient of the objective function at each iterate.
Measuring changes in gradient they build a model of the
objective function which is good enough to produce
superlinear convergence.
As the Hessian is not required, Quasi-Newton methods can be
more efficient than Newton methods which take a long time
to evaluate the Hessian and solve for the Newton direction.
M.M. Betcke Numerical Optimisation
Quasi-Newton
Quadratic model of the objective function at xk :
mk(p) = fk +∇f Tk p +
1
2
pTBkp,
where Bk ∈ Rn×n symmetric positive definite which will be
updated during the iteration.
The minimiser of mk can be written explicitly
pk = −B−1k ∇fk .
pk is used as a search direction and the next iterate becomes
xk+1 = xk + αkpk .
The step length αk is chosen to satisfy the Wolfe conditions.
The iteration is similar to the line search Newton with the key
difference that the Hessian Bk is an approximation.
M.M. Betcke Numerical Optimisation
Bk update
Davidon proposed to update Bk in each iteration instead of
computing it anew.
Question: When we computed the new iterate xk+1 and construct
the new model
mk+1(p) = fk+1 +∇f Tk+1p +
1
2
pTBk+1p,
what requirements should we impose on Bk+1 based on the
knowledge gathered in the last step?
Require: gradient of mk+1 should match the gradient of f at the
last two iterates xk , xk+1.
i) At xk+1: pk+1 = 0,
∇mk+1(0) = ∇fk+1 is satisfied automatically.
ii) At xk = xk+1 − αkpk :
∇mk+1(−αkpk) = ∇fk+1 − αkBk+1pk = ∇fk .
M.M. Betcke Numerical Optimisation
By rearranging ii) we obtain
Bk+1αkpk = ∇fk+1 −∇fk .
Define vectors
sk = xk+1 − xk = αkpk , yk = ∇fk+1 −∇fk ,
ii) becomes the secant equation
Bk+1sk = yk .
As Bk+1 is symmetric positive definite, this is only possible if the
curvature condition holds
sTk yk > 0,
which can be easily seen multiplying the secant equation by sTk
from the left.
M.M. Betcke Numerical Optimisation
If f is strongly convex sTk yk > 0 is satisfied for any xk , xk+1.
However, for nonconvex functions in general this condition will
have to be enforced explicitly by imposing restrictions on the line
search.
sTk yk > 0 is guaranteed if we impose Wolfe or strong Wolfe
conditions:
From the 2nd Wolfe condition sTk ∇fk+1 ≥ c2s
T
k ∇fk , c1 < c2 < 1 it
follows
sTk yk ≥ (c2 − 1)αkp
T
k ∇fk > 0,
since c2 < 1 and pk is a descent direction, and the curvature
condition holds.
M.M. Betcke Numerical Optimisation
Davidon Flecher Powell (DFP)
When sTk yk > 0, the secant equation always has a solution Bk+1.
In fact the secant equation is heavily underdetermined: a
symmetric matrix has n(n + 1)/2 dofs, secant equation: n
conditions, positive definiteness: n inequalities.
Extra conditions to obtain unique solutions: we look for Bk+1 close
to Bk in a certain sense.
DFP update:
Bk+1 = (I − ρkyksTk )Bk(I − ρksky
T
k ) + ρkyky
T
k (DFP B)
with ρk = 1/y
T
k sk .
The inverse Hk = B
−1
k can be obtained with
Sherman-Morrison-Woodbury formula
Hk+1 = Hk −
Hkyky
T
k Hk
yTk Hkyk
+
sks
T
k
yTk sk
. (DFP H)
M.M. Betcke Numerical Optimisation
Sherman-Morrison-Woodbury formula
Figure: Nocedal Wright (A.28)
M.M. Betcke Numerical Optimisation
Broyden Fletcher Goldfarb Shanno (BFGS)
Applying the same argument directly to the inverse of the Hessian
Hk . The updated approximation Hk+1 must be symmetric and
positive definite and must satisfy the secant equation
Hk+1yk = sk .
BFGS update:
Hk+1 = (I − ρkskyTk )Hk(I − ρkyks
T
k ) + ρksks
T
k (BFGS)
with ρk = 1/y
T
k sk .
How to choose H0? Depends on the situation, information about
the problem e.g. start with an inverse of an approximated Hessian
calculated by a finite difference at x0. Otherwise, we can set H0 to
identity or diagonal matrix to reflect the scaling of the variables.
M.M. Betcke Numerical Optimisation
BFGS
1: Given x0, inverse Hessian approximation H0, tolerance ε > 0
2: Set k = 0
3: while ‖∇fk‖ > ε do
4: Compute search direction
pk = −Hk∇fk
5: xk+1 = xk + αkpk where αk is computed with a line search
procedure satisfying Wolfe conditions
6: Define sk = xk+1 − xk and yk = ∇fk+1 −∇fk
7: Compute Hk+1 using (BFGS)
8: k = k + 1
9: end while
M.M. Betcke Numerical Optimisation
Complexity of each iteration is O(n2) plus the cost of function
and gradient evaluations.
There are no O(n3) operations such as linear system solves or
matrix-matrix multiplications.
The algorithm is robust and the rate of convergence is
superlinear. In many cases it outperforms Netwon method,
which while converging quadratically, has higher complexity
per iteration (Hessian computation and solve).
A BFGS version with the Hessian approximation Bk rather
than Hk . The update for Bk is obtained by applying
Sherman-Morrison-Woodbury formula to (BFGS)
Bk+1 = Bk −
Bksks
T
k Bk
sTk Bksk
+
yky
T
k
yTk sk
. (BFGS B)
An O(n2) implementation can be achieved based on updates
of LDLT factors of Bk (with possible diagonal modification for
stability) but no computational advantage is observed on
above algorithm using (BFGS) to update Hk .
M.M. Betcke Numerical Optimisation
The positive definiteness of Hk is not explicitly forced, but if
Hk is positive definite so will be Hk+1.
What happens if at some iteration Hk becomes as poor
approximation to the true inverse Hessian e.g. if sTk yk is tiny
(positive) than the elements of Hk+1 get very large.
It turns out that BFGS has effective self correcting properties,
and Hk tends to recover in a few steps. The self correcting
properties hold only when adequate line search is performed.
In particular Wolfe conditions ensure that the gradients are
sampled at points which allow the model mk to capture the
curvature information.
On the other hand DFP method is less effective in correcting
itself.
DFP and BFGS are dual in the sense that they can be
obtained by switching s ↔ y ,B ↔ H.
M.M. Betcke Numerical Optimisation
Implementation
αk = 1 should always be tried first, because this step length
will eventually be accepted (under certain conditions), thereby
producing super linear convergence.
Computational evidence suggests that it is more economical
(in terms of function evaluations) to perform fairly inaccurate
line search.
c1 = 10
−4, c2 = 0.9 are commonly used with Wolfe conditions.
M.M. Betcke Numerical Optimisation
Heuristic for scaling H0
Choice H0 = βI is popular, but there is no good strategy for
estimating β.
If β is too large, the first step p0 = −βg0 is too long and line
search may require many iterations to find a suitable step length
α0.
Heuristic: estimate β after the first step has been computed (using
H0 = I amounts to steepest descent step) but before the H0
update (in step 7) and change the provisional value by setting
H0 =
sT
k
yk
yT
k
yk
I . This scaling attempts to approximate scaling with an
eigenvalue of the inverse Hessian: from Taylor theorem
yk = Ḡkαkpk = Ḡksk
we have that the secant equation is satisfied for average Hessian
Ḡk =
∫ 1
0
∇2f (xk + ταkpk)dτ.
M.M. Betcke Numerical Optimisation
Symmetric rank-1 (SR-1) update
Both BFGS and DFP methods perform a rank-2 update while
preserving symmetry and positive definiteness.
Question: Does a rank-1 update exist such that the secant
equation is satisfied and the symmetry and definiteness are
preserved?
Rank-1 update:
Bk+1 = Bk + σvv
T, σ ∈ {+1,−1}
and v is chosen such that Bk+1 satisfies the secant equation
yk = Bk+1sk .
Substituting the explicit rank-1 form into the secant equation
yk = Bksk + (σv
Tsk)︸ ︷︷ ︸
:=δ−1, δ 6=0
v
we see that v must be of the form v = δ(yk − Bksk).
M.M. Betcke Numerical Optimisation
Substituting v = δ(yk − Bksk) back into the secant equation we
obtain
yk − Bksk = σδ2[sTk (yk − Bksk)](yk − Bksk)
which is satisfied if and only if
σ = sign[sTk (yk − Bksk)], δ = ±|s
T
k (yk − Bksk)|
−1/2.
Hence, the only symmetric rank-1 update satisfying the secant
equation is
Bk+1 = Bk +
(yk − Bksk)(yk − Bksk)T
(yk − Bksk)Tsk
. (SR-1)
Applying the Sherman-Morrison-Woodbury formula we obtain the
inverse Hessian update
Hk+1 = Hk +
(sk − Hkyk)(sk − Hkyk)T
(sk − Hkyk)Tyk
. (SR-1)
SR-1 update does not preserve the positive definiteness. It is a
drawback for line search methods but could be an asset for trust
region as it allows to generate indefinite Hessians.
M.M. Betcke Numerical Optimisation
SR-1 breakdown
The main drawback of SR-1 is that (yk − Bksk)Tsk (same for Hk)
can become 0 even for a convex quadratic function i.e. there may
be steps where the is no symmetric rank-1 update which satisfies
the secant equation.
Three cases:
(yk − Bksk)Tsk 6= 0, unique symmetric rank-1 update
satisfying secant equation exists.
yk = Bksk , then the only update is Bk+1 = Bk .
(yk − Bksk)Tsk = 0 and yk 6= Bksk , there is no symmetric
rank-1 update satisfying secant equation.
Remedy: Skipping i.e. apply update only if
|(yk − Bksk)Tsk | ≥ r‖sk‖‖yk − Bksk‖,
where r ∈ (0, 1) is a small number (typically r = 10−8), otherwise
set Bk+1 = Bk .
M.M. Betcke Numerical Optimisation
SR-1 applicability
This simple safeguard adequately prevents the breakdown.
Recall: for BFGS update skipping is not recommended if the
curvature condition sTk yk > 0 fails. Because it can occur often
by e.g. taking too small step if the line search does not
impose the Wolfe conditions. For SR-1 sTk (yk − Bksk) ≈ 0
occurs infrequently as it requires near orthogonality of sk and
yk − Bksk and moreover it implies that sTk Ḡksk ≈ s
T
k Bksk ,
where Ḡk is the average Hessian over the last step meaning
that the curvature approximation along sk is essentially
already correct.
The Hessian approximations generated by SR-1 are good,
often better than those by BFGS.
When the curvature condition yTk sk > 0 cannot be imposed
e.g. constraint problems or partially separable functions, where
indefinite Hessian approximations are desirable as they reflect
the indefiniteness of the true Hessian.
M.M. Betcke Numerical Optimisation
SR-1 trust-region method
1: Given x0, B0, ∆, η ∈ (0, 10−3), r ∈ (0, 1) and ε > 0
2: Set k = 0
3: while ‖∇fk‖ > ε do
4: sk = arg mins s
T∇fk + 12 s
TBks, subject to ‖s‖ ≤ ∆k
5: yk = ∇f (xk + sk)−∇fk
6: ρk = (fk − f (xk + sk))/− (sTk ∇fk +
1
2
sTk Bksk)
7: if ρk > η then
8: xk+1 = xk + sk
9: else
10: xk+1 = xk (failed step)
11: end if
12: Update ∆k in dependence of ρk , ‖sk‖ (as in trust-region methods)
13: if |(yk − Bksk)Tsk | ≥ r‖sk‖‖yk − Bksk‖ then
14: Update Bk+1 using (SR-1) (even if xk+1 = xk to improve bad
approximation along sk)
15: else
16: Bk+1 = Bk
17: end if
18: k = k + 1
19: end while
M.M. Betcke Numerical Optimisation
Theorem: Hessian approximation for quadratic function
Let f : Rn → R be a strongly quadratic function
f (x) = bTx + 1
2
xTAx with A symmetric positive definite. For any
starting point x0 and any symmetric initial matrix H0, the iterates
xk+1 = xk + pk , pk = −Hk∇fk ,
where Hk is updated with (SR-1), converge to the minimiser in at
most n steps provided that (sk − Hkyk)Tyk 6= 0 for all k . After n
steps, if the search directions pk are linearly independent,
Hn = A
−1.
Proof idea: Show by induction that the secant equation Hkyj = sj
is satisfied for all j = 1, . . . , k − 1 i.e. Hk (not merely the last one
k − 1). Use that for such quadratic function it holds yi = Asi .
For SR-1 Hkyj = sj , j = 1, . . . , k − 1 holds regardless how the line
search is performed. In contrast for BFGS, it can only be shown
under the assumption that the line search is exact.
M.M. Betcke Numerical Optimisation
Theorem: Hessian approximation for general function
Let f : Rn → R twice continuously differentiable with the Hessian
bounded and Lipschitz continuous in a neighbourhood of a point
x? ∈ Rn and {xk} a sequence of iterates such that xk → x?.
Suppose that
|(yk − Bksk)Tsk | ≥ r‖sk‖‖yk − Bksk‖
holds for all k and some r ∈ (0, 1) and that the steps sk are
uniformly independent (steps do not tend to fall in a subspace of
dimension less than n).
Then the matrices Bk generated by the update (SR-1) satisfy
lim
k→∞
‖Bk −∇2f (x?)‖ = 0.
M.M. Betcke Numerical Optimisation
The Broyden class
Broyden class is a family of updates of the form
Bk+1 = Bk −
Bksks
T
k Bk
sTk Bks
T
k
+
yky
T
k
yTk sk
+ τk(s
T
k Bksk)vkv
T
k , (Broyden)
where τk is a scalar parameter and
vk =
yk
yTk sk
−
Bksk
sTk Bksk
.
For τk = 0 we recover BFGS and for τk = 1 we DFP.
Hence we can write (Broyden) as a linear combination of the two
Bk+1 = (1− τk)BBFGSk+1 + τkB
DFP
k+1 .
Since both BFGS and DFP satisfy secant equation so does the
whole Broyden class.
Since BFSG and DFP updates preserve positive definiteness of the
Hessian when sTk yk > 0, so does the restricted Broyden class
which is obtained by restricting 0 ≤ τk ≤ 1.
M.M. Betcke Numerical Optimisation
Theorem: monotonicity of eigenvalue approximation
Let f : Rn → R is the strongly convex quadratic function
f (x) = bTx + 1
2
xTAx with A symmetric positive definite. Let B0
any symmetric positive matrix and x0 be any starting point for the
iteration
xk+1 = xk + pk , pk = −B−1k ∇fk ,
where Bk is updated with (Broyden) with τk ∈ [0, 1].
Denote with λk1 ,≤ λ
k
2 ≤ · · · ≤ λ
k
n the eigenvalues of
A1/2B−1k A
1/2.
Then for all k, we have
min{λki , 1} ≤ λ
k+1
i ≤ max{λ
k
i , 1}, i = 1, . . . , n.
The interlacing property does not hold if τk 6∈ [0, 1].
Consequence: The eigenvalues λki converge monotonically (but
not strictly monotonically) to 1, which are the eigenvalues when
Bk = A. Significantly, the result holds even if the line search is not
exact.
M.M. Betcke Numerical Optimisation
So do the best updates belong to the restricted Broyden class?
We recover SR-1 formula for
τk =
sTk yk
sTk yk − s
T
k Bksk
,
which does not belong to the restricted Broyden class as τk may
fall outside of [0, 1].
It can be shown that for B0 symmetric positive definite, if for all k
sTk yk > 0 and τk > τ
c
k , then all Bk generated by (Broyden) remain
symmetric and positive definite. Here
τ ck = (1− µk)
−1 ≤ 0, µk =
(yTk B
−1
k yk)(s
T
k Bksk)
(yTk sk)
2
≥ 1.
When the line search is exact all the methods in the Broyden class
with τk ≥ τ ck generate the same sequence of iterates, even for
nonlinear functions because the directions differ only by length and
this is compensated by the exact line search.
M.M. Betcke Numerical Optimisation
Thm: Properties of Broyden class for quadratic function
Let f : Rn → R be the strongly convex quadratic function
f (x) = bTx + 1
2
xTAx with A symmetric positive definite. Let x0
be any starting point and B0 any symmetric positive definite
matrix. Assume that αk is the exact step length and τk ≥ τ ck for
all k . Then it holds
(i) The iterates are independent of τk and converge to the
solution in at most n iterations.
(ii) The secant equation is satisfied for all previous search
directions
Bksj = yj , j = 1, . . . , k − 1.
(iii) If B0 = I , then the sequence of iterates {xk} is identical to
that generated by the conjugate gradient method, in
particular the search directions sk are conjugate
sTi Asj = 0, i 6= j .
(iv) If n iterations are performed, we have Bn = A.
M.M. Betcke Numerical Optimisation
Few comments …
The theorem can be slightly generalised to hold if the Hessian
approximation remains nonsingular but not necessarily positive
definite i.e. τk could be smaller than τ
c
k provided the chosen
value did not produce singular updated matrix.
(iii) can be generalised to B0 6= I , then the Broyden class
method is identical to preconditioned conjugate gradient
method with the preconditioner B0.
The theorem is mainly of theoretical interest as the inexact
line search used in practice significantly alters the performance
of the methods. This type of analysis however, guided much
of the development in quasi-Newton methods.
M.M. Betcke Numerical Optimisation
Global convergence
For general nonlinear objective function, there is no global
convergence result for quasi-Newton methods i.e. convergence to a
stationary point from any starting point and any suitable Hessian
approximation.
Theorem: [BFGS global convergence]
Let f : Rn → R be twice continuously differentiable and x0 be a
starting point for which the level set L = {x ∈ Rn : f (x) ≤ f (x0)}
is convex and there exist two positive constants m,M such that
m‖z‖2 ≤ zT∇2f (x)z ≤ M‖z‖2, ∀z ∈ Rn, x ∈ L.
Then for any symmetric positive definite matrix B0 the sequence
{xk} generated by BFGS algorithm (with ε = 0) converges to the
miminizer x? of f .
This results can be generalised to the restricted Broyden class with
τk ∈ [0, 1) i.e. except for DFP method.
M.M. Betcke Numerical Optimisation
Theorem: Superlinear local convergence of BFGS
Let f : Rn → R be twice continuously differentiable and the
sequence of iterates generated by BFGS algorithm converge to
x? ∈ Rn such that the Hessian ∇2f is Lipschitz continuous at x?
‖∇2f (x)−∇2f (x?)‖ ≤ L‖x − x?‖, ∀x ∈ N (x0), 0 < L <∞, and that it holds ∞∑ k=1 ‖xk − x?‖ <∞, then xk converges to x ? at a superlinear rate. M.M. Betcke Numerical Optimisation Theorem: SR-1 trust region convergence Let {xk} be the sequence of iterates generated by the SR-1 trust region method. Suppose the following conditions hold: the sequence {xk} does not terminate, but remains in a closed bounded convex set D on which f is twice continuously differentiable and in which f has a unique stationary point x?; ∇2f (x?) is positive definite and ∇2f (x) is Lipschitz continuous in N (x?); the sequence {Bk} is bounded in norm; |(yk − Bksk)Tsk | ≥ r‖sk‖‖yk − Bksk‖, r ∈ (0, 1), ∀k. Then for the sequence {xk} we have limk→∞ xk = x? and lim k→∞ ‖xk+n+1 − x?‖ ‖xk − x?‖ = 0 (n + 1-step superlinear rate). M.M. Betcke Numerical Optimisation Remarks: SR-1 update does not maintain positive definiteness of Bk . In practice Bk can be indefinite at any iteration (trust region bound may continue to be active for arbitrarily large k) but it can be shown that (asymptotically) Bk remains positive definite most of the time regardless whether the intial approximation B0 was positive definite or not. The theorem does not require exact solution of the trust region subproblem. M.M. Betcke Numerical Optimisation