Numerical Optimisation: Quasi-Newton methods
Numerical Optimisation:
Quasi-Newton methods
Marta M. Betcke
m.betcke@ucl.ac.uk,
Kiko Rullan
f.rullan@cs.ucl.ac.uk
Department of Computer Science,
Centre for Medical Image Computing,
Centre for Inverse Problems
University College London
Lecture 7 & 8
M.M. Betcke Numerical Optimisation
Quasi-Newton
First idea by William C. Davidon in mid 1950, who was
frustrated by performance of coordinate descent.
Quickly picked up by Fletcher and Powell who demonstrated
that the new algorithm was much faster and more reliable
then existing methods.
Davidon’s original paper was not accepted for publication.
More than 30 years later it appeared in the first issue of the
SIAM Journal on Optimization in 1991.
Like steepest gradient, Quasi Newton methods only require
the gradient of the objective function at each iterate.
Measuring changes in gradient they build a model of the
objective function which is good enough to produce
superlinear convergence.
As the Hessian is not required, Quasi-Newton methods can be
more efficient than Newton methods which take a long time
to evaluate the Hessian and solve for the Newton direction.
M.M. Betcke Numerical Optimisation
Broyden Fletcher Goldfarb Shanno (BFGS)
Quadratic model of the objective function at xk :
mk(p) = fk +∇f Tk p +
1
2
pTBkp,
where Bk ∈ Rn×n symmetric positive definite which will be
updated during the iteration.
The minimiser of mk can be written explicitly
pk = −B−1k ∇fk .
pk is used as a search direction and the next iterate becomes
xk+1 = xk + αkpk .
The step length αk is chosen to satisfy the Wolfe conditions.
The iteration is similar to the line search Newton with the key
difference that the Hessian Bk is an approximation.
M.M. Betcke Numerical Optimisation
Bk update
Davidon proposed to update Bk in each iteration instead of
computing it anew.
Question: When we computed the new iterate xk+1 and construct
the new model
mk+1(p) = fk+1 +∇f Tk+1p +
1
2
pTBk+1p,
what requirements should we impose on Bk+1 based on the
knowledge gathered in the last step?
Require: gradient of mk+1 should match the gradient of f at the
last two iterates xk , xk+1.
i) At xk+1: pk+1 = 0,
∇mk+1(0) = ∇fk+1 is satisfied automatically.
ii) At xk = xk+1 − αkpk :
∇mk+1(−αkpk) = ∇fk+1 − αkBk+1pk = ∇fk .
M.M. Betcke Numerical Optimisation
By rearranging ii) we obtain
Bk+1αkpk = ∇fk+1 −∇fk .
Define vectors
sk = xk+1 − xk = αkpk , yk = ∇fk+1 −∇fk ,
ii) becomes the secant equation
Bk+1sk = yk .
As Bk+1 is symmetric positive definite, this is only possible if the
curvature condition holds
sTk yk > 0,
which can be easily seen multiplying the secant equation by sTk
from the left.
M.M. Betcke Numerical Optimisation
If f is strongly convex sTk yk > 0 is satisfied for any xk , xk+1.
However, for nonconvex functions in general this condition will
have to be enforced explicitly by imposing restrictions on the line
search.
sTk yk > 0 is guaranteed if we impose Wolfe or strong Wolfe
conditions:
From the 2nd Wolfe condition sTk ∇fk+1 ≥ c2s
T
k ∇fk , c1 < c2 < 1 it
follows
sTk yk ≥ (c2 − 1)αkp
T
k ∇fk > 0,
since c2 < 1 and pk is a descent direction, and the curvature
condition holds.
M.M. Betcke Numerical Optimisation
Davidon Flecher Powell (DFP)
When sTk yk > 0, the secant equation always has a solution Bk+1.
In fact the secant equation is heavily underdetermined: a
symmetric matrix has n(n + 1)/2 dofs, secant equation: n
conditions, positive definiteness: n inequalities.
Extra conditions to obtain unique solutions: we look for Bk+1 close
to Bk in a certain sense.
DFP update:
Bk+1 = (I − ρkyksTk )Bk(I − ρksky
T
k ) + ρkyky
T
k (DFP B)
with ρk = 1/y
T
k sk .
The inverse Hk = B
−1
k can be obtained with
Sherman-Morrison-Woodbury formula
Hk+1 = Hk −
Hkyky
T
k Hk
yTk Hkyk
+
sks
T
k
yTk sk
. (DFP H)
M.M. Betcke Numerical Optimisation
Sherman-Morrison-Woodbury formula
Figure: Nocedal Wright (A.28)
M.M. Betcke Numerical Optimisation
BFGS
Applying the same argument directly to the inverse of the Hessian
Hk . The updated approximation Hk+1 must be symmetric and
positive definite and must satisfy the secant equation
Hk+1yk = sk .
BFGS update:
Hk+1 = (I − ρkskyTk )Hk(I − ρkyks
T
k ) + ρksks
T
k (BFGS)
with ρk = 1/y
T
k sk .
How to choose H0? Depends on situation, information about the
problem e.g. starting with an inverse of approximated Hessian
calculated by a finite difference at x0. Otherwise, we can it to
identity of diagonal matrix to reflect the scaling of the variables.
M.M. Betcke Numerical Optimisation
BFGS
1: Given x0, inverse Hessian approximation H0, tolerance ε > 0
2: Set k = 0
3: while ‖∇fk‖ > ε do
4: Compute search direction
pk = −Hk∇fk
5: xk+1 = xk + αkpk where αk is computed with a line search
procedure satisfying Wolfe conditions
6: Define sk = xk+1 − xk and yk = ∇fk+1 −∇fk
7: Compute Hk+1 using (BFGS)
8: k = k + 1
9: end while
M.M. Betcke Numerical Optimisation
Complexity of each iteration is O(n2) plus the cost of function
and gradient evaluations.
There are no O(n3) such as linear system solves or
matrix-matrix multiplications
The algorithm is robust and the rate of convergence is
superlinear. In many cases it outperforms Netwon method,
which while converging quadratically, has higher complexity
per iteration (Hessian computation and solve).
A BFGS version with the Hessian approximation Bk rather
than Hk . The update for Bk is obtained by applying
Sherman-Morrison-Woodbury formula to (BFGS)
Bk+1 = Bk −
Bksks
T
k Bk
sTk Bksk
+
yky
T
k
yTk sk
. (BFGS B)
An O(n2) implementation can be achieved based on updates
of LDLT factors of Bk (with possible diagonal modification for
stability) but no computational advantage is observed above
(BFGS).
M.M. Betcke Numerical Optimisation
The positive definiteness of Hk is not explicitly forced, but if
Hk is positive definite so will be Hk+1.
What happens if at some iteration Hk becomes as poor
approximation to the true inverse Hessian. E.g. if sTk yk is tiny
(positive) than the elements of Hk+1 get very large. It turns
out that BFGS has effective self correcting properties, and Hk
tends to recover in a few steps. The self correcting properties
hold only when adequate line search is performed. In
particular Wolfe condiions ensure that the gradients are
sampled at points which allow the model mk to capture the
curvature information.
On the other hand DFP method is less effective in correcting
itself.
DFP and BFGS are dual in the sense that they can be
obtained by switching s ↔ y ,B ↔ H.
M.M. Betcke Numerical Optimisation
Implementation
αk = 1 should always be tried first, because this step length
will eventually be accepted (under certain conditions), thereby
producing super linear convergence.
Computational evidence suggest that it is more economical (in
terms of function evaluations) to perform fairly inaccurate line
search.
c1 = 10
−4, c2 = 0.9 are commonly used with Wolfe conditions.
M.M. Betcke Numerical Optimisation
Heuristic for scaling H0
Choice H0 = βI is popular, but there is no good strategy for
estimating β.
If β is too large, the first step p0 = −βg0 is too long and line
search may require many iterations to find a suitable step length
α0.
Heuristic: estimate β after the first step has been computed but
before the H0 update (in step 7) and change the provisional value
by setting H0 =
sT
k
yk
yT
k
yk
I . This scaling attempts to approximate
scaling with an eigenvalue of the inverse Hessian (from Taylor
theorem: the secant equation is satisfied for average Hessian
Ḡk =
∫ 1
0
∇2f (xk + ταkpk)dτ , yk = Ḡkαkpk = Ḡksk).
M.M. Betcke Numerical Optimisation
Symmetric rank-1 (SR-1) update
Both BFGS and DFP methods perform a rank-2 update while
preserving symmetry and positive definiteness.
Question: Does a rank-1 update exist such that the secant
equation is satisfied and the symmetry and definiteness are
preserved?
Rank-1 update:
Bk+1 = Bk + σvv
T, σ ∈ {+1,−1}
and v is chosen such that Bk+1 satisfies the secant equation
yk = Bk+1sk .
Substituting the explicit rank-1 form into the secant equation
yk = Bksk + (σv
Tsk)︸ ︷︷ ︸
:=δ
v
we see that v must be of the form v = δ(yk − Bksk).
M.M. Betcke Numerical Optimisation
Substituting v = δ(yk − Bksk) back into the secant equation we
obtain
yk − Bksk = σδ2[sTk (yk − Bksk)](yk − Bksk)
which is satisfied if and only if
σ = sign[sTk (yk − Bksk)], δ = ±|s
T
k (yk − Bksk)|
−1/2.
Hence, the only symmetric rank-1 update satisfying the secant
equation is
Bk+1 = Bk +
(yk − Bksk)(yk − Bksk)T
(yk − Bksk)Tsk
. (SR-1)
Applying the Sherman-Morrison-Woodbury formula the inverse
Hessian update is
Hk+1 = Hk +
(sk − Hkyk)(sk − Hkyk)T
(sk − Hkyk)Tyk
. (SR-1)
SR-1 update does not preserve the positive definiteness. It is a
drawback for line search methods but could be an asset for trust
region as it allows to generate indefinite Hessians.
M.M. Betcke Numerical Optimisation
SR-1 breakdown
The main drawback of SR-1 is that (sk − Hkyk)Tyk can become 0
even for a convex quadratic function i.e. there may be steps where
the is no symmetric rank-1 update which satisfies the secant
equation.
Three cases:
(yk − Bksk)Tsk 6= 0, unique symmetric rank-1 update
satisfying secant equation exists.
yk = Bksk , then the only update is Bk+1 = Bk .
(yk − Bksk)Tsk = 0 and yk 6= Bksk , there is no symmetric
rank-1 update satisfying secant equation.
Remedy: Skipping i.e. apply update only if
|(yk − Bksk)Tsk | ≥ r‖sk‖‖yk − Bksk‖,
where r ∈ (0, 1) is a small number (typically r = 10−8), otherwise
set Bk+1 = Bk .
M.M. Betcke Numerical Optimisation
SR-1 applicability
This simple safeguard adequately prevents the breakdown.
Recall: for BFGS update skipping is not recommended if the
curvature condition sTk yk > 0 fails. Because it can occur often
by e.g. taking to small step if the line search does not impose
the Wolfe conditions. For SR-1 sTk (yk − Bksk) ≈ 0 occurs
infrequently as it requires near orthogonality of sk and
yk − Bksk and moreover it implies that sTk Ḡksk ≈ s
T
k Bksk ,
where Ḡk is the average Hessian over the last step meaning
that the curvature approximation along sk is essentially
already correct.
The Hessian approximation generated by SR-1 are good, often
better than those by BFGS
When the curvature condition yTk sk > 0 cannot be imposed
e.g. constraint problems or partially separable functions,
indefinite Hessian approximations are desirable as they reflect
the indefiniteness of the true Hessian.
M.M. Betcke Numerical Optimisation
SR-1 trust-region method
1: Given x0, B0, ∆, η ∈ (0, 10−3), r ∈ (0, 1) and ε > 0
2: Set k = 0
3: while ‖∇fk‖ > ε do
4: sk = arg mins s
T∇fk + 12 s
TBks, subject to ‖s‖ ≤ ∆k
5: yk = ∇f (xk + sk)−∇fk
6: ρk = (fk − f (xk + sk))/− (sTk ∇fk +
1
2
sTk Bksk)
7: if ρk > η then
8: xk+1 = xk + sk
9: else
10: xk+1 = xk (failed step)
11: end if
12: Update ∆k in dependence of ρk , ‖sk‖ (as in trust-region methods)
13: if |(yk − Bksk)Tsk | ≥ r‖sk‖‖yk − Bksk‖ then
14: Update Bk+1 using (SR-1) (even if xk+1 = xk to improve bad
approximation along sk)
15: else
16: Bk+1 = Bk
17: end if
18: k = k + 1
19: end while
M.M. Betcke Numerical Optimisation
Theorem: Hessian approximation for quadratic function
Let f : Rn → R is the strongly quadratic function
f (x) = bTx + 1
2
xTAx with A symmetric positive definite. For any
starting point x0 and any symmetric initial matrix H0, the iterates
xk+1 = xk + pk , pk = −Hk∇fk ,
where Hk is updated with (SR-1), converge to the minimiser in at
most n steps provided that (sk − Hkyk)Tyk 6= 0 for all k . After n
steps, if the search directions pk are linearly independent,
Hn = A
−1.
Proof idea: Show inductively that Hkyj = sj , for all
j = 1, . . . , k − 1 i.e. Hk satisfies the secant equations for all the
directions up to now.
For quadratic function the secant equation is satisfied for all
previous directions, regardless how the line search is performed.
In contrast for BFGS it can only be shown under the assumption
that the line search is exact.
M.M. Betcke Numerical Optimisation
Theorem: Hessian approximation for general function
Let f : Rn → R twice continuously differentiable with the Hessian
bounded and Lipschitz continuous in a neighbourhood of a point
x? ∈ Rn and {xk} a sequence of iterates such that xk → x?.
Suppose that
|(yk − Bksk)Tsk | ≥ r‖sk‖‖yk − Bksk‖
holds for all k and some r ∈ (0, 1) and that the steps sk are
uniformly independent (steps do not tend to fall in a subspace of
dimension less than n).
Then the matrices Bk generated by the update (SR-1) satisfy
lim
k→∞
‖Bk −∇2f (x?)‖ = 0.
M.M. Betcke Numerical Optimisation
The Broyden class
Broyden class is a family of updates of the form
Bk+1 = Bk −
Bksks
T
k Bk
sTk Bks
T
k
+
yky
T
k
yTk sk
+ τk(s
T
k Bksk)vkv
T
k , (Broyden)
where τk is a scalar parameter and
vk =
yk
yTk sk
−
Bksk
sTk Bksk
.
For τk = 0 we recover BFGS and for τk = 1 we DFP.
Hence we can write (Broyden) as a linear combination of the two
Bk+1 = (1− τk)BBFGSk+1 + τkB
DFP
k+1 .
Since both BFGS and DFP satisfy secant equation so does the
whole Broyden class.
Since BFSG and DFP updates preserve positive definiteness of the
Hessian when sTk yk > 0, so does the restricted Broyden class
which is obtained by restricting 0 ≤ τk ≤ 1.
M.M. Betcke Numerical Optimisation
Theorem: monotonicity of eigenvalue approximation
Let f : Rn → R is the strongly convex quadratic function
f (x) = bTx + 1
2
xTAx with A symmetric positive definite. Let B0
any symmetric positive matrix and x0 be any starting point for the
iteration
xk+1 = xk + pk , pk = −B−1k ∇fk ,
where Bk is updated with (Broyden) with τk ∈ [0, 1].
Denote with λk1 ,≤ λ
k
2 ≤ · · · ≤ λ
k
n the eigenvalues of
A1/2B−1k A
1/2.
Then for all k, we have
min{λki , 1} ≤ λ
k+1
i ≤ max{λ
k
i , 1}, i = 1, . . . , n.
The interlacing property does not hold if τk 6∈ [0, 1].
Consequence: The eigenvalues λki converge monotonically (but
not strictly monotonically) to 1, which are the eigenvalues when
Bk = A. Significantly, the result holds even if the line search is not
exact.
M.M. Betcke Numerical Optimisation
So do the best updates belong to the restricted Broyden class?
We recover SR-1 formula for
τk =
sTk yk
sTk yk − s
T
k Bksk
,
which does not belong to the restricted Broyden class as τk may
fall outside of [0, 1].
It can be shown that for B0 symmetric positive definite, if for all k
sTk yk > 0 and τk > τ
c
k , then all Bk generated by (Broyden) remain
symmetric and positive definite. Here
τ ck = (1− µk)
−1 ≤ 0, µk =
(yTk B
−1
k yk)(s
T
k Bksk)
(yTk sk)
2
≥ 1.
When the line search is exact all the methods in the Broyden class
with τk ≥ τ ck generate the same sequence of iterates, even for
nonlinear functions because the directions differ only by length and
this is compensated by the exact line search.
M.M. Betcke Numerical Optimisation
Thm: Properties of Broyden class for quadratic function
Let f : Rn → R be the strongly convex quadratic function
f (x) = bTx + 1
2
xTAx with A symmetric positive definite. Let x0
be any starting point and B0 any symmetric positive definite
matrix. Assume that αk is the exact step length and τk ≥ τ ck for
all k . Then it holds
(i) The iterates are independent of τk and converge to the
solution in at most n iterations.
(ii) The secant equation is satisfied for all previous search
directions
Bksj = yj , j = 1, . . . , k − 1.
(iii) If B0 = I , then the sequence of iterates {xk} is identical to
that generated by the conjugate gradient method, in
particular the search directions sk are conjugate
sTi Asj = 0, i 6= j .
(iv) If n iterations are performed, we have Bn = A.
M.M. Betcke Numerical Optimisation
Few comments …
The theorem can be slightly generalised to hold if the Hessian
approximation remains nonsingular but not necessarily positive
definite i.e. τk could be smaller than τ
c
k provided the chosen
value did not produce singular updated matrix.
(iii) can be generalised to B0 6= I , then the Broyden class
method is identical to preconditioned conjugate gradient
method with the preconditioner B0.
The theorem is mainly of theoretical interest as the inexact
line search used in practice significantly alters the performance
of the methods. This type of analysis however, guided much
of the development in quasi-Newton methods.
M.M. Betcke Numerical Optimisation
Global convergence
For general nonlinear objective function, there is no global
convergence result for quasi-Newton methods i.e. convergence to a
stationary point from any starting point and any suitable Hessian
approximation.
Theorem: [BFGS global convergence]
Let f : Rn → R be twice continuously differentiable and x0 be a
starting point for which the level set L = {x ∈ Rn : f (x) ≤ f (x0)}
is convex and there exist two positive constants m,M such that
m‖z‖2 ≤ zT∇2f (x)z ≤ M‖z‖2, ∀z ∈ Rn, x ∈ L.
Then for any symmetric positive definite matrix B0 the sequence
{xk} generated by BFGS algorithm (with ε = 0) converges to the
miminizer x? of f .
This results can be generalised to the restricted Broyden class with
τk ∈ [0, 1) i.e. except for DFP method.
M.M. Betcke Numerical Optimisation
Theorem: Superlinear convergence BFGS
Let f : Rn → R be twice continuously differentiable and the
sequence of iterates generated by BFGS algorithm converge to
x? ∈ Rn such that the Hessian ∇2f is Lipschitz continuous at x?
‖∇2f (x)−∇2f (x?)‖ ≤ L‖x − x?‖, ∀x ∈ N (x0), 0 < L <∞, and that it holds ∞∑ k=1 ‖xk − x?‖ <∞, then xk converges to x ? at a superlinear rate. M.M. Betcke Numerical Optimisation Theorem: SR-1 trust region convergence Let {xk} be the sequence of iterates generated by the SR-1 trust region method. Suppose the following conditions hold: the sequence {xk} does not terminate, but remains in a closed bounded convex set D on which f is twice continuously differentiable and in which f has a unique stationary point x?; ∇2f (x?) is positive definite and ∇2f (x) is Lipschitz continuous in N (x?); the sequence {Bk} is bounded in norm; |(yk − Bksk)Tsk | ≥ r‖sk‖‖yk − Bksk‖, r ∈ (0, 1), ∀k. Then for the sequence {xk} we have limk→∞ xk = x? and lim k→∞ ‖xk+n+1 − x?‖ ‖xk − x?‖ = 0 (n + 1-step superlinear rate). Remark: SR-1 update does not maintain positive definiteness of Bk . In practice Bk can be indefinite at any iteration (trust region bound may continue to be active for arbitraily large k) but it can be shown that (asymptotically) Bk remains positive definite most of the time. M.M. Betcke Numerical Optimisation