Numerical Optimisation: Large scale methods
Numerical Optimisation:
Large scale methods
Marta M. Betcke
m.betcke@ucl.ac.uk,
Kiko Rullan
f.rullan@cs.ucl.ac.uk
Department of Computer Science,
Centre for Medical Image Computing,
Centre for Inverse Problems
University College London
Lecture 9
M.M. Betcke Numerical Optimisation
Issues arising from large scale
Hessian solve: Line search and trust region methods require
factorisation of the Hessian. For large scale it is infeasible and
has to be performed using large scale techniques such as
sparse factorisations or iterative methods.
Hessian computation and storage: Hessian approximations
generated in quasi-Newton methods are usually dense even if
the true Hessian is sparse. Limited-memory variants have
been developed, where the Hessian approximation can be
stored using only few vectors (slow convergence).
Approximated Hessians preserving sparsity.
Special structure properties of the objective function like
partial separability i.e. the function can be decomposed into a
sum of simpler functions each depending only on a small
subspace of Rn.
M.M. Betcke Numerical Optimisation
Inexact Newton methods
Solve the Newton step system
∇2fk︸︷︷︸
:=A
piNk = −∇fk︸ ︷︷ ︸
b
using iterative method: CG or Lanczos with a modification to
handle negative curvature.
Implementation can be done matrix free i.e. the Hessian does not
need to be calculated or stored explicitly, we only require a routine
which executes the Hessian matrix vector product.
Question: How does the inexact solve impact on the local
convergence of the Newton methods?
M.M. Betcke Numerical Optimisation
Most of the termination rules for iterative methods are based on
the residual
rk = ∇2fkpiNk +∇fk ,
where piNk is the inexact Newton step.
Usually we terminate CG when
‖rk‖ ≤ ηk‖∇fk‖, (iN-STOP)
where {ηk} is some sequence 0 < ηk < 1. For the moment we assume that step of length αk = 1 is taken i.e. globalisation strategies do not interfere with the inexact-Newton step. M.M. Betcke Numerical Optimisation Theorem: local convergence Suppose ∇2f (x) exists and is continuous in the neighbourhood of a minimiser x?, with ∇2f (x?) positive definite. Consider the inexact Newton iteration with step length αk = 1, xk+1 = xk + pk , with a starting point x0 sufficiently close to x ?, terminated with the stopping (iN-STOP) with ηk ≤ η for some constant η ∈ [0, 1). Then the sequence {xk} converges to x? and satisfies ‖∇2f (x?)(xk+1 − x?)‖ ≤ η̂‖∇2f (x?)(xk − x?)‖, for some constant η̂ : η < η̂ < 1. Remark: This result provides convergence for {ηk} bounded away from 1. M.M. Betcke Numerical Optimisation Proof idea convergence (superlinear): Continuity of ∇2f (x) in a neighbourhood N (x?) of x? implies ∇f (xk) = ∇2f (x?)(xk − x?) + o(‖xk − x?‖), thus show instead ‖∇f (xk+1)‖ ≤ η̂‖∇f (xk)‖. Continuity and positive definiteness of ∇2f (x) in N (x?) implies ∃L ∈ R > 0 : ‖∇2f (xk)−1‖ ≤ L, ∀xk ∈ N (x?) and hence
‖pk‖ ≤ L(‖∇f (xk)‖+ ‖rk‖) ≤ 2L‖∇f (xk)‖.
From Taylor theorem and continuity of ∇2f (x) in N (x?) we have
∇f (xk+1) = ∇f (xk) +∇2f (xk)pk +
∫ 1
0
[∇2f (xk + tpk)−∇2f (xk)]pkdt
= ∇f (xk) +∇2f (xk)pk + o(‖pk‖)
= ∇f (xk)− (∇f (xk)− rk) + o(‖∇f (xk)‖) = rk + o(‖∇f (xk)‖)
‖∇f (xk+1)‖ ≤ ηk‖∇f (xk)‖+ o(‖∇f (xk)‖) ≤ (ηk + o(1))‖∇f (xk)‖
with ηk = o(1), ≤ o(‖∇f (xk)‖).
M.M. Betcke Numerical Optimisation
Proof idea convergence (quadratic):
Continuity of ∇2f (x) in a neighbourhood N (x?) of x? implies
∇f (xk) = ∇2f (x?)(xk − x?) + o(‖xk − x?‖),
thus show instead ‖∇f (xk+1)‖ ≤ η̂‖∇f (xk)‖.
Continuity and positive definiteness of ∇2f (x) in N (x?) implies
∃L ∈ R > 0 : ‖∇2f (xk)−1‖ ≤ L, ∀xk ∈ N (x?) and hence
‖pk‖ ≤ L(‖∇f (xk)‖+ ‖rk‖) ≤ 2L‖∇f (xk)‖.
From Taylor theorem and Lipschitz continuity of ∇2f (x) in N (x?)
∇f (xk+1) = ∇f (xk) +∇2f (xk)pk +
∫ 1
0
[∇2f (xk + tpk)−∇2f (xk)]pkdt
= ∇f (xk) +∇2f (xk)pk +O(‖pk‖2)
= ∇f (xk)− (∇f (xk)− rk) +O(‖∇f (xk)‖2) = rk +O(‖∇f (xk)‖2)
with ηk = O(‖∇f (xk)‖)
‖∇f (xk+1)‖ ≤ ηk‖∇f (xk)‖+O(‖∇f (xk)‖2) ≤ O(‖∇f (xk)‖2).
M.M. Betcke Numerical Optimisation
Theorem: superlinear (quadratic) convergence
Suppose ∇2f (x) exists and is continuous in the neighbourhood of
a minimiser x?, with ∇2f (x?) positive definite.
Let the sequence {xk} generated by the inexact Newton iteration
with step length αk = 1, xk+1 = xk + pk with stopping (iN-STOP)
and ηk ≤ η for some constant η ∈ [0, 1) and a starting point x0
sufficiently close to x?, converge to x?.
Then the rate of convergence is superlinear if ηk → 0.
If in addition ∇2f (x) is Lipschitz continuous for x ∈ N (x?) and
ηk = O(‖∇fk‖), then the convergence is quadratic.
Remark: To obtain superlinear convergence we can set
e.g. ηk = min(0.5,
√
‖∇fk‖). The choice ηk = min(0.5, ‖∇fk‖)
would yield quadratic convergence.
M.M. Betcke Numerical Optimisation
Line search Newton CG
Also called truncated Newton method. The key differences to
standard Newton line search method:
Solve the Newton step with CG with initial guess 0 and the
termination criterium (iN-STOP) with the suitable choice of
ηk , e.g. ηk = min(0.5,
√
‖∇fk‖) for superlinear convergence.
Note, that if we are close enough to the solution the stopping
tolerance decreases in each outer (line search) iteration.
The inner CG iteration can be preconditioned.
Away from the solution x? the Hessian may not be positive
definite. Therefore, we terminate CG whenever a direction of
of non-positive curvature is generated dTj ∇
2fkdj ≤ 0. This
guarantees that the produced search direction is a descent
direction and preserves the fast pure Newton convergence rate
provided αk = 1 is used whenever it satisfies the acceptance
criteria.
Weakness: Performance when Hessian is nearly singular.
M.M. Betcke Numerical Optimisation
M.M. Betcke Numerical Optimisation
Trust region Newton CG
Use a special CG variant to solve the quadratic trust region model
problem
min
p∈Rn
mk(p) = fk + p
T∇fk +
1
2
pT∇2fkp, subject to ‖p‖ ≤ ∆k .
Modifications:
Use the termination criterium (iN-STOP) as in line search
variant with a suitable choice of ηk ,
e.g. ηk = min(0.5,
√
‖∇fk‖) for superlinear convergence.
If CG generates direction of non-positive curvature
i.e. dTj ∇
2fkdj ≤ 0, stop and return pk = zj + τdj which
minimises mk(pk) along dj and satisfies ‖pk‖ = ∆k .
If the current iterate violates the trust region constraint
i.e. ‖zj+1‖ ≥ ∆k , stop and return pk = zj + τdj , τ ≥ 0 which
satisfies ‖pk‖ = ∆k .
M.M. Betcke Numerical Optimisation
M.M. Betcke Numerical Optimisation
The initialisation z0 = 0 is crucial:
Whenever ‖rk‖ ≥ εk , the algorithm terminates at a point pk
for which mk(pk) ≤ mk(pCk ) that is when the reduction in the
model is at least that of the Cauchy point.
If dT0 Bkd0 = ∇f
T
0 B0∇f0 ≤ 0, the first if is activated and the
algorithm returns the Cauchy point p = −(∆0/‖∇f0‖)∇f0.
Otherwise, the algorithm defines
z1 = α0d0 =
rT0 r0
dT0 Bkd0
d0 = −
∇f T0 ∇f0
∇f T0 B0∇f0
∇f0.
If ‖z1‖ < ∆0 then z1 is exactly the Cauchy point. Subsequent
steps ensure that the final pk satisfies mk(pk) ≤ mk(z1).
When ‖z1‖ ≥ ∆0, the second if is activated and the algorithm
terminates at the Cauchy point.
Therefore, it is globally convergent.
‖zk+1‖ > ‖zk‖ > · · · > ‖z1‖ as a consequence of the
initialisation z0 = 0. Thus we can stop as soon as the
boundary of trust region has been reached, because no further
iterates giving a lower value of mk will lie inside the trust
region.
M.M. Betcke Numerical Optimisation
Preconditioning can be used, but requires change of trust
region definition, which can be reformulated in the standard
form in terms of a variable p̂ = Dp and modified
ĝk = D
−T∇fk and B̂k = D−T(∇2fk)D−1. Of particular
interest is incomplete Cholesky factorisation (Algorithm 7.3 in
Nocedal and Wright).
The limitation of the algorithm is that it accepts any direction
of negative curvature, even if this direction gives insignificant
reduction in the model. To improve performance, CG can be
replaced by Lanczos method (which can be seen as
generalisation of CG which works for indefinite system albeit is
more computationally expensive) for which techniques from
exact trust region can be applied to compute a direction to
quickly move away from stationary points which are not
minimisers.
M.M. Betcke Numerical Optimisation
Limited memory quasi-Newton methods
Recall the BFGS formula
Hk+1 = (I −
sky
T
k
yTk sk
)Hk(I −
yks
T
k
yTk sk
) +
sks
T
k
yTk sk
(BFGS)
with sk = xk+1 − xk = αkpk , yk = ∇fk+1 −∇fk . Application of
BFGS Hessian approximation can be efficiently implemented
storing the list of vector pairs (sk , yk).
The limited memory version can be obtained by restricting the
total number of vectors used to construct the Hessian
approximation to the last m� n. After the mth update, the oldest
pair in the list makes space for the new pair.
Same strategy can be applied to the other quasi-Newton schemes
(including updating Bk for use with e.g. trust region methods
rather than line search methods which require Hk).
Application: large, non-sparse Hessians.
Convergence: often linear convergence rate.
M.M. Betcke Numerical Optimisation
Theoretical connection to CG methods
Consider the memoryless BFGS
Hk+1 = (I −
sky
T
k
yTk sk
)(I −
yks
T
k
yTk sk
) +
sks
T
k
yTk sk
i.e. the previous Hessian is reset to identity, Hk = I .
If the memory less BFGS is applied in conjunction with an exact
line search i.e. pTk ∇fk+1 = 0 for all k . We then obtain
pk+1 = −Hk+1∇fk+1 = −∇fk+1 +
yTk ∇fk+1
yTk pk
pk ,
which is exactly the Hestens-Stiefel formula, which reduces to
Polak-Ribiere when ∇f Tk+1pk = 0
βHSk+1 =
∇f Tk+1(∇fk+1 −∇fk)
pTk (∇fk+1 −∇fk)
, βPRk+1 =
∇f Tk+1(∇fk+1 −∇fk)
∇f Tk ∇fk
.
M.M. Betcke Numerical Optimisation
Compact representation of BFGS update
Let B0 be symmetric positive definite and assume that the vector
pairs {si , yi}k−1i=0 satisfy s
T
i yi > 0. Applying k BFGS updates with
these vector pairs to B0 yields
M.M. Betcke Numerical Optimisation
In limited memory version we replace the columns or diagonal
entries in the matrices cyclically (keeping m last columns).
Since the dimension of the middle matrix is small, the
factorisation cost is negligible.
Cost of an update: 2mn +O(m3)
Cost of Bkv : (4m + 1)n +O(m3), (for B0 = δk I )
This approximation can be used in trust region methods for
unconstrained problems, but also in methods for constrained
optimisation.
Similar compact representation can be derived for Hk
Compact representation can also be derived for SR-1
Bk = B0+(Yk−B0Sk)(Dk+Lk+LTk −S
T
k B0Sk)
−1(Yk−B0Sk)T
with Sk ,Yk ,Dk , Lk as before. The inverse formula for Hk can
be obtained by swapping B ↔ H, s ↔ y , however limited
memory SR-1 can be less effective than BFGS.
M.M. Betcke Numerical Optimisation
Sparse quasi-Newton updates
We require the quasi-Newton approximation to the Hessian Bk to
has the same (or similar) sparsity pattern as the true Hessian.
Suppose that we know which components of the Hessian are
non-zero
Ω =
{
(i , j) : [∇2f (x)]ij 6= 0 for some point x in the domain of f
}
,
and suppose that the current approximation Bk mirrors this
sparsity structure. Such sparse update can be obtained as a
solution of the following quadratic program
min
B
‖B − Bk‖2F =
∑
(i ,j)∈Ω
[Bij − (Bk)ij ]2,
subject to Bsk = yk , B = B
T, Bij = 0 ∀(i , j) 6∈ Ω.
It can be shown that the solution of this problem can be obtained
solving an n × n linear system with sparsity pattern Ω. Bk+1 is not
guaranteed to be positive definite. The new Bk+1 can be used
within a trust region.
M.M. Betcke Numerical Optimisation
Unfortunately, this approach has several drawbacks, it is not scale
invariant under linear transformations and the performance is
disappointing. The fundamental weakness is that the closeness in
Frobenius norm is an inadequate model and the produced
approximations can be poor.
An alternative approach is to relax the secant equation making
sure that it is approximately satisfied at the m last steps (as
opposed to holding strictly in the last step) and solve
min
B
‖BSk − Yk‖2F ,
subject to B = BT, Bij = 0 ∀(i , j) 6∈ Ω,
with Sk ,Yk containing the last m of si , yi , respectively.
This convex optimisation problem has a solution but it is not easy
to compute. Furthermore, it can produce singular and poorly
conditioned Hessian approximations. Even though it frequently
outperforms the previous approach, its performance is still not
impressive for large scale problems.
M.M. Betcke Numerical Optimisation
Partially separable functions
An unconstrained optimisation problem is separable if the
objective function f : Rn → R can be decomposed in a sum of
independent functions e.g.
f (x) = f1(x1, x3) + f2(x2, x4, x6) + f3(x5).
The optimal value can be found optimising each function
independently, which is in general much less expensive.
In many large scale problems the objective function f : Rn → R is
not separable but it still can be written as a sum of simpler
component functions. Each such component has the property that
it only changes in a small number of directions while for other
directions is remains constant. We call such functions partially
separable.
All functions which have a sparse Hessian are partially separable,
but there are many partially separable functions with dense
Hessians. Partial separability allows for economical representation
and effective quasi-Newton updating.
M.M. Betcke Numerical Optimisation
Consider an objective function f : Rn → R
f (x) =
∑̀
i=1
fi (x),
where each fi depends only on a few components of x . For such fi ,
its gradient and Hessian contain only few non-zeros.
For the function f we have by linearity of differentiation
∇f (x) =
∑̀
i=1
∇fi (x), ∇2f (x) =
∑̀
i=1
∇2fi (x)
thus we can maintain an quasi-Newton approximation to each
individual component Hessian ∇2fi (x) instead of approximating
the entire Hessian ∇2f (x).
M.M. Betcke Numerical Optimisation
Example: partially separable approximation
Consider a partially separable objective function
f (x) = (x1 − x23 )
2︸ ︷︷ ︸
f1(x)
+ (x2 − x24 )
2︸ ︷︷ ︸
f2(x)
+ (x3 − x22 )
2︸ ︷︷ ︸
f3(x)
+ (x4 − x21 )
2︸ ︷︷ ︸
f4(x)
.
Each fi depends on two components only, all have the same form.
Denote
x [1] =
[
x1
x3
]
, U1 =
[
1 0 0 0
0 0 1 0
]
, x [1] = U1x , φ1(z1, z2) = (z1−z22 )
2.
Then f1(x) = φ(U1x) and using chain rule we obtain
∇f1(x) = UT1 ∇φ1(U1x), ∇
2f1(x) = U
T
1 ∇
2φ1(U1x)U1.
M.M. Betcke Numerical Optimisation
For the Hessians ∇2φ1 and ∇2f1 we have
∇2φ1(U1x) =
[
2 −4×3
−4×3 12×23 − 4×1
]
, ∇2f1(x) =
2 0 −4×3 0
0 0 0 0
−4×3 0 12×23 − 4×1 0
0 0 0 0
.
Idea: maintain quasi-Newton approximation B [1] to 2× 2 Hessian
∇2φ1 and lift it up to ∇2f1.
After a step from xk to xk+1
s
[1]
k = x
[1]
k+1 − x
[1]
k , y
[1]
k = ∇φ1(x
[1]
k+1)−∇φ1(x
[1]
k ),
and we use BFGS or SR-1 updating to obtain the new
approximation B
[1]
k+1 of the small, dense Hessian ∇
2φ1 and we lift
it back using
∇2f1(x) ≈ UT1 B
[1]
k+1U1.
We do the same for all component functions and we obtain
∇2f ≈ B =
∑̀
i=1
UTi B
[1]Ui .
M.M. Betcke Numerical Optimisation
The approximated Hessian may be used in trust region
algorithm, obtaining an approximate solution to
Bkpp = −∇fk .
Bk does not need to be assembled explicitly but conjugate
gradient method can be used and the products Bkv can be
performed directly using matrices Ui and B
[i ].
This approach is particularly useful for large number of
variables with very small dependence of component functions.
Then each respective component Hessian can be much faster
approximated by the iterative method (small problem requires
few directions) and the so obtained full Hessian approximation
is usually much better than one obtained by a quasi-Newton
method applied to the problem ignoring the partially separable
structure (large Hessian requires a lot of directions to
approximate the curvature).
M.M. Betcke Numerical Optimisation
It is not always possible for BFGS to update the partial
Hessian B [1], as the curvature condition (s [1])Ty [1] > 0 may
not be satisfied even if the full Hessian is at least positive
semidefinite. This can be overcome applying SR-1 update to
the component Hessians, which proved effective in practice.
The limitation of this quasi-Newton approach is the cost of
computing the step, which is comparable to the cost of
Newton step, thus it may be beneficial to actually take the
Newton step.
Another problem is the difficulty of identifying the partially
separable structure of a function. The performance of
quasi-Newton methods is satisfactory provided that we find
the finest partially separable decomposition.
M.M. Betcke Numerical Optimisation