程序代写代做代考 scheme algorithm chain Numerical Optimisation: Large scale methods

Numerical Optimisation: Large scale methods

Numerical Optimisation:
Large scale methods

Marta M. Betcke
m.betcke@ucl.ac.uk,

Kiko Rullan
f.rullan@cs.ucl.ac.uk

Department of Computer Science,
Centre for Medical Image Computing,

Centre for Inverse Problems
University College London

Lecture 9

M.M. Betcke Numerical Optimisation

Issues arising from large scale

Hessian solve: Line search and trust region methods require
factorisation of the Hessian. For large scale it is infeasible and
has to be performed using large scale techniques such as
sparse factorisations or iterative methods.

Hessian computation and storage: Hessian approximations
generated in quasi-Newton methods are usually dense even if
the true Hessian is sparse. Limited-memory variants have
been developed, where the Hessian approximation can be
stored using only few vectors (slow convergence).
Approximated Hessians preserving sparsity.

Special structure properties of the objective function like
partial separability i.e. the function can be decomposed into a
sum of simpler functions each depending only on a small
subspace of Rn.

M.M. Betcke Numerical Optimisation

Inexact Newton methods

Solve the Newton step system

∇2fk︸︷︷︸
:=A

piNk = −∇fk︸ ︷︷ ︸
b

using iterative method: CG or Lanczos with a modification to
handle negative curvature.

Implementation can be done matrix free i.e. the Hessian does not
need to be calculated or stored explicitly, we only require a routine
which executes the Hessian matrix vector product.

Question: How does the inexact solve impact on the local
convergence of the Newton methods?

M.M. Betcke Numerical Optimisation

Most of the termination rules for iterative methods are based on
the residual

rk = ∇2fkpiNk +∇fk ,

where piNk is the inexact Newton step.

Usually we terminate CG when

‖rk‖ ≤ ηk‖∇fk‖, (iN-STOP)

where {ηk} is some sequence 0 < ηk < 1. For the moment we assume that step of length αk = 1 is taken i.e. globalisation strategies do not interfere with the inexact-Newton step. M.M. Betcke Numerical Optimisation Theorem: local convergence Suppose ∇2f (x) exists and is continuous in the neighbourhood of a minimiser x?, with ∇2f (x?) positive definite. Consider the inexact Newton iteration with step length αk = 1, xk+1 = xk + pk , with a starting point x0 sufficiently close to x ?, terminated with the stopping (iN-STOP) with ηk ≤ η for some constant η ∈ [0, 1). Then the sequence {xk} converges to x? and satisfies ‖∇2f (x?)(xk+1 − x?)‖ ≤ η̂‖∇2f (x?)(xk − x?)‖, for some constant η̂ : η < η̂ < 1. Remark: This result provides convergence for {ηk} bounded away from 1. M.M. Betcke Numerical Optimisation Proof idea convergence (superlinear): Continuity of ∇2f (x) in a neighbourhood N (x?) of x? implies ∇f (xk) = ∇2f (x?)(xk − x?) + o(‖xk − x?‖), thus show instead ‖∇f (xk+1)‖ ≤ η̂‖∇f (xk)‖. Continuity and positive definiteness of ∇2f (x) in N (x?) implies ∃L ∈ R > 0 : ‖∇2f (xk)−1‖ ≤ L, ∀xk ∈ N (x?) and hence

‖pk‖ ≤ L(‖∇f (xk)‖+ ‖rk‖) ≤ 2L‖∇f (xk)‖.

From Taylor theorem and continuity of ∇2f (x) in N (x?) we have

∇f (xk+1) = ∇f (xk) +∇2f (xk)pk +
∫ 1

0

[∇2f (xk + tpk)−∇2f (xk)]pkdt

= ∇f (xk) +∇2f (xk)pk + o(‖pk‖)
= ∇f (xk)− (∇f (xk)− rk) + o(‖∇f (xk)‖) = rk + o(‖∇f (xk)‖)

‖∇f (xk+1)‖ ≤ ηk‖∇f (xk)‖+ o(‖∇f (xk)‖) ≤ (ηk + o(1))‖∇f (xk)‖
with ηk = o(1), ≤ o(‖∇f (xk)‖).

M.M. Betcke Numerical Optimisation

Proof idea convergence (quadratic):

Continuity of ∇2f (x) in a neighbourhood N (x?) of x? implies
∇f (xk) = ∇2f (x?)(xk − x?) + o(‖xk − x?‖),

thus show instead ‖∇f (xk+1)‖ ≤ η̂‖∇f (xk)‖.

Continuity and positive definiteness of ∇2f (x) in N (x?) implies
∃L ∈ R > 0 : ‖∇2f (xk)−1‖ ≤ L, ∀xk ∈ N (x?) and hence

‖pk‖ ≤ L(‖∇f (xk)‖+ ‖rk‖) ≤ 2L‖∇f (xk)‖.

From Taylor theorem and Lipschitz continuity of ∇2f (x) in N (x?)

∇f (xk+1) = ∇f (xk) +∇2f (xk)pk +
∫ 1

0

[∇2f (xk + tpk)−∇2f (xk)]pkdt

= ∇f (xk) +∇2f (xk)pk +O(‖pk‖2)
= ∇f (xk)− (∇f (xk)− rk) +O(‖∇f (xk)‖2) = rk +O(‖∇f (xk)‖2)

with ηk = O(‖∇f (xk)‖)
‖∇f (xk+1)‖ ≤ ηk‖∇f (xk)‖+O(‖∇f (xk)‖2) ≤ O(‖∇f (xk)‖2).

M.M. Betcke Numerical Optimisation

Theorem: superlinear (quadratic) convergence

Suppose ∇2f (x) exists and is continuous in the neighbourhood of
a minimiser x?, with ∇2f (x?) positive definite.
Let the sequence {xk} generated by the inexact Newton iteration
with step length αk = 1, xk+1 = xk + pk with stopping (iN-STOP)
and ηk ≤ η for some constant η ∈ [0, 1) and a starting point x0
sufficiently close to x?, converge to x?.

Then the rate of convergence is superlinear if ηk → 0.

If in addition ∇2f (x) is Lipschitz continuous for x ∈ N (x?) and
ηk = O(‖∇fk‖), then the convergence is quadratic.

Remark: To obtain superlinear convergence we can set
e.g. ηk = min(0.5,


‖∇fk‖). The choice ηk = min(0.5, ‖∇fk‖)

would yield quadratic convergence.

M.M. Betcke Numerical Optimisation

Line search Newton CG

Also called truncated Newton method. The key differences to
standard Newton line search method:

Solve the Newton step with CG with initial guess 0 and the
termination criterium (iN-STOP) with the suitable choice of
ηk , e.g. ηk = min(0.5,


‖∇fk‖) for superlinear convergence.

Note, that if we are close enough to the solution the stopping
tolerance decreases in each outer (line search) iteration.

The inner CG iteration can be preconditioned.

Away from the solution x? the Hessian may not be positive
definite. Therefore, we terminate CG whenever a direction of
of non-positive curvature is generated dTj ∇

2fkdj ≤ 0. This
guarantees that the produced search direction is a descent
direction and preserves the fast pure Newton convergence rate
provided αk = 1 is used whenever it satisfies the acceptance
criteria.

Weakness: Performance when Hessian is nearly singular.
M.M. Betcke Numerical Optimisation

M.M. Betcke Numerical Optimisation

Trust region Newton CG

Use a special CG variant to solve the quadratic trust region model
problem

min
p∈Rn

mk(p) = fk + p
T∇fk +

1

2
pT∇2fkp, subject to ‖p‖ ≤ ∆k .

Modifications:

Use the termination criterium (iN-STOP) as in line search
variant with a suitable choice of ηk ,
e.g. ηk = min(0.5,


‖∇fk‖) for superlinear convergence.

If CG generates direction of non-positive curvature
i.e. dTj ∇

2fkdj ≤ 0, stop and return pk = zj + τdj which
minimises mk(pk) along dj and satisfies ‖pk‖ = ∆k .

If the current iterate violates the trust region constraint
i.e. ‖zj+1‖ ≥ ∆k , stop and return pk = zj + τdj , τ ≥ 0 which
satisfies ‖pk‖ = ∆k .

M.M. Betcke Numerical Optimisation

M.M. Betcke Numerical Optimisation

The initialisation z0 = 0 is crucial:

Whenever ‖rk‖ ≥ εk , the algorithm terminates at a point pk
for which mk(pk) ≤ mk(pCk ) that is when the reduction in the
model is at least that of the Cauchy point.

If dT0 Bkd0 = ∇f
T

0 B0∇f0 ≤ 0, the first if is activated and the
algorithm returns the Cauchy point p = −(∆0/‖∇f0‖)∇f0.
Otherwise, the algorithm defines

z1 = α0d0 =
rT0 r0

dT0 Bkd0
d0 = −

∇f T0 ∇f0
∇f T0 B0∇f0

∇f0.

If ‖z1‖ < ∆0 then z1 is exactly the Cauchy point. Subsequent steps ensure that the final pk satisfies mk(pk) ≤ mk(z1). When ‖z1‖ ≥ ∆0, the second if is activated and the algorithm terminates at the Cauchy point. Therefore, it is globally convergent. ‖zk+1‖ > ‖zk‖ > · · · > ‖z1‖ as a consequence of the
initialisation z0 = 0. Thus we can stop as soon as the
boundary of trust region has been reached, because no further
iterates giving a lower value of mk will lie inside the trust
region.

M.M. Betcke Numerical Optimisation

Preconditioning can be used, but requires change of trust
region definition, which can be reformulated in the standard
form in terms of a variable p̂ = Dp and modified
ĝk = D

−T∇fk and B̂k = D−T(∇2fk)D−1. Of particular
interest is incomplete Cholesky factorisation (Algorithm 7.3 in
Nocedal and Wright).

The limitation of the algorithm is that it accepts any direction
of negative curvature, even if this direction gives insignificant
reduction in the model. To improve performance, CG can be
replaced by Lanczos method (which can be seen as
generalisation of CG which works for indefinite system albeit is
more computationally expensive) for which techniques from
exact trust region can be applied to compute a direction to
quickly move away from stationary points which are not
minimisers.

M.M. Betcke Numerical Optimisation

Limited memory quasi-Newton methods

Recall the BFGS formula

Hk+1 = (I −
sky

T
k

yTk sk
)Hk(I −

yks
T
k

yTk sk
) +

sks
T
k

yTk sk
(BFGS)

with sk = xk+1 − xk = αkpk , yk = ∇fk+1 −∇fk . Application of
BFGS Hessian approximation can be efficiently implemented
storing the list of vector pairs (sk , yk).

The limited memory version can be obtained by restricting the
total number of vectors used to construct the Hessian
approximation to the last m� n. After the mth update, the oldest
pair in the list makes space for the new pair.

Same strategy can be applied to the other quasi-Newton schemes
(including updating Bk for use with e.g. trust region methods
rather than line search methods which require Hk).

Application: large, non-sparse Hessians.
Convergence: often linear convergence rate.

M.M. Betcke Numerical Optimisation

Theoretical connection to CG methods

Consider the memoryless BFGS

Hk+1 = (I −
sky

T
k

yTk sk
)(I −

yks
T
k

yTk sk
) +

sks
T
k

yTk sk

i.e. the previous Hessian is reset to identity, Hk = I .
If the memory less BFGS is applied in conjunction with an exact
line search i.e. pTk ∇fk+1 = 0 for all k . We then obtain

pk+1 = −Hk+1∇fk+1 = −∇fk+1 +
yTk ∇fk+1
yTk pk

pk ,

which is exactly the Hestens-Stiefel formula, which reduces to
Polak-Ribiere when ∇f Tk+1pk = 0

βHSk+1 =
∇f Tk+1(∇fk+1 −∇fk)
pTk (∇fk+1 −∇fk)

, βPRk+1 =
∇f Tk+1(∇fk+1 −∇fk)

∇f Tk ∇fk
.

M.M. Betcke Numerical Optimisation

Compact representation of BFGS update

Let B0 be symmetric positive definite and assume that the vector
pairs {si , yi}k−1i=0 satisfy s

T
i yi > 0. Applying k BFGS updates with

these vector pairs to B0 yields

M.M. Betcke Numerical Optimisation

In limited memory version we replace the columns or diagonal
entries in the matrices cyclically (keeping m last columns).

Since the dimension of the middle matrix is small, the
factorisation cost is negligible.

Cost of an update: 2mn +O(m3)
Cost of Bkv : (4m + 1)n +O(m3), (for B0 = δk I )
This approximation can be used in trust region methods for
unconstrained problems, but also in methods for constrained
optimisation.

Similar compact representation can be derived for Hk

Compact representation can also be derived for SR-1

Bk = B0+(Yk−B0Sk)(Dk+Lk+LTk −S
T
k B0Sk)

−1(Yk−B0Sk)T

with Sk ,Yk ,Dk , Lk as before. The inverse formula for Hk can
be obtained by swapping B ↔ H, s ↔ y , however limited
memory SR-1 can be less effective than BFGS.

M.M. Betcke Numerical Optimisation

Sparse quasi-Newton updates

We require the quasi-Newton approximation to the Hessian Bk to
has the same (or similar) sparsity pattern as the true Hessian.
Suppose that we know which components of the Hessian are
non-zero

Ω =
{

(i , j) : [∇2f (x)]ij 6= 0 for some point x in the domain of f
}
,

and suppose that the current approximation Bk mirrors this
sparsity structure. Such sparse update can be obtained as a
solution of the following quadratic program

min
B

‖B − Bk‖2F =

(i ,j)∈Ω

[Bij − (Bk)ij ]2,

subject to Bsk = yk , B = B
T, Bij = 0 ∀(i , j) 6∈ Ω.

It can be shown that the solution of this problem can be obtained
solving an n × n linear system with sparsity pattern Ω. Bk+1 is not
guaranteed to be positive definite. The new Bk+1 can be used
within a trust region.

M.M. Betcke Numerical Optimisation

Unfortunately, this approach has several drawbacks, it is not scale
invariant under linear transformations and the performance is
disappointing. The fundamental weakness is that the closeness in
Frobenius norm is an inadequate model and the produced
approximations can be poor.

An alternative approach is to relax the secant equation making
sure that it is approximately satisfied at the m last steps (as
opposed to holding strictly in the last step) and solve

min
B

‖BSk − Yk‖2F ,

subject to B = BT, Bij = 0 ∀(i , j) 6∈ Ω,

with Sk ,Yk containing the last m of si , yi , respectively.

This convex optimisation problem has a solution but it is not easy
to compute. Furthermore, it can produce singular and poorly
conditioned Hessian approximations. Even though it frequently
outperforms the previous approach, its performance is still not
impressive for large scale problems.

M.M. Betcke Numerical Optimisation

Partially separable functions

An unconstrained optimisation problem is separable if the
objective function f : Rn → R can be decomposed in a sum of
independent functions e.g.

f (x) = f1(x1, x3) + f2(x2, x4, x6) + f3(x5).

The optimal value can be found optimising each function
independently, which is in general much less expensive.

In many large scale problems the objective function f : Rn → R is
not separable but it still can be written as a sum of simpler
component functions. Each such component has the property that
it only changes in a small number of directions while for other
directions is remains constant. We call such functions partially
separable.

All functions which have a sparse Hessian are partially separable,
but there are many partially separable functions with dense
Hessians. Partial separability allows for economical representation
and effective quasi-Newton updating.

M.M. Betcke Numerical Optimisation

Consider an objective function f : Rn → R

f (x) =
∑̀
i=1

fi (x),

where each fi depends only on a few components of x . For such fi ,
its gradient and Hessian contain only few non-zeros.

For the function f we have by linearity of differentiation

∇f (x) =
∑̀
i=1

∇fi (x), ∇2f (x) =
∑̀
i=1

∇2fi (x)

thus we can maintain an quasi-Newton approximation to each
individual component Hessian ∇2fi (x) instead of approximating
the entire Hessian ∇2f (x).

M.M. Betcke Numerical Optimisation

Example: partially separable approximation

Consider a partially separable objective function

f (x) = (x1 − x23 )
2︸ ︷︷ ︸

f1(x)

+ (x2 − x24 )
2︸ ︷︷ ︸

f2(x)

+ (x3 − x22 )
2︸ ︷︷ ︸

f3(x)

+ (x4 − x21 )
2︸ ︷︷ ︸

f4(x)

.

Each fi depends on two components only, all have the same form.

Denote

x [1] =

[
x1
x3

]
, U1 =

[
1 0 0 0
0 0 1 0

]
, x [1] = U1x , φ1(z1, z2) = (z1−z22 )

2.

Then f1(x) = φ(U1x) and using chain rule we obtain

∇f1(x) = UT1 ∇φ1(U1x), ∇
2f1(x) = U

T
1 ∇

2φ1(U1x)U1.

M.M. Betcke Numerical Optimisation

For the Hessians ∇2φ1 and ∇2f1 we have

∇2φ1(U1x) =
[

2 −4×3
−4×3 12×23 − 4×1

]
, ∇2f1(x) =




2 0 −4×3 0
0 0 0 0
−4×3 0 12×23 − 4×1 0

0 0 0 0


 .

Idea: maintain quasi-Newton approximation B [1] to 2× 2 Hessian
∇2φ1 and lift it up to ∇2f1.
After a step from xk to xk+1

s
[1]
k = x

[1]
k+1 − x

[1]
k , y

[1]
k = ∇φ1(x

[1]
k+1)−∇φ1(x

[1]
k ),

and we use BFGS or SR-1 updating to obtain the new

approximation B
[1]
k+1 of the small, dense Hessian ∇

2φ1 and we lift
it back using

∇2f1(x) ≈ UT1 B
[1]
k+1U1.

We do the same for all component functions and we obtain

∇2f ≈ B =
∑̀
i=1

UTi B
[1]Ui .

M.M. Betcke Numerical Optimisation

The approximated Hessian may be used in trust region
algorithm, obtaining an approximate solution to

Bkpp = −∇fk .

Bk does not need to be assembled explicitly but conjugate
gradient method can be used and the products Bkv can be
performed directly using matrices Ui and B

[i ].

This approach is particularly useful for large number of
variables with very small dependence of component functions.
Then each respective component Hessian can be much faster
approximated by the iterative method (small problem requires
few directions) and the so obtained full Hessian approximation
is usually much better than one obtained by a quasi-Newton
method applied to the problem ignoring the partially separable
structure (large Hessian requires a lot of directions to
approximate the curvature).

M.M. Betcke Numerical Optimisation

It is not always possible for BFGS to update the partial
Hessian B [1], as the curvature condition (s [1])Ty [1] > 0 may
not be satisfied even if the full Hessian is at least positive
semidefinite. This can be overcome applying SR-1 update to
the component Hessians, which proved effective in practice.

The limitation of this quasi-Newton approach is the cost of
computing the step, which is comparable to the cost of
Newton step, thus it may be beneficial to actually take the
Newton step.

Another problem is the difficulty of identifying the partially
separable structure of a function. The performance of
quasi-Newton methods is satisfactory provided that we find
the finest partially separable decomposition.

M.M. Betcke Numerical Optimisation