程序代写代做代考 algorithm chain Numerical Optimisation: Conjugate gradient methods

Numerical Optimisation: Conjugate gradient methods

Numerical Optimisation:
Conjugate gradient methods

Marta M. Betcke
m.betcke@ucl.ac.uk,

Kiko Rullan
f.rullan@cs.ucl.ac.uk

Department of Computer Science,
Centre for Medical Image Computing,

Centre for Inverse Problems
University College London

Lecture 5 & 6

M.M. Betcke Numerical Optimisation

Conjugate gradient: CG

The linear CG method was proposed by Hestens and Stiefel in
1952 as a direct method for solution of linear systems of
equations with positive definite matrix. (It was used to solve
106 difference equations on the Zuse computer at ETH (with
a sufficiently accurate answer obtained in 90 iterations each
approximately taking 2h 20 minutes.)

In 1950 Lanczos iteration (including orthogonality of the basis
and 3-term recurrence) applied to eigenvalue problems.

Renaissance in early 1970 work by John Reid brought the
connection to iterative methods. Game change: performance
of CG is determined by the distribution of the eigenvalues of
the matrix (preconditioning).

In top 10 algorithms of 20th century.

Nonlinear conjugate gradient method proposed by Fletcher
and Reeves 1960.

M.M. Betcke Numerical Optimisation

Linear CG

Solution of linear system
Ax = b,

with A is symmetric positive definite matrix is equivalent to the
quadratic optimisation problem

minφ(x) =
1

2
xTAx − bTx .

Both have the same unique solution. In fact

∇φ(x) = Ax − b =: r(x)

thus the linear system is the 1st order necessary condition (which is
also sufficient for strictly convex function φ).

M.M. Betcke Numerical Optimisation

Conjugate directions

A set of non-zero vectors {p1, p2, . . . p`} is said to be conjugate
with respect to the symmetric positive definite matrix A if

pTi Apj = 0, i 6= j , i , j = 1, . . . , `.

Conjugate directions are linearly independent.

Conjugacy enables us to minimise φ in n steps by successfully
minimising it along the individual directions in the set.

M.M. Betcke Numerical Optimisation

Conjugate direction method

Given a starting point x0 and the set of conjugate directions
{p0, p2, . . . pn−1} let us generate the sequence {xk}

xk+1 = xk + αpk ,

where αk is the one dimensional minimiser of the quadratic
function along pk , φ(xk + αpk)

αk = −
rTk pk

pTk Apk
.

For any x0 ∈ Rn the sequence converges to the solution x? in at
most n steps.

M.M. Betcke Numerical Optimisation

Proof: Because span{p0, p2, . . . pn−1} = Rn

x? − x0 = σ0p0 + σ1p1 + · · ·+ σn−1pn−1.

Multiplying from the left by pTk A and using the conjugacy property
we obtain σk as

σk =
pTk A(x

? − x0)
pTk Apk

, k = 0, . . . , n − 1.

On the other hand, in the kth iteration the method generates
approximation

xk = x0 + α0p0 + α1p1 + · · ·+ αk−1pk−1.

Multiplying from the left by pTk A and using the conjugacy property
we have pTk A(xk − x0) = 0 and

pTk A(x
? − x0) = pTk A(x

? − xk) = pTk (b − Axk) = −p
T
k rk .

Substituting into σk = −
pT
k
rk

pT
k
Apk

= αk for k = 0, . . . , n − 1.
M.M. Betcke Numerical Optimisation

Theorem: [Expanding subspace minimisation]

For any starting point x0 for the sequence {xk} generated by the
conjugate direction method it holds

rTk pi = 0, i = 0, 1, . . . , k − 1,

and xk is the minimiser of φ(a) =
1
2
xTAx − bTx over the set

{x : x = x0 + span{p0, p1, . . . , pk−1}} .

M.M. Betcke Numerical Optimisation

Proof: Let’s define

h(σ) = φ(x0 + σ0p0 + · · ·+ σk−1pk−1),

where σ = (σ0, σ1, . . . , σk−1)
T. Since h(σ) is a strictly convex

quadratic, it has a unique minimiser σ? that satisfies

∂h(σ?)

∂σi
= 0, i = 0, . . . , k − 1.

Using the chain rule we obtain

∇φ(x0 + σ?p0 + · · ·+ σ?k−1pk−1)
Tpi = 0, i = 0, 1, . . . , k − 1.

Recall that r(x) = ∇φ(x), thus for the minimiser
x̃ = x0 + σ

?
0p0 + · · ·+ σ

?
k−1pk−1 on {p0, p1, . . . , pk−1} it follows

r(x̃)Tpi = 0 as claimed.
By induction:
For k = 1, from x1 = x0 + α0p0 being a minimiser of φ along p0 it
follows rT1 p0 = 0.

M.M. Betcke Numerical Optimisation

Suppose that rTk−1pi = 0 for i = 0, 1, . . . , k − 2.

rk = Axk − b = A(xk−1 + αk−1pk−1)− b = rk−1 + αk−1Apk−1.

and
pTk−1rk = p

T
k−1rk−1 + αk−1p

T
k−1Apk−1 = 0

by the definition of αk−1 = −
pT
k−1rk−1

pT
k−1Apk−1

.

For any other pi , i = 0, 1, . . . , k − 2 we have

pTi rk = p
T
i rk−1 + αk−1p

T
i Apk−1 = 0,

where the first term disappears because of the induction hypothesis
and the second because of the conjugacy of pi . Thus we have
shown rTk pi = 0 for i = 0, 1, . . . , k − 1 and the proof is complete.

M.M. Betcke Numerical Optimisation

Conjugate gradient vs conjugate direction

So far the discussion was valid for any set of conjugate
direction.

An example are eigenvectors of a symmetric positive definite
matrix A which are orthogonal and conjugate w.r.t. A.
Computation of full set of eigenvectors is expensive. Similarly,
Gram Schmidt orthogonalisation process could be adopted to
produce conjugate directions, however it is again expensive as
it requires to store all the directions to orthogonalise against.

Conjugate gradient (CG) method has a very special property,
it can compute a new vector pk using only the previous vector
pk−1 i.e. it does not need to know the vectors p0, p1, . . . pk−2
while pk is automatically conjugate to those vectors. This
makes CG particularly cheap in terms of computation and
memory.

M.M. Betcke Numerical Optimisation

Conjugate gradient

In CG each new direction is chosen as

pk = −rk + βkpk−1,

where

βk =
rTk Apk−1

pTk−1Apk−1
,

follows from requiring that pk−1, pk be conjugate
i.e. pTk−1Apk = 0.

We initialise p0 with the steepest descent direction at x0.

As in the conjugate direction method, we perform successive one
dimensional minimisation along each of the search directions.

M.M. Betcke Numerical Optimisation

CG: preliminary version

Given x0
Set r0 = Ax0 − b, p0 = −r0, k = 0
while rk 6= 0 do
αk = −

rT
k
pk

pT
k
Apk

xk+1 = xk + αkpk
rk+1 = Axk+1 − b
βk+1 =

rT
k+1

Apk

pT
k
Apk

pk+1 = −rk+1 + βk+1pk
k = k + 1

end while

M.M. Betcke Numerical Optimisation

CG

Given x0
Set r0 = Ax0 − b, p0 = −r0, k = 0
while rk 6= 0 do
αk =

rT
k
rk

pT
k
Apk

xk+1 = xk + αkpk
rk+1 = rk + αkApk

βk+1 =
rT
k+1

rk+1

rT
k
rk

pk+1 = −rk+1 + βk+1pk
k = k + 1

end while

M.M. Betcke Numerical Optimisation

Theorem:

For the kth iterate of the conjugate gradient method, xk 6= x? the
following hold:

rTk ri = 0, i = 0, 1, . . . , k − 1 (1)
span{r0, r1, . . . , rk} = span{r0,Ar0, . . . ,Ak r0}︸ ︷︷ ︸

=:Kk (A,r0)

(2)

span{p0, p1, . . . , pk} = span{r0,Ar0, . . . ,Ak r0} (3)
pTk Api = 0, i = 0, 1, . . . , k − 1. (4)

Therefore, the sequence {xk} converges to x? in at most n steps.

The proof of this theorem relies on p0 = −r0. The result does not
hold for other choices of p0.

Note that the gradients rk are actually orthogonal, while the
directions pk are conjugate, thus the name of conjugate gradients
is actually a misnomer.

M.M. Betcke Numerical Optimisation

Rate of convergence

From the properties of the k + 1st iterate we have

xk+1 = x0 + α0p0 + · · ·+ αkpk (5)
= x0 + γ0r0 + γ1Ar0 + · · ·+ γkAk r0 (6)

for some γi , i = 0, . . . , k .
Let Pk denote the kth degree polynomial

Pk(λ) = γ0 + γ1λ+ · · ·+ γkλk

then
xk+1 = x0 + Pk(A)r0.

Recall that CG minimises the quadratic function φ over
x0 + span{p0, . . . , pk} which is the same as x0 +Kk(A, r0) i.e.

arg minφ(x) = arg minφ(x)− φ(x?)

= arg min
1

2
(x − x?)TA(x − x?) = arg min

1

2
‖x − x?‖2A

M.M. Betcke Numerical Optimisation

CG vs steepest descent with optimal step length

Figure: Wiki: Conjugate gradient method

M.M. Betcke Numerical Optimisation

Thus CG computes the minimising polynomial over all polynomials
of degree k

min
Pk
‖x0 + Pk(A)r0 − x?‖A.

Observe that similar expressions hold for the error

xk − x? = x0 + Pk−1(A)r0 − x? = x0 − x? + Pk−1(A)A(x0 − x?)︸ ︷︷ ︸
=r0

= [I + APk−1(A)](x0 − x?)

and the residual

rk = Axk − b = A(xk − x?) = A(x0 + Pk−1(A)r0 − x?)
= A(x0 − x?)︸ ︷︷ ︸

=r0

+APk−1(A)r0 = [I + APk−1(A)]r0

M.M. Betcke Numerical Optimisation

Let the eigenvalue decomposition of the symmetric positive
definite matrix

A = VTΛV =
n∑
i

λiviv
T
i ,

with 0 < λ1 ≤ λ2 ≤ · · · ≤ λn and vi , i = 1, . . . , n the corresponding orthogonal eigenvectors. Since V is a basis for any vector in Rn, in particular for x0 − x? = ∑n i=1 ξivi . Notice that any eigenvector vi of A is also an eigenvector of Pk(A) with the corresponding eigenvalue Pk(λi ) Pk(A)vi = Pk(λi )vi , i = 1, . . . , n. M.M. Betcke Numerical Optimisation Hence xk+1 − x? = n∑ i=1 [1 + λiPk(λi )]ξivi and ‖xk+1 − x?‖2A = n∑ i=1 λi [1 + λiPk(λi )] 2ξ2i Since Pk is optimal w.r.t. this norm we have ‖xk+1 − x?‖2A = min Pk n∑ i=1 λi [1 + λiPk(λi )] 2ξ2i ≤ min Pk max 1≤i≤n [1 + λiPk(λi )] 2 ( n∑ i=1 λiξ 2 i ) = min Pk max 1≤i≤n [1 + λiPk(λi )] 2‖x0 − x?‖2A. M.M. Betcke Numerical Optimisation Theorem If A has only r distinct eigenvalues, then CG will converge to the solution in at most r iterations. Proof: Suppose the eigenvalues take on distinct r values τ1 < · · · < τr and define a polynomial Qr (λ) = (−1)r τ1τ2 . . . τr (λ− τ1) . . . (λ− τr ) and note that Qr (λi ) = 0, i = 1, . . . , n and Q(0) = 1. Then P̄r−1(λ) = (Qr (λ)− 1)/λ is of degree r − 1 and we have 0 ≤ min Pr−1 max 1≤i≤n [1 + λiPr−1(λi )] 2 ≤ max 1≤i≤n [1 + λi P̄r−1(λi )] 2 = max 1≤i≤n Q2r (λi ) = 0 and ‖xr − x?‖2A = 0 and hence xr = x ?. M.M. Betcke Numerical Optimisation Convergence rate Theorem If A has eigenvalues λ1 ≤ λ2 ≤ · · · ≤ λn, we have that ‖xk+1 − x?‖2A ≤ ( λn−k − λ1 λn−k + λ1 )2 ‖x0 − x?‖2A. Proof idea: Choose polynomial P̄k such that Qk+1(λ) = 1 + λP̄k(λ) has roots at the k largest eigenvalues λn, λn−1, . . . , λn−k+1 and at the midpoint between λn−k and λ1. It can be shown that he maximum value attained by Qk+1 on the remaining eigenvalues λ1, . . . , λn−k is λn−k−λ1 λn−k+λ1 . Theorem In terms of condition number κ(A) = ‖A‖2‖A−1‖2 = λn/λ1, we have that ‖xk − x?‖A ≤ 2 (√ κ(A)− 1√ κ(A) + 1 )k ‖x0 − x?‖A. M.M. Betcke Numerical Optimisation Preconditioning We can accelerate CG through transformations which cluster eigenvalues. This process is known as preconditioning. We perform a change of variables x̂ = Cx . Then the quadratic function φ in terms of x̂ reads φ̂(x̂) = 1 2 x̂T(C−TAC−1)x̂ − (C−Tb)Tx̂ . Minimising φ̂ is equivalent to solving the system of normal equations (C−TAC−1)x̂ = (C−Tb) and the convergence rate of CG depends on the eigenvalues of C−TAC−1. M.M. Betcke Numerical Optimisation It is not necessary to carry out the transforms explicitly. We can apply CG to φ̂ in terms of x̂ and then invert the transformations to reexpress all the equations in terms of the original variable x . In fact, the preconditioned CG algorithm does not use the factorisation M = CTC explicitly, only M. If we set M = I we recover unpreconditioned CG algorithm. The properties of CG generalise, in particular for PCG it holds rTi M −1rj = 0, ∀i 6= j . M.M. Betcke Numerical Optimisation Preconditioned CG (PCG) Given x0, preconditioner M Set r0 = Ax0 − b, Solve My0 = r0 p0 = −y0, k = 0 while rk 6= 0 do αk = rT k yk pT k Apk xk+1 = xk + αkpk rk+1 = rk + αkApk Solve Myk+1 = rk+1 βk+1 = rT k+1 yk+1 rT k yk pk+1 = −yk+1 + βk+1pk k = k + 1 end while M.M. Betcke Numerical Optimisation Nonlinear conjugate gradient Recall that CG can be interpreted as a minimiser of a quadratic convex function φ(x) = 1 2 xTAx − xTb. Can the algorithm for φ be generalised to a nonlinear function f ? Recall that step length αk minimises φ along pk . For general f : compute αk using line search αk = min α f (xk + αpk) r = Ax − b = ∇φ(x). For general function f : r → ∇f M.M. Betcke Numerical Optimisation Fletcher Reeves Given x0 Evaluate f0 = f (x0), ∇f0 = ∇f (x0) Set p0 = −∇f0, k = 0 while ∇fk 6= 0 do Compute αk using line search, αk = minα f (xk + αpk) xk+1 = xk + αkpk βk+1 = ∇f T k+1 ∇fk+1 ∇f T k ∇fk pk+1 = −∇fk+1 + βk+1pk k = k + 1 end while M.M. Betcke Numerical Optimisation Descent direction Is pk a descent direction? ∇f Tk pk = −∇f T k ∇fk + βk∇f T k pk−1 ? < 0 If αk−1 is a local minimiser along pk−1, ∇f Tk pk−1 = 0 and ∇f Tk pk = −∇f T k ∇fk < 0 thus pk is a descent direction. If the linear search is not exact, due to the second term βk∇f Tk pk−1, pk may fail to be a descent direction. This can be avoided by requiring that the step length αk−1 satisfies the strong Wolfe conditions f (xk + αkpk) ≤ f (xk) + c1αk∇f Tk pk , |∇f (xk + αkpk)Tpk | ≤ −c2∇f Tk pk with 0 < c1 < c2 < 1 2 . M.M. Betcke Numerical Optimisation Lemma [1] Let f be twice continuously differentiable, and the level set {x : f (x) ≤ f (x0)} is bounded. If the step length αk in the FR algorithm satisfies strong Wolfe conditions with 0 < c2 < 1 2 . then the method generates descent directions pk that satisfy − 1 1− c2 ≤ ∇f Tk pk ‖∇fk‖2 ≤ 2c2 − 1 1− c2 , k = 1, 2, . . . . (7) Proof: First note that the upper bound (2c2 − 1)/(1− c2) monotonically increases for c2 ∈ (0, 12) and −1 < (2c2 − 1)/(1− c2) < 0. Thus Lemma [1] implies that pk is a descent direction ∇f Tk pk < 0. The inequalities can be shown by induction using the form of the update the second strong Wolfe condition. M.M. Betcke Numerical Optimisation Induction: k = 0 : p0 = −∇f0 → ∇f T0 p0 ‖∇f0‖2 = −1 and (7) holds. Assume (7) holds for some k ≥ 1. From βFRk+1 we have ∇f Tk+1pk+1 ‖∇fk+1‖2 = −1 + βFRk+1 ∇f Tk+1pk ‖∇fk+1‖2 = −1 + ∇f Tk+1pk ‖∇fk‖2 . Plugging curvature Wolfe condition |∇f Tk+1pk | ≤ −c2∇f T k pk into last equation (note ∇f Tk pk < 0 by induction hypothesis) we obtain −1 + c2 ∇f Tk pk ‖∇fk‖2 ≤ ∇f Tk+1pk+1 ‖∇fk+1‖2 ≤ −1− c2 ∇f Tk pk ‖∇fk‖2 . Substituting the lower bound for ∇f T k pk ‖∇fk‖2 from induction hypothesis we obtain (7) for k + 1 −1− c2 1− c2 ≤ ∇f Tk+1pk+1 ‖∇fk+1‖2 ≤ −1 + c2 1− c2 . M.M. Betcke Numerical Optimisation Weakness of FR algorithm If FR generates a bad direction and a tiny step, then the next direction and the next step are also likely to be poor. Let θk = ∠(pk ,−∇fk), cos θk = −∇f Tk pk ‖∇fk‖‖pk‖ . A bad direction pk is almost orthogonal to −∇fk and cos θk ≈ 0. Multiplying (7) by ‖∇fk‖/‖pk‖ we obtain 1− 2c2 1− c2 ‖∇fk‖ ‖pk‖ ≤ cos θk ≤ 1 1− c2 ‖∇fk‖ ‖pk‖ , k = 1, 2, . . . . Thus cos θk ≈ 0 if and only if ‖∇fk‖ � ‖pk‖. Since pk is almost orthogonal to −∇fk , the step from xk to xk+1 is likely tiny, i.e. xk+1 ≈ xk . Consequently, ∇fk ≈ ∇fk+1 then βk+1 ≈ 1 and finally given ‖∇fk+1‖ ≈ ‖∇fk‖ � ‖pk‖, pk+1 ≈ pk and the new direction will improve little. If cos θk ≈ 0 holds and the the subsequent step is small, the following updates are unproductive. M.M. Betcke Numerical Optimisation Polak-Ribière Polak-Ribière: βPRk+1 = ∇f Tk+1(∇fk+1 −∇fk) ‖∇fk‖2 (8) If f is strongly convex quadratic function and the line search is exact, ∇fk+1 ⊥ ∇fk and βPRk+1 = β FR k+1. For general nonlinear functions and inexact line search, numerical experience indicates that PR algorithm is more robust and efficient. As is, the strong Wolfe conditions do not guarantee that pk is always a descent direction. For βk+1 = max{βPRk+1, 0}, simple adaptation of strong Wolfe conditions ensures the descent property. M.M. Betcke Numerical Optimisation Other choices of βk Hestenes - Stiefel (similar to PR in both theory and practical performance): Consecutive directions are conjugate wrt average Hessian Ḡk = ∫ 1 0 ∇2f (xk + ταkpk)dτ . From Taylor’s theorem we have ∇fk+1 = ∇fk + αk Ḡkpk . Solving pTk+1Ḡkpk = 0 where pk+1 = −∇fk+1 + βk+1pk for βk+1 yields βHSk+1 = ∇f Tk+1(∇fk+1 −∇fk) (∇fk+1 −∇fk)Tpk (9) Two competitive with PR choices which guarantee pk to be descent direction under (standard) Wolfe conditions on αk : βk+1 = ‖∇fk+1‖2 (∇fk+1 −∇fk)Tpk (10) βk+1 = ( yk − 2pk ‖yk‖2 yTk pk )T ∇fk+1 yTk pk with yk = ∇fk+1−∇fk . (11) M.M. Betcke Numerical Optimisation Restarts Set βk = 0 in every nth step i.e. take steepest descent step. Restarting serves to refresh the algorithm erasing old information that may be not beneficial. Such restarting leads to n step quadratic convergence ‖xk+n − x?‖ = O‖xk − x?‖2. Consider function which is strongly convex quadratic close to the solution x? but non-quadratic elsewhere. Once close to the solution the restart will allow the method to behave like linear conjugate gradients, in particular with finite termination within n steps from the restart (recall that the finite termination property for linear CG only holds if initiated with p0 = −∇f0). In practice, conjugate gradient methods are usually used when n is large, hence n steps are never taken. Observe that the gradients are mutually orthogonal when f is a quadratic function. Restart when two consecutive gradients are far from orthogonal |∇f T k ∇fk+1| ‖∇fk‖2 ≥ ν, with ν typically 0.1. M.M. Betcke Numerical Optimisation When for some search direction pk , cos θk ≈ 0 and the subsequent step is small, substituting ∇fk+1 ≈ ∇fk into βPRk+1 results in βPRk+1 ≈ 0 and the next direction pk+1 ≈ −∇fk+1 the steepest descent direction. Therefore the PR algorithm essentially performs a restart after it encounters a bad direction. The same argument applies to HS, and PR+. FR algorithm requires some restart. Hybrid FR-PR: Global convergence can be guaranteed if |βk | ≤ βFRk for all k ≥ 2. This suggest following strategy βk =   −βFRk , β PR k < −β FR k βPRk , |β PR k | ≤ β FR k βFRk , β PR k > β

FR
k

(12)

M.M. Betcke Numerical Optimisation

Global convergence – assumptions

Assumptions:

i) The level set L = {x : f (x) ≤ f (x0)} be bounded.
ii) In some open neighbourhood N of L, the objective function f

is Lipschitz continuously differentiable.

These assumptions imply that there is a constant γ such that

‖∇f (x)‖ ≤ γ, ∀x ∈ L.

M.M. Betcke Numerical Optimisation

Global convergence – restarted CG method

From Zoutenjik’s lemma it follows that any line search iteration
xk+1 = xk + αkpk where pk is a descent direction and the step
length αk satisfies Wolfe conditions gives the limit

∞∑
k=0

cos2 θk‖∇fk‖2 <∞. Similarly, to the global convergence for line search, global convergence for restarted conjugate gradient algorithms periodically setting βk = 0 (hence cos θk+1 = 1) can be proven in a subsequence lim inf k→∞ ‖∇fk‖ = 0. M.M. Betcke Numerical Optimisation Global convergence - unrestarted FR method Theorem: [Al-Baali] Suppose that the assumptions i) and ii) hold and FR algorithm is implemented with line search that satisfies strong Wolfe conditions with 0 < c1 < c2 < 1 2 . Then lim inf k→∞ ‖∇fk‖ = 0. Proof: By contradiction (assume ‖∇fk‖ ≥ γ > 0). Substitute into
Zoutenjik’s result and use definition of pk and upper bound in
Lemma [1] recursively to show that the assumed to converge
sequence in lower bounded by harmonic series which is divergent
hence contradiction.

This global convergence result can be extended to any method
satisfying |βk | ≤ βFRk for all k ≥ 2.

M.M. Betcke Numerical Optimisation

If constants c4, c5 > 0 exist such that

cos θk ≥ c4
‖∇fk‖
‖pk‖

,
‖∇fk‖
‖pk‖

≥ c5 > 0, k = 1, 2, . . .

if follows from Zoutenjik’s result that

lim
k→∞

‖∇fk‖ = 0.

This result can be established for PR for f strongly convex and
exact line search.

For general nonconvex functions it is not possible even though PR
performs better in practice than FR. PR method can cycle
infinitely even if ideal line search is used i.e. line search which
returns αk that is the first positive stationary point of f (xk +αpk).
Example relies on βk < 0 which motivated the modification β+k = max{0, βk}. M.M. Betcke Numerical Optimisation