程序代写代做代考 Bayesian algorithm Numerical Optimisation: Least squares

Numerical Optimisation: Least squares

Numerical Optimisation:
Least squares

Marta M. Betcke
m.betcke@ucl.ac.uk,

Kiko Rullan
f.rullan@cs.ucl.ac.uk

Department of Computer Science,
Centre for Medical Image Computing,

Centre for Inverse Problems
University College London

Lecture 10 & 11

M.M. Betcke Numerical Optimisation

Least squares problem

Least squares is a problem where the objective function has the
following special form

f (x) =
1

2

m∑
j=1

r2j (x),

where each rj is a smooth function from Rn → R. We refer to
each of the rj as a residual, and we assume that m ≥ n.

Least squares problems are ubiquitous in applications, where the
discrepancy between the model and the observed behaviour is
minimised.

M.M. Betcke Numerical Optimisation

Let’s assemble the individual components rj into the residual
vector r : Rn → Rm

r(x) = (r1(x), r2(x), . . . , rm(x))
T.

Using this vector, f becomes f (x) = 1
2
‖r(x)‖22. The derivatives of

f can be calculated with help of the Jacobian

J(x) =

[
∂rj
∂xi

]
ij

=



∇r1(x)T
∇r2(x)T


∇rm(x)T


 .

∇f (x) =
m∑
j=1

rj(x)∇rj(x) = J(x)Tr(x),

∇2f (x) =
m∑
j=1

∇rj(x)∇rj(x)T +
m∑
j=1

rj(x)∇2rj(x)

= J(x)TJ(x) +
m∑
j=1

rj(x)∇2rj(x),

.

M.M. Betcke Numerical Optimisation

Example

Model of concentration of drug in bloodstream

φ(x ; t) = x1 + tx2 + t
2×3 + x4 exp

−x5t .

Find a set of parameters x to that the model best matches the
data solving

1

2

m∑
j=1

(φ(x , tj)− yj︸ ︷︷ ︸
=:rj (x)

)2.

Figure: Nocedal Wright Fig 10.1 (left), Fig 10.2 (right)

M.M. Betcke Numerical Optimisation

Bayesian perspective

Bayes’ theorem

π(x |y) =
π(y |x)π(x)

π(y)
∝ π(y |x)π(x).

Denote the discrepancy between the model and the measurement
at data point tj as

�j = φ(x ; tj)− yj .

Assume that �js are independent and identically distributed with
variance σ2 and probability density function gσ(·). The likelyhood
of a particular set of observations yj , j = 1, 2, . . . ,m given the
parameter vector x is given by

π(y |x) =
m∏
j=1

gσ(�j) =
m∏
j=1

gσ(φ(x ; tj)− yj).

M.M. Betcke Numerical Optimisation

The maximum a posteriori probability (MAP) estimate vs the
maximum likelyhood estimate

xMAP = max
x
π(x |y) = max

x
π(y |x)π(x).

If gσ is a normal distribution

gσ(�) =
1


2πσ

exp

(

�2

2σ2

)
and x is uniformly distributed i.e. π(x) = const, the MAP estimate
reads

xMAP = max
x

1

(

2πσ)m
exp


− 1

2σ2

m∑
j=1

(φ(x ; tj)− yj)2

= min
x

m∑
j=1

(φ(x ; tj)− yj)2.

M.M. Betcke Numerical Optimisation

Linear least squares

If φ(x ; t) is linear function in x , the least squares problem becomes
linear:

The residuals rj(x) = φ(x ; tj)− yj are linear and in the vectorized
notation

r(x) = Jx − y ,

where

the vector of measurements y = (y1, y2, . . . ym)
T = r(0)T

the matrix J with rows Jj ,: : φ(x ; tj) = Jj ,:x

are both independent of x .

M.M. Betcke Numerical Optimisation

The linear least squares has the form

f (x) =
1

2
‖Jx − y‖2.

The gradient and Hessian are

∇f (x) = JT(Jx − y), ∇2f (x) = JTJ.

Note: f (x) is convex i.e. the stationary point is the global
minimiser ∇f (x?) = 0

Normal equations

∇f (x?) = JT(Jx? − y) = 0 ⇔ JTJx? = JTy .

M.M. Betcke Numerical Optimisation

Roadmap: solution of linear least squares

Solve the normal equations JTJx? = JTy

+ If m� n, computing JTJ explicitly results is a smaller matrix
easier to store than J. This can be solved by e.g. Cholesky
decomposition.

– Formulating JTJ squares the condition number.
+ For regularised ill-posed problems squaring of the condition

number may not be an issue.
– If J is rank deficient (or ill conditioned) Cholesky

decomposition will require pivoting.

Solve the least squares x? = arg minx ‖Jx − b‖2
+ If J is of moderate size you can use direct methods like QR

decomposition or SVD decomposition.
+ If J is large and sparse or given in operator form i.e. x → Jx

use iterative methods like CGLS or LSQR.
+ Does not square the condition number.
+ In particular SVD and iterative methods e.g. LSQR can easily

deal with ill-conditioning.

M.M. Betcke Numerical Optimisation

QR factorizarion

Let

JP = Q

[
R
0

]
= [ Q1︸︷︷︸

∈Rm×n

Q2︸︷︷︸
∈Rm×(m−n)

]

[
R
0

]
= Q1R,

where

P ∈ Rn×n column permutation matrix (hence orthogonal),

Q ∈ Rm×m orthogonal matrix,

R ∈ Rn×n upper triangular with positive diagonal.

Recall: Multiplication with orthogonal matrix preserves ‖ · ‖2.

‖Jx − y‖22 = ‖Q
T(J

=I︷︸︸︷
PPT x − y)‖22 = ‖(

=[R,0]T︷ ︸︸ ︷
QTJP )PTx − QTy‖22

= ‖RPTx − QT1 y‖
2
2 + ‖Q

T
2 y‖

2
2.

Solution: x? = PR−1QT1 y . In practice we perform backsubstitution
on Rz = QT1 y and permute for x

? = Pz .
M.M. Betcke Numerical Optimisation

Singular value decomposition

Let

J = U

[
S
0

]
VT = [ U1︸︷︷︸

∈Rm×n

U2︸︷︷︸
∈Rm×(m−n)

]

[
S
0

]
VT = U1SV

T,

where

U ∈ Rm×m,V ∈ Rn×n are orthogonal matrices,

S ∈ Rn×n diagonal matrix with elements σ1 ≥ σ2 ≥ · · · ≥ σn > 0.

‖Jx − y‖22 = ‖U
T(J

=I︷ ︸︸ ︷
VVT x − y)‖22 = ‖(

=[S ,0]T︷ ︸︸ ︷
UTJV )VTx − UTy‖22

= ‖SVTx − UT1 y‖
2
2 + ‖U

T
2 y‖

2
2.

Solution: x? = VS−1UT1 y =
∑n

i=1
uTi y
σi

vi . If σi are small, they
would undue amplify the noise and can be omitted from the sum.
Picard condition: |uTi y | should decay faster than σi .

M.M. Betcke Numerical Optimisation

LSQR [Paige, Saunders ’82]

LSQR applied to

min
f
‖Af − g‖22 + τ‖f ‖

2
2 = min

f

∥∥∥∥
[

A√
τ I

]
f −

[
g
0

]∥∥∥∥2
2

,

where τ is an optional damping parameter, is analytically
equivalent to CG applied to the normal equations. However, it
avoids forming them (hence no condition number squaring) using
the Golub–Kahan bidiagonalization [Golub,Kahan ’65].

G-K bidiagonalisation yields the projected least squares problem

min
yi

∥∥∥∥
[
Bi√
τ I

]
yi − β1e1

∥∥∥∥ , (P-LS)
which is then solved using QR decomposition yielding the
approximation for the solution of the original problem, fi = Viyi .

M.M. Betcke Numerical Optimisation

Golub-Kahan bidiagonalization

G-K bidiagonalization with a starting vector g for minf ‖g − Af ‖

Ui+1(β1e1) = g

AVi = Ui+1Bi

ATUi+1 = ViB
T
i + αi+1vi+1e

T
i+1,

ei : ith canonical basis vector, αi ≥ 0 and βi ≥ 0 chosen such that
‖ui‖ = ‖vi‖ = 1 and UTi U = I ,V

T
i V = I ,

Bi =



α1
β2 α2

β3
. . .
. . . αi

βi+1


 , Ui = [u1, u2, . . . , ui ], Vi = [v1, v2, . . . , vi ].

(1)

M.M. Betcke Numerical Optimisation

Preconditioned LSQR

Preconditioned LSQR

f̂ = argmin‖g − AL−1f̂ ‖.

The corresponding normal equation is exactly the split
preconditioned normal equation

L−TATAL−1f̂ = L−TATg , f = L−1f̂ . (2)

Similarly as for CG, the LSQR algorithm can be formulated
without the need to provide a factorization LTL of M, the MLSQR
algorithm [Arridge, B, Harhanen ’14].

M.M. Betcke Numerical Optimisation

MLSQR [Arridge, B, Harhanen ’14]

1: Initialization:
2: β1u1 = g
3: p̃ = ATu1
4: ṽ1 = M

−1p̃, α1 = (ṽ1, p̃)
1/2

, ṽ1 = ṽ1/α1
5: w̃1 = ṽ1, f0 = 0, φ̄1 = β1, ρ̄1 = α1
6: for i = 1, 2, . . . do
7: Bidiagonalization:
8: βi+1ui+1 = Aṽi − αiui
9: p̃ = ATui+1 − βi+1p̃

10: ṽi+1 = M
−1p̃, αi+1 = (ṽi+1, p̃)

1/2
,

11: p̃ = p̃/αi+1, ṽi+1 = ṽi+1/αi+1
12: Orthogonal transformation: smart QR, see [Paige Sanders ’82]
13: Update:
14: fi = fi−1 + (φi/ρi )w̃i
15: w̃i+1 = ṽi+1 − (θi+1/ρi )w̃i
16: Break if stopping criterion satisfied
17: end for

M.M. Betcke Numerical Optimisation

LSQR with explicit regularization (τ 6= 0)

In preconditioned formulation, Tikhonov (explicit)
regularization amounts to damping. For a fixed value of τ ,
damping can be easily incorporated in LSQR at the cost of
doubling the number of Givens rotations [Paige, Saunders ’82].

Due to the shift invariance of Krylov spaces, Vi are the same
for any τ . If Vi are stored (expensive for many/long vectors),
the projected least squares problem (P-LS) can be efficiently
solved for multiple values of τ .

Solving (P-LS) with a variable τ is discussed in [Bjorck ’88]
using singular value decomposition of the bidiagonal matrix Bi
(no efficient SVD update even though Bi simply expands by a
row and a column in each iteration). Those quantities can be
obtained at the cost O(i2) at the i th iteration.
For larger i , the algorithm described in [Elden ’77] for the
least squares solution of (P-LS) at the cost of O(i) for each
value of τ is the preferable option.

M.M. Betcke Numerical Optimisation

Stopping LSQR / MLSQR

[Saunders Paige ’82] discusses three stopping criteria:

S1: ‖r̄i‖ ≤ BTOL‖g‖+ ATOL‖Ā‖‖fi‖ (consistent systems),
S2:

‖ĀT r̄i‖
‖Ā‖‖r̄i‖

≤ ATOL (inconsistent systems),

S3: cond(Ā) ≥ CONLIM (both),

where r̄i := ḡ − Āfi with Ā =
[

A√
τL

]
, ḡ =

[
g
0

]
.

[Arridge, B, Harhanen ’14] uses Morozov discrepancy principle
(suitable for ill-posed problems)

S4: ‖ri‖ ≤ ηδ, η > 1,
where ri := g −Afi , δ is the (estimated) noise level, η > 1 prevents
overregularization

+ if τ = 0, ri = r̄i and the sequence ‖ri‖ = ‖r̄i‖ is monotonically
decreasing. Moreover if initialised with f0, ‖fi‖ is strictly
monotonically growing (relevant for damped problem),

+ priorconditioning does not alter the residual.

M.M. Betcke Numerical Optimisation

Gauss-Newton (GN) method

Gauss-Newton (GN) can be viewed as a modified Newton method
with line search.

Recall the specific form of gradient and Hessian of least squares
problems

∇f (x) = J(x)Tr(x), ∇2f (x) = J(x)TJ(x) +
m∑
j=1

rj(x)∇2rj(x).

Substituting into the Newton equation

∇2f (xk)pk = −∇f (xk)
and using the approximation ∇2f (x) ≈ J(x)TJ(x) we obtain

J(xk)
T︸ ︷︷ ︸

=:JT
k

J(xk)p
GN
k = −J(xk)

T r(xk)︸ ︷︷ ︸
=:rk

.

Implementations of GN usually perform a line search along pGNk
requiring the step length to satisfy e.g. Armijo or Wolfe conditions.

M.M. Betcke Numerical Optimisation

Does not require computation of the individual Hessians
∇2rj , j = 1, . . . ,m. If the Jacobian Jk has been computed
when evaluating the gradient no other derivatives are needed.

Frequently the first term JTk Jk dominates the second term in
the Hessian i.e. |rj(x)|‖∇2rj(x)‖ are much smaller than the
eigenvalues JTJ. This happens if either the residual rj or
‖∇2rj‖ are small. Thus JTk Jk is a good approximation to ∇

2fk
and the convergence rate of GN is close to Newton.

If Jk has full rank and ∇fk 6= 0, pGNk is a descent direction for
f and hence suitable for line search

(pGNk )
T∇fk = (pGNk )

TJTk rk = −(p
GN
k )

TJTk Jkp
GN
k

= −‖JkpGNk ‖
2 ≤ 0.

The final inequality is strict unless Jkp
GN
k = 0 in which case

by the GN equation and Jk being full-rank we have
0 = JTk rk = ∇fk and xk is a stationary point.

M.M. Betcke Numerical Optimisation

Interpretation of GN step

The GN equation
JTk Jkp

GN
k = −J

T
k rk

is exactly the normal equation for the linear least squares problem

min
p

1

2
‖Jkp + rk‖2.

Hence, we can find the GN direction pGNk solving this linear least
squares problem using any of the techniques discussed before.

We can view GN equation as obtained from a linear model for the
vector function r(xk + p) ≈ rk + Jkp,

f (xk + p) =
1

2
‖rk(xk + p)‖2 ≈

1

2
‖Jkp + rk‖2

and pGNk = arg minp
1
2
‖Jkp + rk‖2.

M.M. Betcke Numerical Optimisation

Global convergence of GN

The global convergence is a consequence of the convergence
theorem for line search methods [Zoutendijk].

To satisfy the conditions of Zoutendijk’s theorem, we need to make
following assumptions:

rj is Lipschitz continuously differentiable in a neighbourhood
M of the bounded level set L = {x : f (x) ≤ f (x0)},
J(x) satisfies the uniform full-rank condition, γ > 0

‖J(x)z‖ ≥ γ‖z‖, ∀x ∈M.

Then for the iterates xk generated by the GN method with step
length satisfying Wolfe conditions, we have

lim
k→∞

JTk rk = ∇f (xk) = 0.

M.M. Betcke Numerical Optimisation

Similarly as for the line search, we check that the angle
θk = ∠(p

GN
k ,−∇fk) is uniformly bounded away from π/2

cos θk = −
(pGNk )

T∇fk
‖pGNk ‖‖∇fk‖

=
‖JkpGNk ‖

2

‖pGNk ‖‖J
T
k Jkp

GN
k ‖


γ2‖pGNk ‖

2

β2‖pGNk ‖2
=
γ2

β2
> 0,

where ‖Jk(x)‖ ≤ β <∞, ∀x ∈ L is a consequence of boundedness of the level set L and Lipschitz continuous differentiability of rj , j = 1, . . . ,m. Then from ∑ k≥0 cos 2 θk‖∇fk‖2 <∞ in Zoutendijk’s theorem it follows ∇fk → 0. If Jk for some k is rank deficient, the matrix J T k Jk in GN equation is singular and the system has infinitely many solutions, however cos θk is not necessarily bounded away from 0. M.M. Betcke Numerical Optimisation Convergence rate GN The convergence of GN can be rapid if JTk Jk dominates the second term in the Hessian. Similarly as showing the convergence rate of Newton iteration, if xk is sufficiently close to x ?, J(x) satisfies the uniform full-rank condition, we have for a unit step in GN direction xk + p GN k − x ? = xk − x? − [ :=J(xk ) TJ(xk )︷ ︸︸ ︷ JTJ(xk) ] −1∇f (xk) = [JTJ(xk)] −1  JTJ(xk)(xk − x?) +∇f (x?)︸ ︷︷ ︸ =0 −∇f (xk)   . Using H(x) to denote the second term in the Hessian, it follows from Taylor theorem that ∇f (xk)−∇f (x?) = ∫ 1 0 JTJ(x? + t(xk − x?))(xk − x?)dt + ∫ 1 0 H(x? + t(xk − x?))(xk − x?)dt, M.M. Betcke Numerical Optimisation Putting everything together and assuming Lipschitz continuity of H(·) near x? and using L.c.d. of rj ⇒ L.c. of JTr(x) ‖xk + pGNk − x ?‖ ≤ ∫ 1 0 ‖[JTJ(xk)]−1H(x? + t(xk − x?))‖‖xk − x?‖dt + ∫ 1 0 ‖ [JTJ(xk)]−1︸ ︷︷ ︸ ‖·‖≤γ−2 ( JTJ(x? + t(xk − x?))− JTJ(xk) )︸ ︷︷ ︸ ‖·‖=O(‖xk−x?‖) would need L.c. of J TJ(x) ‖‖xk − x?‖dt ≈‖[JTJ(x?)]−1H(x?)‖‖xk − x?‖+O(‖xk − x?‖2). Hence, if ‖[JTJ(x?)]−1H(x?)‖ � 1, we can expect GN to converge quickly towards the solution x?. When H(x?) = 0, the convergence is quadratic (Newton). When the Jacobian J(x) is large and sparse, the exact solve of GN equation can be replaced by an inexact solve as in inexact Newton methods but with the true Hessian ∇2f (xk) replaced with J(xk) TJ(xk). The positive semidefinitness of J(xk) TJ(xk) simplifies the algorithms. Instead of (preconditioned) CG, (preconditioned) LSQR should be used. M.M. Betcke Numerical Optimisation Levenberg-Marquardt (LM) method Levenberg-Marquardt (LM) makes use of the same Hessian approximation as GN but within the framework of trust region methods. Trust region methods can cope with (nearly) rank-deficient Hessian, which is a weakness of GN. The constraint model problem to be solved at each iteration min p 1 2 ‖Jkp + rk‖2, subject to ‖p‖ ≤ ∆k , (CM-LM) where ∆k > 0 is the trust region radius.

Note: The least squares term corresponds to quadratic model

mk(p) =
1

2
‖rk‖2 + pTJTk rk +

1

2
pTJTk Jkp.

M.M. Betcke Numerical Optimisation

Solution of the constraint model problem

The solution of the constraint model problem (CM-LM) is an
immediate consequence of the general result for trust region
methods [More, Sorensen]:

If the solution pGNk of the GN equation lies strictly inside the
trust region i.e. ‖pGNk ‖ < ∆k , then p LM k = p GN k solves (CM-LM). Otherwise, there is a λ > 0 such that the solution pLMk of
(CM-LM) satisfies ‖pLMk ‖ = ∆k and

(JTk Jk + λI )p
LM
k = −J

T
k rk .

Note: The last equation is the normal equation to the linear least
squares problem

min
p

1

2

∥∥∥∥
[
Jk√
λI

]
p +

[
r
0

]∥∥∥∥2 ,
which gives us a way of solving (CM-LM) without computing JTk Jk .

M.M. Betcke Numerical Optimisation

Global convergence LM

Global convergence is a consequence of the corresponding trust
region global convergence theorem.

To satisfy the conditions of that theorem we make the following
assumptions:

η ∈ (0, 1
4

) (for strong convergence)

rj is Lipschitz continuously differentiable in a neighbourhood
M of the bounded level set L = {x : f (x) ≤ f (x0)},
The approximate solution pk of (CM-LM) satisfies

mk(0)−mk(pk) ≥ c1‖JTk rk‖min
(

∆k ,
‖JTk rk‖
‖JTk Jk‖

)
,

for some constant c1 > 0, and in addition ‖pk‖ ≤ γ∆k for
some constant γ ≥ 1.

We then have that

lim
k→∞

∇fk = lim
k→∞

JTk rk = 0.

M.M. Betcke Numerical Optimisation

As for trust region methods, there is no need to evaluate the
right hand side of the decrease condition, but it is sufficient to
ensure reduction of at least the Cauchy point, which can be
calculated inexpensively. If the iterative CG-Steighaus
approach is used, this is guaranteed.

The local convergence of LM is similar to GN. Near the
solution x?, at which the first term J(x?)TJ(x?) of the
Hessian ∇2f (x?) dominates the second term, the trust region
becomes inactive and the algorith takes GN steps giving rapid
local convergence.

M.M. Betcke Numerical Optimisation

References

[Paige, Saunders ’82] Link to website with papers and codes
https://web.stanford.edu/group/SOL/software/lsqr/

[Bjorck ’96] A. Bjorck, “Numerical Methods for Least Squares
Problems”, SIAM, 1996

[Golub, Kahan ’65] G.H. Golub and W. Kahan, “Calculating
the singular values and pseudoinverse of a matrix”, SIAM
J.Numer.Anal. 2 ,1965

[Elden ’77] L. Elden, “Algorithms for the regularization of
ill-conditioned least squares problems”, BIT, 17, 1977

[Arridge, B, Harhanen ’14] S R Arridge, M M Betcke and L
Harhanen, “Iterated preconditioned LSQR method for inverse
problems on unstructured grids”, Inverse Problems, 30(7)
2014

[Hansen ’98] P. C. Hansen “Rank-Deficient and Discrete
Ill-Posed Problems”, SIAM, 1998

M.M. Betcke Numerical Optimisation