Numerical Optimisation: Least squares
Numerical Optimisation:
Least squares
Marta M. Betcke
m.betcke@ucl.ac.uk,
Kiko Rullan
f.rullan@cs.ucl.ac.uk
Department of Computer Science,
Centre for Medical Image Computing,
Centre for Inverse Problems
University College London
Lecture 10 & 11
M.M. Betcke Numerical Optimisation
Least squares problem
Least squares is a problem where the objective function has the
following special form
f (x) =
1
2
m∑
j=1
r2j (x),
where each rj is a smooth function from Rn → R. We refer to
each of the rj as a residual, and we assume that m ≥ n.
Least squares problems are ubiquitous in applications, where the
discrepancy between the model and the observed behaviour is
minimised.
M.M. Betcke Numerical Optimisation
Let’s assemble the individual components rj into the residual
vector r : Rn → Rm
r(x) = (r1(x), r2(x), . . . , rm(x))
T.
Using this vector, f becomes f (x) = 1
2
‖r(x)‖22. The derivatives of
f can be calculated with help of the Jacobian
J(x) =
[
∂rj
∂xi
]
ij
=
∇r1(x)T
∇r2(x)T
…
∇rm(x)T
.
∇f (x) =
m∑
j=1
rj(x)∇rj(x) = J(x)Tr(x),
∇2f (x) =
m∑
j=1
∇rj(x)∇rj(x)T +
m∑
j=1
rj(x)∇2rj(x)
= J(x)TJ(x) +
m∑
j=1
rj(x)∇2rj(x),
.
M.M. Betcke Numerical Optimisation
Example
Model of concentration of drug in bloodstream
φ(x ; t) = x1 + tx2 + t
2×3 + x4 exp
−x5t .
Find a set of parameters x to that the model best matches the
data solving
1
2
m∑
j=1
(φ(x , tj)− yj︸ ︷︷ ︸
=:rj (x)
)2.
Figure: Nocedal Wright Fig 10.1 (left), Fig 10.2 (right)
M.M. Betcke Numerical Optimisation
Bayesian perspective
Bayes’ theorem
π(x |y) =
π(y |x)π(x)
π(y)
∝ π(y |x)π(x).
Denote the discrepancy between the model and the measurement
at data point tj as
�j = φ(x ; tj)− yj .
Assume that �js are independent and identically distributed with
variance σ2 and probability density function gσ(·). The likelyhood
of a particular set of observations yj , j = 1, 2, . . . ,m given the
parameter vector x is given by
π(y |x) =
m∏
j=1
gσ(�j) =
m∏
j=1
gσ(φ(x ; tj)− yj).
M.M. Betcke Numerical Optimisation
The maximum a posteriori probability (MAP) estimate vs the
maximum likelyhood estimate
xMAP = max
x
π(x |y) = max
x
π(y |x)π(x).
If gσ is a normal distribution
gσ(�) =
1
√
2πσ
exp
(
−
�2
2σ2
)
and x is uniformly distributed i.e. π(x) = const, the MAP estimate
reads
xMAP = max
x
1
(
√
2πσ)m
exp
− 1
2σ2
m∑
j=1
(φ(x ; tj)− yj)2
= min
x
m∑
j=1
(φ(x ; tj)− yj)2.
M.M. Betcke Numerical Optimisation
Linear least squares
If φ(x ; t) is linear function in x , the least squares problem becomes
linear:
The residuals rj(x) = φ(x ; tj)− yj are linear and in the vectorized
notation
r(x) = Jx − y ,
where
the vector of measurements y = (y1, y2, . . . ym)
T = r(0)T
the matrix J with rows Jj ,: : φ(x ; tj) = Jj ,:x
are both independent of x .
M.M. Betcke Numerical Optimisation
The linear least squares has the form
f (x) =
1
2
‖Jx − y‖2.
The gradient and Hessian are
∇f (x) = JT(Jx − y), ∇2f (x) = JTJ.
Note: f (x) is convex i.e. the stationary point is the global
minimiser ∇f (x?) = 0
Normal equations
∇f (x?) = JT(Jx? − y) = 0 ⇔ JTJx? = JTy .
M.M. Betcke Numerical Optimisation
Roadmap: solution of linear least squares
Solve the normal equations JTJx? = JTy
+ If m� n, computing JTJ explicitly results is a smaller matrix
easier to store than J. This can be solved by e.g. Cholesky
decomposition.
– Formulating JTJ squares the condition number.
+ For regularised ill-posed problems squaring of the condition
number may not be an issue.
– If J is rank deficient (or ill conditioned) Cholesky
decomposition will require pivoting.
Solve the least squares x? = arg minx ‖Jx − b‖2
+ If J is of moderate size you can use direct methods like QR
decomposition or SVD decomposition.
+ If J is large and sparse or given in operator form i.e. x → Jx
use iterative methods like CGLS or LSQR.
+ Does not square the condition number.
+ In particular SVD and iterative methods e.g. LSQR can easily
deal with ill-conditioning.
M.M. Betcke Numerical Optimisation
QR factorizarion
Let
JP = Q
[
R
0
]
= [ Q1︸︷︷︸
∈Rm×n
Q2︸︷︷︸
∈Rm×(m−n)
]
[
R
0
]
= Q1R,
where
P ∈ Rn×n column permutation matrix (hence orthogonal),
Q ∈ Rm×m orthogonal matrix,
R ∈ Rn×n upper triangular with positive diagonal.
Recall: Multiplication with orthogonal matrix preserves ‖ · ‖2.
‖Jx − y‖22 = ‖Q
T(J
=I︷︸︸︷
PPT x − y)‖22 = ‖(
=[R,0]T︷ ︸︸ ︷
QTJP )PTx − QTy‖22
= ‖RPTx − QT1 y‖
2
2 + ‖Q
T
2 y‖
2
2.
Solution: x? = PR−1QT1 y . In practice we perform backsubstitution
on Rz = QT1 y and permute for x
? = Pz .
M.M. Betcke Numerical Optimisation
Singular value decomposition
Let
J = U
[
S
0
]
VT = [ U1︸︷︷︸
∈Rm×n
U2︸︷︷︸
∈Rm×(m−n)
]
[
S
0
]
VT = U1SV
T,
where
U ∈ Rm×m,V ∈ Rn×n are orthogonal matrices,
S ∈ Rn×n diagonal matrix with elements σ1 ≥ σ2 ≥ · · · ≥ σn > 0.
‖Jx − y‖22 = ‖U
T(J
=I︷ ︸︸ ︷
VVT x − y)‖22 = ‖(
=[S ,0]T︷ ︸︸ ︷
UTJV )VTx − UTy‖22
= ‖SVTx − UT1 y‖
2
2 + ‖U
T
2 y‖
2
2.
Solution: x? = VS−1UT1 y =
∑n
i=1
uTi y
σi
vi . If σi are small, they
would undue amplify the noise and can be omitted from the sum.
Picard condition: |uTi y | should decay faster than σi .
M.M. Betcke Numerical Optimisation
LSQR [Paige, Saunders ’82]
LSQR applied to
min
f
‖Af − g‖22 + τ‖f ‖
2
2 = min
f
∥∥∥∥
[
A√
τ I
]
f −
[
g
0
]∥∥∥∥2
2
,
where τ is an optional damping parameter, is analytically
equivalent to CG applied to the normal equations. However, it
avoids forming them (hence no condition number squaring) using
the Golub–Kahan bidiagonalization [Golub,Kahan ’65].
G-K bidiagonalisation yields the projected least squares problem
min
yi
∥∥∥∥
[
Bi√
τ I
]
yi − β1e1
∥∥∥∥ , (P-LS)
which is then solved using QR decomposition yielding the
approximation for the solution of the original problem, fi = Viyi .
M.M. Betcke Numerical Optimisation
Golub-Kahan bidiagonalization
G-K bidiagonalization with a starting vector g for minf ‖g − Af ‖
Ui+1(β1e1) = g
AVi = Ui+1Bi
ATUi+1 = ViB
T
i + αi+1vi+1e
T
i+1,
ei : ith canonical basis vector, αi ≥ 0 and βi ≥ 0 chosen such that
‖ui‖ = ‖vi‖ = 1 and UTi U = I ,V
T
i V = I ,
Bi =
α1
β2 α2
β3
. . .
. . . αi
βi+1
, Ui = [u1, u2, . . . , ui ], Vi = [v1, v2, . . . , vi ].
(1)
M.M. Betcke Numerical Optimisation
Preconditioned LSQR
Preconditioned LSQR
f̂ = argmin‖g − AL−1f̂ ‖.
The corresponding normal equation is exactly the split
preconditioned normal equation
L−TATAL−1f̂ = L−TATg , f = L−1f̂ . (2)
Similarly as for CG, the LSQR algorithm can be formulated
without the need to provide a factorization LTL of M, the MLSQR
algorithm [Arridge, B, Harhanen ’14].
M.M. Betcke Numerical Optimisation
MLSQR [Arridge, B, Harhanen ’14]
1: Initialization:
2: β1u1 = g
3: p̃ = ATu1
4: ṽ1 = M
−1p̃, α1 = (ṽ1, p̃)
1/2
, ṽ1 = ṽ1/α1
5: w̃1 = ṽ1, f0 = 0, φ̄1 = β1, ρ̄1 = α1
6: for i = 1, 2, . . . do
7: Bidiagonalization:
8: βi+1ui+1 = Aṽi − αiui
9: p̃ = ATui+1 − βi+1p̃
10: ṽi+1 = M
−1p̃, αi+1 = (ṽi+1, p̃)
1/2
,
11: p̃ = p̃/αi+1, ṽi+1 = ṽi+1/αi+1
12: Orthogonal transformation: smart QR, see [Paige Sanders ’82]
13: Update:
14: fi = fi−1 + (φi/ρi )w̃i
15: w̃i+1 = ṽi+1 − (θi+1/ρi )w̃i
16: Break if stopping criterion satisfied
17: end for
M.M. Betcke Numerical Optimisation
LSQR with explicit regularization (τ 6= 0)
In preconditioned formulation, Tikhonov (explicit)
regularization amounts to damping. For a fixed value of τ ,
damping can be easily incorporated in LSQR at the cost of
doubling the number of Givens rotations [Paige, Saunders ’82].
Due to the shift invariance of Krylov spaces, Vi are the same
for any τ . If Vi are stored (expensive for many/long vectors),
the projected least squares problem (P-LS) can be efficiently
solved for multiple values of τ .
Solving (P-LS) with a variable τ is discussed in [Bjorck ’88]
using singular value decomposition of the bidiagonal matrix Bi
(no efficient SVD update even though Bi simply expands by a
row and a column in each iteration). Those quantities can be
obtained at the cost O(i2) at the i th iteration.
For larger i , the algorithm described in [Elden ’77] for the
least squares solution of (P-LS) at the cost of O(i) for each
value of τ is the preferable option.
M.M. Betcke Numerical Optimisation
Stopping LSQR / MLSQR
[Saunders Paige ’82] discusses three stopping criteria:
S1: ‖r̄i‖ ≤ BTOL‖g‖+ ATOL‖Ā‖‖fi‖ (consistent systems),
S2:
‖ĀT r̄i‖
‖Ā‖‖r̄i‖
≤ ATOL (inconsistent systems),
S3: cond(Ā) ≥ CONLIM (both),
where r̄i := ḡ − Āfi with Ā =
[
A√
τL
]
, ḡ =
[
g
0
]
.
[Arridge, B, Harhanen ’14] uses Morozov discrepancy principle
(suitable for ill-posed problems)
S4: ‖ri‖ ≤ ηδ, η > 1,
where ri := g −Afi , δ is the (estimated) noise level, η > 1 prevents
overregularization
+ if τ = 0, ri = r̄i and the sequence ‖ri‖ = ‖r̄i‖ is monotonically
decreasing. Moreover if initialised with f0, ‖fi‖ is strictly
monotonically growing (relevant for damped problem),
+ priorconditioning does not alter the residual.
M.M. Betcke Numerical Optimisation
Gauss-Newton (GN) method
Gauss-Newton (GN) can be viewed as a modified Newton method
with line search.
Recall the specific form of gradient and Hessian of least squares
problems
∇f (x) = J(x)Tr(x), ∇2f (x) = J(x)TJ(x) +
m∑
j=1
rj(x)∇2rj(x).
Substituting into the Newton equation
∇2f (xk)pk = −∇f (xk)
and using the approximation ∇2f (x) ≈ J(x)TJ(x) we obtain
J(xk)
T︸ ︷︷ ︸
=:JT
k
J(xk)p
GN
k = −J(xk)
T r(xk)︸ ︷︷ ︸
=:rk
.
Implementations of GN usually perform a line search along pGNk
requiring the step length to satisfy e.g. Armijo or Wolfe conditions.
M.M. Betcke Numerical Optimisation
Does not require computation of the individual Hessians
∇2rj , j = 1, . . . ,m. If the Jacobian Jk has been computed
when evaluating the gradient no other derivatives are needed.
Frequently the first term JTk Jk dominates the second term in
the Hessian i.e. |rj(x)|‖∇2rj(x)‖ are much smaller than the
eigenvalues JTJ. This happens if either the residual rj or
‖∇2rj‖ are small. Thus JTk Jk is a good approximation to ∇
2fk
and the convergence rate of GN is close to Newton.
If Jk has full rank and ∇fk 6= 0, pGNk is a descent direction for
f and hence suitable for line search
(pGNk )
T∇fk = (pGNk )
TJTk rk = −(p
GN
k )
TJTk Jkp
GN
k
= −‖JkpGNk ‖
2 ≤ 0.
The final inequality is strict unless Jkp
GN
k = 0 in which case
by the GN equation and Jk being full-rank we have
0 = JTk rk = ∇fk and xk is a stationary point.
M.M. Betcke Numerical Optimisation
Interpretation of GN step
The GN equation
JTk Jkp
GN
k = −J
T
k rk
is exactly the normal equation for the linear least squares problem
min
p
1
2
‖Jkp + rk‖2.
Hence, we can find the GN direction pGNk solving this linear least
squares problem using any of the techniques discussed before.
We can view GN equation as obtained from a linear model for the
vector function r(xk + p) ≈ rk + Jkp,
f (xk + p) =
1
2
‖rk(xk + p)‖2 ≈
1
2
‖Jkp + rk‖2
and pGNk = arg minp
1
2
‖Jkp + rk‖2.
M.M. Betcke Numerical Optimisation
Global convergence of GN
The global convergence is a consequence of the convergence
theorem for line search methods [Zoutendijk].
To satisfy the conditions of Zoutendijk’s theorem, we need to make
following assumptions:
rj is Lipschitz continuously differentiable in a neighbourhood
M of the bounded level set L = {x : f (x) ≤ f (x0)},
J(x) satisfies the uniform full-rank condition, γ > 0
‖J(x)z‖ ≥ γ‖z‖, ∀x ∈M.
Then for the iterates xk generated by the GN method with step
length satisfying Wolfe conditions, we have
lim
k→∞
JTk rk = ∇f (xk) = 0.
M.M. Betcke Numerical Optimisation
Similarly as for the line search, we check that the angle
θk = ∠(p
GN
k ,−∇fk) is uniformly bounded away from π/2
cos θk = −
(pGNk )
T∇fk
‖pGNk ‖‖∇fk‖
=
‖JkpGNk ‖
2
‖pGNk ‖‖J
T
k Jkp
GN
k ‖
≥
γ2‖pGNk ‖
2
β2‖pGNk ‖2
=
γ2
β2
> 0,
where ‖Jk(x)‖ ≤ β <∞, ∀x ∈ L is a consequence of boundedness of the level set L and Lipschitz continuous differentiability of rj , j = 1, . . . ,m. Then from ∑ k≥0 cos 2 θk‖∇fk‖2 <∞ in Zoutendijk’s theorem it follows ∇fk → 0. If Jk for some k is rank deficient, the matrix J T k Jk in GN equation is singular and the system has infinitely many solutions, however cos θk is not necessarily bounded away from 0. M.M. Betcke Numerical Optimisation Convergence rate GN The convergence of GN can be rapid if JTk Jk dominates the second term in the Hessian. Similarly as showing the convergence rate of Newton iteration, if xk is sufficiently close to x ?, J(x) satisfies the uniform full-rank condition, we have for a unit step in GN direction xk + p GN k − x ? = xk − x? − [ :=J(xk ) TJ(xk )︷ ︸︸ ︷ JTJ(xk) ] −1∇f (xk) = [JTJ(xk)] −1 JTJ(xk)(xk − x?) +∇f (x?)︸ ︷︷ ︸ =0 −∇f (xk) . Using H(x) to denote the second term in the Hessian, it follows from Taylor theorem that ∇f (xk)−∇f (x?) = ∫ 1 0 JTJ(x? + t(xk − x?))(xk − x?)dt + ∫ 1 0 H(x? + t(xk − x?))(xk − x?)dt, M.M. Betcke Numerical Optimisation Putting everything together and assuming Lipschitz continuity of H(·) near x? and using L.c.d. of rj ⇒ L.c. of JTr(x) ‖xk + pGNk − x ?‖ ≤ ∫ 1 0 ‖[JTJ(xk)]−1H(x? + t(xk − x?))‖‖xk − x?‖dt + ∫ 1 0 ‖ [JTJ(xk)]−1︸ ︷︷ ︸ ‖·‖≤γ−2 ( JTJ(x? + t(xk − x?))− JTJ(xk) )︸ ︷︷ ︸ ‖·‖=O(‖xk−x?‖) would need L.c. of J TJ(x) ‖‖xk − x?‖dt ≈‖[JTJ(x?)]−1H(x?)‖‖xk − x?‖+O(‖xk − x?‖2). Hence, if ‖[JTJ(x?)]−1H(x?)‖ � 1, we can expect GN to converge quickly towards the solution x?. When H(x?) = 0, the convergence is quadratic (Newton). When the Jacobian J(x) is large and sparse, the exact solve of GN equation can be replaced by an inexact solve as in inexact Newton methods but with the true Hessian ∇2f (xk) replaced with J(xk) TJ(xk). The positive semidefinitness of J(xk) TJ(xk) simplifies the algorithms. Instead of (preconditioned) CG, (preconditioned) LSQR should be used. M.M. Betcke Numerical Optimisation Levenberg-Marquardt (LM) method Levenberg-Marquardt (LM) makes use of the same Hessian approximation as GN but within the framework of trust region methods. Trust region methods can cope with (nearly) rank-deficient Hessian, which is a weakness of GN. The constraint model problem to be solved at each iteration min p 1 2 ‖Jkp + rk‖2, subject to ‖p‖ ≤ ∆k , (CM-LM) where ∆k > 0 is the trust region radius.
Note: The least squares term corresponds to quadratic model
mk(p) =
1
2
‖rk‖2 + pTJTk rk +
1
2
pTJTk Jkp.
M.M. Betcke Numerical Optimisation
Solution of the constraint model problem
The solution of the constraint model problem (CM-LM) is an
immediate consequence of the general result for trust region
methods [More, Sorensen]:
If the solution pGNk of the GN equation lies strictly inside the
trust region i.e. ‖pGNk ‖ < ∆k , then p
LM
k = p
GN
k solves
(CM-LM).
Otherwise, there is a λ > 0 such that the solution pLMk of
(CM-LM) satisfies ‖pLMk ‖ = ∆k and
(JTk Jk + λI )p
LM
k = −J
T
k rk .
Note: The last equation is the normal equation to the linear least
squares problem
min
p
1
2
∥∥∥∥
[
Jk√
λI
]
p +
[
r
0
]∥∥∥∥2 ,
which gives us a way of solving (CM-LM) without computing JTk Jk .
M.M. Betcke Numerical Optimisation
Global convergence LM
Global convergence is a consequence of the corresponding trust
region global convergence theorem.
To satisfy the conditions of that theorem we make the following
assumptions:
η ∈ (0, 1
4
) (for strong convergence)
rj is Lipschitz continuously differentiable in a neighbourhood
M of the bounded level set L = {x : f (x) ≤ f (x0)},
The approximate solution pk of (CM-LM) satisfies
mk(0)−mk(pk) ≥ c1‖JTk rk‖min
(
∆k ,
‖JTk rk‖
‖JTk Jk‖
)
,
for some constant c1 > 0, and in addition ‖pk‖ ≤ γ∆k for
some constant γ ≥ 1.
We then have that
lim
k→∞
∇fk = lim
k→∞
JTk rk = 0.
M.M. Betcke Numerical Optimisation
As for trust region methods, there is no need to evaluate the
right hand side of the decrease condition, but it is sufficient to
ensure reduction of at least the Cauchy point, which can be
calculated inexpensively. If the iterative CG-Steighaus
approach is used, this is guaranteed.
The local convergence of LM is similar to GN. Near the
solution x?, at which the first term J(x?)TJ(x?) of the
Hessian ∇2f (x?) dominates the second term, the trust region
becomes inactive and the algorith takes GN steps giving rapid
local convergence.
M.M. Betcke Numerical Optimisation
References
[Paige, Saunders ’82] Link to website with papers and codes
https://web.stanford.edu/group/SOL/software/lsqr/
[Bjorck ’96] A. Bjorck, “Numerical Methods for Least Squares
Problems”, SIAM, 1996
[Golub, Kahan ’65] G.H. Golub and W. Kahan, “Calculating
the singular values and pseudoinverse of a matrix”, SIAM
J.Numer.Anal. 2 ,1965
[Elden ’77] L. Elden, “Algorithms for the regularization of
ill-conditioned least squares problems”, BIT, 17, 1977
[Arridge, B, Harhanen ’14] S R Arridge, M M Betcke and L
Harhanen, “Iterated preconditioned LSQR method for inverse
problems on unstructured grids”, Inverse Problems, 30(7)
2014
[Hansen ’98] P. C. Hansen “Rank-Deficient and Discrete
Ill-Posed Problems”, SIAM, 1998
M.M. Betcke Numerical Optimisation