Numerical Methods & Scientific Computing: lecture notes
Data fitting
Week 9: aim to cover
Nonlinear least squares, steepest descent, Gauss-Newton (Lecture 17)
QR factorization; Nonlinear model fitting (Lab 9)
Levenberg-Marquardt algorithm (as trust region method) (Lecture 18)
Numerical Methods & Scientific Computing: lecture notes
Data fitting
Fitting nonlinear models
Recall, we could fit linear models
y ⇠ ⌃kxi i(t)
to data {(Ti,Yi)},i = 1…m, by finding the least squares solution to an overdetermined linear system
Ax = b
where x are the parameters or coe cients in the model, Aij = j (Ti ) and bi =Yi.
If the model is nonlinear in (some of) the parameters, we instead have a nonlinear least squares problem.
Example
fitting exponentials
Numerical Methods & Scientific Computing: lecture notes
Data fitting
The cost function
In nonlinear models the parameters xj appear in the residual of the model at each data point
ri(x)=Yi ⌃k k(Ti,x) R(x)=(r1(x),r2(x),…rm(x))T 2Rm
and x 2 Rn i.e. n parameters and m data points as before.
As before we aim to minimize the 2-norm of the residual, so our cost function to minimize is
C (x) = 1 ||R(x)||2 = 1 R(x)T R(x) 22
The factor of 1/2 is to simplify later equations.
We write the residual as
Numerical Methods & Scientific Computing: lecture notes
Data fitting
Nonlinear least squares
We could aim to minimize C(x) using any of the standard methods of nonlinear optimization [see MAST30013] but we instead describe methods that take advantage of the special form of C(x) in the least squares case.
In some sense, nonlinear least squares problems are intermediate between nonlinear systems of equations and nonlinear optimization. It won’t be surprising that we need iterative methods. In particular, we solve a sequence of linear least squares problems.
As for root-finding methods, there are optimization methods that use only function information and methods that use gradient information. Newton’s method, which uses second derivative information, is regarded as too expensive for real problems. We describe 3 methods that use gradient information.
Numerical Methods & Scientific Computing: lecture notes
Data fitting
Steepest descent
The simplest method using gradient information uses the fact that the direction of steepest descent of a function C of several variables is
d = rC(x) 2 Rn For the NLSQ case, we have
where
rC = J(x)T R(x) Jij = @ri
@xj
is the Jacobian, an m ⇥ n matrix.
Since a necessary condition for a minimum is rC (x) = 0 we have the
condition
at a minimum x⇤.
J(x⇤)T R(x⇤) = 0
Numerical Methods & Scientific Computing: lecture notes
Data fitting
Line search
This suggests an algorithm
start from an initial guess x_0
until termination criterion is met
compute descent direction d_k at current position x_c
x_+ = x_c + a_k d_k
[step a distance a_k along direction d_k to get trial posit
test trial position for sufficient descent etc.
if OK accept x_+ and current a_k; x_c := x_+
else adjust a_k and repeat attempted step [backtrack]
end
io
Numerical Methods & Scientific Computing: lecture notes
Data fitting
Global convergence
Under suitable (reasonable) assumptions, this algorithm is globally convergent to a local minimizer. That does not mean that it finds the global minimum — that is much harder to achieve.
However this algorithm sometimes struggles in the latter stages of the iteration
and is generally slow (linearly convergent).
We would like a method that converges faster once we’ve gotten close to the local minimizer.
Think Newton’s method!
Example
the Rosenbrock function
Numerical Methods & Scientific Computing: lecture notes
Data fitting
Gauss-Newton method
For large problems, Newton’s method (which requires second derivative information at each step) is regarded as too expensive. For the NLSQ problem, a related method is used to get faster convergence near the minimizer.
If we linearize the residual vector about the current guess, we get the local linear model
m(x) = R(xc) + J(xc)(x xc) = R(xc) + J(xc)s = 0
and we seek the minimum norm solution of this overdetermined system
i.e. we solve a sequence of linear least squares problems J(xc)s = R(xc); x+ = xc + s
This is called a Gauss-Newton step
The corresponding normal equations are
J(xc)TJ(xc)s= J(xc)TR(xc); x+ =xc +s
Numerical Methods & Scientific Computing: lecture notes
Data fitting
An example
Numerical Methods & Scientific Computing: lecture notes
Data fitting
Newton step
Alternatively, we could minimize the local quadratic model of the cost
function
minm(x)=C(xc)+rC(xc)s+ 1sTr2Cs 2
where the necessary condition rm(s) = 0 gives (r2C)s = rC(xc)
Recall the cost function is C (x) = 1 R(x)T R(x) so 2
and
rC(xc) = J(xc)TR(xc)
r2 C (xc ) = J(xc )T J(xc ) + ⌃m ri (xc )r2 ri (xc )
giving ⇥J(xc)TJ(xc)+⌃mri(xc)r2ri(xc)⇤s= JT(xc)R(xc);
x+ =xc +s This Newton step is NOT the same as a Gauss-Newton step.
Numerical Methods & Scientific Computing: lecture notes
Data fitting
Connection with Gauss-Newton
The second term in r2C i.e. ⌃mri(xc)r2ri(xc) requires the evaluation of M Hessian N ⇥ N matrices and so is expensive. This term is missing from the Gauss-Newton step, which requires only first derivative information J.
If the problem is a zero residual problem i.e. R(x⇤) = 0, then the second term vanishes close enough to the minimum, so we might expect the second term to be negligible for small residual problems, or for models that are nearly linear i.e. r00 small.
So we could regard the Gauss-Newton step as an approximate Newton step, using the dominant (and cheaper) parts of the Newton step.
i
Numerical Methods & Scientific Computing: lecture notes
Data fitting
Local convergence
We know Newton’s method is quadratically convergent, once we are close enough to the minimum. What about the Gauss-Newton step?
Under Standard assumptions, one can prove:
9K, >0,suchthatifxc 2B( ) =) ||e+ || K ||ec ||2 + ||R(x⇤ )|| ||ec ||
Consequences:
1 for a zero residual problem, Gauss-Newton is quadratically convergent
2 for a small residual problem or close to the minimum, Gauss-Newton is linearly convergent
||e+|| r||ec|| r < 1
and typically faster than steepest descent
3 for a large residual problem or far from the minimum, there is no
Theorem
reason to expect Gauss-Newton to converge
Numerical Methods & Scientific Computing: lecture notes
Data fitting
End of Lecture 17