Linear Least-Squares Problems
Goals of this chapter
• Introduce and solve the linear least-squares problem, ubiqui- tous in data fitting applications.
• Introduce algorithms based on orthogonal transformations.
Copyright By PowCoder代写 加微信 powcoder
• Evaluatedifferentalgorithmsandunderstandwhattheirbasic features translate into in terms of a tradeoff between stability and efficiency.
• Introduce SVD use for rank-deficient and highly ill- conditioned problems.
• Consider weighted least-squares and present the highly pop- ular Kalman filter as a case study.
This chapter is concerned with methods for solving the optimization problem30
min ∥b − Ax∥2, x
where the dimensions of the real matrix and vectors are
A∈Rm×n, x∈Rn, b∈Rm, m≥n.
We assume that A has full column rank, i.e., rank(A) = n: in other words, the n columns of A are linearly independent vectors in Rm. Notice that in the overdetermined case, m > n, there is typically no x satisfying Ax = b exactly, even in the absence of roundoff error.
The least-squares problem arises often in many diverse application fields, especially where data fitting is required; recall Example 3.18. Instances arise in machine learning, computer vision, and computer graphics applications, to name but a few. In computer vision people may want to match local invariant features of cluttered images under arbitrary rotations, scalings,
30The name of the problem arises from squaring the term to be minimized, namely solving minx ∥b − Ax∥2, which has the same solution and amounts to finding the least squares.
174 Chapter 6: Linear Least-Squares Problems
change of brightness and contrast, and so on. Such actions are parametrized and the best values for the parameters are then sought to match the data. In computer graphics a parametrization of a surface mesh in three dimensions may be sought that yields a morphing of one animal into another, say. In such applications the question of just how to parametrize, or how to generate an efficient predictive model A in the present terminology, is often the crux of the matter and is far from trivial. But once the method of parametrization is determined, what follows is often an instance of the problem considered here. Other applications seek to find an approximating function v(t, x) depending on a continuous variable t that fits data pairs (ti, bi), i = 1, . . . , m, as in Example 3.18. We will apply our solution techniques to such instances in this chapter.
In Section 6.1 we formulate an n × n system of linear equations called the normal equations that yield the solution for the stated minimization problem. Solving the normal equations is a fast and straightforward approach for solving linear least-squares problems. However, the resulting algorithm is not as stable as possible. Orthogonal transformations provide a more stable way to proceed, and this is described in Section 6.2. Algorithms to actually carry out the necessary QR decomposition are described in Section 6.3.
What if the assumption that A has full column rank is violated, i.e. rank(A) < n? Just as when attempting to solve a square linear system with a singular matrix A, the solution for our least-squares optimization problem is not unique in that case. A sensible choice from the class of best solutions is then required. Clouding the issue further are situations where A is “almost” rank-deficient. Methods based on using SVD (see Section 3.4) for solving least-squares problems are then appropriate and are discussed in Section 6.4. Often this boils down to using a method called truncated SVD, which derives and solves a well-defined nearby problem.
Problems often arise in practice where optimization with respect to some “energy” is desired. The minimization problem posed at the beginning of this chapter is then replaced by the weighted
least-squares problem
(b − Ax)T W(b − Ax),
for a given symmetric positive definite m × m weight matrix W . The usual least-squares problem
is recovered for W = I, the identity. The more general case is considered in Section 6.5. 6.1 Least-squares problems and the normal equations
In this section we first solve the least-squares minimization problem, obtaining the normal equa- tions. Then we show examples related to data fitting.
The space range(A) spanned by the columns of our m × n matrix A (i.e., all vectors z of the form z = Ay) is generally of dimension at most n, and we further assume that the dimension equals n, so that A has a full column rank. In other words, we assume that its columns are linearly independent. Notice also that in the overdetermined case, m > n, b generally does not lie in the range space of A.
Deriving the normal equations
Let us rewrite the problem that we aim to solve as31
min ψ(x), where ψ(x) = 1∥b − Ax∥2. x2
31In this chapter, as in other Chapters that deal with linear algebra problems, we use x to denote both the variable of a function such as ψ(x) and its minimizer. This is in contrast to Chapters 2, 9 and 10, where a solution to a nonlinear problem in x is denoted by x∗. The confusion is probably greatest right here, where the problem considered is that of minimizing a function. However, the techniques used in this chapter are different from those used for general nonlinear optimization problems, and hopefully the slight inconsistency here is harmless.
6.1. Least-squaresproblemsandthenormalequations 175
We have squared the normed expression, thus getting rid of the square root sign in the definition of the norm,32 and multiplied it by 1/2: this will not change the minimizer, in the sense that the same solution coefficients xj will be obtained. Notice that we have dropped the subscript 2 in the norm notation: there is no other norm here to get confused by.
Finally, we define the residual vector by
r = b−Ax. Writing these matrix and vectors explicitly we have
a…abr 1,1 1,n 1 1
x1 . x=.,A= .
. . . . ,b=.,r=.. . . .
xn . …
am,1 … am,n bm rm
Note that the matrix A is m×n, with m > n and perhaps m ≫ n, so it is “tall and skinny”, and correspondingly we do not expect r to vanish at the optimum. See Figure 6.1.
Figure 6.1. Matrices and vectors and their dimensions.
We have a minimization problem for a smooth scalar function in several variables, given by
ψ ( x ) = 12 ∥ r ∥ 2 .
The necessary conditions for a minimum may be viewed as an extension to those in Section 2.6
and are obtained by setting the derivatives of ψ with respect to each unknown xk to zero. This 32If you are not sure you remember everything you need to remember about norms, Section 3.2 may come to your aid.
176 Chapter 6: Linear Least-Squares Problems
gives the n equations
Example 6.1. Consider the toy problem defined by
10 0.1 A=1 1, b=0.9.
0.1 − x1 r=b−Ax=0.9−x1 −x2
the conditions for a minimum yield
∂ ψ(x) = 0, k = 1,…,n. ∂xk
2.0−x1 −2×2
and ψ(x) = 1∥r∥2 = 1 (0.1−x1)2 +(0.9−x1 −x2)2 +(2.0−x1 −2×2)2. Taking the
derivatives of ψ(x) with respect to x1 and x2 and equating each to zero, we obtain two linear
equations in component form, which we write as
(−1)∗(0.1−x1)+(−1)∗(0.9−x1 −x2)+(−1)∗(2.0−x1 −2×2) = 0, (0)∗(0.1−x1)+(−1)∗(0.9−x1 −x2)+(−2)∗(2.0−x1 −2×2) = 0.
But now, note that we can write these two equations in a pleasingly concise matrix-vector notation as −AT r = 0. Furthermore, this vector equality can be immediately rearranged to read
AT Ax = AT b.
Continuing with the general case, since
2 1 1m n
ψ(x)=2∥r∥2=2 bi− ai,jxj , i=1 j=1
∂ m n
∂x ψ(x) = bi − ai,jxj(−ai,k) = 0 k i=1 j=1
for k = 1,2,…,n. The latter equality can be rewritten as m n m
ai,jxj = ai,kbi for k = 1,…,n. i=1
In matrix-vector form this expression looks much simpler; as in Example 6.1, it reads
ATAx = ATb.
This system of n linear equations in n unknowns is called the normal equations. Note that B = AT A can be much smaller in size than A; see Figure 6.1. The matrix B is symmetric positive definite given that A has full column rank; see Section 3.3.
6.1. Least-squaresproblemsandthenormalequations 177
Theorem: Least Squares.
The least-squares problem
min ∥Ax − b∥2, x
where A has full column rank, has a unique solution that satisfies the normal equations
AT A x = AT b.
Least squares solution uniqueness
Is the solution of the normal equations really a minimizer (and not, say, a maximizer) of the least-squares problem? and if yes, is this the global minimum? The Least Squares Theorem in the framed box says it is.
To see this, note first that our objective function is a quadratic in n variables. Indeed, we can write
ψ ( x ) = 12 ( b − A x ) T ( b − A x ) = 12 x T B x − b T A x + 12 ∥ b ∥ 2 .
Expanding this quadratic function in a multivariate Taylor series (see page 326) around the al- leged minimizer (i.e., the solution of the normal equations, denoted x) amounts to the usual Taylor expansion in each variable: for any nonzero vector increment ξ we write
n∂ψ1nn ∂2ψ ψ(x+ξ)=ψ(x)+ ξj ∂x (x)+2 ξjξk ∂x ∂x (x).
j=1 j=1k=1
Hence, using the fact that B is positive definite to write in vector notation that ξT Bξ > 0, we
completing the proof of the theorem.
ψ(x+ξ) = ψ(x)+ξTBξ > ψ(x),
Solving via the normal equations
Let us emphasize again that x, which has been our vector argument above, is now specified as the solution for the normal equations, which is indeed the solution of the least-squares minimization problem.
Furthermore, if A has full column rank, then B = AT A is symmetric positive definite. The least-squares problem has therefore been reduced, at least in principle, to that of solving an n × n system of linear equations of the form described in Section 4.4, using the Cholesky decomposition.
The beauty does not end here, geometrically speaking. Notice that for the corresponding residual at optimum we have
AT(b−Ax) = ATr = 0.
178 Chapter 6: Linear Least-Squares Problems
Hence we seek a solution satisfying that the residual is orthogonal to the column space of A.33 Figure 6.2 illustrates the projection.
6.2. Discrete least-squares approximation.
Our solution is given by x = (AT A)−1AT b. The matrix multiplying b is important enough to have a name and special notation: it is called the pseudo-inverse of A, denoted by
A† = (AT A)−1AT .
Notice that A†A = I, the n × n identity. However, the m × m matrix P = AA† is not equal to an identity matrix. Rather, it is an orthogonal projector, i.e., it is a symmetric matrix satisfying P 2 = P . Consult Figure 6.2 again and see Exercises 3.18 and 6.15.
The solution via the normal equations using direct solvers amounts to the four steps laid out in the algorithm given below.
Algorithm: Least Squares via Normal Equations.
1. FormB=ATAandy=ATb.
2. Compute the Cholesky factor of B, i.e., the lower triangular
matrix G satisfying B = GGT .
3. Solve the lower triangular system Gz = y for z.
4. Solve the upper triangular system GT x = z for x.
Efficiency of the normal equations-based least-squares algorithm
The overall computational work for the algorithm is dominated by its first step. It is approxi- mately mn2 floating point operations (flops) for that step: the cost of n(n + 1)/2 inner products
33The term ‘normal’ typically refers in a geometric context to a line that is perpendicular to a surface. Accordingly, the name ‘normal equations’ arises from the orthogonality of the residual to the column space of A.
6.1. Least-squaresproblemsandthenormalequations 179
of each pair of the m vectors making up the columns of A. The cost of the remaining steps follows from our discussion in Chapter 4. For step 2 the cost is n3/3 + O(n2) flops, while steps 3 and 4 cost O(n2) flops. This is another case where despite the fact that operation counts are in general unsatisfactory for measuring true performance, they do deliver the essential result that the main cost here, especially when m ≫ n, is in forming the matrix B = AT A.
Example 6.2. Let’s see some more numbers in a slightly larger example than Example 6.1. Choose m = 5, n = 3, and specify
2 3 5 −2
A=5 3 −2, b=5. 3 5 4 −2
We wish to solve the least-squares problem minx ∥b − Ax∥. Following the normal equations algorithm we calculate
40 30 10 18 B=ATA=30 79 47, y=ATb= 5 .
10 47 55 −21
Now solving the square system Bx = y via the Cholesky decomposition and forward and back- ward substitution yields the final result x = (0.35, 0.40, −0.79)T , correct to the number of digits shown.
Here are more numbers. The optimal residual (rounded) is
r = b − Ax = (4.44, 0.04, 0.59, −1.89, 1.31)T .
This vector is orthogonal to each column of A. Finally, the pseudo-inverse of A (rounded) is
0.05 A† = B−1AT = −0.05
0.07 0.10 0.05 −0.07 0.06 0.01 0.14 −0.11 0.05
−0.13 0.16 .
As with the inverse of a square nonsingular matrix, we do not recommend explicitly calculating such a matrix, and will not do so again in this book.
Please see if you can write a short MATLAB script verifying all these calculations. Data fitting
Generally, data fitting problems arise as follows. We have observed data b and a model function that for any candidate model x provides predicted data. The task is to find x such that the predicted data match the observed data to the extent possible, by minimizing their difference in the least-squares sense. In the linear case which we study here, the predicted data are given by Ax. In this context the assumption that A has full column rank does not impose a serious restriction: it just implies that there is no redundancy in the representation of the predicted data, so that for any given vector x ∈ Rn there is no other vector x ̃ ∈ Rn such that Ax ̃ = Ax.
Example 6.3 (Linear regression). Consider fitting a given data set of m pairs (ti , bi ) by a straight line. Thus, we want to find the coefficients x1 and x2 of
v(t) = x1 + x2t,
180 Chapter 6: Linear Least-Squares Problems
such that v(ti) ≈ bi, i = 1,…,m. So n = 2 here, and
1 t1 1 t2
A = . . . . . . . 1 tm
The components of the normal equations are
B2,1 =B1,2,
This leads to a system of just two linear equations given by
m m mx1+ ti x2= bi,
i=1 i=1 m m m
ti x1+ t2i x2= tibi. i=1 i=1 i=1
1=m, B1,2 =
1·ti = ti, i=1
The solution is written explicitly as the following formula:
This formula is extensively used in statistics and data science.
For readers who like to see examples with numbers, consider the data (used also in Exam-
Then m = 3, m m i=1
m t2m b−m tbm t x= i=1i i=1i i=1ii i=1i,
mmi=1t2i − (mi=1ti)2 mm tb−m tm b
i=1 i i i=1 i i=1 i. mmi=1t2i − (mi=1ti)2
ti 0.0 1.0 bi 0.1 0.9
ti = 0.0+1.0+2.0 = 3.0, m i=1
bi = 0.1+0.9+2.0 = 3.0, m
t2i = 5.0,
i=1 tibi = 0.9 + 4.0 = 4.9. This yields (please verify) the solution x1 = 0.05, x2 = 0.95.
One purpose of this example is to convince you of the beauty of matrix-vector notation and concise computer programs as an alternative to the above exposition.
A more usual regression curve would of course be constructed for many more data points. See Figure 6.3 for a plot of a regression curve obtained with these formulas where m = 25. Here ti = i < ti+1 denotes years since 1990, and bi is the average deviation in temperature from one degree Celsius in a certain secret location over a certain period of time. Although it does not always hold that bi < bi+1, the straight line fit has a positive slope, which indicates a clear warming trend.
6.1. Least-squaresproblemsandthenormalequations 181
0 5 10 15 20 25
Figure 6.3. Linear regression curve (in blue) through green data. Here, m = 25, n = 2.
Polynomial data fitting
Extending the linear regression formulas to a higher degree polynomial fit, v(t) ≡ pn−1(t) =
x1 + x2t + · · · + xntn−1, is straightforward too. Writing for each data point v(ti) = x1 + x2ti +
···+xntn−1 =(1,ti,...,tn−1)x,thematrixAistheextensionoftheVandermondematrix ii
encountered in Section 3.5, given by
1 t1 t12 ··· t1n−1 1 t2 t22 ··· t2n−1
. . . . A=... ..
. . . . 1 tm−1 tm−12 · · · tm−1n−1
1 tm tm2 ··· tmn−1
Note that the structure of the matrix A depends on our choice of the basis functions used to describe polynomials. More on this is provided in Chapter 11. For now let us just say that the simple basis used here, called monomial basis, is often good only for polynomials of a really low degree. But then, these are the ones we may want in the present context anyway.
Here is a MATLAB function for best approximation by low order polynomials, using the normal equations.
data and v
182 Chapter 6: Linear Least-Squares Problems
function coefs = lsfit (t, b, n)
% function coefs = lsfit (t , b, n)
% Construct coefficients of the polynomial of
% degree at most n−1 that best fits data (t ,b)
t = t(:); b = b(:); % make sure t and b are column vectors m=length(t);
% tall and skinny A
A = ones(m,n);
for j=1:n−1
A(:,j+1) = A(:,j).*t;
% normal equations and solution
B = A’*A; y = A’*b;
coefs=B\y;
Example 6.4. Sample the function f(t) = cos(2πt) + 10(t − .5)5 at 21 equidistant points on the interval [0, 1] and construct best fits by polynomials of degree at most n − 1, for each n = 1,2,3,4,5.
Following is an appropriate MATLAB script.
tt = 0:1/(m−1):1;
bb = cos(2*pi*tt) + 10*(tt −.5).^5;
% find polynomial coefficients
for n=1:5,
coefs{n} = lsfit(tt,bb,n);
% Evaluate and plot
t = 0:.01:1;
z = ones (5 ,101); for n=1:5
z(n,:) = z(n,:) * coefs{n}(n); for j=n−1:−1:1
z(n,:) = z(n,:).*t + coefs{n}(j);
plot(t,z,tt ,bb,’ro’)
xlabel(’t’)
ylabel ( ’p_{n−1} ’ )
The resulting approximants pn−1(t) are plotted in Figure 6.4. You can see the improvement in the fitting error as the degree n − 1 increases. Only the quartic polynomial (n = 5) has enough inflexion points to chase the exact curve fairly faithfully.
Data fitting vs. interpolation
Reflecting on Examples 6.4 and 3.18 (page 94), there is a seeming paradox hidden in our ar- guments, namely, that we attempt to minimize the residual for a fixed n, n < m, but refuse to simply increase n until n = m. Indeed, this would drive the residual to zero, the resulting scheme being a polynomial that interpolates the data. Why not just interpolate as in Example 3.18? (In- terpolation techniques are discussed at length in Chapter 11.)
6.1. Least-squaresproblemsandthenormalequations 183
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Figure 6.4. The first 5 best polynomial approximations to f(t) = cos(2πt)+10(t−.5)5 sampled at 0 : .05 : 1. The data values appear as red circles. Clearly, the larger n the better the polynomial pn−1 fits the data.
The simple answer is that choosing n is part of our modeling efforts, and the ensuing least- squares minimization problem is part of the solution process with n already fixed. But there are reasons for choosing n small in the first place. One is hinted at in Example 6.3, namely, that we are trying to find the trend in the data on a long time scale.
An additional reason for not interpolating the data values is that they may contain measure- ment errors. Moreover, we may want a model function that depends only on a few parameters xj for ease of manipulation, although we determine them based on all the given data. In other words, it is beneficial to avoid what is known as overfit. For a poignant example we refer the reader to Exercise 6.3.
The solution of the least-squares data fitting problem through solving the normal equations has the advantage of being straightforward and efficient. A linear least-squares solver is im- plemented in the MATLAB backslash operator as well as in the MATLAB command polyfit. Replacing the last three lines in our function lsfit by the line
coefs=A\b;
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com