STAT 513/413: Lecture 5 Another bit of linear algebra
(solving equations)
Recall Lecture 4: solving equations
The theory for a linear model y ∼ Xβ may suggest to obtain the least squares estimates via the formula b = (XTX)−1XTy
but experts in numerical computations know that it should be done rather via solving (the system of equations) (XTX)b = XTy
and experts in statistical computing know that the best way is to do it via so-called QR decomposition (MATLAB “backslash” operator) applied directly to the matrix X rather than XTX
b <- qr.solve(X, y)
Also, qr.solve can solve systems of equations Ax = y when A is not square matrix
1
Solving systems of linear equations: Example 1
This will be our Example 1:
x1+2x2=5 weknowhowtosolveit: x1=3,x2=1 x1− x2=2
A sophisticated way to write the system: Ax = b
1 2 x1 5
where A= 1 −1 x= x
Subtracting first line from the second and leaving the first one as is
we get the upper triangular form of the system
x1 + 2x2 = 5 which is easy to solve: first obtain − 3x2 = −3 x2 = 1 from the second equation
and then substitute it into the first: x1 + 2 = 5 which yields x1 = 3
So, (upper) triangular systems are solved easily by backsolving
2
and b= 2
2
Systems: determined, under- and overdetermined
Undetermined system has more than one solution x1 + 2x2 = 5 this is an undetermined system
2x1 + 4x2 = 10 there are not enough independent equations
Such a system cannot be solved: we can only deduce something like a relationship giving the form of all possible solutions: here it is x1 = 5 − 2x2. This is typically not a task suitable for doing by a computer
Every system that has less equations than unknowns is undetermined. Nonetheless, as we can see, even systems with enough equations (or more) may be undetermined
3
x1 + 2x2 = 5 x1 − x2 = 2 2x1 − 2x2 = 4
this is a determined system
number of independent equations just right this one is a multiple of the previous one
Determined systems
A determined system: there is a unique solution to it
Apparently, a determined system has to have at least that many equations as unknowns. Usually it has the same number of both (like our Example 1), but it is not a rule: it can happen that there are more equations than unknowns
This systems has (also) a unique solution x1 = 3 and x2 = 1
4
An example with more equations than unknowns
We will do this when computing invariant probability of a Markov chain. The first attempt is
> P = matrix(c(0.5,0.1,0.5,0.9),2,2)
> P
[,1] [,2]
[1,] 0.5 0.5
[2,] 0.1 0.9
# note: a transition matrix of a Markov chain
> A = diag(2) – t(P)
> solve(A,c(0,0))
Error in solve.default(A, c(0, 0)) :
system is computationally singular: reciprocal condition number = 2
> A
[,1] [,2]
[1,] 0.5 -0.1
[2,] -0.5 0.1
# indeed
Indeed, we are solving for probabilities p1 and p2; we should add the equation p1 + p2 = 1 (otherwise any multiple of those is a solution too). After that, the system is determined, but its matrix is not square
5
Example 2
> A=rbind(A,c(1,1)) >A
[,1] [,2]
[1,] 0.5 -0.1
[2,] -0.5 0.1
[3,] 1.0 1.0
> solve(A,c(0,0,1))
Error in solve.default(A, c(0, 0, 1)) : ’a’ (3 x 2) must be square
> qr.solve(A,c(0,0,1))
[1] 0.1666667 0.8333333
This will be our Example 2
And now: what is this qr.solve?
6
Overdetermined systems
Overdetermined system has no solution
x1 + 2×2 = 5 this is an overdetermined system
x1 − x2 = 2 number of independent equations too big x1 + 2×2 = 4
Typically, an overdetermined system has more equations than unknowns – but it is not a rule; in fact, an overdetermined system may have any number of equations, starting from two
We can see that there are no x1,x2 satisfying all the equations: so in the classical sense, we cannot solve the system.
We may, however, look for some solution satisfying the system in the approximate sense
7
Approximate solutions of overdetermined systems
We are looking for x such that Ax =. b
where it is not really clear what =. has to mean, but out of many possibilities we choose the mathematically and computationally convenient one: we look for x that makes ∥Ax − b∥2 minimal
Making ∥Ax − b∥2 is equivalent to making ∥Ax − b∥2 minimal, but the former
∥u∥2 =uTu=u21 +u2 +…u2p is easier to handle than the latter
∥u∥2 =√uTu=(u21 +u2 +…u2p)1/2
8
QR decomposition (full version)
Motivation: operations performed on the left- and right-hand side of the system Ax = b can be interepreted as multiplying by certain matrices; for numerical stability, it is best if these matrices are orthogonal; product of orthogonal matrices is again an orthogonal matrix
QR-decomposition: writing A in a form A = QR
where Q is orthogonal, QTQ = QQT = I and R is upper triangular
There are several ways of computing the QR decomposition, connected with the names of Gram-Schmidt, Householder, Givens – we do not cover this here
9
How is this used for solving equations?
We start with
Ax = y
which is actually, due to the fact that A = QR
QRx = y
On multiplying both sides from left by QT we obtain
QTQRx = QTy that is,
Rx = QTy
and since R is now upper triangular, we can obtain x by backsolving
10
Economy version
If A is a square, p × p matrix, then everything is easy: both Q and R are square matrices with the same dimension p × p
But A does not have to be square: every matrix has a QR decomposition, with possible singularity reflected in R. If A is p×q, then Q has to be p×p (orthogonal matrix must be square, otherwise QTQ=QQT could not happen) and R is thus p×q
An important case is when p > q, then the last p−q rows of R are zeros. In such a case, it may be of interest also to consider the “economy” (“thin”, “reduced”) version:
Q ̇ is p×q and R ̇ is q×q; of course, then
Q ̇ cannot be orthogonal: it still holds true that Q ̇ TQ ̇ = Iq
but nothing for Q ̇ Q ̇ T
11
Some mathematical analysis perhaps
We can analyze the connection between the full Q and economy Q ̇ using blockwise matrix calculations. Knowing that Q ̇ is a part of Q, we can write Q = Q ̇ Q ̃. When we multiply Q by R, then the corresponding blocks of R are R ̇ and O, the latter block containing exclusively zeros. (That follows just by the analysis of dimensions and the fact that R is upper triangular.) We have
̇ ̃R ̇ ̇ ̇ ̃ ̇ ̇ A=QR= Q Q O =QR+QO=QR
We have also
T Q ̇ T ̇ ̃ Q ̇ T Q ̇ Q ̇ T Q ̃ I q O
I p = Q Q = Q ̃ T Q Q = Q ̃ T Q ̇ Q ̃ T Q ̃ = O I p − q which shows that Q ̇ TQ ̇ = Iq. Finally
T ̇ ̃Q ̇T ̇ ̇T ̃ ̃T IP=QQ=QQQ ̃T =QQ+QQ
but that does not say anything interesting
12
0.5000
-0.5000
1.0000
-1.0000
0.1000
1.0000
The default in Matlab is the full version
>> A=[0.5 -1; -0.5 0.1; 1 1] A=
>> [Q R]=qr(A) Q=
-0.4082 -0.8398
0.4082 0.1826
-0.8165 0.5112
R=
-1.2247 -0.3674
0 1.3693 00
0.3578
0.8944
0.2683
13
R=
Economy version has to be specifically asked for
>> [Q R]=qr(A,0) Q=
-0.4082
0.4082
-0.8165
-0.8398
0.1826
0.5112
-1.2247 0
-0.3674
1.3693
In R, it is the other way round: the default is the economy version
14
Let us try it on square matrices first
> A=matrix(c(1,1,2,-1),nrow=2)
> A
[,1] [,2]
[1,] 1 2
[2,] 1 -1
> solve(A,c(5,2))
[1] 3 1
> qr(A)
$qr
[,1]
[1,] -1.4142136 -0.7071068
[2,] 0.7071068 -2.1213203
$rank
[1] 2
…
> qr(A)$qr
[,1] [,2]
[1,] -1.4142136 -0.7071068
[2,] 0.7071068 -2.121320
# remember? the matrix from Example 1
# and here Example 1 is solved
[,2]
15
> Q=qr.Q(qr(A)) >Q
[,1] [,2]
[1,] -0.7071068 -0.7071068
[2,] -0.7071068 0.7071068
> R=qr.R(qr(A)) >R
[,1] [,2]
[1,] -1.414214 -0.7071068
[2,] 0.000000 -2.1213203
And finally check it out
> Q %*% R
[,1] [,2]
[1,] 1 2
[2,] 1 -1
After reading the help ?qr
16
Finishing Example 1
After transforming to QTA = QTy, we get
> t(Q) %*% A # left-hand side t(Q) %*% c(5,2))
[,1] [,2]
[1,] -1.414214e+00 -0.7071068
[2,] 1.110223e-16 -2.1213203
> t(Q) %*% c(5,2) # right-hand side
[,3]
[1,] -4.949747
[2,] -2.121320
And then we do the backsolving
> (t(Q) %*% c(5,2))[2] / R[2,2]
[1] 1
> ((t(Q) %*% c(5,2))[1] – R[1,2]*x1)/R[1,1]
[1] 3
It works!
17
>A
[,1] [,2]
[1,] 0.5 -0.1 [2,] -0.5 0.1 [3,] 1.0 1.0 > Q=qr.Q(qr(A)) >Q
[,1] [,2]
[1,] -0.4082483 0.5773503
[2,] 0.4082483 -0.5773503
[3,] -0.8164966 -0.5773503
> round(t(Q) %*% Q,digits=3)
[,1] [,2]
[1,] 1 0
[2,] 0 1
Example 2
18
> round(Q %*% t(Q),digits=2)
[,1] [,2] [,3]
[1,] 0.5 -0.5 [2,] -0.5 0.5 [3,] 0.0 0.0 > R=qr.R(qr(A)) >R
0 0 1
[,1]
[1,] -1.224745 -0.7348469
[2,] 0.000000 -0.6928203
> Q %*% R
[,1] [,2]
[1,] 0.5 -0.1
[2,] -0.5 0.1
[3,] 1.0 1.0
Example 2 continued
[,2]
19
Example 2 finished
> LHS=t(Q) %*% A
> RHS=t(Q) %*% c(0,0,1)
> cbind(LHS,RHS)
[,1] [,2] [,3]
[1,] -1.224745e+00 -0.7348469 -0.8164966
[2,] 2.220446e-16 -0.6928203 -0.5773503
> x2=RHS[2]/LHS[2,2]
> x2
[1] 0.8333333
> x1=(RHS[1]-LHS[1,2]*x2)/LHS[1,1]
> x1
[1] 0.1666667
> qr.solve(A,c(0,0,1))
[1] 0.1666667 0.8333333
Note that both systems have been determined, but only the matrix of the first one was square. So, QR decomposition can nicely solve determined linear systems, even when the matrix is not square. However, it can do more: overdetermined systems.
20
Overdetermined system: Example 3
> A=cbind(c(1,1,1),c(1,2,3)) >A
[,1] [,2]
[1,] 1 1
[2,] 1 2
[3,] 1 3
> y=c(1,4,0)
This system is clearly overdetermined – but we can solve it using QR decomposition as a determined one; but while it looks the same when running R commands, the mathematics is different here
> qr.solve(A,y)
[1] 2.666667 -0.500000
> abline(qr.solve(A,y))
> A %*% qr.solve(A,y)
[,1]
[1,] 2.166667
[2,] 1.666667
[3,] 1.166667
So, we do not have Ax=y, but only Ax=. y
21
What goes on here?
In fact, we are solving a regression problem: fitting line through (1,1), (2,4), (3,0) via least squares (find the fitted values!)
0.0 0.5 1.0 1.5 2.0 2.5 3.0
x
22
y
01234
> Q=qr.Q(qr(A)) > R=qr.R(qr(A)) >Q
[,1] [,2]
[1,] -0.5773503 7.071068e-01
[2,] -0.5773503 2.775558e-16
[3,] -0.5773503 -7.071068e-01
>R
[,1] [,2]
[1,] -1.732051 -3.464102
[2,] 0.000000 -1.414214
> LHS=t(Q) %*% A
> RHS=t(Q) %*% y
> cbind(LHS,RHS)
[,1] [,2] [,3]
[1,] -1.732051e+00 -3.464102 -2.8867513
[2,] 4.440892e-16 -1.414214 0.7071068
> b2=RHS[2]/LHS[2,2]
> b1=(RHS[1]-LHS[1,2]*b2)/LHS[1,1]
> c(b1,b2)
[1] 2.666667 -0.500000
Indeed
23
Actually, full QR is in R too
> Q=qr.Q(qr(A),complete=TRUE) > R=qr.R(qr(A),complete=TRUE) >Q
[,1] [,2] [,3]
[1,] -0.5773503 7.071068e-01 0.4082483
[2,] -0.5773503 2.775558e-16 -0.8164966
[3,] -0.5773503 -7.071068e-01 0.4082483
>R
[,1] [,2]
[1,] -1.732051 -3.464102
[2,] 0.000000 -1.414214
[3,] 0.000000 0.000000
> Q %*% R
[,1] [,2]
[1,] 1 1
[2,] 1 2
[3,] 1 3
24
So, why does it work for overdetermined systems?
Some first heuristics: check, for some arbitrary x (perhaps for several ones if you wish)
> x # just arbitrary x
[1] 0.6434361 0.4814825
> Q=qr.Q(qr(A),complete=TRUE) # still Example 3
> sum((A %*% x – y)^2)
[1] 10.10418
So, this is the squared distance of Ax to y. After premultiplying with QT
> sum((t(Q) %*% A %*% x – t(Q) %*% y)^2)
[1] 10.10418
the distance remains the same. This corresponds to the fact that orthogonal transformations (multiplying by Q as well as by QT represents such transformations) are preserving lengths (and angles)
That is what is behind the whole thing. Of course, the full explanation is only via …
25
… general mathematics!
NotethatA=QR=Q ̇R ̇,whereQ=Q ̇ Q ̃isacomponentofthe full QR decomposition and has orthonormal columns. We have
∥(Ax−y)∥2 = (Ax−y)T(Ax−y) = (y−Ax)T(y−Ax) = ∥y−Ax∥2 ∥(QT(y − Ax)∥2 = (QT(y − Ax))T(QT(y − Ax))
= (y−Ax)TQQT(y−Ax) = (y−Ax)T(y−Ax) = ∥y−Ax∥2
Q ̇ T ∥QT(y−Ax)∥2 =
2 Q ̃T
2 Q ̇ Ty−Q ̇ TQ ̇ R ̇x2 (y−Ax) =
2 Q ̃Ty−Q ̃TQ ̇R ̇x 2 = ∥Q ̇ Ty−R ̇x∥2 +∥Q ̃Ty∥2
Q ̇ Ty−R ̇x2 =
Looking for x minimizing this, we note that the second term does not depend on it, and the first term will be minimal when it become zero – that is, when R ̇x=Q ̇Ty
As R ̇ is q×q upper triangular, we just obtain x by backsolving – the whole thing succeds as long as R ̇ has rank q; that is true if A has full rank min{p,q}
Q ̃Ty 2 2 2
26
Conclusion
This likely explains why economy is the default in R: the whole thing serves for finding least-squares fits!
QR decomposition can nicely solve determined linear systems, even when the matrix is not square. However, it can do more: overdetermined systems, and thus, in particular, it is used to compute least-squares fits directly from the matrix X, without a need to compute XTX
So, to obtain the least-squares solution b for y ∼ Xβ: the expert in numerical computations does not
solve(t(X) %*% X) %*% (t(X} %*% y)
but the expert in statistical computing does not even
solve(t(X) %*% X, t(X) %*% y)
or solve(crossprod(X), crossprod(X,y))
and not even qr.solve(crossprod(X), crossprod(X,y))
but qr.solve(X, y)
27