Scientific Computing Spring 2020
Lecture 6: Eigenvalue and Singular Value Decompositions
Lecturer: Qi Deng Scribes: Qi Deng
Note: LaTeX template courtesy of UC Berkeley EECS dept.
Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications.
They may be distributed outside this class only with the permission of the Instructor.
6.1 Eigenvalue decomposition
Let A be an n × n matrix, we say that λ and x(x ̸= 0) are the eigenvalue and the corresponding eigenvector of A if
Ax = λx
holds. In linear algebra we know that eigenvalues are the root of polynomial p(λ) = det(λI − A). Suppose Axk = λkxk for k = 1,2,…,n, then we can write AX = XΛ. If X is invertible, then we say that A is diagonalizable and have the eigenvalue decomposition (EVD):
A = XΛX−1.
Here Λ is a diagonal matrix of eigenvalues of A. Of course not all the matrices are diagonalizable, one simple
1 1
non-diagonalizable matrix is A = 0 1 . We list two useful sufficient conditions for guaranteeing that a
matrix is diagonalizable.
1. If A has n distinct eigenvalues, then A is diagonalizable.
2. If A is symmetric, then A is diagonalizable. Specifically, A is orthogonally similar to Λ: A = QΛQ−1 where Q is orthogonal.
We also note that, in general, a square matrix is orthogonally similar to an upper triangular matrix (Schur decomposition):
QT AQ = Λ + U
where U is strictly upper triangular1 . 6.2 Algorithms
Numerical algorithms can be grouped into two classes, the exact methods and iterative method. Typical exact methods including the Gaussian elimination guarantee exact solution after finite number of flops. In contrast, iterative methods recursively update the solution variables and gradually improve the accuracy, the solution error as a decreasing function of iteration number.
1Upper triangular and the diagonals are zero: Uii = 0, i = 1, 2, …, n 6-1
6-2 Lecture 6: Eigenvalue and Singular Value Decompositions
6.2.1 Power method
For computing EVD, a possible approach is to find the root of characteristic polynomial using Newton method, yet it is inefficient for polynomial with degree > 5. More practical methods for EVD are iterative methods based on matrix power. Suppose A = XΛX−1, then Ak = XΛkX−1 and the entries in Λk are much easier to separate and pick out. To begin with, we present the power iteration (Algorithm 1) for computing the leading eigen value. It will then be generalized to finding the full spectrum.
Algorithm 1: Power method Input: q0;
for k = 1,2,3,… do
yk = Aqk−1;
qk = yk/∥yk∥2;
λ(k) =qkTAqk; end
Theorem 1 (Power method). Suppose that A is diagonalizable and in the EVD of A: |λ1| > |λ2| > |λ3| > ··· > |λn|.
In Algorithm 1, assume that q0 is not deficient (q0T x1 ̸= 0), then limk→∞ λ(k) = λ1. More specifically, λ k
|λ1−λ(k)|=O2 . λ
1
Proof. We note from the recursion qk ∝ yk ∝ Akq0. Using the EVD qk ∝ XΛkX−1q0. Suppose q0 is general such that q0 = X−1α and α1 ̸= 0. Then
nαλk x 1 + i i x i .
n
A k q 0 = X Λ k α = α i λ ki x i = α 1 λ k1
i=2 α1 λ1
qk ∈ span{A q0}, then qk ∝ zk = x1 + i=2 α1 λ1 xi −→ x1. Thus qk converges to the first eigen vector.
i=1
k n αiλik k=∞
Moreover,
λ ( k ) − λ 1
= q kT ( A − λ 1 I ) q k = zkT (A − λ1I)zk
z kT z k
Tn αiλik
=
zk i=2α1 λ1 xi z kT z k
Tα2
λ2 k zk α1 x2 +
αi λik
xi
n
i=3 α1 λ2
= λ1
λ2 k
=Oλ. 1
zkTzk
We remark that it is important to assume q0 in general position: q0T x1 ̸= 0. For example, we can generate q0 from random Gaussian. Power method may fail if this is not satisfied: if q0 = x2, then qk = x2 for any k.
Lecture 6: Eigenvalue and Singular Value Decompositions 6-3
Algorithm 2: Orthogonal iteration
Input: Q0 s.t. QT0 Q0 = Ir×r; for k = 1,2,3,… do
// Matrix product
Yk = AQk−1;
// QR factorization
QkRk =Yk;
H (k) (k)
λ(Qk AQk) = {λ1 , …, λr }; end
6.2.2 QR iteration
We next generalize power method to find multiple eigenvalues simultaneously. Suppose we want to find r eigenvalues, we then keep track of q1 , q2 , . . . , qr and want to turn them into eigenvectors. The key step is to use QR decomposition to orthogonalize the vectors, otherwise if we perform power methods on qi separately, they all converge to the leading eigenvector. We describe the orthogonal iteration method in Algorithm 2. Using analogous but more technical analysis, one could show that
lim λ(k) = λi, i = 1,2,…,r. k→∞ i
Furthermore, if we want to find the all the eigenvalues, we initiate Q0 with an orthogonal matrix. When the algorithm converges, Qk → Q and Rk → R. We will have QR = AQ. In other words, we have the Schur decomposition:
A = QRQT , in which the eigenvalues lie in the diagonal of R.
Recall in Algorithm 2 we have Rk = QTk AQk−1. Besides Qk, we can directly track Ak = QTk AQk. We note that
and
Ak = QTk AQk−1QTk−1Qk = RkQTk−1Qk, QTk−1QkRk = QTk−1AQk−1 = Ak−1.
Since QTk−1Qk is also orthogonal, by replacing QTk−1Qk with Qk, we obtain the QR iteration for tracking Ak (Algorithm 3).
Algorithm 3: QR iteration
for k = 1,2,3,… do
// QR factorization Ak−1 = QkRk;
// Matrix product Ak =RkQk;
end
The drawback of both QR and Orthogonal Iteration is that they require O(n3) flops in each epoch due to the QR factorization of Ak. QR iterations seems less efficient, if we note that the full Gaussian elimination takes O(n3).
6-4 Lecture 6: Eigenvalue and Singular Value Decompositions
We highlight a more efficient routine to reduce the cost to O(n2). The key step is to first find a Hessenberg matrix H0 orthogonally similar to A, and then perform QR iteration on H0. First we recall that a matrix is upperHessenbergifaij =0fori>j+1.
h11 h12 … h1n h h … .
Hupper = 21 22 .. .. . . hn−1n
0 hnn−1 hnn
We make a crucial observation about QR for Hessenberg matrix. If Ak−1 is Hessenberg, then we have
1. QR factorization of Ak−1 can be done in O(n2) flops. Consider using Householder transform: VnVn−1 · · · V1Ak−1 = Rk. Since each Vi only has to deal with two rows (O(n) data), the time cost is O(n2).
2. Ak = RkQk can be computed in O(n2) flops and the resulting matrix Ak is still Hessenberg matrix. Since Ak = RkVnVn−1 · · · V1, each Vi only mixes the two adjacent columns and introduce nonzeros in the subdiagonals.
It remains to find the initial point H0 = QT AQ. This can be achieved by Householder method, which is similar to what we have seen in Householder QR factorization. The only difference is that we apply Houeholder transform Q1, . . . , Qn−2 to turn A into Hessenberg
and then compute
One can show that H0 preserves Hessenberg structure.
6.3 Singular value decomposition
Theorem 2. Let A ∈ Rm×n. Then A can be written as
A = UΣV T
Qn−2Qn−1 ···Q1A = H H 0 = H Q T1 Q T2 · · · Q Tn − 2 .
(6.1) where U = Rm×m and V = Rn×n are orthogonal fmatrices and Σ ∈ Rm×n is a diagonal matrix with
nonnegative entries.
The columns of U and V are called left and right singular vectors and the diagonal elements σi are called singular values. Without loss of generality, we assume σ1 ≥ σ2 ≥ . . . ≥ σr ≥ 0, where r = min{m, n}.
Singular values and eigenvalues From (6.1), we have the EVD: ATA=VΣ2VT, AAT =UΣ2UT
Therefore, one can obtain the SVD of A from performing EVD at AT A or AAT .
Lecture 6: Eigenvalue and Singular Value Decompositions 6-5
Matrix norm and SVD SVD is closely related to the matrix norms. We have the following identities:
1. ∥A∥2 = σ1.
2. ∥A∥F = i σi2. 3. ∥A∥F = ∥Σ∥F .
Rank If σk > σk+1 = 0, then matrix A has rank k. Sometimes we use the thin SVD A = UkΣkVkT where Uk =[u1,u2,…,uk]andVk =[v1,v2,…,vk]arethefirstkcolumnsofU andV,andΣk =diag(σ1,…,σk). One can quickly identify that
and
(A) = {x : Ax = 0} = span{vk+1,vk+2,…,vn} range(A) = {x : x = Ay,y ∈ Rm} = span{u1,u2,…,uk}.
Another way of viewing SVD is via the outer product form:
r
A = σ i u i v iT ,
i=1
which decomposes A into the sum of “orthogonal” components weighted by the singular values. Each
component Si = uiviT is a rank-one matrix satisfying SiSjT = 0, SiTSj = 0 when i ̸= j. Let us write At = ti=1 σiuiviT . The next theorem implies that At is the best rank-t approximation of A in the 2-norm
sense.
Theorem 3 (Eckhart-Young). Let t < k = rank(A). Then
min ∥A−X∥2 =∥A−At∥2 =σt+1. rank(X)=t
Proof. First of all, A−At = ki=t+1 σiuiviT . Hence ∥A−At∥2 = σmax(A−At) = σt+1. Suppose rank(X) = t. Since rank(X) + dim(ker(A)) = n, we must have
ker(X) ∩ span{v1, v2, ..., vt+1} ≠ ∅.
Let z be a unit vector in the intersection set: Xz = 0, and z = t+1 αi · vi = Vt+1α for some coefficients
∥α∥2 = 1.
Hence X = At attains the optimum.
i=1
∥A−X∥2 ≥∥(A−X)z∥2 =∥U Σ VT z∥
t+1 t+1 t+1 2
t+1
= σ i 2 α i 2
i=1 ≥ σt+1.
Theorem 3 has a profound consequence. Suppose the singular values are decreasing rapidly, then At may capture most of the information of A with small t.
6-6 Lecture 6: Eigenvalue and Singular Value Decompositions
6.3.1 SVD for least squares
We note that SVD can be used to solved least squares problems. Assume that m ≥ n and let A = U ΣV T . Then we have
x = (AT A)−1AT b = (V Σ2V T )−1AT b = V Σ−2V T AΣUT b = V Σ−1UT b.
The overall running time of SVD-LS is 4n2m − 4 n3 + O(n2) flops. SVD is particularly appealing for rank-
3
deficient system, for which rank(A) = k < n. If A doesn’t have full rank, the least squares problem have infinite many solutions. We must use some other criteria to select a certain type of solution. Let us use SVD to find the minimum-norm solution.
Theorem 4. Suppose A = U ΣV T is the SVD. Then k
x∗ =σ−1uTbv iii
i=1
minimizes ∥Ax−b∥2 and has the smallest 2-norm of all minimizers. Moreover, ∥Ax∗ −b∥2 = mi=k+1(uTi b)2.
Proof. For any x, we have
∥Ax−b∥2 =∥UT(Ax−b)∥2 =∥Σα−UTb∥2 =(σiαi −uTi b)2 +(uTi b)2.
where α = V T x. At the optimality, we must have αi = uTi b/σi (1 ≤ i ≤ k). Moreover, ∥α∥2 = ∥x∥2, hence theminimumnormsolutionmusthaveαi =0(k+1≤i≤n).
Define the the pseudo-inverse of A to be A† = VΣ+UT where Σ† = diagσ−1,σ−1,...,σ−1,0,.... The 12k
above theorem simply says that the LS solution is
x∗ = A+b.
IfAhasfullrank,thenA† =(ATA)−1AT andifm=nandAisnonsingular,A† =A−1.
km i=1 k+1