Eigenvalues and Singular Values
Goals of this chapter
• Introduce the power method and its variants for computing one or some eigenvalues/eigenvectors (or eigenpairs) of a ma- trix.
• Discuss the computation of singular values and present a few examples that demonstrate its usefulness.
Copyright By PowCoder代写 加微信 powcoder
• Describe the QR iteration for computing all the eigenvalues of a general matrix.
In this chapter we consider algorithms for solving the eigenvalue problem Ax = λx,
and for computing the singular value decomposition (SVD) A = UΣV T .
We also discuss several relevant applications.
Recall from Section 3.1 (page 78) that eigenvalues, and therefore also singular values, gen- erally cannot be computed precisely in a finite number of steps, even in the absence of floating point error. All algorithms for computing eigenvalues and singular values are therefore necessar- ily iterative, unlike those in Chapters 4 and 5 for solving linear systems.
Note: Although many of our observations apply to complex matri- ces, we will assume that A is real. Furthermore, for most of this chapter, we assume that A has real eigenvalues, although this is cer- tainly not true for all real matrices. Note that we can always assume that an eigenvector corresponding to a real eigenvalue is also real.
286 Chapter 8: Eigenvalues and Singular Values
Methods for finding eigenvalues can be split into two distinct categories. The first is generally based on decompositions involving similarity transformations for finding several or all eigenval- ues. The other category mainly deals with very large and typically sparse matrices for which just a few eigenvalues and/or eigenvectors are required; these are algorithms that are largely based on matrix-vector products.
We start in Section 8.1 by discussing the effect of repeated multiplication of a vector by the given matrix. This leads to the fundamental power method, which finds applications, for example, in search algorithms for large networks. The important concept of shift and invert is introduced next, leading to the inverse iteration. We will see how it can accelerate convergence, albeit at a price.
In Section 8.2 we turn to the SVD, introduced in Section 4.4, and explain how it can be used in practice. One such application brings us back to solving rank-deficient or highly ill-conditioned least squares problems, which was described in Section 6.4.
Complete descriptions of robust algorithms for finding all eigenvalues or singular values of a given general matrix involve several nontrivial aspects and belong in a more advanced text. However, we discuss some details of such algorithms in Section 8.3.
The Lanczos and Arnoldi processes can be utilized for computing a few eigenpairs of large and sparse matrices; this is discussed in the more advanced Section 8.4.
8.1 The power method and variants
Note: Compared to the sequence of Chapters 4–5 and 7 for solving linear systems there appears to be an inverted order here: methods based on matrix-vector products and aiming at possibly finding only a few eigenvalues are discussed first, while the general methods for finding all eigenvalues for matrices without a special structure are delayed until Section 8.3. This is because those latter methods rely on the ones introduced in Section 8.1 as building blocks.
The power method is a simple technique for computing the dominant eigenvalue and eigen- vector of a matrix. It is based on the idea that repeated multiplication of a random vector (almost any vector) by the given matrix yields vectors that eventually tend towards the direction of the dominant eigenvector.
But why should we ever want to compute the dominant eigenvector of a matrix? Here is a motivating case study.
Example 8.1. Google’s search engine continues to be the dominant Internet search technology: it is a mechanism for ranking pages and displaying top hits – webpages that are, in the search engine’s judgment, the most relevant sites to the user’s query.
The company will not reveal the exact details of its current algorithms, indeed these are rumored to be continually modified, but the basic algorithm that has produced an Internet search giant is called PageRank and was published in 1998 by Google’s founders, and . This algorithm amounts to a computation of the dominant eigenvector of a large and very sparse matrix.
Before diving into the gory details, though, let us give some necessary context. The Web is an enormous creature of billions of webpages, and it dynamically changes: pages are updated, added and deleted constantly. Search engines are continuously busy using their vast computing power to “crawl” the Web and update their records. When a user enters a search query, the search
8.1. Thepowermethodandvariants 287
engine is ready with billions of records, and according to the specifics of the query the relevant webpages can be instantly retrieved. A tricky part, though, is to rank those relevant pages and determine the order in which they are displayed to the user; indeed, the typical surfer rarely flips through more than a few top hits. The idea that formed the basis for the search engine was that beyond questions of contents (and rest assured that these are not overlooked) it is also crucial to take the link structure of the Web into account.
The idea of PageRank is this. Given a network linkage graph with n nodes (webpages), the importance of a webpage is determined by the number and the importance of pages that link to it. Mathematically, suppose the importance, or rank44, of page i, is given by xi. To determine the value of xi we first record all the pages that link to it. Suppose the locations (or indices) of these pages are given by the set {Bi}. If a webpage whose index is j ∈ Bi points to Nj pages
including page i, and its own rank is xj , then we say that it contributes a share of xj Nj
xi of page i. Thus, in one formula, we set 1
xi = N xj, i=1,…,n. j∈Bi j
to the rank
Looking carefully at this expression, we can see that in fact it is nothing but an eigenvalue problem! We seek a vector x such that x = Ax, where the nonzero values aij of the matrix A are the elements 1/Nj associated with page i. In other words, we are aiming to compute an eigenvector of A that corresponds to an eigenvalue equal to 1. Since the number of links in and out of a given webpage is smaller by many orders of magnitude than the overall number of webpages, the matrix A is extremely sparse.
There are a few unanswered questions at this point. For example, how do we at all know that the matrix A has an eigenvalue equal to 1? If it does, is that eigenvalue large or small in magnitude compared to the other eigenvalues of the matrix? If there is a solution x to the problem, is it unique up to magnitude? Even if it is unique, is it guaranteed to be real? And even if it is real, would all its entries be necessarily positive, as philosophically befits the term “rank?” Some of the answers to these questions are simple and some are more involved. We will discuss them in Example 8.3. For now, please take our word that with a few adjustments to this basic model (see Example 8.3), A indeed has an eigenvalue 1, which happens to be the dominant eigenvalue, with algebraic multiplicity 1, and the problem has a unique real positive solution x. This is exactly the vector of ranking that we are looking for, and the power method discussed next can be used to find it.
Developing the power method
Suppose that the eigenvalues of a matrix A are given by {λj,xj}, for j = 1,…,n. Note that here xj are eigenvectors as in Section 3.1, not iterates as in Chapter 7. Let v0, ∥v0∥ = 1, be an arbitrary initial guess, and consider the following algorithm:
for k = 1, 2, . . . until termination do v ̃ = Avk−1
v k = v ̃ / ∥ v ̃ ∥
The output of this algorithm at the kth step is a vector vk = γkAkv0, where γk is a scalar that guarantees that ∥vk∥ = 1. As long as A is real and v0 is real, all the vk are real by definition. The reason for the repeated normalization is that there is nothing in the definition of
44This term relates better to an army rank than to a matrix rank, in the current context.
288 Chapter 8: Eigenvalues and Singular Values
an eigenvector to prevent our iterates from growing in magnitude, which would indeed happen for large eigenvalues, accelerating roundoff error growth. Hence we keep the iterate magnitude in check.
We will assume throughout that the matrix A has n linearly independent eigenvectors. Hence, it is possible to express v0 as a linear combination of the eigenvectors {xj }: there are coefficients βj such that
v0 = Upon multiplying v0 by A we obtain
Av0 = A βjxj = j=1
βjAxj = βjλjxj. j=1
The eigenvectors that correspond to the larger eigenvalues are therefore more pronounced in the
new linear combination. Continuing, for any positive integer k we have
magnitude, and the magnitude of the second eigenvalue is smaller than that of the first, so
|λ1|>|λj|, j=2,…,n.
That is, the possibility of equality of magnitudes is excluded. Since we are assuming that A is real, this implies that λ1 is real, because complex eigenvalues of real matrices appear in com- plex conjugate pairs and if λ1 were part of a complex conjugate pair, λ2 would have the same magnitude. Suppose also that v0 has a component in the direction of x1, that is β1 ̸= 0. Then
Now, assume that the eigenvalues λ1 , λ2 , . . . , λn are sorted in decreasing order in terms of their
kk1 jλjk111k1 jλj
j=1 1 j=2 1
where γk is a normalization factor, there to ensure that ∥vk ∥ = 1. Now, by our assumption about
λ k thedominanceofthefirsteigenvalueitfollowsthatforj > 1wehave j → 0ask → ∞.
Thus, the larger k is, the more dominant x1 is in vk, and in the limit we obtain a unit vector in
the direction of x1.
We thus have a simple way of approximately computing the dominant eigenvector. What
about the dominant eigenvalue? In the case of Example 8.1 and upcoming Example 8.3 this eigenvalue is known, but in most cases it is not. Here we can use the Rayleigh quotient, defined for any given vector by
μ(v)= vTAv. vT v
If v were an eigenvector, then μ(v) would simply give the associated eigenvalue. If v is not an eigenvector, then the Rayleigh quotient of v gives the closest possible approximation to the eigenvalue in the least squares sense. (We ask you to show this in Exercise 8.2.)
Note that by the normalization of vk we have vkT vk = ∥vk ∥2 = 1, and hence μ(vk) = vkT Avk.
n λ k n λ k
β j x =γλkβx +γλk β j x,
8.1. Thepowermethodandvariants 289
Algorithm: Power Method.
Input: matrix A and initial guess v0.
Output: approximate dominant eigenvector vk and approximate
dominant eigenvalue λ(k) 1
for k = 1, 2, . . . until termination v ̃ = Avk−1
v k = v ̃ / ∥ v ̃ ∥ 2
λ(k) = vT Av
This then is our estimate for the eigenvalue λ1 in the kth iteration. The power method for com- puting the dominant eigenpair of a matrix, under the conditions we have stated so far, is given on the current page.
Example 8.2. It is surprising how much a diagonal matrix can tell us about the qualities of an eigenvalue computation method for a more general symmetric matrix A. This is mainly because such matrices have a spectral decomposition A = QDQT with Q orthogonal and D diagonal, and so the diagonal matrix here is representative of the ‘non-orthogonal’ part, loosely speaking. Thus, we are emboldened to present in Figure 8.1 two experiments applying the power method to diagonal matrices.
0 50 100 150 200
Figure 8.1. Convergence behavior of the power method for two diagonal matrices, Example 8.2.
In MATLAB syntax the matrices are
u = [1:32];
v = [1:30 ,30 ,32];
absolute error
290 Chapter 8: Eigenvalues and Singular Values
A = diag(u);
B=diag(v);
Thus, in both A and B the largest eigenvalue is λ1 = 32, but while in A the second eigenvalue isλ2 =31,inBitisλ2 =30.
The plot depicts the absolute error, given by
|λ(k) −λ | ≡ |λ(k) −32|. 111
The results in the graph show the significant improvement in convergence for the second matrix B . The accelerated convergence can best be understood by comparing log 31 to log 30 . The
ratio between these two values is approximately 2.03, which indicates a roughly similar factor in speed of convergence.
Exercise 8.3 considers a similar experiment for non-diagonal matrices.
Example 8.3. As promised, let us return to the problem described in Example 8.1, and illustrate how the power method works to yield the PageRank vector defined there. Let e be a vector of length n with all elements equal to 1, and define v0 = n1 e. Then, for k = 0, 1, . . . the iteration is defined by
This is equivalent to applying the power method without vector normalization and eigenvalue
estimation – operations that are not required for the present simple example.
To illustrate this, consider the link structure of the toy network depicted in Figure 8.2. Here the webpages are nodes in the graph, numbered 1 through 6. Notice that the graph is directed, or equivalently, the matrix associated with it is nonsymmetric. Indeed, a page pointing to another
page does not necessarily imply that the latter also points to the former.
Figure 8.2. A toy network for the PageRank Example 8.3.
The task of constructing the link matrix becomes straightforward now. The jth column rep- resents the outlinks from page j. For instance, the third column of the matrix indicates that node 3 has outlinks to the three nodes numbered 1, 4 and 6. Each of them is thus assigned the value 1/3 in their corresponding location in the (third) column, and the remaining column entries are set to zero. We obtain
(k+1) 1 (k)
vi = N vj , i=1,…,n.
0 0 13 0 0 0 1 0 0 1 1 0
2 1 2 2 A=0 2 0 0 0 1.
12 0 13 0 0 0 0 0 0 12 0 0
011010 232
8.1. Thepowermethodandvariants 291
In the matrix A the rows indicate the inlinks. For example, row 2 indicates that the nodes linking to node 2 are numbered 1, 4 and 5. Note that only the columns sum up to 1; we have no control over the sum of elements in any particular row. Such a matrix is called column stochastic.45
The matrix A is elementwise nonnegative. There is a very elegant, more than century old theory which ensures for this matrix that a unique, simple eigenvalue 1 exists46. Based on this theory and other considerations it can be shown that all other eigenvalues of A are smaller in magnitude and the associated eigenvector of the dominant eigenvalue has real entries.
To compute PageRank, we start with the vector v0 = (1/6, 1/6, 1/6, 1/6, 1/6, 1/6)T , and repeatedly multiply it by A. (This choice of an initial guess is based more on a “democratic” state of mind than any mathematical or computational hunch. In the absence of any knowledge on the solution, why not start from a vector that indicates equal ranking to all?!) Eventually, the method converges to
0.2223 0.3612
x = 0.6669 , 0.3334 0.1667
which is the desired PageRank vector. This shows that node 3 is the top ranked entry. The output of a query will not typically include the values of x; rather, the ranking is given, which according to the values of x is (3,6,2,4,1,5).
Note that in this special case, since ∥v0∥1 = 1, all subsequent iterates satisfy ∥vk∥1 = 1 (why?), and hence we are justified in skipping the normalization step in the power iteration.
There is also a probabilistic interpretation to this model. A random surfer follows links in a random walk fashion, with the probability of following a particular link from webpage j given by the value 1/Nj in the corresponding edge. We ask ourselves what is the probability of the surfer being at a given wepbage after a long (“infinite”) time, regardless of where they started their journey.
Can we expect this basic model to always work for a general Internet network? Not without making adjustments. Figure 8.3 illustrates two potential difficulties. A dangling node is depicted in the left diagram. Such a situation occurs when there are no outlinks from a certain webpage. Note that this implies nothing about the webpage’s rank! It could be a very important webpage that simply has no links; think, for example, of a country’s constitution. The difficulty that a dangling node generates is that the corresponding column in A is zero. In our toy example, imagine that node 6 did not have a link to node 3. In that case the matrix would have a zero 6th column. This practically means that once we hit node 6, we are “stuck” and cannot continue to follow links.
One simple way this can be fixed in the PageRank model is by changing the entire column from zeros to entries all equal to 1/n (in our example, 1/6). In terms of the probabilistic in- terpretation, this means assuming that if a surfer follows links and arrives at a webpage without outlinks, they can jump out to any webpage with equal probability. This is called a dangling node correction.
Another potential difficulty with the basic model is the possibility of entering a “dead end”, as is illustrated in the right diagram of Figure 8.3. In graph-speak, this is called a terminal strong component. In the diagram node 1 leads to a cyclic path, and once a surfer arrives at it there is
45 In the literature the matrix A is often denoted by P T , indicating that its transpose, P , is row stochastic and contains the information on outlinks per node in its rows. To reduce notational overhead, we do not use a transpose throughout this example.
46The theory we are referring to here is called Perron-Frobenius, and it can be found in a very vast body of literature on nonnegative matrices.
292 Chapter 8: Eigenvalues and Singular Values
Figure 8.3. Things that can go wrong with the basic model: depicted are a dangling node (left) and a terminal strong component featuring a cyclic path (right).
no way out. There is no reason to think that “closed loops” that are isolated from the outer world do not exist in abundance on the Web. There are many instances where this general situation (of course, not necessarily with a cyclic path) can happen; think, for example, of a large body of documentation for a programming language like Java. Webpages point to one another, and it is likely that there are external links to this component of the graph, but it may be the case that there are only interlinks, and no links from this body of documentation back to the outside world.
In the matrix A, this is translated to a block of entries that barely “connect” or do not at all to any rows or columns of the matrix that do not belong to that block. For one, this may eliminate the uniqueness of the PageRank vector. The way this is fixed in the PageRank model is by forming a convex combination of the matrix A (after the dangling node correction) with a rank-1 matrix. A popular strategy is to replace A by
αA + (1 − α)ueT ,
where α is a damping factor, e is a vector of all 1s, and u is a personalization vector. We then compute the dominant eigenvector of this new matrix. For example, if α = 0.85 and u = n1 e then the fix can be interpreted as a model where with probability 0.85 a surfer follows links, but there is a 0.15 chance that the surfer will jump randomly to anywhere on the Web. If u contains zeros then the random jump is done selectively, only to those portions of the graph that correspond to nonzero entries of u; this explains the term “personalization”.
The matrix αA + (1 − α)ueT has some very interesting spectral properties. In particular, while its dominant eigenvalue is 1, the rest of its eigenvalues (which are generally complex) are bounded in magnitude by α. This readily leads to the conclusion that the smaller α is, the faster the power method converges. The problem is that a small α means that we give less weight to the link
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com