Abstract
Manifold learning is a popular recent approach to nonlinear dimensionality reduction. Algorithms for this task are based on the idea that the dimensional- ity of many data sets is only artificially high; though each data point consists of perhaps thousands of fea- tures, it may be described as a function of only a few underlying parameters. That is, the data points are actually samples from a low-dimensional manifold that is embedded in a high-dimensional space. Man- ifold learning algorithms attempt to uncover these parameters in order to find a low-dimensional repre- sentation of the data. In this paper, we discuss the motivation, background, and algorithms proposed for manifold learning. Isomap, Locally Linear Embed- ding, Laplacian Eigenmaps, Semidefinite Embedding, and a host of variants of these algorithms are exam- ined.
1 Introduction
Many recent applications of machine learning – in data mining, computer vision, and elsewhere – re- quire deriving a classifier or function estimate from an extremely large data set. Modern data sets of- ten consist of a large number of examples, each of which is made up of many features. Though access to an abundance of examples is purely beneficial to an algorithm attempting to generalize from the data, managing a large number of features – some of which may be irrelevant or even misleading – is typically a burden to the algorithm. Overwhelmingly complex feature sets will slow the algorithm down and make finding global optima difficult. To lessen this burden on standard machine learning algorithms (e.g. clas- sifiers, function estimators), a number of techniques have been developed to vastly reduce the quantity of features in a data set —i.e. to reduce the dimension- ality of data.
Dimensionality reduction has other, related uses in addition to simplifying data so that it can be effi- ciently processed. Perhaps the most obvious is vi- sualization; if data lies in a 100-dimensional space, one cannot get an intuitive feel for what the data looks like. However, if a meaningful two- or three- dimensional representation of the data can be found, then it is possible to “eyeball” it. Though this may seem like a trivial point, many statistical and machine learning algorithms have very poor optimality guar- antees, so the ability to actually see the data and the output of an algorithm is of great practical interest.
Beyond visualization, a dimensionality reduction procedure may help reveal what the underlying forces governing a data set are. For example, suppose we are to classify an email as spam or not spam. A typical approach to this problem would be to represent an email as a vector of counts of the words appearing in the email. The dimensionality of this data can easily be in the hundreds, yet an effective dimensionality reduction technique may reveal that there are only a few exceptionally telling features, such as the word “Viagra.”
There are many approaches to dimensionality re- duction based on a variety of assumptions and used in a variety of contexts. We will focus on an approach initiated recently based on the observation that high- dimensional data is often much simpler than the di- mensionality would indicate. In particular, a given high-dimensional data set may contain many fea- tures that are all measurements of the same under- lying cause, so are closely related. This type of phe- nomenon is common, for example, when taking video footage of a single object from multiple angles simul- taneously. The features of such a data set contain much overlapping information; it would be helpful to somehow get a simplified, non-overlapping represen- tation of the data whose features are identifiable with the underlying parameters that govern the data. This intuition is formalized using the notion of a manifold:
Algorithms for manifold learning
Lawrence Cayton
lcayton@cs.ucsd.edu
June 15, 2005
1
the data set lies along a low-dimensional manifold em- bedded in a high-dimensional space, where the low- dimensional space reflects the underlying parameters and high-dimensional space is the feature space. At- tempting to uncover this manifold structure in a data set is referred to as manifold learning.
1.1 Organization
We begin by looking at the intuitions, basic mathe- matics, and assumptions of manifold learning in sec- tion 2. In section 3, we detail several algorithms pro- posed for learning manifolds: Isomap, Locally Linear Embedding, Laplacian Eigenmaps, Semidefinite Em- bedding, and some variants. These algorithms will be compared and contrasted with one another in section 4. We conclude with future directions for research in section 5.
1.2 Notation and Conventions
Throughout, we will be interested in the problem of reducing the dimensionality of a given data set con- sisting of high-dimensional points in Euclidean space.
• The high-dimensional input points will be re- ferred to as x1,x2,…xn. At times it will be convenient to work with these points as a single matrix X, where the ith row of X is xi.
• The low-dimensional representations that the di- mensionality reduction algorithms find will be referred to as y1,y2,…,yn. Y is the matrix of these points.
• n is the number of points in the input.
• D is the dimensionality of the input (i.e. xi ∈
RD ).
• d is the dimensionality of the manifold that the input is assumed to lie on and, accordingly, the dimensionality of the output (i.e. yi ∈ Rd).
• k is the number of nearest neighbors used by a particular algorithm.
• N(i) denotes the set of the k-nearest neighbors of xi.
• Throughout, we assume that the (eigenvector, eigenvalue) pairs are ordered by the eigenvalues. That is, if the (eigenvector, eigenvalue) pairs are (vi,λi), for i = 1,…,n, then λ1 ≥ λ2 ≥ ··· ≥
λn. We refer to v1,…vd as the top d eigenvec- tors, and vn¡d+1 , . . . vn as the bottom d eigenvec- tors.
2 Preliminaries
In this section, we consider the motivations for man- ifold learning and introduce the basic concepts at work. We have already discussed the importance of dimensionality reduction for machine learning; we now look at a classical procedure for this task and then turn to more recent efforts.
2.1 Linear Dimensionality Reduction
Perhaps the most popular algorithm for dimension- ality reduction is Principal Components Analysis (PCA). Given a data set, PCA finds the directions (vectors) along which the data has maximum vari- ance in addition to the relative importance of these directions. For example, suppose that we feed a set of three-dimensional points that all lie on a two- dimensional plane to PCA. PCA will return two vec- tors that span the plane along with a third vector that is orthogonal to the plane (so that all of R3 is spanned by these vectors). The two vectors that span the plane will be given a positive weight, but the third vector will have a weight of zero, since the data does not vary along that direction. PCA is most useful in the case when data lies on or close to a linear sub- space of the data set. Given this type of data, PCA will find a basis for the linear subspace and allow one to disregard the irrelevant features.
The manifold learning algorithms may be viewed as non-linear analogs to PCA, so it is worth detail- ing PCA somewhat. Suppose that X ∈ Rn£D is a matrix whose rows are D-dimensional data points. We are looking for the d-dimensional linear subspace of RD along which the data has maximum variance. If in fact the data lies perfectly along a subspace of RD, PCA will reveal that subspace; otherwise, PCA will introduce some error. The objective function it optimizes is
maxvar(XV), V
where V is an orthogonal D×d matrix. If d = 1, then V is simply a unit-length vector which gives the direc- tion of maximum variance. In general, the columns of V give the d dimensions that we project the original data onto. This cost function admits an eigenvalue- based solution, which we now detail for d = 1. The
2
direction of maximum variation is given by a unit vec- tor v to project the data points onto, which is found as follows:
max var(Xv) = v =1
max E(Xv)2 − (EXv)2 (1) v =1
= max E(Xv)2 (2) 15
v =1
= max (xi v)2 (3)
v =1 i
= maxv
10
5
0
−5
−10
v =1
i
= max (v xi)(xi v) (4)
(xixi )v(5)
i 30
v =1
= maxvXXv. (6)
−15
v =1
Equality 2 follows from assuming the data is mean- centered, without loss of generality. The last line is a form of Rayleigh’s Quotient and so is maximized by setting v to the eigenvector with the largest corre- sponding eigenvector. In general, the d-dimensional PCAsolutionisXV,whereV istheD×dmatrix whose columns are the top d eigenvectors (i.e. (Vi, λi) are (eigenvalue, eigenvector) pairs with λ1 ≥ λ2 ≥ ··· ≥ λd).
Despite PCA’s popularity it has a number of lim- itations. Perhaps the most blatant drawback is the requirement that the data lie on linear subspace. Re- turning the previous example, what if the plane was curled as it is in figure 1? Though the data is still intuitively two-dimensional, PCA will not correctly extract this two-dimensional structure. The prob- lem is that this data set, commonly referred as the “swiss roll,” is a two-dimensional manifold, not a two- dimensional subspace. Manifold learning algorithms essentially attempt to duplicate the behavior of PCA, but on manifolds instead of linear subspaces.
We now briefly review the concept of a manifold and formalize the manifold learning problem.
2.2 Manifolds
Consider the curve shown in figure 2. Note that the curve is in R3, yet it has zero volume, and in fact zero area. The extrinsic dimensionality – three – is some- what misleading since the curve can be parameterized by a single variable. One way of formalizing this in- tuition is via the idea of a manifold: the curve is a one-dimensional manifold because it locally “looks like” a copy of R1. Let us quickly review some basic
3
20
10
0
−10
−5
0 5 10 15
Figure 1: A curled plane: the swiss roll.
1
0.5
0
−0.5
−1 1
0.5
14
0 10 8
−0.5
6 4
−1 0
Figure 2: A one-dimensional manifold embedded in three dimensions.
2
12
terminology from geometry and topology in order to crystallize this notion of dimensionality.
Definition 1. A homeomorphism is a continuous function whose inverse is also a continuous function.
Definition 2. A d-dimensional manifold M is set that is locally homeomorphic with Rd. That is, for each x ∈ M, there is an open neighborhood around x, Nx, and a homeomorphism f : Nx → Rd. These neighborhoods are referred to as coordinate patches, and the map is referred to a a coordinate chart. The image of the coordinate charts is referred to as the parameter space.
Manifolds are terrifically well-studied in mathe- matics and the above definition is extremely general. We will be interested only in the case where M is a subset of RD, where D is typically much larger than d. In other words, the manifold will lie in a high- dimensional space (RD), but will be homeomorphic with a low-dimensional space (Rd, with d < D).
Additionally, all of algorithms we will look at require some smoothness requirements that further constrain the class of manifolds considered.
Definition 3. A smooth (or differentiable) manifold is a manifold such that each coordinate chart is dif- ferentiable with a differentiable inverse (i.e., each co- ordinate chart is a diffeomorphism).
We will be using the term embedding in a number of places, though our usage will be somewhat loose and inconsistent. An embedding of a manifold M into RD is a smooth homeomorphism from M to a subset of RD. The algorithms discussed on this paper find embeddings of discrete point-sets, by which we simply mean a mapping of the point set into another space (typically lower-dimensional Euclidean space). An embedding of a dissimilarity matrix into Rd, as discussed in section 3.1.2, is a configuration of points whose interpoint distances match those given in the matrix.
With this background in mind, we proceed to the main topic of this paper.
2.3 Manifold Learning
We have discussed the importance of dimensionality reduction and provided some intuition concerning the inadequacy of PCA for data sets lying along some non-flat surface; we now turn to the manifold learning approach to dimensionality reduction.
We are given data x1,x2,...,xn ∈ RD and we wish to reduce the dimensionality of this data. PCA is ap- propriate if the data lies on a low-dimensional sub- space of RD. Instead, we assume only that the data lies on a d-dimensional manifold embedded into RD, where d < D. Moreover, we assume that the mani- fold is given by a single coordinate chart.1 We can now describe the problem formally.
Solving this problem is referred to as manifold learning, since we are trying to “learn” a manifold from a set of points. A simple and oft-used example in the manifold learning literature is the swiss roll, a two-dimensional manifold embedded in R3. Figure 3 shows the swiss roll and a “learned” two-dimensional embedding of the manifold found using Isomap, an algorithm discussed later. Notice that points nearby in the original data set neighbor one another in the two-dimensional data set; this effect occurs because the chart f between these two graphs is a homeomor- phism (as are all charts).
Though the extent to which “real-world” data sets exhibit low-dimensional manifold structure is still be- ing explored, a number of positive examples have been found. Perhaps the most notable occurrence is in video and image data sets. For example, suppose we have a collection of frames taken from a video of a person rotating his or her head. The dimensional- ity of the data set is equal to the number of pixels in a frame, which is generally very large. However, the images are actually governed by only a couple of degrees of freedom (such as the angle of rotation). Manifold learning techniques have been successfully applied to a number of similar video and image data sets [Ple03].
3 Algorithms
In this section, we survey a number of algorithms proposed for manifold learning. Our treatment goes roughly in chronological order: Isomap and LLE were the first algorithms developed, followed by Laplacian
1All smooth, compact manifolds can be described by a sin- gle coordinate chart, so this assumption is not overly restric- tive.
4
Problem: Given points x1, . . . , xn ∈ RD
that lie on a d-dimensional manifold M
that can be described by a single coordi-
nate chart f : M → Rd, find y1,...,yn ∈
d def
R , where yi = f(xi).
20
20 10 0 −10 −20
40
10 0 −10
Original Data
0 −10 −5 0 5 10 15 Isomap
−40 −20 0 20
a manifold) between points in the input us- ing shortest-path distances on the data set’s k- nearest neighbor graph.
2. Use MDS to find points in low-dimensional Eu- clidean space whose interpoint distances match the distances found in step 1.
We consider each of these points in turn.
3.1.1 Estimating geodesic distances
Again, we are assuming that the data is drawn from a d-dimensional manifold embedded in RD and we hope to find a chart back into Rd. Isomap assumes further that there is an isometric chart —i.e. a chart that preserves the distances between points. That is, if xi, xj are points on the manifold and G(xi, xj ) is the geodesic distance between them, then there is a chartf:M→Rd suchthat
∥f (xi ) − f (xj )∥ = G(xi , xj ).
Furthermore, it is assumed that the manifold is smooth enough that the geodesic distance between nearby points is approximately linear. Thus, the Eu- clidean distance between nearby points in the high- dimensional data space is assumed to be a good ap- proximation to the geodesic distances between these points.
For points that are distant in the high-dimensional space, the Euclidean distance between them is not a good estimate of the geodesic distance. The problem is that though the manifold is locally linear (at least approximately), this approximation breaks down as the distance between points increases. Estimating distances between distant points, thus, requires a dif- ferent technique. To perform this estimation, the Isomap algorithm first constructs a k-nearest neigh- bor graph that is weighted by the Euclidean dis- tances. Then, the algorithm runs a shortest-path al- gorithm (Dijkstra’s or Floyd’s) and uses its output as the estimates for the remainder of the geodesic distances.
3.1.2 Multidimensional Scaling
Once these geodesic distances are calculated, the Isomap algorithm finds points whose Euclidean dis- tances equal these geodesic distances. Since the man- ifold is isometrically embedded, such points exist, and in fact, they are unique up to translation and ro- tation. Multidimensional Scaling [CC01] is a classi- cal technique that may be used to find such points.
Figure 3: The swiss roll (top) and the learned two- dimensional representation (bottom).
Eigenmaps. Semidefinite Embedding and the vari- ants of LLE were developed most recently.
Before beginning, we note that every algorithm that we will discuss requires a neighborhood-size pa- rameter k. Every algorithm assumes that within each neighborhood the manifold is approximately flat. Typically, one simply runs an algorithm over a variety of choices of neighborhood size and com- pares the outputs to pick the appropriate k. Other neighborhood schemes, such as picking a neighbor- hood radius ε, are sometimes used in place of the k-nearest neighbor scheme.
3.1 Isomap
Isomap [TdL00] – short for isometric feature map- ping – was one of the first algorithms introduced for manifold learning. The algorithm is perhaps the best known and most applied among the multitude of pro- cedures now available for the problem at hand. It may be viewed as an extension to Multidimensional Scaling (MDS), a classical method for embedding dis- similarity information into Euclidean space. Isomap consists of two main steps:
1. Estimate the geodesic distances (distances along
5
The basic problem it solves is this: given a matrix D ∈ Rn£n of dissimilarities, construct a set of points whose interpoint Euclidean distances match those in D closely. There are numerous cost functions for this task and a variety of algorithms for minimizing these cost functions. We focus on classical MDS (cMDS), since that is what is used by Isomap.
The cMDS algorithm, shown in figure 4, is based on the following theorem which establishes a connection between the space of Euclidean distance matrices and the space of Gram matrices (inner-product matrices).
Theorem 1. A nonnegative symmetric matrix D ∈ Rn£n, with zeros on the diagonal, is a Euclidean dis-
A proof of this theorem is given in the appendix. This theorem gives a simple technique to convert a Euclidean distance matrix D into an appropriate Gram matrix B. Let X be the configuration that we are after. Since B is its Gram matrix, B = XX . To get X from B, we spectrally decompose B into UΛU .Then,X=UΛ1/2.
What if D is not Euclidean? In the application of Isomap, for example, D will generally not be perfectly Euclidean since we are only able to approximate the geodesic distances. In this case, B, as described in the above theorem, will not be positive semidefinite and thus will not be a Gram matrix. To handle this case, cMDS projects B onto the cone of positive semidefi- nite matrices by setting its negative eigenvalues to 0 (step 3).
Finally, users of cMDS often want a low- dimensional embedding. In general, X := UΛ1/2
(step 4) will be an n-dimensional configuration. This
dimensionality is reduced to d in step 5 of the algo-
rithm by simply removing the n − d trailing columns
of X. Doing so is equivalent to projecting X onto its
X Xvi = ξivi
classical Multidimensional Scaling
input: D ∈ Rn£n(Dii = 0, Dij ≥ 0), d ∈ {1, . . . , n}
tance matrix if and only if B = −2HDH, where def 1
H = I − n 11 , is positive semidefinite. Further- more, this B will be the Gram matrix for a mean- centered configuration with interpoint distances given by D.
top d principal components, as we now demonstrate. def 1/2
Let X = UΛ+ be the n-dimensional configuration found in step 4. Then, UΛ+U is the spectral de- composition of XX . Suppose that we use PCA to project X onto d dimensions. Then, PCA returns XV , where V ∈ Rn£d and the columns of V are given by the eigenvectors of X X:v1, . . . , vd. Let us compute these eigenvectors.
+
Thus, vi = ei, where ei is the ith standard basis vec- torforRn,andξi=1.Then,XV=X[e1···ed]= [X]n£d, so the truncation performed in step 5 is in- deed equivalent to projecting X onto its first d prin- cipal components.
This equivalence is useful because if a set of dis- similarities can be realized in d-dimensions, then all but the first d columns of X will be zeros; moreover, Λ+ reveals the relative importance of each dimension, since these eigenvalues are the same as those found by PCA. In other words, classical MDS automatically finds the dimensionality of the configuration associ- ated with a set of dissimilarities, where dimensional- ity refers to the dimensionality of the subspace along which X lies. Within the context of Isomap, classical MDS automatically finds the dimensionality of the parameter space of the manifold, which is the dimen- sionality of the manifold.
Classical MDS finds the optimal d-dimensional configuration of points for the cost function ∥HDH − HD¤H∥2F , where D¤ runs over all Euclidean distance matrices for d-dimensional configurations. Unfortu- nately, this cost function makes cMDS tremendously sensitive to noise and outliers; moreover, the prob- lem is NP-hard for a wide variety of more robust cost functions [CD05].
3.1.3 Putting it together
The Isomap algorithm consists of estimating geodesic distances using shortest-path distances and then find- ing an embedding of these distances in Euclidean
6
1. 2.
3.
4. 5.
SetB:=−1HDH,whereH=I−111 isthe 2n
centering matrix.
Compute the spectral decomposition of B: B = UΛU .
Form Λ+ by setting [Λ+ ]ij := max{Λij , 0}.
Set X := UΛ1/2. +
Return [X ]n£d .
Figure 4: Classical MDS.
1/2 1/2
(U Λ+ ) (UΛ+ )vi = ξivi
1/2 1/2
(Λ+ U UΛ+ )vi = ξivi
Λ+vi = ξivi.
def 1
space using cMDS. Figure 5 summarizes the algo- rithm.
One particularly helpful feature of Isomap – not found in some of the other algorithms – is that it automatically provides an estimate of the dimension- ality of the underlying manifold. In particular, the number of non-zero eigenvalues found by cMDS gives the underlying dimensionality. Generally, however, there will be some noise and uncertainty in the esti- mation process, so one looks for a tell-tale gap in the spectrum to determine the dimensionality.
Unlike many of the algorithms to be discussed, Isomap has an optimality guarantee [BdSLT00]; roughly, Isomap is guaranteed to recover the parame- terization of a manifold under the following assump- tions.
1. The manifold is isometrically embedded into RD.
2. The underlying parameter space is convex —i.e. the distance between any two points in the pa- rameter space is given by a geodesic that lies entirely within the parameter space. Intuitively, the parameter space of the manifold from which we are sampling cannot contain any holes.
3. The manifold is well sampled everywhere.
4. The manifold is compact.
The proof of this optimality begins by showing that, with enough samples, the graph distances approach the underlying geodesic distances. In the limit, these distances will be perfectly Euclidean and cMDS will return the correct low-dimensional points. Moreover, cMDS is continuous: small changes in the input will result in small changes in the output. Because of this robustness to perturbations, the solution of cMDS will converge to the correct parameterization in the limit.
Isomap has been extended in two major directions: to handle conformal mappings and to handle very large data sets [dST03]. The former extension is re- ferred to as C-Isomap and addresses the issue that many embeddings preserve angles, but not distances. C-Isomap has a theoretical optimality guarantee sim- ilar to that of Isomap, but requires that the manifold be uniformly sampled.
The second extension, Landmark Isomap, is based on using landmark points to speed up the calcula- tion of interpoint distances and cMDS. Both of these steps require O(n3) timesteps2, which is prohibitive
2The distance calculation can be brought down to O(kn2 log n) using Fibonacci heaps.
Isomap
input:x1,...,xn∈RD,k
1. Form the k-nearest neighbor graph with edge weights Wij := ∥xi − xj ∥ for neighboring points xi,xj.
2. Compute the shortest path distances between all pairs of points using Dijkstra’s or Floyd’s algo- rithm. Store the squares of these distances in D.
3. Return Y :=cMDS(D).
Figure 5: Isomap.
for large data sets. The basic idea is to select a small set of l landmark points, where l > d, but l ≪ n. Next, the approximate geodesic distances from each point to the landmark points are computed (in con- trast to computing all n×n interpoint distances, only n×l are computed). Then, cMDS is run on the l×l distance matrix for the landmark points. Finally, the remainder of the points are located using a simple for- mula based on their distances to the landmarks. This technique brings the time complexity of the two com- putational bottlenecks of Isomap down considerably: the time complexity of L-Isomap is O(dln log n + l3) instead of O(n3).
3.2 Locally Linear Embedding
Locally Linear Embedding (LLE) [SR03] was intro- duced at about the same time as Isomap, but is based on a substantially different intuition. The idea comes from visualizing a manifold as a collection of overlapping coordinate patches. If the neighbor- hood sizes are small and the manifold is sufficiently smooth, then these patches will be approximately lin- ear. Moreover, the chart from the manifold to Rd will be roughly linear on these small patches. The idea, then, is to identify these linear patches, characterize the geometry of them, and find a mapping to Rd that preserves this local geometry and is nearly linear. It is assumed that these local patches will overlap with one another so that the local reconstructions will combine into a global one.
As in all of these methods, the number of neighbors chosen is a parameter of the algorithm. Let N(i) be the set of the k-nearest neighbors of point xi. The first step of the algorithm models the manifold as a collection of linear patches and attempts to charac-
7
terize the geometry of these linear patches. To do so, it attempts to represent xi as a weighted, convex combination of its nearest-neighbors. The weights are chosen to minimize the following squared error for each i:
Locally Linear Embedding
input: x1,…,xn ∈ RD, d, k
1. Compute reconstruction weights. For each point
x , set
i C¡1
2
C¡1
xi −
Wijxj .
jN(i)
The weight matrix W is used as a surrogate for the local geometry of the patches. Intuitively, Wi reveals the layout of the points around xi. There are a couple of constraints on the weights: each row must sum to one (equivalently, each point is represented as a convex combination of its neighbors) and Wij = 0 if j ̸∈ N(i). The second constraint reflects that LLE is a local method; the first makes the weights invariant to global translations: if each xi is shifted by α then
• Return Y := [U]n£d.
Figure 6: Locally Linear Embedding.
configuration is found by minimizing
2 yi − Wijyj
ij
Wi:=k jk . lm lm
2. Compute the low-dimensional embedding.
• Let U be the matrix whose columns are the eigenvectors of (I − W) (I − W) with nonzero accompanying eigenvalues.
xi+α−
jN(i)
2 2 Wij(xj+α) =xi− Wijxj ,
jN(i)
(7)
so W is unaffected. Moreover, W is invariant to global rotations and scalings.
Fortunately, there is a closed form solution for W , which may be derived using Lagrange multipliers. In particular, the reconstruction weights for each point xi are given by
with respect to y1,…,yn ∈ Rd. Note that this ex- pression has the same form as the error minimized in the first step of the algorithm, but that we are now minimizing with respect to y1, . . . , yn, rather than W . The cost function (7) may be rewritten as
Y MY,
C¡1 W ̃ i = k j k ,
where
def
written more concisely as (I − W) (I − W). There
are a couple of constraints on Y . First, Y Y = I,
where C is the local covariance matrix with entries def
k
Cjk = (xi − ηj) (xi − ηk) and ηj,ηk are neighbors ̃ n£k
ofxi. W ∈R isthentransformedintothesparse n × n matrix W; Wij = W ̃il if xj is the lth neighbor of xi, and is 0 if j ̸∈ N(i).
and δij = 1 if i = j and 0 otherwise. M may be
Wi is a characterization of the local geometry around xi of the manifold. Since it is invariant to ro- tations, scalings, and translations, Wi will also char- acterize the local geometry around xi for any linear mapping of the neighborhood patch surround xi. We expect that the chart from this neighborhood patch to the parameter space should be approximately lin- ear since the manifold is smooth. Thus, Wi should capture the local geometry of the parameter space as well. The second step of the algorithm finds a configuration in d-dimensions (the dimensionality of the parameter space) whose local geometry is char- acterized well by W. Unlike with Isomap, d must be known a priori or estimated. The d-dimensional
which forces the solution to be of rank d. Second,
C¡1 lm lm
def Mij = δij −Wij −Wji +
WkiWkj,
8
Yi = 0; this constraint centers the embedding on i
the origin. This cost function is a form of Rayleigh’s quotient, so is minimized by setting the columns of Y to the bottom d eigenvectors of M. However, the bottom (eigenvector, eigenvalue) pair is (1, 0), so the final coordinate of each point is identical. To avoid this degenerate dimension, the d-dimensional embed- ding is given by the bottom non-constant d eigenvec- tors.
Conformal Eigenmaps [SS05] is a recent technique that may be used in conjunction with LLE to provide an automatic estimate of d, the dimensionality of the manifold. LLE is summarized in figure 6.
3.3 Laplacian Eigenmaps
The Laplacian Eigenmaps [BN01] algorithm is based
on ideas from spectral graph theory. Given a graph
Laplacian Eigenmaps
input: x1 …xn ∈ RD, d, k, σ. e¡ xi¡xj 2/2σ2
1. Set Wij = 0
2. Let U be the matrix whose columns are the
eigenvectors of Ly = λDy with nonzero accom- panying eigenvalues.
3. Return Y := [U]n£d.
Figure 7: Laplacian Eigenmaps.
y1,…,yn. Note that we could achieve the same re- sult with the constraint y y = 1, however, using (9) leads to a simple eigenvalue solution.
Using Lagrange multipliers, it can be shown that the bottom eigenvector of the generalized eigenvalue problem Ly = λDy minimizes 8. However, the bot- tom (eigenvalue,eigenvector) pair is (0,1), which is yet another trivial solution (it maps each point to 1). Avoiding this solution, the embedding is given by the eigenvector with smallest non-zero eigenvalue.
We now turn to the more general case, where we are looking for d-dimensional representatives y1, . . . , yn of the data points. We will find a n×d matrix Y, whose rows are the embedded points. As before, we wish to minimize the distance between similar points; our cost function is thus
2
Wij∥yi −yj∥ =tr(Y LY). (10)
ij
In the one-dimensional case, we added a constraint to avoid the trivial solution of all zeros. Analogously, we must constrain Y so that we do not end up with a solution with dimensionality less than d. The con- straint Y DY = I forces Y to be of full dimensional- ity;wecoulduseY Y =Iinstead,butY DY =I is preferable because it leads to a closed-form eigen- value solution. Again, we may solve this problem via the generalized eigenvalue problem Ly = λDy. This time, however, we need the bottom d non-zero eigenvalues and eigenvectors. Y is then given by the matrix whose columns are these d eigenvectors.
The algorithm is summarized in figure 7.
3.3.1 Connection to LLE
Though LLE is based on a different intuition than Laplacian Eigenmaps, Belkin and Niyogi [BN03] were able to show that that the algorithms were equivalent
and a matrix of edge weights, W , the graph Laplacian def
if xj ∈ N(i) otherwise.
is defined as L = D−W, where D is the diagonal matrix with elements Dii = Wij.3 The eigenval-
j
ues and eigenvectors of the Laplacian reveal a wealth of information about the graph such as whether it is complete or connected. Here, the Laplacian will be exploited to capture local information about the manifold.
We define a “local similarity” matrix W which re- flects the degree to which points are near to one an- other. There are two choices for W :
1. Wij = 1 if xj is one of the k-nearest neighbors of xi and equals 0 otherwise.
2. Wij = e¡ xi¡xj 2/2σ2 for neighboring nodes; 0 otherwise. This is the Gaussian heat kernel, which has an interesting theoretical justification given in [BN03]. On the other hand, using the Heat kernel requires manually setting σ, so is considerably less convenient than the simple- minded weight scheme.
We use this similarity matrix W to find points y ,…,y ∈Rd thatarethelow-dimensionalanalogs
1n
of x1,…,xn. Like with LLE, d is a parameter of the algorithm, but it can be automatically estimated by using Conformal Eigenmaps as a post-processor. For simplicity, we first derive the algorithm for d = 1 —i.e. each yi is a scalar.
Intuitively, if xi and xk have a high degree of sim- ilarity, as measured by W , then they lie very near to one another on the manifold. Thus, yi and yj should be near to one another. This intuition leads to the following cost function which we wish to minimize:
Wij(yi −yj)2. (8) ij
This function is minimized by the setting y1 = · · · = yn = 0; we avoid this trivial solution by enforcing the constraint
y Dy = 1. (9)
This constraint effectively fixes a scale at which the yi’s are placed. Without this constraint, setting yi0 = αyi, with any constant α < 1, would give a configuration with lower cost than the configuration
3L is referred to as the unnormalized Laplacian in many places. The normalized version is D¡1/2LD¡1/2.
9
in some situations. The equivalence rests on another motivation for Laplacian Eigenmaps, not discussed here, that frames the algorithm as a discretized ver- sion of finding the eigenfunctions4 of the Laplace- Beltrami operator, L, on a manifold. The authors demonstrate that LLE, under certain assumptions, is finding the eigenfunctions of 12 L2 , which are the same as the eigenfunctions of L. In the proof, they assume that the neighborhood patches selected by LLE are perfectly locally linear (i.e., the neighbors of a point x actually lie on the tangent plane at x). Further- more, the equivalence is only shown in expectation over locally uniform measures on the manifold.
The connection is an intriguing one, but has not been exploited or even demonstrated on any real data sets. It is not clear how reasonable the assumptions are. These two algorithms are still considered dis- tinct, with different characteristics.
3.4 Semidefinite Embedding
Semidefinite embedding (SDE) [WSS04], yet another algorithm for manifold learning, is based on a physi- cal intuition. Imagine that each point is connected to its nearest neighbors with a rigid rod. Now, we take this structure and pull it as far apart as possible –i.e. we attempt to maximize the distance between points that are not neighbors. If the manifold looks like a curved version of the parameter space, then hopefully this procedure will unravel it properly.
To make this intuition a little more concrete, let us look at a simple example. Suppose we sample some points from a sine curve in R2. The sine curve is a one-dimensional manifold embedded into R2, so let us apply the SDE idea to it to try to find a one-dimensional representation. We first attach each point to its two nearest neighbors. We then pull apart the resulting structure as far as possible. Figure 8 shows the steps of the procedure. As the figure shows, we have successfully found a one-dimensional repre- sentation of the data set. On the other hand, if the neighbors were chosen a bit differently, the procedure would have failed.
We this simple example in mind, we proceed to describe the algorithm more precisely. The SDE al- gorithm uses a semidefinite program [VB96], which is essentially a linear program with additional con- straints that force the variables to form a positive semidefinite matrix. Recall that a matrix is a Gram
4In the algorithm presented, we are looking for eigenvectors. The eigenvectors are the eigenfunctions evaluated at the data points.
10
1 0.5 0 −0.5 −1
1 0.5 0 −0.5 −1
0123456 1
0.5 0 −0.5 −1
012345678
Figure 8: An illustration of the idea behind SDE. The top figure is a few points sampled from a sine curve. The middle figure shows the result of connecting each point to its two neighbors. The bottom shows the result of the “pulling” step.
matrix (an inner product matrix) if and only if it is positive semidefinite. The SDE algorithm uses a semidefinite program to find an appropriate Gram matrix, then extracts a configuration from the Gram matrix using the same trick employed in classical MDS.
The primary constraint of the program is that dis- tances between neighboring points remain the same –i.e if xi and xj are neighbors then ∥yi − yj∥ = ∥xi − xj ∥, where yi and yj are the low-dimensional representatives of xi and xj. However, we must ex- press this constraint in terms of the Gram matrix, B, since this is the variable the semidefinite program works with (and not Y ). Translating the constraint is simple enough, though, since
0123456
∥yi −yj∥2
= ∥yi∥2 +∥yj∥2 −2⟨yi,yj⟩
= ⟨yi,yi⟩ + ⟨yj,yj⟩ − 2⟨yi,yj⟩ = Bii+Bjj−2Bij.
The constraint, then is
Bii +Bjj −2Bij =∥xi −xj∥2
for all neighboring points xi, xj .
We wish to “pull” the remainder of the points as
far apart as possible, so we maximize the interpoint distances of our configuration subject to the above constraint. The ob jective function is:
max ∥yi −yj∥2 = i,j
max
Bii +Bjj −2Bij
i,j
Semidefinite Embedding
input: x1,...,xn ∈RD,k
1. Solve the following semidefinite program:
maximize subject to:
Instead, there is an approximation known as lSDE [WPS05], which uses matrix factorization to keep the size of the semidefinite program down. The idea is to factor the Gram matrix B into QLQ , where L is a l×l Gram matrix of landmark points and l ≪ n. The semidefinite program only finds L and then the rest of the points are located with respect to the landmarks. This method was a full two orders of magnitude faster than SDE in some experiments [WPS05]. This ap- proximation to SDE appears to produce high-quality results, though more investigation is needed.
3.5 Other algorithms
We have looked at four algorithms for manifold learn- ing, but there are certainly more. In this section, we briefly discuss two other algorithms.
As discussed above, both the LLE and Lapla- cian Eigenmaps algorithms are Laplacian-based al- gorithms. Hessian LLE (hLLE) [DG03] is closely related to these algorithms, but replaces the graph Laplacian with a Hessian5 estimator. Like the LLE algorithm, hLLE looks at locally linear patches of the manifold and attempts to map them to a low- dimensional space. LLE models these patches by representing each xi as a weighted combination of its neighbors. In place of this technique, hLLE runs principal components analysis on the neighborhood set to get a estimate of the tangent space at xi (i.e. the linear patch surrounding xi). The second step of LLE maps these linear patches to linear patches in Rd using the weights found in the first step to guide the process. Similarly, hLLE maps the tangent space estimates to linear patches in Rd, warping the patches as little as possible. The “warping” factor is the Frobenius norm of the Hessian matrix. The mapping with minimum norm is found by solving an eigenvalue problem.
The hLLE algorithm has a fairly strong optimal- ity guarantee – stronger, in some sense, then that of Isomap. Whereas Isomap requires that the pa- rameter space be convex and the embedded man- ifold be globally isometric to the parameter space, hLLE requires only that the parameter space be con- nected and locally isometric to the manifold. In these cases hLLE is guaranteed to asymptotically recover the parameterization. Additionally, some empirical evidence supporting this theoretical claim has been given: Donoho and Grimes [DG03] demonstrate that
5Recall that the Hessian of a function is the matrix of its partial second derivatives.
11
tr(B)
Bij = 0;
ij
Bii +Bjj −2Bij =∥xi −xj∥2
(for all neighboring xi,xj); B ≽ 0.
2. Compute the spectral decomposition of B: B = UΛU .
3. Form Λ+ by setting [Λ+]ij := max{Λij,0}. 4. Return Y := UΛ1/2.
Figure 9: Semidefinite Embedding.
= max Bii+Bjj i,j
= max tr(B)
The second equality follows from a regularization constraint: since the interpoint distances are invari- ant to translation, there is an extra degree of freedom in the problem which we remove by enforcing
Bij =0. ij
The semidefinite program may be solved by using any of a wide variety of software packages. The sec- ond part of the algorithm is identical to the second part of classical MDS: the eigenvalue decomposition UΛU is computed for B and the new coordinates are given by U Λ1/2 . Additionally, the eigenvalues re- veal the underlying dimensionality of the manifold. The algorithm is summarized in figure 9.
One major gripe with SDE is that solving a semidefinite program requires tremendous computa- tional resources. State-of-the-art semidefinite pro- gramming packages cannot handle an instance with n ≥ 2000 or so on a powerful desktop machine. This is a a rather severe restriction since real data sets often contain many more than 2000 points. One pos- sible approach to solving larger instances would be to develop a projected subgradient algorithm [Ber99]. This technique resembles gradient descent, but can be used to handle non-differentiable convex functions. This approach has been exploited for a similar prob- lem: finding robust embeddings of dissimilarity in- formation [CD05]. This avenue has not been pursued for SDE.
+
hLLE outperforms Isomap on a variant of the swiss roll data set where a hole has been punched out of the surface. Though its theoretical guarantee and some limited empirical evidence make the algorithm a strong contender, hLLE has yet to be widely used, perhaps partly because it is a bit more sophisticated and difficult to implement than Isomap or LLE. It is also worth pointing out the theoretical guarantees are of a slightly different nature than those of Isomap: whereas Isomap has some finite sample error bounds, the theorems pertaining to hLLE hold only in the continuum limit. Even so, the hLLE algorithm is a viable alternative to many of the algorithms dis- cussed.
Another recent algorithm proposed for manifold learning is Local Tangent Space Alignment (LTSA) [ZZ04]. Similar to hLLE, the LTSA algorithm esti- mates the tangent space at each point by performing PCA on its neighborhood set. Since the data is as- sumed to lie on a d-dimensional manifold, and be locally linear, these tangent spaces will admit a d- dimensional basis. These tangent spaces are aligned to find global coordinates for the points. The align- ment is computed by minimizing a cost function that allows any linear transformation of each local coor- dinate space. LTSA is a relatively recent algorithm that has not find widespread popularity yet; as such, its relative strengths and weaknesses are still under investigation.
There are a number of other algorithms in addi- tion the ones described here. We have touched on those algorithms that have had the greatest impact on developments in the field thus far.
4 Comparisons
We have discussed a myriad of different algorithms, all of which are for roughly the same task. Choosing between these algorithms is a difficult task; there is no clear “best” one. In this section, we discuss the trade-offs associated with each algorithm.
4.1 Embedding type
One major distinction between these algorithms is the type of embedding that they compute. Isomap, Hessian LLE, and Semidefinite Embedding all look for isometric embeddings —i.e. they all assume that there is a coordinate chart from the parameter space to the high-dimensional space that preserves inter- point distances and attempt to uncover this chart. In
contrast, LLE (and the C-Isomap variant of Isomap) look for conformal mappings – mappings which pre- serve local angles, but not necessarily distances. It re- mains open whether the LLE algorithm actually can recover such mappings correctly. It is unclear exactly what type of embedding the Laplacian Eigenmaps al- gorithm computes [BN03]; it has been dubbed “prox- imity preserving” in places [WS04]. This breakdown is summarized in the following table.
Algorithm
C-Isomap
Isomap
Hessian LLE
Laplacian Eigenmaps Locally Linear Embedding Semidefinite Embedding
Mapping
conformal isometric isometric unknown conformal isometric
12
This comparison is not particularly practical, since the form of embedding for a particular data set is usu- ally unknown; however, if, say, an isometric method fails, then perhaps a conformal method would be a reasonable alternative.
4.2 Local versus global
Manifold learning algorithms are commonly split into two camps: local methods and global methods. Isomap is considered a global method because it con- structs an embedding derived from the geodesic dis- tance between all pairs of points. LLE is considered a local method because the cost function that it uses to construct an embedding only considers the placement of each point with respect to its neighbors. Similarly, Laplacian Eigenmaps and the derivatives of LLE are local methods. Semidefinite embedding, on the other hand, straddles the local/global division: it is a lo- cal method in that intra-neighborhood distances are equated with geodesic distances, but a global method in that its objective function considers distances be- tween all pairs of points. Figure 10 summarizes the breakdown of these algorithms.
The local/global split reveals some distinguishing characteristics between algorithms in the two cate- gories. Local methods tend to characterize the local geometry of manifolds accurately, but break down at the global level. For instance, the points in the em- bedding found by local methods tend be well placed with respect to their neighbors, but non-neighbors may be much nearer to one another than they are on the manifold. The reason for these aberrations is simple: the constraints on the local methods do
Local
LLE
Lap Eig
Figure 10: The local/global split.
not pertain to non-local points. The behavior for Isomap is just the opposite: the intra-neighborhood distances for points in the embedding tend to be in- accurate whereas the large interpoint distances tend to be correct. Classical MDS creates this behavior; its cost function contains fourth-order terms of the distances, which exacerbates the differences in scale between small and large distances. As a result, the emphasis of the cost function is on the large inter- point distances and the local topology of points is often corrupted in the embedding process. SDE be- haves more like the local methods in that the dis- tances between neighboring points are guaranteed to be the same before and after the embedding.
The local methods also seem to handle sharp curva- ture in manifolds better than Isomap. Additionally, shortcut edges – edges between points that should not actually be classified as neighbors – can have a potentially disastrous effect in Isomap. A few short- cut edges can badly damage all of geodesic distances approximations. In contrast, this type of error does not seem to propagate through the entire embedding with the local methods.
4.3 Time complexity
The primary computational bottleneck for all of these algorithms – except for semidefinite embedding – is a spectral decomposition. Computing this decompo- sition for a dense n × n matrix requires O(n3) time steps. For sparse matrices, however, a spectral de- composition can be performed much more quickly. The global/local split discussed above turns out to be a dense/sparse split as well.
First, we look at the local methods. LLE builds
a matrix of weights W ∈ Rn£n such that Wij ̸= 0
only if j is one of i’s nearest neighbors. Thus, W
is very sparse: each row contains only k nonzero ele-
ments and k is usually small compared to the number
of points. LLE performs a sparse spectral decom-
position of (I − W ) (I − W ). Moreover, this ma-
trix contains some additional structure that may be
used to speed up the spectral decomposition [SR03].
Similarly, the Laplacian Eigenmaps algorithm finds
the eigenvectors of the generalized eigenvalue prob-
def
The two LLE variants described, Hessian LLE and Local Tangent Space Alignment, both require one spectral decomposition per point; luckily, the matri- ces to be decomposed only contain n×k entries, so the decompositions requires only O(k3) time steps. Both algorithms also require a spectral decomposition of an n × n matrix, but these matrices are sparse.
For all of these local methods, the spectral decom- positions scale, pessimistically, with O((d + k)n2), which is a substantial improvement over O(n3).
The spectral decomposition employed by Isomap,
on the other hand, is not sparse. The actual de-
composition, which occurs during the cMDS phase
def 1
SDE
Global Isomap
13
lem Ly = λDy. Recall that L = D−W, where W is a weight matrix and D is the diagonal matrix of row-sums of W. All three of these matrices are sparse: W only contains k nonzero entries per row, D is diagonal, and L is their difference.
of the algorithm, is of B = −2HDH, where D is the matrix of approximate interpoint square geodesic distances, and H is a constant matrix. The ma- trix B ∈ Rn£n only has rank approximately equal to the dimensionality of the manifold, but is never- theless dense since D has zeros only on the diago- nal. Thus the decomposition requires a full O(n3) timesteps. This complexity is prohibitive for large data sets. To manage large data sets, Landmark Isomap was created. The overall time complexity of Landmark Isomap is the much more manageable O(dln log n + l3), but the solution is only an approx- imate one.
Semidefinite embedding is by far the most compu- tationally demanding of these methods. It also re- quires a spectral decomposition of a dense n × n ma- trix, but the real bottleneck is the semidefinite pro- gram. Semidefinite programming methods are still relatively new and, perhaps as a result, quite slow. The theoretical time complexity of these procedures varies, but standard contemporary packages cannot handle an instance of the SDE program if the num- ber of points is greater than 2000 or so. Smaller in-
stances can be handled, but often require an enor- mous amount of time. The lSDE approximation is a fast alternative to SDE that has been shown to yield a speedup of two orders of magnitude in some experi- ments [WPS05]. There has been no work relating the time requirements of lSDE back to other manifold learning algorithms yet.
To summarize: the local methods – LLE, Laplacian Eigenmaps, and variants – scale well up to very large data sets; Isomap can handle medium size data sets and the Landmark approximation may be used for large data sets; SDE can handle only fairly small in- stances, though its approximation can tackle medium to large instances.
4.4 Other issues
One shortcoming of Isomap, first pointed out as a theoretical issue in [BdSLT00] and later examined empirically, is that it requires that the underlying parameter space be convex. Without convexity, the geodesic estimates will be flawed. Many manifolds do not have convex parameter spaces; one simple ex- ample is given by taking the swiss roll and punching a hole out of it. More importantly, [DG02] found that many image manifolds do not have this prop- erty. None of the other algorithms seem to have this shortcoming.
The only algorithms with theoretical optimality guarantees are Hessian LLE and Isomap. Both of these guarantees are asymptotic ones, though, so it is questionable how useful they are. On the other hand, these guarantees at least demonstrate that hLLE and Isomap are conceptually correct.
Isomap and SDE are the only two algorithms that automatically provide an estimate of the dimensional- ity of the manifold. The other algorithms requires the target dimensionality to be know a priori and often produce substantially different results for different choices of dimensionality. However, the recent Con- formal Eigenmaps algorithm provides a patch with which to automatically estimate the dimensionality when using Laplacian Eigenmaps, LLE, or its vari- ants.
5 What’s left to do?
Though manifold learning has become an increasingly popular area of study within machine learning and re- lated fields, much is still unknown. One issue is that all of these algorithms require a neighborhood size
parameter k. The solutions found are often very dif- ferent depending on the choice of k, yet there is little work on actually choosing k. One notable exception is [WZZ04], but these techniques are discussed only for LTSA.
Perhaps the most curious aspect of the field is that there are already so many algorithms for the same task, yet new ones continue to be developed. In this paper, we have discussed a total of nine algorithms, but our coverage was certainly not exhaustive. Why are there so many?
The primary reason for the abundance of algo- rithms is that evaluating these algorithms is difficult. Though there are a couple of theoretical performance results for these algorithms, the primary evaluation methodology has been to run the algorithm on some (often artificial) data sets and see whether the re- sults are intuitively pleasing. More rigorous evalu- ation methodologies remain elusive. Performing an experimental evaluation requires knowing the under- lying manifolds behind the data and then determining whether the manifold learning algorithm is preserv- ing useful information. But, determining whether a real-world data set actually lies on a manifold is prob- ably as hard as learning the manifold; and, using an artificial data set may not give realistic results. Addi- tionally, there is little agreement on what constitutes “useful information” – it is probably heavily problem- dependent. This uncertainty stems from a certain slack in the problem: though we want to find a chart into low-dimensional space, there may be many such charts, some of which are more useful than others. Another problem with experimentation is that one must somehow select a realistic cross-section of man- ifolds.
Instead of an experimental evaluation, one might establish the strength of an algorithm theoretically. In many cases, however, algorithms used in machine learning do not have strong theoretical guarantees associated with them. For example, Expectation- Maximization can perform arbitrarily poorly on a data set, yet it is a tremendously useful and popu- lar algorithm. A good manifold learning algorithm may well have similar behavior. Additionally, the type of theoretical results that are of greatest utility are finite-sample guarantees. In contrast, the algo- rithmic guarantees for hLLE and Isomap6, as well as a number of other algorithms from machine learning, are asymptotic results.
6The theoretical guarantees for Isomap are a bit more useful as they are able to bound the error in the finite sample case.
14
Devising strong evaluation techniques looks to be a difficult task, yet is absolutely necessary for progress in this area. Perhaps even more fundamental is deter- mining whether real-world data sets actually exhibit a manifold structure. Some video and image data sets have been found to feature this structure, but much more investigation needs to be done. It may well turn out that most data sets do not contain embed- ded manifolds, or that the points lie along a manifold only very roughly. If the fundamental assumption of manifold learning turns out to be unrealistic for data mining, then it needs to be determined if these tech- niques can be adapted so that they are still useful for dimensionality reduction.
The manifold leaning approach to dimensionality reduction shows alot of promise and has had some successes. But, the most fundamental questions are still basically open:
Is the assumption of a manifold structure reasonable?
If it is, then:
How successful are these algorithms at uncovering this structure?
Acknowledgments
Thanks to Sanjoy Dasgupta for valuable sugges- tions. Thanks also to Henrik Wann Jensen and Virginia de Sa for being on my committee. Fig- ures 1 and 3 were generated using modified ver- sions of code downloaded from the Isomap website (http://isomap.stanford.edu) and the Hessian LLE website (http://basis.stanford.edu/HLLE).
References
[BdSLT00] Mira Bernstein, Vin de Silva, John Lang- ford, and Joshua Tenenbaum. Graph ap- proximations to geodesics on embedded
manifolds. Manuscript, Dec 2000.
[Ber99] Dimitri P. Bertsekas. Nonlinear Progam- ming. Athena Scientific, 2nd edition,
1999.
[BN01] M. Belkin and P. Niyogi. Laplacian eigen- maps and spectral techniques for em- bedding and clustering. In Advances in Neural Information Processing Systems,
2001.
[BN03]
[Bur05]
[CC01]
[CD05]
[DG02]
[DG03]
[dST03]
[dST04]
[Mil65]
[Ple03]
Mikhail Belkin and Partha Niyogi. Lapla- cian eigenmaps for dimensionality re- duction and data representation. Neu- ral Computation, 15(6):1373–1396, June 2003.
Christopher J.C. Burges. Geometric methods for feature extraction and di- mensional reduction. In L. Rokach and O. Maimon, editors, Data Mining and Knowledge Discovery Handbook: A Com- plete Guide for Practitioners and Re- searchers. Kluwer Academic Publishers, 2005.
T.F. Cox and M.A.A. Cox. Multidimen- sional Scaling. Chapman and Hall/CRC, 2nd edition, 2001.
Lawrence Cayton and Sanjoy Dasgupta. Cost functions for euclidean embedding. Unpublished manuscript, 2005.
David L. Donoho and Carrie Grimes. When does isomap recover the natural parametrization of families of articulated images? Technical report, Department of Statistics, Stanford University, 2002.
David L. Donoho and Carrie Grimes. Hessian eigenmaps: new locally lin- ear embedding techniques for high- dimensional data. Proceedings of the Na- tional Academy of Sciences, 100:5591– 5596, 2003.
Vin de Silva and Joshua Tenenbaum. Global versus local methods in nonlinear dimensionality reduction. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Process- ing Systems, 2003.
Vin de Silva and Joshua B. Tenenbaum. Sparse multidimensional scaling using landmark points. Manuscript, June 2004.
John W Milnor. Topology from the differ- entiable viewpoint. The University Press of Virginia, 1965.
Robert Pless. Using isomap to explore video sequences. In Proc. International Conference on Computer Vision, 2003.
15
[SR03] L. Saul and S. Roweis. Think globally, fit locally: unsupervised learning of low di- mensional manifolds. Journal of Machine
Learning Research, 4:119–155, 2003. [SS05] F. Sha and L.K. Saul. Analysis and ex-
tension of spectral methods for nonlinear dimensionality reduction. Submitted to ICML, 2005.
[TdL00] J. B. Tenenbaum, V. deSilva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction.
Science, 290:2319–2323, 2000.
[VB96] Lieven Vandenberghe and Stephen Boyd. Semidefinite programming. SIAM Re-
view, 1996.
[WPS05] K. Weinberger, B. Packer, and L. Saul. Nonlinear dimensionality reduction by semidefinite programming and kernel ma- trix factorization. In Proceedings of the Tenth International Workshop on Artifi-
cial Intelligence and Statistics, 2005.
[WS04] K. Q. Weinberger and L. K. Saul. Unsu- pervised learning of image manifolds by semidefinite programming. In Proceed- ings of the IEEE Conference on Com- puter Vision and Pattern Recognition,
2004.
[WSS04] K.Q. Weinberger, F. Sha, and L.K. Saul. Learning a kernel matrix for nonlinear di- mensionality reduction. In Proceedings of the International Conference on Machine
Learning, 2004.
[WZZ04] Jing Wang, Zhenyue Zhang, and Hongyuan Zha. Adaptive manifold learning. In Advances in Neural Infor-
mation Processing Systems, 2004.
[ZZ03] Hongyuan Zha and Zhenyue Zhang. Isometric embedding and continuum isomap. In Proceedings of the Twenti- eth International Conference on Machine
Learning, 2003.
[ZZ04] Zhenyue Zhang and Hongyuan Zha. Prin- cipal manifolds and nonlinear dimension reduction via local tangent space align- ment. SIAM Journal of Scientific Com-
puting, 26(1):313–338, 2004.
A Proof of theorem 1
First, we restate the theorem.
Theorem. A nonnegative symmetric matrix D ∈ Rn£n, with zeros on the diagonal, is a Euclidean dis-
Proof: Suppose that D is a Euclidean distance ma- trix. Then, there exist x1, . . . , xn such that ∥xi − xj ∥2 = Dij for all i, j . We assume without loss that these points are mean-centered —i.e. xij = 0 for all j . We expand the norm: i
Dij = ∥xi−xj∥2 = ⟨xi,xi⟩+⟨xj,xj⟩−2⟨xi,xj⟩, (11) so
1
⟨xi,xj⟩=−2(Dij −⟨xi,xi⟩−⟨xj,xj⟩). (12)
def
Dij = Bii + Bjj −2 Bij iiii
=
16
tance matrix if and only if B = −2HDH, where def 1
H = I − n 11 , is positive semidefinite. Further- more, this B will be the Gram matrix for a mean- centered configuration with interpoint distances given by D.
Let Bij = ⟨xi, xj ⟩. We proceed to rewrite Bij in terms of D. From (11),
is mean-centered. It follows that 1 1
(Bii +Bjj).
The second equality follows because the configuration
i
def 1
n Dij = n Bii +Bjj. (13) ii
Similarly,
1D =1B+B,and (14)
n ij n jj ii jj
1D=2B. (15) n2 ij n ii
ij i
Subtracting (15) from the sum of (14) and (13) leaves Bjj + Bii on the right-hand side, so we may rewrite (12) as
2 n n n2
i j ij
1 −2[HDH]ij.
Bij = −1Dij−1Dij−1Dij+ 1 Dij
=
Thus, −12HDH is a Gram matrix for x1,...,xn. def 1
Conversely,supposethatB = −2HDHispositive semidefinite. Then, there is a matrix X such that XX = B. We show that the points given by the rows of X have square interpoint distances given by D.
First, we expand B using its definition: Bij = − 12 [H DH ]ij
= −1Dij−1Dij−1Dij+ 1 Dij 2 n n n2
1222 = −2 dij−di−dj+d2 .
Let xi be the ith row of X. Then,
∥xi − xj∥2 = ∥xi∥2 + ∥xj∥2 − 2⟨xi,xj⟩ = Bii +Bjj −2Bij
= Dij.
The final equality follows from expanding Bij, Bii, and Bjj as written above. Thus, we have shown that the configuration x1, . . . , xn have square interpoint distances given by D, so D is a Euclidean distance matrix.
i j ij
17