Unsupervised Learning
January 3, 2014
()
Chap 14 Unsupervised Learning
January 3, 2014 1 / 63
Outline
1 Introduction
2 Cluster Analysis
Self-Organizing Maps
4 Principal Components, Curves and Surfaces
Independent Component Analysis and Exploratory Projection Pursuit
Multidimensional Scaling
3
5
6
()
Chap 14 Unsupervised Learning
January 3, 2014 2 / 63
Introduction
Problem: One has a set of N observations (x1,x2, · · · , xN ) of a random p-vector X with joint density Pr (X ). The goal is to directly infer the properties of this probability density without the help of a “supervisor”.
The dimension p is often very large.
the properties of interest are often more complicated than simple
location estimates
Unsupervised learning can be more difficult because of no direct measure of success available. (Heuristic arguments for judgements of both motivation of the algorithm and quality of the results).
Cluster analysis attempts to find multiple convex regions of the X-space that contain modes of Pr(X). This can tell whether or not Pr(X) can be represented by a mixture of simpler densities representing distinct types or classes of observations.
()
Chap 14 Unsupervised Learning
January 3, 2014 3 / 63
Cluster Analysis
Cluster Analysis or data segmentation methods are related to grouping or segmenting a collection of objects into subsets or ‘clusters’, such that those within each cluster are more closely related to one another than objects assigned to different clusters. The goal is to find the pattern or uncover the structure in data.
In addition, the goal is sometimes to arrange the clusters into a natural hierarchy.
Ideally, cluster analysis will form groups that are well separated in the space of X.
In most cases, the no. of clusters is unknown.
()
Chap 14 Unsupervised Learning
January 3, 2014 4 / 63
Cluster Analysis Example
• •••
• • • •• • •• • ••
•••• • •••• ••••• •
•• ••• •••
••• • •
• •
•
•••
•• •••• •••••
••• ••••••• ••• •
•••••••• ••• ••
• •• • ••• •• ••• • • • • • • •
•• • • • • • •• •
•••••• •• • •
••••
•• ••• ••
X1
FIGURE 14.4. Simulated data in the plane, clus- tered into three classes (represented by orange, blue and green) by the K-means clustering algorithm
•
•
()
Chap 14 Unsupervised Learning
January 3, 2014 5 / 63
X2
Proximity Matrices
Fundamental to all clustering techniques is the choice of distance or dissimilarity measure between two objects.
Proximity Matrices can be either similarity or dissimilarity matrices.
a suitable monotone-decreasing function can be used to convert similarity matrix to dissimilarity matrix.
If there are N objects, a dissimilarity matrix D is of dimension
N × N, symmetric and non-negative with diagonal elements all 0, this is used as input for clustering algorithm.
Subjectively judged dissimilarities are seldom distances in the strict sense, since the triangle inequality does not hold.
()
Chap 14 Unsupervised Learning
January 3, 2014 6 / 63
Dissimilarities Based on Attributes
We need to consider the nature of the variables, the scales of the measurement, very often the subject matter knowledge.
Quantitative variables: squared difference (Euclidean distance) or absolute difference (Manhattan distance) are common choices. For two points x = (x1,··· ,xp) and y = (y1,··· ,yp)
Euclideandistance:d(x,y)=(x−y)′(x−y)
Statisticaldistance:d(x,y)=(x−y)′A(x−y)
Minkowskimetric:d(x,y)=[pi=1|xi −yi|m]1/m
Manhattandistance:d(x,y)=pi=1|xi −yi|
Notes: the correlation measure based on standardized input is
equivalent to the measure based on squared differences. Ordinal variables: Replace their M original values with
i−1/2,i = 1,··· ,M. It can then be treated as quantitative M
variables.
Categoricalvariables:themostcommonchoiceisLrr′ =1forall r ̸= r′.
()
Chap 14 Unsupervised Learning
January 3, 2014 7 / 63
Object Dissimilarity
To combine the dissimilarities based on a mixture types of the attributes, a weighted sum is used, proper weighting is usually difficult to decide, it depends on the context of each problem.
pp
D(xi,xi′ ) = wjdj(xij,xi′j);wj = 1
j=1 j=1
Setting the weight wj to the same value for each variable (say,
wj = 1,∀j) does not necessarily give all attributes equal influence. For example, with p quantitative variables and squared-error distance used for each coordinate, the relative importance of each variable is proportional to its variance over the data set.
Specifying an appropriate dissimilarity measure is far more important in obtaining success with clustering than choice of clustering algorithm.
()
Chap 14 Unsupervised Learning
January 3, 2014 8 / 63
Elements of Statistical Learning (2nd Ed.) ⃝c Hastie, Tibshirani & Friedman 2009 Chap 14 Object Dissimilarity
Giving all attributes equal influence can be highly counter-productive:
•
•••••• •• • ••• • •• • • ••
• • • • •• •
••• •• • • •• ••• •
• • • • • •• •• •••• •• •••••••••
• •• ••• • •• • •• •
•• •••
-6 -4 -2 0 2 4 -2 -1 0 1 2
X1 X1
FIGURE 14.5. Simulated data: on the left, K-means clustering (with K=2) has been applied to the raw data. The two colors indicate the cluster memberships. On the right, the features were first standardized before clustering. This is equivalent to using feature weights 1/[2 · var(Xj )]. The standardization has obscured the two well-separated groups. Note that each plot uses the same units in the horizontal and vertical axes.
• •• ••• ••
••• •• • • • ••
•••••• •••• •
••• • • • • •• • • •• •••
•• • ••••• ••• • • •••
••• • •• ••• • • •• • •
•••• • •••
•• •••
()
Chap 14 Unsupervised Learning
January 3, 2014 9 / 63
X2
-6 -4 -2 0 2 4
X2
-2 -1 0 1 2
Combinatorial Algorithms
Minimizing within-cluster point scatter is equivalent to maximizing between-cluster point scatter:
1N N
T=2 i=1 i′=1dii′ =
1K 2 k=1 C(i)=k C(i′)=k dii′ + C(i′)̸=k dii′
W(C) = 12 Kk=1 C(i)=k C(i′)=k d(xi,xi′) B(C) = 21 Kk=1 C(i)=k C(i′)̸=k dii′
T =W(C)+B(C)
One simply minimizes W or equivalently maximizes B over all
possible assignments of the N data points to K clusters.
Optimization by complete enumeration is not feasible for any medium or large data sets. Some algorithms can provide local optima search results.
()
Chap 14 Unsupervised Learning
January 3, 2014 10 / 63
K-Means Clustering Algorithm
It is intended for situations in which all variables are of the quantitative type, and squared Euclidean distance is chosen as the dissimilarity measure.
minimizing W (C):
1K
W(C)= ||xi−xi′||2
2 k=1 C(i)=k C(i′)=k K
=N ||x−x ̄||2 kik
k=1 C(i)=k
where x ̄k is the mean vector associated with the kth cluster, Nk = Ni=1 I(C(i) = k)
()
Chap 14 Unsupervised Learning
January 3, 2014 11 / 63
K-Means Clustering Algorithm
Initially, the data points are assigned at random to the K sets. step 1: the centroid is computed for each set.
step 2: every point is assigned to the cluster whose centroid is closest to that point.
These two steps are alternated until a stopping criterion is met, i.e., when there is no further change in the assignment of the data points.
()
Chap 14 Unsupervised Learning
January 3, 2014 12 / 63
Notes on K-Means Clustering Algorithm
In general, the algorithm does not achieve a global minimum of W(C) over the assignments. In fact, since the algorithm uses discrete assignment rather than a set of continuous parameters, the ”minimum” it reaches cannot even be properly called a local minimum. Despite these limitations, the algorithm is used fairly frequently as a result of its ease of implementation.
Initialization of K-means: K-means results are highly dependent on the initialization used. one should start the algorithm with many different random choices for the starting means, and choose the solution having smallest value of the objective function.
Distance Measure: There are different distance measures that can be used. The most common ones are:
L2 distance (squared Euclidean distance)
L1 distance (Manhattan distance): The absolute value of the
componentwise difference between the point and the class center. This is the simplest distance to calculate and may be more robust to outliers. (Also called K-medians).
()
Chap 14 Unsupervised Learning
January 3, 2014 13 / 63
Notes on K-Means Clustering Algorithm
Termination of K-means algorithm: Theoretically, k-means should terminate when no more change of class membership occurs. There are proofs of termination for k-means. These rely on the fact that both steps of k-means reduce variance. So eventually, there is no move to make that will continue to reduce the variance. If running to completion requires a large number of iterations. In software, typically termination occurs when one of the following criteria is met:
Terminate after fewer than m changes of classes
Terminate after J iterations.
Number of classes: Number of classes is usually given as an input variable. There are sometimes good reasons to use a specific number of classes. You may be wanting to classify a scene into vegetation, soil and water, in which case 3 or 4 should suffice. Other methods to decide the number of clusters from the data itself will be introduced later.
()
Chap 14 Unsupervised Learning
January 3, 2014 14 / 63
A simple example using Iris data
First do two runs of the algorithm with different starting points to see if it converges to the same clusters. The three centres are randomly chosen using sample.
ir.kmeans <- kmeans(ir,ir[sample(1:150,3),])
junk <- kmeans(ir,ir[sample(1:150,3),])
table(junk$cluster,ir.kmeans$cluster)
123 150 0 0 2 0 38 0 3 0 062
This will not always happen!
()
Chap 14 Unsupervised Learning
January 3, 2014 15 / 63
Model-based clustering
Assume that each group of points belongs to a parametric distribution (for example a multivariate normal).
Banfield and Raftery (1993) suggested such an approach, in which we can specify how the cluster shape, size, and orientation varies between clusters.
Clusters are resolved by maximizing the likelihood function:
NK
L(θ1 , θ2 , · · · , θK , p1 , p2 , · · · , pK |x1 , x2 , · · · , xN ) = pk f (xn |θk )
n=1k=1
EM algorithm is usually used to maximize this log-likelihood.
()
Chap 14 Unsupervised Learning
January 3, 2014 16 / 63
Gaussian Mixtures as Soft K-means Clustering
This is a model-based clustering. Suppose data follows a mixture model density:
K
g(x) = πkN(μk,σ2I),πk ≥ 0,πk = 1
k=1 k
EM algorithm is used here to maximize the likelihood of Gaussian Mixtures. "Soft" here means the likelihood is written according to a probabilistic assignments of each point to different clusters.
()
Chap 14 Unsupervised Learning
January 3, 2014 17 / 63
K-medoids
K-means clustering could not be applied in cases that we have only distances rather than raw observations.
K-medoids is a variation of K-means algorithm by using manhattan distance and restricting the cluster center to be one of the cases belonging to the cluster.
This modification of the centre location makes the program computationally much more expensive than the K-means method.
()
Chap 14 Unsupervised Learning
January 3, 2014 18 / 63
Elements of Statistical Learning (2nd Ed.) ⃝c Hastie, Tibshirani & Friedman 2009 Chap 14 K-medoids
function in R: pam (Partitioning Around Medoids) in library(cluster) See page 517 for the Country Dissimilarities example:
ZAI IND BRA
EGY
USA ISR FRA EGY BEL ZAI IND BRA YUG USS CUB
Reordered Dissimilarity Matrix
-2 0 2 4
First MDS Coordinate
FIGURE 14.10. Survey of country dissimilarities. (Left panel:) dissimilarities reordered and blocked ac- cording to 3-medoid clustering. Heat map is coded from m ost sim ilar (dark red) to least sim ilar (bright red). (Right panel:) two-dimensional multidimensional scal- ing plot, with 3-medoid clusters indicated by different colors.
USA
BELISR YUG
FRA
CHI CUB
USS
()
Chap 14 Unsupervised Learning
January 3, 2014 19 / 63
Second MDS Coordinate
-2 -1 0 1 2 3
CHI CUB USS YUG BRA IND ZAI BEL EGY FRA ISR
How to estimate the number of classes K ∗ from data? For data segmentation K is usually defined as part of the problem.
Data based methods for estimating K ∗ typically examine the within cluster dissimilarity Wk as a function of the number of clusters K.
Method 1: based on the intuition
{Wk − Wk+1|K < K∗} >> {Wk − Wk+1|K ≥ K∗}, an estimate of Kˆ∗is obtained by identifying a ‘kink’ of the plot of Wk as a function of K .
Method 2: The ‘Gap’ statistics compares the curve log Wk to the curve obtained from data uniformly distributed over a rectangle. Kˆ ∗ is estimated to be the place where the gap between two curves is largest. The formal rule for estimating K ∗ is
K∗ =argmin {K|G(K)≥G(K +1)−s′ } K K+1
()
Chap 14 Unsupervised Learning
January 3, 2014 20 / 63
How to estimate the number of classes K ∗ from data?
•
•
•• ••
•• ••
••• ••
2468 2468
Number of Clusters Number of Clusters
FIGURE 14.11. (Left panel): observed (green) and expected (blue) values of log W K for the sim ulated data of Figure 14.4. Both curves have been translated to equal zero at one cluster. (Right panel): Gap curve, equal to the difference between the observed and ex- pected values of logWK. The Gap estimate K∗ is the smallest K producing a gap within one standard devi- ationofthegapatK+1;hereK∗ =2.
••• •••
•
•
()
Chap 14 Unsupervised Learning
January 3, 2014 21 / 63
logWK
-3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0
-0.5 0.0
0.5 1.0
Gap
How to estimate the number of classes K ∗ from data?
Method 3: The silhouette plot is described in the Splus documentation and Kaufman and Rousseeuw’s book. The idea is that for case i in group A, we let
a(i) = average dissimilarity of i to all other objects of A
b(i) = minimum average dissimilarity of i to any other cluster C.
The width of the ith silhouette, s(i) is given as s(i) = b(i) − a(i)
max(a(i ), b(i ))
We interpret it as follows:
Value Interpretation
near 1
near 0 near -1
i is well classified
i is on a boundary between clusters i is in the wrong class.
()
Chap 14 Unsupervised Learning
January 3, 2014 22 / 63
More on silhouette plot
The silhouette plot is a graph of s(i) for each case. A gap separates clusters. The average of s(i) values within a cluster indicates how good the cluster is, and the average of all s(i) values indicates how good the overall clustering is.
To explore this, we run pam for several different numbers of clusters and look at the silhouette values:
()
Chap 14 Unsupervised Learning
January 3, 2014 23 / 63
More on silhouette plot
for (i in 2:6){
ir.pam _ pam(ir,k=i)
cat(i,” avg: “,round(ir.pam$silinfo$avg.width,2))
cat(” clus: “,round(ir.pam$silinfo$clus.avg.widths,2),”\n”)
}
2 avg: 0.69 clus: 0.81 0.62
3 avg: 0.55 clus: 0.8 0.42 0.45
4 avg: 0.49 clus: 0.77 0.35 0.39 0.31
5 avg: 0.49 clus: 0.76 0.27 0.43 0.38 0.39
6 avg: 0.47 clus: 0.76 0.31 0.41 0.19 0.38 0.39
So this suggests that 2 clusters is best, since it has the largest average silhouette coefficient.
()
Chap 14 Unsupervised Learning
January 3, 2014 24 / 63
Agglomerative Hierarchical Clustering
Agglomerative hierarchical methods (bottom-up): Start with n clusters, at each stage, they join two clusters into one until there is only one cluster left. In deciding which two clusters are closest at each stage, following methods can be considered:
Single linkage (minimum distance)
complete linkage (maximum distance)
average linkage
Ward’s method (finding the smallest increase in ESS)
Single linkage possibly create a long string of cluster. complete linkage will try to generate compact spherical clusters similar as average linkage. The hierarchy of merge can be represented by a tree structure or dendrogram.
To decide on a number of clusters, we look for a big jump in dissimilarity.
()
Chap 14 Unsupervised Learning
January 3, 2014 25 / 63
Dendrogram
All agglomerative and some divisive methods (when viewed bottom-up) possess a monotonicity property.
The dissimilarity between merged clusters is monotone increasing with the level of the merger.
A dendrogram plots the height of each node as proportional to the value of the intergroup dissimilarity between its two daughters
()
Chap 14 Unsupervised Learning
January 3, 2014 26 / 63
Elements of Statistical Learning (2nd Ed.) ⃝c Hastie, Tibshirani & Friedman 2009 Chap 14 Dendrograms from agglomerative hierarchical
clustering
FIGURE 14.12. Dendrogram from agglomerative hi- erarchical clustering with average linkage to the human tum or m icroarray data.
()
Chap 14 Unsupervised Learning
January 3, 2014 27 / 63
BREAST BREAST
MCF7A-repro BREAST MCF7D-repro
LEUKEMIA K562B-repro K562A-repro
UNKNOWN OVARIAN
MELANOMA MELANOMA
CNS CNS
LEUKEMIA LEUKEMIA
RENAL
MELANOMA MELANOMA
MELANOMA MELANOMA
RENAL RENAL RENAL
RENAL RENAL
NSCLC NSCLC
COLON COLON
PROSTATE OVARIAN
PROSTATE CNS
RENAL
COLON COLON
NSCLC OVARIAN
OVARIAN NSCLC
COLON
LEUKEMIA
LEUKEMIA MELANOMA
NSCLC OVARIAN
OVARIAN
NSCLC MELANOMA RENAL
RENAL BREAST
CNS CNS
BREAST NSCLC
NSCLC BREAST
COLON COLON
LEUKEMIA
BREAST NSCLC
Elements of Statistical Learning (2nd Ed.) ⃝c Hastie, Tibshirani & Friedman 2009 Chap 14 Dendrograms from agglomerative hierarchical
clustering
Average Linkage Complete Linkage Single Linkage
FIGURE 14.13. Dendrograms from agglomerative hi- erarchical clustering of hum an tum or m icroarray data.
()
Chap 14 Unsupervised Learning
January 3, 2014 28 / 63
Divisive Hierarchical Clustering
An advantage of divisive over agglomerative can occur when partitioning the data into a relatively small number of clusters. Divisive hierarchical methods (top-down): starting with one large cluster containing all n observations. Clusters are divided until each cluster contains only a single observation.
Here is a method which is fully described in chapter 6 of Kaufman and Rousseeuw (1990). The diana-algorithm constructs a hierarchy of clusterings by this method.
At each stage, the cluster with the largest diameter is selected. (The diameter of a cluster is the largest dissimilarity between any two of its observations.)
To divide the selected cluster, the algorithm first looks for its most disparate observation (i.e., which has the largest average dissimilarity to the other observations of the selected cluster). This observation initiates the ”splinter group”.
In subsequent steps, the algorithm reassigns observations that are closer to the ”splinter group” than to the ”old party”. The result is a division of the selected cluster into two new clusters.
()
Chap 14 Unsupervised Learning
January 3, 2014 29 / 63
Self-Organizing Maps-SOM
A constrained version of K-means clustering, in which the prototypes are encouraged to lie in a one- or two-dimensional manifold in the feature space.
Initialization: two-dimensional rectangular grid of K prototypes mj ∈ Rp lie in the two-dimensional principal component plane of the data.
The SOM procedure tries to bend the plane so that the prototypes approximate the data points as well as possible
The effect of the update is to move the prototypes closer to the data, but also to maintain a smooth two-dimensional spatial relationship between the prototypes.
Once the model is fit, the observations can be mapped down onto the two-dimensional grid.
()
Chap 14 Unsupervised Learning
January 3, 2014 30 / 63
Self-Organizing Maps-SOM
observations xi are processed one at a time:
find the closest prototype mj to xi in Euclidean distance in Rp, then
for all neighbors mk of mj, (||mk − mj|| < r) move mk toward xi via the update
mk ←mk +α(xi −mk)
The neighbors are defined in the space of the one- or two-dimensional manifold.
the learning rate α typically is decreased from say 1.0 to 0.0 over a few thousand iterations.
threshold r is decreased linearly from starting value R to 1 over a few thousand iterations.
()
Chap 14 Unsupervised Learning
January 3, 2014 31 / 63
SOM example
1.5 1 0.5 0 −0.5
−1 1.5
1
1.5
0.5
1
0
−0.5
0.5
−1
−1
FIGURE 14.15. Simulated data in three classes, near the surface of a half-sphere.
0 −0.5
()
Chap 14 Unsupervised Learning
January 3, 2014 32 / 63
SOM example
•••
• •
55 44 33 22 11
12345 12345
•• ••• ••
•• •• • •
• •• •••
• • •• • •••• •• •••
•
••••• • • • • • • ••• • •• • • • ••
•• ••••• ••
FIGURE 14.16. Self-organizing map applied to half– sphere data example. Left panel is the initial config- uration, right panel the final one. The 5 × 5 grid of prototypes are indicated by circles, and the points that project to each prototype are plotted randomly within the corresponding circle.
•• • •
•• •• •••
•• • • ••• ••
• •• •• ••• •• •• •••
•• • •• • •• •
•• • •• ••
•
••• • •• • •• • • • ••
()
Chap 14 Unsupervised Learning
January 3, 2014 33 / 63
SOM example
FIGURE 14.17. Wiremesh representation of the fit- ted SOM model in IR3. The lines represent the hori- zontal and vertical edges of the topological lattice. The double lines indicate that the surface was folded diag- onally back on itself in order to model the red points. The cluster members have been jittered to indicate their color, and the purple points are the node centers.
()
Chap 14 Unsupervised Learning
January 3, 2014 34 / 63
SOM example
•
• •
•
• ••
•
•••••••••••••••••••••••••••••••••••••••••
••
0 500 1000 1500 2000 2500
Iteration
FIGURE 14.18. Half-sphere data: reconstruction er- ror for the SOM as a function of iteration. Error for k-means clustering is indicated by the horizontal line.
()
Chap 14 Unsupervised Learning
January 3, 2014 35 / 63
Reconstruction Error
0 10 20 30 40 50
Self-Organizing Maps-SOM
If r is taken to be very small, so that each neighborhood contains only one point, then the spatial connection between prototypes is lost.
SOM is an online version of K-means clustering.
To check whether SOM works for the problem, compute and compare the reconstruction error ||x − mj ||2, summed over observations for SOM and K-means. If the SOM is a reasonable method, the reconstruction error should not be much larger than that of K-means.
()
Chap 14 Unsupervised Learning
January 3, 2014 36 / 63
Document Organization and Retrieval example
FIGURE 14.19. Heatmap representation of the SOM model fit to a corpus of 12,088 newsgroup comp.ai.neural-nets contributions (courtesy W EB- SOM homepage). The lighter areas indicate higher–
density areas.
()
Chap 14 Unsupervised Learning January 3, 2014 37 / 63 Populated nodes are automatically la-
Principal Components
Given an n × p data matrix X, we seek a q × p projection matrix Γ with columns γi,i = 1...q, such that XΓ- the q dimensional projection of X onto Γ, captures most of the information in the original p dimensional space.
One solution to this problem is principal components, which identifies a solution Γ such that
1 The columns of Γ are orthogonal.
2 The projected data XΓ are uncorrelated.
3 The variance of Xγi is maximized subject to (1). That is, γ1 is chosen to maximize Var(Xγ1), γ2 is chosen to maximize Var(Xγ2) among all vectors orthogonal to γ1, etc.
4 The variance of the orthogonal distances from the original data to the data projected onto the first k principal components is minimized over all possible k-dimensional subspaces subject to (2). That is, if we use the projected data to “reconstruct” the original data, the reconstruction error is minimized.
()
Chap 14 Unsupervised Learning
January 3, 2014 38 / 63
Elements of Statistical Learning (2nd Ed.) ⃝c Hastie, Tibshirani & Friedman 2009 Chap 14 Principal Components Example 1-Handwritten Digits
Figure 14.22 shows a sample of 130 handwritten 3’s
Each 3 is a digitized 16 × 16 grayscale image, thus p = 256. Here N=658.
FIGURE 14.22. A sample of 130 handwritten 3’s
show() s a variety
of Cwharpi1t4inUngsupsertvyiseldeLse.arning
January 3, 2014 39 / 63
⃝
The first two principal components of Handwritten Digits
•• •
O• ••
•••• O• •• • •• • •
••
O• •••• •••
•
•
• •• • • ••• •
• • • •• • • • • •• • •
•• • • • ••
•• •• •••
•• • ••• • •
• • ••••••• • •• •••
••
•• • • •
• •• ••
•
O•
•••
••
•• ••••• ••• •••• ••
O• •
•••
• ••
•••• •• •
• • •
••• •O•••
• O• • • • • •• • •
•••• •
• •
•O O•••• •
• • • • • •••
• • ••••O•• ••• ••• •
• •••
•
•
• •• •
•
• •
• O•
• •• •O• ••• •
• • •••••• •• • • • •• ••• •• •• •••• • ••• O•••••• •
• •••
• O•••• • • • •• ••• ••••
• •••••• ••
••••
• O••• •••••O
••••• • •• • •••
• ••• •••••• •••••
•• • ••••••O•
•••
••• • ••••••••••••••••••
•• •••• • ••• • • • • •• • •
• • •• • ••
•
• •
• •
•• •
•
O•••• •••
• •••••••• •O
•O•••••••• ••••
•••• ••••••• •• ••• • ••• ••• ••• •• • •
• • • • •••••• • ••
• ••••O•• • •• •
••• • •• •••
• • • • • • • • •• •
O• • •
•• ••••• •••
• • • • • • • • O• • ••• • •
••• •• • •O• ••• ••• ••
• O• • • •
•• •••••
•• ••••
• •
••
•
•••
•
-6 -4 -2 0 2 4 6 8
First Principal Component
FIGURE 14.23. (Left panel:) the first two princi- pal components of the handwritten threes. The circled points are the closest projected images to the vertices of a grid, defined by the marginal quantiles of the principal components. (Right panel:) The images corresponding to the circled points. These show the nature of the first two principal components.
()
Chap 14 Unsupervised Learning
January 3, 2014 40 / 63
Second Principal Component
-5 0 5
Principal Curves and Surfaces
Principal curves generalize the principal component line, providing a smooth one-dimensional curved approximation to a set of data points in Rp.
A principal surface is more general, providing a curved manifold approximation of dimension 2 or more.
()
Chap 14 Unsupervised Learning
January 3, 2014 41 / 63
Principal Curves
Def: Let f(λ) be a parameterized smooth curve in Rp. Hence f(λ)
is a vector function with p coordinates, each a smooth function of Elements of Statistical Learning (2nd Ed.) ⃝c Hastie, Tibshirani & Friedman 2009 Chap 14
the single parameter λ. For each data value x , let λf (x ) define the closest point on the curve to x. Then f(λ) is called a principal curve for the distribution of the random vector X if
f(λ) = E(X|λf (X) = λ)
•
• • • •
•
••
• •
• •
• •
••
• • •
• • •
•
•
• • [ f 1 ( λ ) ,
f ( λ )
=
f 2 ( λ ) ]
FIGURE 14.27. The principal curve of a set of data. Each point on the curve is the average of all data points that project there.
•
•
• ••
()
Chap 14 Unsupervised Learning
January 3, 2014 42 / 63
.
Principal points
Principal points: The set of k points that minimize the expected distance from X to its prototype are called the principal points of the distribution. Principal points are the distributional analogs of centroids found by K-means clustering.
()
Chap 14 Unsupervised Learning
January 3, 2014 43 / 63
Finding a Principal Curve
LetXT =(X1,X2,···,Xp)
principal curve f(λ) = [f1(λ),f2(λ),··· ,fp(λ)] Algorithm
ˆfj(λ) ← E(Xj|λ(X) = λ);j = 1,2,··· ,p
A scatterplot smoother to estimate the conditional expectations by smoothing each Xj as a function of the arc-length λˆ(X).
λˆf(x) ← argminλ′||x −ˆf(λ′)||2, done for each observed point.
With finite data, the principal curve algorithm starts with the linear principal component, and iterates the above two steps until convergence.
()
Chap 14 Unsupervised Learning
January 3, 2014 44 / 63
Principal surfaces
The mostly commonly used is the two-dimensional principal surface, with coordinate functions
f(λ1,λ2) = [f1(λ1,λ2),··· ,fp(λ1,λ2)]
The estimates in step (a) are obtained from two-dimensional
surface smoothers.
very similar to self-organizing maps.
But the principal surface estimates a separate prototype
f (λ1 (xi ), λ2 (xi )) for each data point xi , while the SOM shares a smaller number of prototypes for all data points.
Principal surfaces is continuous and preserves distance locally, while SOMs are discrete.
()
Chap 14 Unsupervised Learning
January 3, 2014 45 / 63
Principal surfaces fit to the half-sphere data
FIGURE 14.28. Principal surface fit to half-sphere data. (Left panel:) fitted two-dimensional surface. (Right panel:) projections of data points onto the sur- face, resulting in coordinates λˆ1,λˆ2.
• • • •• •• • ••• • •••••
• • • • • • •
•••• •• ••
•••• •
•
••• •••••
•• •••• • ••••••
•• ••• •••• ••
• • •• •• • •••
••• ••
-0.1 0.0 0.1 0.2
λ1
()
Chap 14 Unsupervised Learning
January 3, 2014 46 / 63
λ2
-0.2 -0.1 0.0 0.1 0.2
Factor analysis
factor analysis: when indirect measurements arise from an underlying source, which typically cannot be directly measured.
Examples: psychological tests, EEG brain scans measure the neuronal activity in various parts of the brain indirectly via electromagnetic signals recorded at sensors, The trading prices of stocks.
Factor analysis is a classical technique developed in the statistical literature that aims to identify these latent sources.
Assumption of Gaussian distributions hindered its usefulness to some extent.
()
Chap 14 Unsupervised Learning
January 3, 2014 47 / 63
Factor analysis
SVD of X: X = UDVT can be represented as √T√
X=( NU)(DV / N)=SA
Hence each of the columns of X is a linear combination of the columns of S. The columns of S have zero mean, are uncorrelated and have unit variance. Thus we can interpret the SVD as an estimate of a latent variable model.
However for any orthogonal p × p matrix R, we have
X = SA = SRRT A = S∗A∗, thus it is impossible to identify any particular latent variables as unique underlying sources.
()
Chap 14 Unsupervised Learning
January 3, 2014 48 / 63
Factor analysis
The classical factor analysis model assumes: X1 =a11S1 +···+a1qSq +ε1
X2 =a21S1 +···+a2qSq +ε2 .
Xp =ap1S1 +···+apqSq +εp
Here S is a vector of q < p underlying latent variables or factors, A is a p × q matrix of factor loadings, the εj are uncorrelated zero-mean disturbances.
Typically the Sj and the εj are modelled as Gaussian random variables, and the model is fit by maximum likelihood. Unfortunately the identifiability issue remains.
If the Var (εj ) are all assumed to be equal, the leading q components of the SVD identify the subspace determined by S.
()
Chap 14 Unsupervised Learning
January 3, 2014 49 / 63
Independent Component Analysis
ICA model has exactly the same form of assumption: X1 =a11S1 +a12S2 +···+a1pSp
X2 =a21S1 +a22S2 +···+a2pSp
.
Xp =ap1S1+ap2S2+···+appSp
but now the Si are assumed to be statistically independent and non-Gaussian.
We wish to recover the mixing matrix A in X = AS
Without loss of generality, we can assume that X has already been whitened to have Cov(X) = I; This in turn implies that A is orthogonal.
ICA tries to find an orthogonal A such that the components of the vector random variable S = AT X are independent (and non-Gaussian).
()
Chap 14 Unsupervised Learning
January 3, 2014 50 / 63
Independent Component Analysis
The differential entropy H of a random variable Y with density g(y) is given by
H(Y) = −
among all random variables with equal variance, Gaussian
g(y)logg(y)dy variables have the maximum entropy.
In information theory and statistics, negentropy is used as a measure of distance to normality. Negentropy J(Y) is defined as J(Y) = H(Z) − H(Y). Negentropy is always nonnegative, is invariant by any linear invertible change of coordinates, and vanishes if and only if Y is Gaussian.
()
Chap 14 Unsupervised Learning
January 3, 2014 51 / 63
Independent Component Analysis
the mutual information I(Y) measures the dependence between the components of Y :
pp
I(Y) = H(Yj) − H(Y) = H(Yj) − H(X) − log|detA| (1)
j=1 j=1 p
= H (Yj ) − H (X ) j=1
(2)
Finding an A to minimize I(Y ) = I(AT X ) looks for the orthogonal transformation that leads to the most independence between its components. This is equivalent to minimizing the sum of the entropies of the separate components of Y, which in turn amounts to maximizing their departures from Gaussianity.
In summary, ICA looks for orthogonal projections such that the projected data look as far from Gaussian as possible.
()
Chap 14 Unsupervised Learning
January 3, 2014 52 / 63
ICA example: an artificial time series data
Source Signals Measured Signals
PCA Solution ICA Solution
FIGURE 14.37. Illustration of ICA vs. PCA on ar- tificial time-series data. The upper left panel shows the two source signals, measured at 1000 uniformly spaced time points. The upper right panel shows the observed mixed signals. The lower two panels show the principal components and independent component solutions.
()
Chap 14 Unsupervised Learning
January 3, 2014 53 / 63
⃝
ICA example: mixed two uniform dist. data (X1+2X2, 2X1+X2)
Source S Data X
* ***
** ***** * ***
**** ********* * * ** *** *****
***** ******** ********** *** *
** ******** ** **** **** ** ******
* ** *
** * *
* *************** ***** **********
*** *** **** **** *****
** * * ***** *** * *** * ** **** ******* *
***** ***** ******** * *** *
* ** *
** **** * * * ** * ***** **
**** ** ***** * ******* *****
* **** * * * ****** **** ** ****** * ** ** *
*** ****** * ******
**** *
*** * **** *** ************** ** ******* ***********
*** *** ** ****** ** ** ****** ** ***** ** *****
* ******************
* ***** ** * * ** * *
****** ****** ** * **** ** *********** * ** * * ****** * * * *** * * ********** ***** ** * ** * ** ********** ******* ********* ** ** * * * *
* **** **** ******** ****
*** * * * ** **** **********
* ****** *****
******* ****** *** * * * * * **** * *** * ****
***** **
PCA Solution ICA Solution
* *
* **** ** * *** * *
* ** *** * *****
* *** **** * *** *** ** * **
* ** * * ** ** *** ***** * * * **********
***** ********** * ** ******** ** * ** ** *** *** * * *
* * ** * * *** ** * * * * * * **** ** *** ***** *** ******* *** ** * * * ** ****** * * **
***** * **** * ********** **************** ******* ** ** * * ***** ** *
****** ** *
* ** * ******** ***** * ****
** ****** *** * ** *** **
** ****** *
* ** *
***
* * ** * * *** * * * ** * * * * ** * * * * * * * ** * ****** ********
** ** *** *** ***** ***** * * **** * ** *** ** *** *
*** ** ****
**** * ********
******* *
* ** ******
** ** ** *****
* * *** **** ***** ****
* *** *** * *** *** *** ***
*** ** * * *********** **
** *** *****
*
* ** ** * ** * ** *** * * * ****** *** ** * ********* ********* * ** * * ***** ********
* **** ***** **** **** ***************
** ******* ** * ************ ** ** **** **** *
* * **** ***** *** ****** ** * *** ** ******* * ******* ****** **** * *
FIGURE 14.38. Mixtures of independent uniform random variables. The upper left panel shows 500 real- izations from the two independent uniform sources, the upper right panel their mixed versions. The lower two panels show the PCA and ICA solutions, respectively.
()
Chap 14 Unsupervised Learning
January 3, 2014 54 / 63
CoElmempenatsroef StahtiesticfialrLsetar5ninPg (C2nAd Eda.)n⃝cdHIaCstieA, Tribeshsiruanltis& Forinedmthane2009 Chap 14 Handwritten Digits data
ICA Components
Component 1
o ooooo o oooo o ooo oooo o
o oo o o o
o
o
o oo
o oo oo o
oooo o
ooo oo
oo
o
o o
o
oooooo o o oo
oooo o ooo oo
oo
o oo o o o
o o o oo oo o ooooo o oooooo
o
o ooo oo
oo
o o
o o
ooooooo ooo o
o oooo ooo
ooo oo
oo
o
o o o o ooooooo o
oo o o ooooooo o o
oo
o ooo o o o
o o
o o o o o oo o
o
o oo o o ooo ooooo o o o
o o oo o oo oo o
o ooo
o o o oo o o
ooo oo ooo oo
oo
oo o ooo oo oooo ooooo
ooo o o ooo o o
Component 2
o o o
o o o oooo o oo o
o oo o
o o
o ooooooooooo o
oo
o o o ooooo ooo
o o
ooo
o o
o o o
o
o oo
o o
oo oo
o oo ooo o o oo o oo
o
oo
o
oo ooooo o
o o
ooo o
oo o
o o o oo o o o o
o o o
o
oo
o o
oo o o oo
o oooo o o ooooo o
o
oo ooooooo oooo o o
o
o
o oo o o oo o o o ooo oo o o
oo o o o ooo
oooo o oo
ooo o ooo oooo oooooooooooo
oooo oo oo o ooo
Component 3
o
o o
o o o
o o
o oooo oooo oo o oo oooo
o
oo o
oooo oooo
o
o
o
oooo o o ooooooo
o
oo ooo
oo o
o
o
ooo
o oooo
o o
oo o o o o
o
ooo oooo o oooooo oo o
ooo o o o oo
oo oo o
o
o o
oooooo oo oooooo o o oo oo
oooo ooooooo
ooo ooooo oo o
o o o ooooo oo oooooo oo oo o ooooo
o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo o o o o oo o o o ooo o o o o o o o ooooooo o
oooo o
oo oooo oo oooo
o oooooo
ooooooo o ooo oooo o
ooooo o o oo o
Component 4
oo
o oooo ooo o
o o o o o o oo
o oo o
ooo
o o oo o
o o oooooo oooo ooo ooo o
oo o
o
o o
ooo o o
oooo oooooo o o o ooo o o oo oo
o oooo
o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo o o o o ooo o o o o o o ooo oooooooooo o
o oo oo oo
ooooo ooo o ooooo oooooooooo o
oooo ooooo o oo
oo oooo oo o o ooooooooooo oo o ooooooooooo
ooooo o oo
Component 5
FIGURE 14.39. A comparison of the first five ICA components computed using FastICA (above diagonal) with the first five PCA components(below diagonal). Each component is standardized to have unit variance.
()
Chap 14 Unsupervised Learning
January 3, 2014 55 / 63
PCA Components
Compare the first 5 PCA and ICA results on the Handwritten Digits data
Mean ICA 1 ICA 2 ICA 3 ICA 4 ICA 5
FIGURE 14.40. The highlighted digits from Fig- ure 14.39. By comparing with the mean digits, we see the nature of the ICA component.
()
Chap 14 Unsupervised Learning
January 3, 2014 56 / 63
Elements of Statistical Learning (2nd Ed.) ⃝c Hastie, Tibshirani & Friedman 2009 Chap 14 EEG data
FIGURE 14.41. Fifteen seconds of EEG data (of 1917 seconds) at nine (of 100) scalp channels (top panel), as well as nine ICA components (lower panel). Whilenearbyelectrodesrecordnearlyidenticalmixtures of brain and non-brain activity, ICA components are temporally distinct. The colored scalps represent the ICA unmixing coefficients aˆj as a heatmap, showing
brain or scalp location of the source.
() Chap 14 Unsupervised Learning
January 3, 2014 57 / 63
Multidimensional Scaling
The idea is to find a lower-dimensional representation of the data that preserves the pairwise distances as well as possible.
MDS requires only the dissimilarities dij , in contrast to methods that need the observations.
Multidimensional scaling seeks values in lower dim space z1,z2,···,zN ∈Rk tominimizetheso-calledstressfunction:
SM(z1,z2,···,zN)=(dii′ −∥zi −zi′∥)2 i̸=i′
This is called least squares scaling.
A variation on least squares scaling–Sammon mapping:
SSm(z1,z2,···,zN)=(dii′ −∥zi −zi′∥)2 i̸=i′ dii′
with more emphasis put on preserving smaller pairwise distances.
()
Chap 14 Unsupervised Learning
January 3, 2014 58 / 63
Classical Multidimensional Scaling
classical scaling minimizes
SC(z1,z2,···,zN)=(Bii′ −⟨zi −z ̄,zi′ −z ̄⟩)2
i̸=i′
where Bii′ = ⟨xi − x ̄, xi′ − x ̄⟩ is the centered inner product.
Suppose X is centered n × p data matrix, the centered inner product can be got from B = XXT .
the eigen-decomposition: B = AΛAT = (AΛ1/2)(AΛ1/2)T , thus the first k dim in AΛ1/2 gives the first k PCA scores.
The Euclidean distance can be got from B: d2 = Bii + Bjj − 2Bij ij
Question: Given dij , can we reconstruct X ?
()
Chap 14 Unsupervised Learning
January 3, 2014 59 / 63
Classical Multidimensional Scaling
Note that dij is invariant to changes in location
rotations reflections
Thus we can’t expect to recover X completely. But assuming data are centered, and let Bii = tr (B) = D, we have
d2=B +nB −0=D+nB ijiijj jj
i
d2=nB +B −0=D+nB ijiijj ii
j
d2 =2nD ij
ij
()
Chap 14 Unsupervised Learning
January 3, 2014 60 / 63
classical Multidimensional Scaling
Now we can get B from dij :
B =1[B +B −d2] (3)
ij 2 ii jj ij
= 1[1(d2 −D)+ 1(d2 −D)−d2] (4)
2n ij n ij ij ji
= 1[1(d2 +d2)−d2 − 1 d2] (5) 2nij ijijn2 ij
jiij
With eigen-decomposition of B, we can reconstruct the principal component scores of X.
()
Chap 14 Unsupervised Learning
January 3, 2014 61 / 63
Example of classical Multidimensional Scaling
••
• ••• ••• •••• ••••••
••
••
•
••• ••• ••
•••••
•••• •
••••• • • ••• •
•• • ••••• •• • ••
•• ••
•
•••• ••••••
-1.0 -0.5 0.0 0.5 1.0
First MDS Coordinate
FIGURE 14.43. First two coordinates for half-sphere data, from classical multi-
dimensional scaling.
•• •••
()
Chap 14 Unsupervised Learning
January 3, 2014 62 / 63
Second MDS Coordinate
-1.0 -0.5 0.0 0.5 1.0
Metric scaling and nonmetric scaling
Least squares and classical scaling are referred to as metric scaling methods, in the sense that the actual dissimilarities or similarities are approximated.
nonmetric scaling effectively uses only ranks. Nonmetric scaling seeks to minimize the stress function
i̸=i′[∥zi −zi′∥−θ(dii′)]2 SNM(z1,z2,···,zN)= i̸=i′ ∥zi −zi′∥2
over the zi and an arbitrary increasing function θ.
With θ fixed, we minimize over zi by gradient descent. With the zi fixed, the method of isotonic regression is used to find the best monotonic approximation θ(dii′ ) to ∥zi − zi′ ∥. These steps are iterated until the solutions stabilize.
()
Chap 14 Unsupervised Learning
January 3, 2014 63 / 63