Unsupervised
Learning
What
Why Examples
What Applications
to do How
What
Data xi
Chi
Xip Matrix form
iin
Xn
Xpxili
X
一
Xp
No labels
lnnlnnnnrnrrrrrrrrrrr
YD
Nhy
8
ǙǛ cations
iiging
ng
area
CIQ
Phycology
Business
Study
Computer Feature
Basket Vision
extraction Engineering
CS no
data compression
Wide
IQ
applications test
recognition
Cork tail party problem
detection
Gnome analysis for
patients
Data too
Digit
Signature
compression
iPhone
cancer
many
photos
一
on
Challenging
No
Y
Teacher No Standard
too Anything
many goes
Examples
ˋˋ
ˋ
What
to do
Maintasksize
Representation
Visualisation Interpret ability
Learning
Data Input
compression
prediction Clustering
her
for
Birds of a feather 物以类聚 人以群分
flock
get
Asideikpresentatimszo Data
业
Representation
止
knowledge t
Intelligence
止
Decisions
How
Methodszn PCA
Kernel
Clustering Density_based
K Gassian
kernel densig
DBSCAN
_n
PCA
means
Mixture
Models
estimation
Spend clustering SU Di
Matrix
Hieray
decomposition
Principal
Analysis PCA
Component
PCA kkarhunentoève transforma吵
Dimension Lossy data
Why
Feature Visualization
reduction
compression
extraction
Can we
recognize
car
plates
HX 920
I
Simpler
I
task
digit
23
recognition
45
89
7
6
4 1. Introduction
FIGURE 1.2. Examples of handwritten digits from U.S. postal envelopes.
prostate specific antigen (PSA) and a number of clinical measures, in 97 men who were about to receive a radical prostatectomy.
The goal is to predict the log of PSA (lpsa) from a number of measure- ments including log cancer volume (lcavol), log prostate weight lweight, age, log of benign prostatic hyperplasia amount lbph, seminal vesicle in- vasion svi, log of capsular penetration lcp, Gleason score gleason, and percent of Gleason scores 4 or 5 pgg45. Figure 1.1 is a scatterplot matrix of the variables. Some correlations with lpsa are evident, but a good pre- dictive model is difficult to construct by eye.
This is a supervised learning problem, known as a regression problem, because the outcome measurement is quantitative.
Example 3: Handwritten Digit Recognition
The data from this example come from the handwritten ZIP codes on envelopes from U.S. postal mail. Each image is a segment from a five digit ZIP code, isolating a single digit. The images are 16×16 eight-bit grayscale maps, with each pixel ranging in intensity from 0 to 255. Some sample images are shown in Figure 1.2.
The images have been normalized to have approximately the same size and orientation. The task is to predict, from the 16 × 16 matrix of pixel intensities, the identity of each image (0, 1, . . . , 9) quickly and accurately. If it is accurate enough, the resulting algorithm would be used as part of an automatic sorting procedure for envelopes. This is a classification problem for which the error rate needs to be kept very low to avoid misdirection of
Evensimplertasknnnnrn Variation between individuals
333
3333
33
or
over time
Variation
333
3 33333333
Q
Common
features Importance
A
Many Linear
Nonliear
solutions
PCA ICA in
k PA
auto encoder
X
3
Scale
3
蘂劉
3
Sǎii
1
Shift
3
歲
Rotate
3
bx
dēgreeo
與
How
Data xi
Xii
Xip Matrix form
gx.it
Xn
一
tiii
敞
or
焾
x
xji
5
嫌一
二
Unxp
旦
Unxp
Dpxpliif i
Dpxp
Vpxpnxp
xidpnxp
XTpxn Ipxp x
ep Dpxpx
Uiivpfil 算
也以
XT
Ē
三
Ipxp x Yxpx
pxn
參
一
一
㔏
Upxn
二
n_n
一
Ǚ
T
Special
3
case
in
pz R2
n
points
tiiiiiix
7
a𠱃 d
嵷
iii
剳
三点 以 yiiǚi 您別
ㄐㄧㄠ
Xilfxix
Htxdg
x Xizez
噬
Xi
十
e 4
週
and
xi
二
Change scale
d訓 duiutdujn
叫
bases differently
fmeis tis
g
for
each PC
See
n coordinates of projections of X’s onto
多
u
了
一一
與 她 籲二鬬
露 XI old basis
and 二覥d
Tìf
lii iEli.in
iii
new basis
Ai I
N i pricinplecon
second
iiiiif
Example
If X Xz
nii
correlated
Tv
are
n
strongly
uxax.tbxz.li
Most info along
can get rid of Oz
not that
important
v
Alternative
Hotelling Pearson
1933
1901
Maximum
Variance formulation
Hotelling
1933
sequential search for N Erin
hyǐj lhyg
s.tv
is the in
largest rust
largest Sij ijsz
largest 8ij.i.jo
is the st Uīvj 二
lhyuny
is the īvj
M_
ft
0
Findiugvi Center X
Varlxy
hi 0
where
o
Lin
S六点 ii.miwxisu.sn
nqixīi
G E如氣v必啊v
Yu ǗS
vīv vīsrhiu
二
器
保
二 Sv 一 入u 0
1
vi eigenvector with eigenvalue 入
T
vīu So 心
二
0
二
v.su maxvīsu
max
入
viv
入
八
u eigenvector with
the
T First Principal
largest eigenvalue 八 Component
Findinguzr max UCXUD 二
IS vz
Nzst
Ll 器_Svz
量 1發
Viv
二
0
ǕUE 1
公
72
D n.in
viu viu
1
0
一
7201
二
o
二
visvz 入202
nzli
望
T
largest eigenvalue 是
vīsviiivzthi maxvsu
0 NINE 1 the
eigenvector with
y
Second
Second Principal Component
ㄟㄟ
入2 maxns.t.in
Minimumerrorfrmulationzr
Pearson Search
st.ly_Ěxi
vīzifj
HE is nigiu
minimized
1901 i Um
i1
Applieationidataon.gr ESL
Dimensionreductiounnnnnn.lt
380 10. Unsupervised Learning
−1.0 −0.5 0.0 0.5 1.0
First principal component
FIGURE 10.2. Ninety observations simulated in three dimensions. Left: the Sparsepcten
first two principal component directions span the plane that best fits the data. It minimizes the sum of squared distances from each point to the plane. Right: the first two principal component score vectors give the coordinates of the projection of the 90 observations onto the plane. The variance in the plane is maximized.
M m=1
xij ≈
zimφjm (10.5)
(assuming the original data matrix X is column-centered). In other words, together the M principal component score vectors and M principal com- ponent loading vectors can give a good approximation to the data when M is sufficiently large. When M = min(n − 1, p), then the representation isexact:x = M z φ .
Be
Her
ij m=1 im jm
interpretation
10.2.3 More on PCA
Scaling the Variables
We have already mentioned that before PCA is performed, the variables should be centered to have mean zero. Furthermore, the results obtained when we perform PCA will also depend on whether the variables have been individually scaled (each multiplied by a different constant). This is in contrast to some other supervised and unsupervised learning techniques, such as linear regression, in which scaling the variables has no effect. (In linear regression, multiplying a variable by a factor of c will simply lead to multiplication of the corresponding coefficient estimate by a factor of 1/c, and thus will have no substantive effect on the model obtained.)
For instance, Figure 10.1 was obtained after scaling each of the variables to have standard deviation one. This is reproduced in the left-hand plot in Figure 10.3. Why does it matter that we scaled the variables? In these data,
Second principal component
−1.0 −0.5 0.0 0.5 1.0
Pwsàhsn PCA Pros
Easy to implement
SNi 二八Ni
i二1 rip plot
of
Sree
Good
for
12
3
mixture Gaussian dist
it
收in fxfxXx
xfx xS
Cons
Not good for Non Gaussian
i
xX iii
入
XXXX
Many
solutions
ix
x
kerndPU.int
keystepsinPCA
n.cn entering
S
Eigerdecomposition Si
占
t1
n
Ixi
二
0
Covariance
合声东吐
n i 吉e 亞 孚 等
u Ni
À
Now Ui
二
Yci
衟
4 linear com
Finding
vi
Finding di
一
不
N
ofColspaceofXJ
Substitute
into
Ěyīhitiài
一划
xē
both sides
Multiply Ěxēxjxīxiniixēxivnlrv
on
vrvgx K
Ǐxixj 避避Kja
rrohhrn
nxilxēxiiiixjai
kij 三 Ikejlii 則
letting ktxi.xikxixj.vn
ai
i
nnih
ain
n
一
則
熱管 一
烈則 a
Mi ka 一則ai
Eli
恐jkjyi
jkjniiiiiiiii.in
n
n
n
烈
Mil ki
二
Ka Kai
Kn
fkn
Knn
Kai nniai
Mi
出
Remarks have
non
same
eigenvalues eigenvectors
zero
Give XER its score along Ni
XTUĒIIG
It Xujai Klxxiain Nxnai
i 二 1 Ìainklyi
vix
Kernel
十
Assume 4 is KG.int
nlliai along
点 ainklxxik
Kernel
trick RP
成
centered
4
R.GG
Kai Score
of
Oitvi
x
二
4
獇
Nxi
is
NOT centered
性
Typically
二十
一族
xj
Tajǐytty
三 kijiikeiikie 一点器金
E
k hnk
knnkuwni.it
kerndPCAz.noKC Choose kernel
Calculate EK
nllia along
点 annxi
K Knnku
Kai Score
of
Oitvi
x
二
No need Only
know kl
4
Typical Gaussian
kernel
划版
to
choice of KGJ
KG
划
e Polynomial
KG
女
kernel ctxgd
Demons
PCA no good 怎管吵
Intrinsic
dim
non
品装
2
linear
klxy 1 x咧 angles
Kay
radias ēi
r
Other applications of
Image Novelty
kernel
PCA
denoising detection
子
可
一
Analysis 子成 NGI
到
X
3iaz_
Factor
G3taenE.I了 i
A
子
Clusteringzf
二
Ěwift
Outlineen
Objectives
Similariyfsimilarity
Techniques K_means
GMM
rn
Applications
Challenges
measures
Group Q
or
objects
Objective
simi liar
xishge.tn
simlxi dlxi
xj xj
Most
of
job
challenging clustering
Clustering depends
which component
to choose distn
g Many
cosine
Scale
on
dist vs
of
spiritual
Challenger
physical
choices
measures
pearson corn
variant
国
Goreexamples later
lnin
High
si
dimensional denoisingo
qǐa
逃
a
si
Mixed
features
type
e.g X weight
simlx y or kernel
linear
止
PCA
is
XÉA studai ay
Similarity
often
or
nonlinear
xi gender
sing
similarity
dissimilarity
rehire
easier 九 define
Similarigfsimianglhzrn
if
xfxixii.mn
xn
____
__Xp
fp
Xnp
Simlxyxlxy j
How to define
共
y
fix
ixn xii
一
一
y
y
列
variables Quantitative
blood presure
D
Can’tstand dislike Ok like terrify
Types
of
height weight age Org
Academic grade Degree of preference
Catg
Color Red
i.BG D Black Yellon
Pink
BJ SGP
XEGGěndw Grade劁 阿劍
POB
Hk S2 NY Mix
i
Quantnvefontinuous distan.cm
nnn7d
x
fI xj
Xj
yj
id
gDr 2j
ki
ĚIG
g
E
Yrzie9
f
1 tyj
r
o
r
器
1不多
no
lPearsona
lpcx.ge
四
dixiif
dlx.gl I plx.gl
If
杰保 义 保
i
y
反
Result
zfzij yjEI
X
are
standardized
px.tl 7
LHS
二 叁年xit If
𨶹
主
1
ltl plx
2 Exjyj zpa.gg
w
y
斷斷
More robust version of Pearson Drag
的一則 一
1
𤦸Ē 爾避
where
tray
rank rank
of of
Xj rgj
yj.dlx.gl
9 zy
Gsinesi
ini.x.mg MINGO
Similarglx.DE E
dlx.gl gfearmanpissimilavbosinediy
fx
Gs Q
型
1
MINI CosQ
but different
X
Kendall’s
xt j
T 二 paired rank correlation
xi
y yi
應依州 所的咧
f concordantpair 也 of disconcordant pairs
E
f
前
火
一一
Xj
yj yp
一一
xp
Categorize
Neg
WlE
f
g
口 E.S.W.MG g
Defined
I
A
3 VABER
Ordinden
Continuous F
止
3,2 Categorical
eg A
C 0,0
Ordinal
A
D
BF
5,4 Ordinal
1
1,0,0,0,0
I
0,0
Mixedtype
dxikwjdj
iwioi.lt
XÉ
i
Xiz 一一 Xij Xip
Xi
Xitxiihi Xj 刷
il
i
Data
X
xn C
齡 文
xi
多
Ck Groupsflusters
Ctn G.fi or
unknown
懇
G
多
K
Total
T
澱
磊
variation ÌÌj dlxixj
i
11
齑
i.xD 㴴癌十点㔏dlxixj
W
13
B
Minus_My
Take
雄
p
Xiēxj
l1
MGÈX
雄
For
xj
Xjp
dist Xithi
xipxj
Euclidean
魏
二三
妙二
世
2 Nk
嚞世
p
Neil Cal
where
E mean of G
Proof WLOG assume p 1
Take kernel
子
U stat n
六爵
ny 刘啡
Eh Xi Xz 主 Elxixz
2
前𤾩巡一 班於前Ǐqj in
2
xixj
2
n Ifj in
X
Nin
min
Wang
K
IW k
ÈN
Remarks Gubinatorid
NP
Iterative decent
algorithm
optimization hard
ˋ
口
黽
FM K
Algorithms Fix
Initialize 49 Randomly
Calculate m
Assign
Cci Repeat
clusters
Xi to GD current cluster means
assign
Xi
mk
to closest
mi’s
o
until
convergence
m.IE lshsk
agminllxi
Remarksg
admin
Global
min
is is
guaranteed NOT
Diff
clusters
Random Choose
initialization
diff
initialization
optimal
one
器
388 10. Unsupervised Learning
Now, we would like to find an algorithm to solve (10.11)—that is, a method to partition the observations into K clusters such that the objective of (10.11) is minimized. This is in fact a very difficult problem to solve precisely, since there are almost Kn ways to partition n observations into K clusters. This is a huge number unless K and n are tiny! Fortunately, a very simple algorithm can be shown to provide a local optimum—a pretty good solution—to the K-means optimization problem (10.11). This approach is laid out in Algorithm 10.1.
Algorithm 10.1 K-Means Clustering
1. Randomly assign a number, from 1 to K, to each of the observations.
These serve as initial cluster assignments for the observations. 2. Iterate until the cluster assignments stop changing:
(a) For each of the K clusters, compute the cluster centroid. The kth cluster centroid is the vector of the p feature means for the observations in the kth cluster.
(b) Assign each observation to the cluster whose centroid is closest (where closest is defined using Euclidean distance).
Algorithm 10.1 is guaranteed to decrease the value of the objective (10.11) at each step. To understand why, the following identity is illu- minating:
1 p
(xij − xi′j)2 = 2
where x ̄kj = 1 xij is the mean for feature j in cluster Ck.
In Step 2(a) the cluster means for each feature are the constants that minimize the sum-of-squared deviations, and in Step 2(b), reallocating the observations can only improve (10.12). This means that as the algorithm is run, the clustering obtained will continually improve until the result no longer changes; the objective of (10.11) will never increase. When the result no longer changes, a local optimum has been reached. Figure 10.6 shows the progression of the algorithm on the toy example from Figure 10.5. K-means clustering derives its name from the fact that in Step 2(a), the cluster centroids are computed as the mean of the observations assigned to each cluster.
Because the K-means algorithm finds a local rather than a global opti- mum, the results obtained will depend on the initial (random) cluster as- signment of each observation in Step 1 of Algorithm 10.1. For this reason, it is important to run the algorithm multiple times from different random
|Ck|
i,i′∈Ck j=1
|Ck | i∈Ck
(xij − x ̄kj)2, (10.12)
p i∈Ck j=1
Data Step 1 Iteration 1, Step 2a
Iteration 1, Step 2b Iteration 2, Step 2a Final Results
FIGURE 10.6. The progress of the K-means algorithm on the example of Fig- ure 10.5 with K=3. Top left: the observations are shown. Top center: in Step 1 of the algorithm, each observation is randomly assigned to a cluster. Top right: in Step 2(a), the cluster centroids are computed. These are shown as large col- ored disks. Initially the centroids are almost completely overlapping because the initial cluster assignments were chosen at random. Bottom left: in Step 2(b), each observation is assigned to the nearest centroid. Bottom center: Step 2(a) is once again performed, leading to new cluster centroids. Bottom right: the results obtained after ten iterations.
initial configurations. Then one selects the best solution, i.e. that for which the objective (10.11) is smallest. Figure 10.7 shows the local optima ob- tained by running K-means clustering six times using six different initial cluster assignments, using the toy data from Figure 10.5. In this case, the best clustering is the one with an objective value of 235.8.
As we have seen, to perform K-means clustering, we must decide how many clusters we expect in the data. The problem of selecting K is far from simple. This issue, along with other practical considerations that arise in performing K-means clustering, is addressed in Section 10.3.3.
10.3 Clustering Methods 389
390 10. Unsupervised Learning
Applications
Vector
235.8 235.8 310.9
Data
320.9 235.8 235.8
Quantization
Compression
Human Tumor
Data
FIGURE 10.7. K-means clustering performed six times on the data from Fig-
ure 10.5 with K = 3, each time with a diff
t ra
ssign
eren
servations in Step 1 of the K-means algorithm. Above each plot is the value of the objective (10.11). Three different local optima were obtained, one of which resulted in a smaller value of the objective and provides better separation between the clusters. Those labeled in red all achieved the same best solution, with an objective value of 235.8.
10.3.2 Hierarchical Clustering
One potential disadvantage of K-means clustering is that it requires us to pre-specify the number of clusters K. Hierarchical clustering is an alter- native approach which does not require that we commit to a particular choice of K. Hierarchical clustering has an added advantage over K-means clustering in that it results in an attractive tree-based representation of the observations, called a dendrogram.
In this section, we describe bottom-up or agglomerative clustering. bottom-up This is the most common type of hierarchical clustering, and refers to agglomerative the fact that a dendrogram (generally depicted as an upside-down tree; see
ndom a
ment of the ob-
compressed K200 K4
ESLO
Original iiii
2 1024
其
1024
Vechrbnuantigationzn
2以2
514 14. Unsupervised Learning
Data
1024 X
管
FIGURE 14.9. Sir Ronald A. Fisher (1890 − 1962) was one of the founders
of modern day statistics, to whom we owe maximum-likelihood, sufficiency, and
many other fundamental concepts. The image on the left is a 1024×1024 grayscale
image at 8 bits per pixel. The center image is the result of 2 × 2 block VQ, using 1024
200 code vectors, with a compression rate of 1.9 bits/pixel. The right image uses only four code vectors, with a compression rate of 0.50 bits/pixel
We see that the procedure is successful at grouping together samples of
the same cancer. In fact, the two breast cancers in the second cluster were
later found to be misdiagnosed and were melanomas that had metastasized.
However, K-means clustering has shortcomings in this application. For one,
it does not give a linear ordering of10ob2j4ects within a cluster: we have simply
listed them in alphabetic order above. Secondly, as the number of clusters
K is changed, the cluster memberships can change in arbitrary ways. That
o
groups
Divide
is, with say four clusters, the clusters need not be nested within the three
into coarser
clusters above. For these reasons, hierarchical clustering (described later),
is probably preferable for this application.
14.3.9 Vector Quantization
The K-means clustering algorithm represents a key tool in the apparently unrelated area of image and signal compression, particularly in vector quan-
512 2 tization or VQ (Gersho and Gray, 1992). The left image in Figure 14.9 is a
digitized photograph of a famous statistician, Sir Ronald Fisher. It consists of 1024 × 1024 pixels, where each pixel is a grayscale value ranging from 0 to 255, and hence requires 8 bits of storage per pixel. The entire image oc- cupies 1 megabyte of storage. The center image is a VQ-compressed version of the left panel, and requires 0.239 of the storage (at some loss in quality). The right image is compressed even more, and requires only 0.0625 of the storage (at a considerable loss in quality).
The version of VQ implemented here first breaks the image into small
blocks, in this case 2×2 blocks of pixels. Each of the 512×512 blocks of four 2This example was prepared by Maya Gupta.
512
1024
pixels
Block representation Xi Xia
Note
G二 igi Someblodsvgsimilian
i.me
Dissimilarig
d
K
aiiixii jj 二 Gijyi
prototypes
il
code books eg
K
Middle
200
R
4
ight
means is Find K
K
Replace original by
rate
original
200
centers
n_n
corresponding centers
Compression
K
8find 1.9 1
4 idkYpiscd Substautidredutnwnrnrnl
0
ESLJ
Data
6M1. Introduction 64
6830
xij
cancer
patients
P
genes
SIDW299104
SIDW380102
SID73161
GNAL
H.sapiensmRNA
SID325394
RASGTPASE
SID207172
ESTs
SIDW377402
HumanmRNA
SIDW469884
ESTs
SID471915
MYBPROTO
ESTsChr.1
SID377451
DNAPOLYMER E 6,6 SID375812
SIDW31489
SID167117 SIDW470459 SIDW487261 Homosapiens SIDW376586
Chr MITOCHONDRIAL60 SID47116 ESTsChr.6 SIDW296310 SID488017 SID305167 ESTsChr.3 SID127504 SID289414
Some Which
da
exploration
gene expression
in DNA
level
microarray
questions
are SIDW257915 similar
patients
ESTsChr.2 SIDW322806 SID200394 ESTsChr.15 SID284853 SID485148 SID297905 ESTs SIDW486740 SMALLNUC ESTs SIDW366311 SIDW357197 SID52979 ESTs SID43609 SIDW416621 ERLUMEN TUPLE1TUP1 SIDW428642 SID381079 SIDW298052 SIDW417270 SIDW362471 ESTsChr.15 SIDW321925 SID380265 SIDW308182 SID381508 SID377133 SIDW365099 ESTsChr.10 SIDW325120 SID360097 SID375990 SIDW128368 SID301902 SID31984 SID42354
genes
Relation between
FIGURE 1.3. DNA microarray data: expression matrix of 6830 genes (rows)
and
How does this compare with ground trait
and 64 samples (columns), for the human tumor data. Only a random sample no
of 100 rows are shown. The display is a heat map, ranging from bright green (negative, under expressed) to bright red (positive, over expressed). Missing values are gray. The rows and columns are displayed in a randomly chosen order.
PTPRC SIDW298203 SIDW310141 SIDW376928 ESTsCh31 SID114241 SID377419 SID297117 SIDW201620 SIDWt279a664 SIDW510534 HLACLASSI SIDW203464 SID239012
716 SIDW376776
SIDW
205
HYPOTHETICAL WASWiskott SIDW321854 ESTsChr.15 SIDW376394 SID280066 ESTsChr.5 SIDW488221 SID46536
patients
genes
BREAST RENAL MELANOMA MELANOMA MCF7D-repro COLON COLON K562B-repro COLON NSCLC LEUKEMIA RENAL MELANOMA BREAST CNS CNS RENAL MCF7A-repro NSCLC K562A-repro COLON CNS NSCLC NSCLC LEUKEMIA CNS OVARIAN BREAST LEUKEMIA MELANOMA MELANOMA OVARIAN OVARIAN NSCLC RENAL BREAST MELANOMA OVARIAN OVARIAN NSCLC RENAL BREAST MELANOMA LEUKEMIA COLON BREAST LEUKEMIA COLON CNS MELANOMA NSCLC PROSTATE NSCLC RENAL RENAL NSCLC RENAL LEUKEMIA OVARIAN PROSTATE COLON BREAST RENAL UNKNOWN
ummm
1. Introduction 5
mail. In order to achieve this low error rate, some objects can be assigned to a “don’t know” category, and sorted instead by hand.
Example 4: DNA Expression Microarrays
DNA stands for deoxyribonucleic acid, and is the basic material that makes up human chromosomes. DNA microarrays measure the expression of a gene in a cell by measuring the amount of mRNA (messenger ribonucleic acid) present for that gene. Microarrays are considered a breakthrough technology in biology, facilitating the quantitative study of thousands of genes simultaneously from a single sample of cells.
Here is how a DNA microarray works. The nucleotide sequences for a few thousand genes are printed on a glass slide. A target sample and a reference
Kmeansk
sample are labeled with red and green dyes, and each are hybridized with the DNA on the slide. Through fluoroscopy, the log (red/green) intensities of RNA hybridizing at each site is measured. The result is a few thousand numbers, typically ranging from say −6 to 6, measuring the expression level of each gene in the target relative to the reference sample. Positive values indicate higher expression in the target versus the reference, and vice versa for negative values.
A gene expression dataset collects together the expression values from a series of DNA microarray experiments, with each column representing an experiment. There are therefore several thousand rows representing individ- ual genes, and tens of columns representing samples: in the particular ex- ample of Figure 1.3 there are 6830 genes (rows) and 64 samples (columns), although for clarity only a random sample of 100 rows are shown. The fig- ure displays the data set as a heat map, ranging from green (negative) to red (positive). The samples are 64 cancer tumors from different patients.
The challenge here is to understand how the genes and samples are or- ganized. Typical questions include the following:
3
(a) whgich samples are most similar to each other, in terms of their expres- sion profiles across genes?
(b) which genes are most similar to each other, in terms of their expression profiles across samples?
(c) do certain genes show very high (or low) expression for certain cancer samples?
We could view this task as a regression problem, with two categorical predictor variables—genes and samples—with the response variable being the level of expression. However, it is probably more useful to view it as unsupervised learning problem. For example, for question (a) above, we think of the samples as points in 6830–dimensional space, which we want to cluster together in some way.
R
No clear elbow
Sum
Try
k
3
與
14.3 Cluster Analysis 513
•
•
•
• • slowly of squares
•••••
K
large
2 4 6 8 10
Number of Clusters K
FIGURE 14.8. Total within-cluster sum of squares for K-means clustering ap- plied to the human tumor microarray data.
Cluster Breast CNS Colon K562 Leukemia MCF7
1350000
2200262
TABLE 14.2. Human tumor data: number of cancer cases of each type, in each of the three clusters from K-means clustering.
3207000
Cluster Melanoma NSCLC Ovarian Prostate Renal Unknown 64
1176291 2720000 3000000
The data are a 6830 × 64 matrix of real numbers, each representing an
expression measurement for a gene (row) and sample (column). Here we
cluster the samples, each of which is a vector of length 6830, correspond-
ing to expression values for the 6830 genes. Each sample has a label such
v
puted the total within-sum of squares for each clustering, shown in Fig- ure 14.8. Typically one looks for a kink in the sum of squares curve (or its logarithm) to locate the optimal number of clusters (see Section 14.3.11). Here there is no clear indication: for illustration we chose K = 3 giving the three clusters shown in Table 14.2.
as breast (for breast cancer), melanoma, and so on; we don’t use these la- bels in the clustering, but will examine posthoc which labels fall into which
34
21
9
clusters.
We applied K-means clustering with K running from 1 to 10, and com-
ˇ
ˇ
Sum of Squares
160000 200000 240000
G
G G
Ovarian Prostate CNS NSCLC
业
Breast Renal
Leukemia MCE 7
colon
Melanoma K 562
Drawbacksofkmeanszrn
Difficult
Wit in
to
choose
K
each cluster
no idea of degree of similarity
linear ordering
Not nested
K changes clusters change
no
E
sensitive
to òutlieri
Solution
Hierarchical
17 1717
More eg.LI
吉
Clustering i
robust
measures Huber
Ktnedoidszdij
given
k
No
already xis
means
Can handle Ecledean
mixed type of data
一
Continuous data
0
516 14. UnsupeKrvised Learning medoids.lk
means
Algorithm 14.2 K-medoids Clustering.
1. For a given cluster assignment C find the observation in the cluster
minimizing total distance to other points in that cluster:
Data i∗k = argmin D(xi,xi′). (14.35) {i:C(i)=k} C(i′)=k
k i∗k efhn
Then m = x , k = 1,2,…,K are the current estimates of the
cluster centers.
2. Givenacurrentsetofclustercenters{m1,…,mK},minimizetheto- tal error by assigning each observation to the closest (current) cluster
center:
X
X XXX C(i) = argmin D(xi, mk).
1≤k≤K
3. Iterate steps 1 and 2 until the assignments do not change.
m(ean
addition, using squared Euclidean distance places the highest influence on
the largest distances. This causes the procedure to lack robustness against
outliers that produce very large distances. These restrictions can be re- moved at the expense of computation.
The only part of the K-means algorithm that assumes squared Eu-
clidean distance is the minimization step (14.32); the cluster representatives
{m , . . . , m } in (14.33) are taken to be the means Xof the currently assigned 1KXTXX
clusters. The algorithm can be generalized for use with arbitrarily defined
dissimilarities D(xi, xi′ ) by replacing this step by an explicit optimization
with respect to {m1, . . . , mK } in (14.33).mIneathne most common form, cen-
ters for each cluster are restricted to be one of the observations assigned
to the cluster, as summarized in Algorithm 14.2. This algorithm assumes
(14.36)
14.
112
). This requires all of the variables to be of the quantitative type. In
X
attribute data, but the approach can also be applied to data described
only by proximity matrices (Section 14.3.1). There is no need to explicitly Median
compute cluster centers; rather we just keep track of the indices i∗k.
Solving (14.32) for each provisional cluster k requires an amount of com-
putation proportional to the number of observations assigned to it, whereas
efhn
for solving (14.35) the computation increases to O(Nk2). Given a set of clus- ter “centers,” {i1,…,iK}, obtaining the new assignments
th
C(i) = argmin dii∗k (14.37) 1≤k≤K
XX XXX
requires computation proportional to K · N as before. Thus, K-medoids is
far more computationally intensive than K-means.
Alternating between (14.35) and (14.37) represents a particular heuristic
search strategy for trying to solve
Ǘedian
Data
cydj
Data 14.3 Cluster Analysis11 517
TABLE 14.3. Data from a political science survey: values are average pairwise dij
lsijs
dissimilarities of countries from a questionnaire given to political science students.
no
Xi’s
BRA 5.I
BEL BRA CHI CUB EGY FRA IND ISR USA USS YUG
58
CHI 7.00 6.50
CUB 7.08 7.00 3.83
EGY 4.83 5.08 8.17 5.83
FRA 2.17 5.75 6.67 6.92 4.92
Ǘcxixi
only
d
j
IND 6.42 5.00 5.58 6.00 4.67
6.42
3.92 6.17
2.25 6.33 2.75
6.17 6.17 6.92 6.17
5.42 6.08 5.83 6.67 3.67
5.58 4.83 6.17 5.67 6.50 6.92
ISR 3.42 5.50 6.42 6.42
USA 2.50 4.92 6.25 7.33 4.50 USS 6.08 6.67 4.25 2.67 6.00 YUG 5.25 6.83 4.50 3.75 5.75 ZAI 4.75 3.00 6.08 6.67 5.00
5.00
dij
are
min K C, {ik}1
dissimilarities
solving (14.38) that provisionally excha
center i with an obser-
K
diik . (14.38) k=1 C(i)=k
Kaufman and Rousseeuw (1990) propose an alternative strategy for directly
vation that is not currently a center, selecting the exchange that produces the greatest reduction in the value of the criterion (14.38). This is repeated until no advantageous exchanges can be found. Massart et al. (1983) derive a branch-and-bound combinatorial method that finds the global minimum of (14.38) that is practical only for very small data sets.
In
fact
NOT
distances
nges
each
k
one can convert This example, taken from Kaufman and Rousseeuw (1990), comes from a
Example: Country Dissimilarities
a
study in which political science students were asked to provide pairwise dis- similarity measures for 12 countries: Belgium, Brazil, Chile, Cuba, Egypt,
dissimilarity
distances
France, India, Israel, United States, Union of Soviet Socialist Republics,
Yugoslavia and Zaire. The average dissimilarity scores are given in Ta- ble 14.3. We applied 3-medoid clustering to these dissimilarities. Note that K-means clustering could not be appli止ed because we have only distances rather than raw observations. The left panel of Figure 14.10 shows the dissimilarities reordered and blocked according to the 3-medoid clustering. The right panel is a two-dimensional multidimensional scaling plot, with the 3-medoid clusters assignments indicated by colors (multidimensional
Multi
dimensional
Scaling
scaling is discussed in Section 14.8.) Both plots show three well-separated
clusters, but the MDS display indicates that “Egypt” falls about halfway
between two clusters.
518 14. Unsupervised Learning
ZAI BRA
IND
USA ISR FRA
Hierarchical
EGY EGY BEL
ZAI IND
CHI CUB
BRA USS
YUG USS CUB
USA
BEL YUG
ISR FRA
Reordered Dissimilarity Matrix
-2 0 2 4
First MDS Coordinate
FIGURE 14.10. Survey of country dissimilarities. (Left panel:) dissimilarities reordered and blocked according to 3-medoid clustering. Heat map is coded from most similar (dark red) to least similar (bright red). (Right panel:) two-dimen- sional multidimensional scaling plot, with 3-medoid clusters indicated by different colors.
Clustering
14.3.11 Practical Issues
In order to apply K-means or K-medoids one must select the number of clusters K∗ and an initialization. The latter can be defined by specifying an initial set of centers {m1,…,mK} or {i1,…,iK} or an initial encoder C(i). Usually specifying the centers is more convenient. Suggestions range from simple random selection to a deliberate strategy based on forward stepwise assignment. At each step a new center ik is chosen to minimize the criterion (14.33) or (14.38), given the centers i1, . . . , ik−1 chosen at the previous steps. This continues for K steps, thereby producing K initial centers with which to begin the optimization algorithm.
A choice for the number of clusters K depends on the goal. For data
segmentation K is usually defined as part of the problem. For example,
a company may employ K sales people, and the goal is to partition a
customer database into K segments, one for each sales person, such that the
customers assigned to each one are as similar as possible. Often, however,
cluster analysis is used to provide a descriptive statistic for ascertaining the
ESL
extent to which the observations comprising the data base fall into natural
distinct groupings. Here the number of such groups K∗ is unknown and
one requires that it, as well as the groupings themselves, be estimated from the data.
Data-based methods for estimating K∗ typically examine the within- cluster dissimilarity WK as a function of the number of clusters K. Separate solutions are obtained for K ∈ {1, 2, . . . , Kmax}. The corresponding values
Second MDS Coordinate
-2 -1 0 1 2 3
CHI CUB USS YUG BRA IND ZAI BEL EGY FRA ISR
Spectral
Clustering
FSL
NMF
ESL
Un surprised
So So
Learning
much done
much yet to be done