L19 – Unsupervised Learning and Clustering
EECS 391
Intro to AI
Unsupervised Learning and Clustering
L19 Tue Nov 13
1 2 3 4 5 6 7
0.5
1.0
1.5
2.0
2.5
petal length (cm)
pe
ta
l w
id
th
(c
m
)
Fisher’s Iris data (unlabeled)
1 2 3 4 5 6 7
0
0.5
1
1.5
2
2.5
petal length (cm)
pe
ta
l w
id
th
(c
m
)
Iris virginica
Iris setosa Iris versicolor
1 2 3 4 5 6 7
0
0.5
1
1.5
2
2.5
petal length (cm)
pe
ta
l w
id
th
(c
m
)
Iris virginica
Iris setosa Iris versicolor
Decision Tree:
Dim 2
Decision Tree:
Dim 1
1 2 3 4 5 6 7
0
0.5
1
1.5
2
2.5
petal length (cm)
pe
ta
l w
id
th
(c
m
)
Fisher’s Iris data
Iris virginica
Iris setosa Iris versicolor
In which example would
you be more confident
about the class?
Decision boundaries
provide a classification
but not uncertainty.
The general classification problem
Data
D = {x1, . . . ,xT }
xi = {x1, . . . , xN}i
desired output
y = {y1, . . . , yK}
model
θ = {θ1, . . . , θM}
Given data, we want to learn a model that
can correctly classify novel observations.
yi =
�
1 if xi ⇥ Ci � class i,
0 otherwise
output is a binary classification vector:
input is a set of T observations,
each an N-dimensional vector
(binary, discrete, or continuous)
model (e.g. a decision tree) is
defined by M parameters.
How do we
approach this
probabilistically?
The answer to all questions of uncertainty
• Let’s apply Bayes’ rule to infer the most probable class given the observation:
• This is the answer, but what does it mean?
• How do we specify the terms?
– p(Ck) is the prior probability on the different classes
– p(x|Ck) is the data likelihood, ie probability of x given class Ck
• How should we define this?
p(Ck|x) =
p(x|Ck)p(Ck)
p(x)
=
p(x|Ck)p(Ck)�
k p(x|Ck)p(Ck)
What classifier would give “optimal” performance?
• Consider the iris data.
• How would we minimize the number
of future mis-classifications?
• We would need to know the true
distribution of the classes.
• Assume they follow a Gaussian
distribution.
• The number of samples in each class
is the same (50), so (assume) p(Ck)
is equal for all classes.
• Because p(x) is the same for all
classes we have:
1 2 3 4 5 6 7
0
0.5
1
1.5
2
2.5
petal length (cm)
pe
ta
l w
id
th
(c
m
)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9p(petal length |C2)
p(petal length |C3)
p(Ck|x) =
p(x|Ck)p(Ck)
p(x)
� p(x|Ck)p(Ck)
1 2 3 4 5 6 7
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Where do we put the boundary?
p(petal length |C2) p(petal length |C3)
1 2 3 4 5 6 7
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Where do we put the boundary?
decision boundary
R32 = C3 is misclassified as C2
R23 = C2 is misclassified as C3
p(petal length |C2) p(petal length |C3)
1 2 3 4 5 6 7
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Where do we put the boundary?
Shifting the boundary
trades-off the two errors.
R32 = C3 is misclassified as C2
R23 = C2 is misclassified as C3
p(petal length |C2) p(petal length |C3)
• The misclassification error is defined by
• which in our case is proportional to the data likelihood
1 2 3 4 5 6 7
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Where do we put the boundary?
R32 = C3 is misclassified as C2
R23 = C2 is misclassified as C3
p(petal length |C2) p(petal length |C3)
p(error) =
Z
R32
p(x|C3)P (C3)dx+
Z
R23
p(x|C2)P (C2)dx
• The misclassification error is defined by
• which in our case is proportional to the data likelihood
1 2 3 4 5 6 7
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Where do we put the boundary?
This region would yield
but we’re still classifying
this region as C2!
p(C3|x) > p(C2|x)
p(petal length |C2) p(petal length |C3)
p(error) =
Z
R32
p(x|C3)P (C3)dx+
Z
R23
p(x|C2)P (C2)dx
• The minimal misclassification error at the point where
1 2 3 4 5 6 7
0
0.2
0.4
0.6
0.8
1
The optimal decision boundary
Optimal decision boundary
p(petal length |C2) p(petal length |C3)
p(C3|x) = p(C2|x)
� p(x|C3)p(C3)/p(x) = p(x|C2)p(C2)/p(x)
� p(x|C3) = p(x|C2)
p(C2 | petal length) p(C3 | petal length)
Note: this assumes we
have only two classes.
Clustering: Classification without labels
• In many situations we don’t have labeled training data, only unlabeled data.
• Eg, in the iris data set, what if we were just starting and didn’t know any classes?
1 2 3 4 5 6 7
0
0.5
1
1.5
2
2.5
petal length (cm)
pe
ta
l w
id
th
(c
m
)
Types of learning
world
(or data)
model
{�1, . . . , �n}
desired output
{y1, . . . , yn}
supervised
world
(or data)
model
{�1, . . . , �n}
unsupervised
world
(or data)
model
{�1, . . . , �n}
model output
reinforcement
reinforcement
A different approach to classification
• Nearby points are likely to be
members of the same class.
• What if we used the points
themselves to classify?
classify x in Ck if x is “similar” to
a point we already know is in Ck.
• Eg: unclassified point x is more
similar Class 2 than Class 1.
• Issue: How to define “similar” ?
Simplest is Euclidean distance:
• Could define other metrics
depending on application, e.g.
text documents, images, etc.
1 2 3 4 5 6 7
0
0.5
1
1.5
2
2.5
x1
x 2
x
Class 1
Class 2
Class 3
Potential advantages:
• don’t need an explicit model
• the more examples the better
• might handle more complex classes
• easy to implement
• “no brain on part of the designer”
Nearest neighbor classification on the iris dataset
d(x,y) =
�
i
(xi � yi)2
Example: Handwritten digits
• Use Euclidean distance to see which
known digit is closest to each class.
• But not all neighbors are the same:
• “k-nearest neighbors”:
look at k-nearest neighbors and
choose most frequent.
• Cautions: can get expensive to find
neighbors
from LeCun etal, 1998
digit data available at:
http://yann.lecun.com/exdb/mnist/
Error Bounds for NN 8
• Amazing fact: asymptotically, err(1-NN) < 2 err(Bayes): eB ≤ e1NN ≤ 2eB − M M − 1 e2B this is a tight upper bound, achieved in the “zero-information” case when the classes have identical densities. • For K-NN there are also bounds. e.g. for two classes and odd K: eB ≤ eKNN ≤ (K−1)/2 ! i=0 " k i # $ ei+1B (1 − eB) k−i + ek−iB (1 − eB) i+1 % • For more on these bounds, see the book A Probabilistic Theory of Pattern Recognition, by L. Devroye, L. Gyorfi & G. Lugosi (1996). Example: USPS Digits 9 • Take 16x16 grayscale images (8bit) of handwritten digits. • Use Euclidean distance in raw pixel space (dumb!) and 7-nn. • Classification error (leave-one-out): 4.85%. Example 7 Nearest Neighbours Nonparametric (Instance-Based) Models 10 •Q: What are the parameters in K-NN? What is the complexity? A: the scalar K and the entire training set. Models which need the entire training set at test time but (hopefully) have very few other parameters are known as nonparametric, instance-based or case based. •What if we want a classifier that uses only a small number of parameters at test time? (e.g. for speed or memory reasons) Idea 1: single linear boundary, of arbitrary orientation Idea 2: many boundaries, but axis-parallel & tree structured 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 x 1 x 2 x1 x2 t1 t2 t3 t4 t5 A B C D EF Linear Classification for Binary Output 11 • Goal: find the line (or hyperplane) which best separates two classes: c(x) = sign[x⊤w&'() weight − w0&'() threshold ] •w is a vector perpendicular to decision boundary • This is the opposite of non-parametric: only d + 1 parameters! • Typically we augment x with a constant term ±1 (“bias unit”) and then absorb w0 into w, so we don’t have to treat it specially. 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 x 1 x 2 example nearest neighbors example from Sam Roweis Digits are just represented as a vector. http://yann.lecun.com/exdb/mnist/ The problem of using templates (ie Euclidean distance) • Which of these is more like the example? A or B? • Euclidean distance only cares about how many pixels overlap. • Could try to define a distance metric that is insensitive to small deviations in position, scale, rotation, etc. • Digit example: - 60,000 training images, - 10,000 test images - no “preprocessing” example A B from Simard etal, 1998 Classifier error rate on test data (%) linear 12 k=3 nearest neighbor (Euclidean distance) 5 2-layer neural network (300 hidden units) 4.7 nearest neighbor (Euclidean distance) 3.1 k-nearest neighbor (improved distance metric) 1.1 convolutional neural net 0.95 best (the conv. net with elastic distortions) 0.4 humans 0.2 - 2.5 performance results of various classifiers (from http://yann.lecun.com/exdb/mnist/) http://yann.lecun.com/exdb/mnist/ A real example: clustering electrical signals from neurons An application of PCA: Spike sorting oscilloscope software analysis electrode filtersamplifier A/D Principal Component Analysis, Apr 23, 2001 / Michael S. Lewicki, CMU ➡➡ ➡ ➡ ? 5 An extracellular waveform with many di�erent spikes 0 5 10 15 20 25 msec How do we sort the di�erent spikes? Principal Component Analysis, Apr 23, 2001 / Michael S. Lewicki, CMU ➡➡ ➡ ➡ ? 6 Basic problem: only information is signal. The true classes are always unknown. Sorting with level detection on an oscilloscope Sorting with level detection −0.5 0 0.5 1 1.5 msec Principal Component Analysis, Apr 23, 2001 / Michael S. Lewicki, CMU ➡➡ ➡ ➡ ? 7 Sorting with level detection −0.5 0 0.5 1 1.5 msec −0.5 0 0.5 1 1.5 msec Level detection doesn’t always work. Principal Component Analysis, Apr 23, 2001 / Michael S. Lewicki, CMU ➡➡ ➡ ➡ ? 7 Why level detection doesn’t work Why level detection doesn’t work background amplitude A B amplitude peak amplitude: neuron 2 peak amplitude: neuron 1 One dimension is not su�cient to separate the spikes. Principal Component Analysis, Apr 23, 2001 / Michael S. Lewicki, CMU ➡➡ ➡ ➡ ? 8 Sorting with level detection −0.5 0 0.5 1 1.5 msec Principal Component Analysis, Apr 23, 2001 / Michael S. Lewicki, CMU ➡➡ ➡ ➡ ? 7 Sorting with level detection −0.5 0 0.5 1 1.5 msec −0.5 0 0.5 1 1.5 msec Level detection doesn’t always work. Principal Component Analysis, Apr 23, 2001 / Michael S. Lewicki, CMU ➡➡ ➡ ➡ ? 7 Idea: try more features Using multiple features −0.5 0 0.5 1 1.5 msec What other features could we use? Principal Component Analysis, Apr 23, 2001 / Michael S. Lewicki, CMU ➡➡ ➡ ➡ ? 9 max amplitude min amplitude Maximum vs minimumMaximum vs minimum −200 −150 −100 −50 0 0 50 100 150 200 250 spike minimum (µV) s p ik e m a x im u m ( µ V ) This allows better discrimination than max alone, but is it optimal? Principal Component Analysis, Apr 23, 2001 / Michael S. Lewicki, CMU ➡➡ ➡ ➡ ? 10 Features using Principal Components (not covered) −100 0 100 200 300 −200 −100 0 100 200 300 1st PC score 2n d P C s co re k-means clustering • Idea: try to estimate k cluster centers by minimizing “distortion” • Define distortion as: • rnk is 1 for the closest cluster mean to xn. • Each point xn is the minimum distance from its closet center. • How do we learn the cluster means? • Need to derive a learning rule. D = N� n=1 K� k=1 rnk ⇤ xn � µk ⇤2 rnk = 1 if xn ⇥ cluster k, 0 otherwise. Deriving a learning rule for the cluster means • Our objective function is: • Differentiate w.r.t. to the mean (the parameter we want to estimate): • We know the optimum is when • Here, we can solve for the mean: • This is simply a weighted mean for each cluster. • Thus we have a simple estimation algorithm (k-means clustering) 1. select k points at random 2. estimate (update) means 3. repeat until converged • convergence (to a local minimum) is guaranteed D = N� n=1 K� k=1 rnk ⇥ xn � µk ⇥2 �D �µk = 2 N� n=1 rnk(xn � µk) �D �µk = 2 N� n=1 rnk(xn � µk) = 0 µk = � n rnkxn� n rnk k-means clustering example −100 0 100 200 300 −200 −100 0 100 200 300 1st PC score 2n d P C s co re Select 3 points at random for cluster means −100 0 100 200 300 −200 −100 0 100 200 300 1st PC score 2n d P C s co re k-means clustering example −100 0 100 200 300 −200 −100 0 100 200 300 1st PC score 2n d P C s co re The update them using the estimate. k-means clustering example −100 0 100 200 300 −200 −100 0 100 200 300 1st PC score 2n d P C s co re And iterate... k-means clustering example −100 0 100 200 300 −200 −100 0 100 200 300 1st PC score 2n d P C s co re k-means clustering example −100 0 100 200 300 −200 −100 0 100 200 300 1st PC score 2n d P C s co re Stop when converged, ie no change. An example of a local minimum 0 100 200 300 −100 −200 −100 0 100 200 300 1st PC score 2n d P C s co re There can be multiple local minima. A probabilistic interpretation: Gaussian mixture models • We’ve already seen a one-dimensional version • This example has three classes: neuron 1, neuron 2, and background noise. • Each can be modeled as a Gaussian • Any given data point comes from just one Gaussian • The whole set of data is modeled by a mixture of three Gaussians • How do we model this?R58 M S Lewicki background amplitude A B amplitude peak amplitude: neuron 2 peak amplitude: neuron 1 Figure 4. The figure illustrates the distribution of amplitudes for the background activity and the peak amplitudes of the spikes from two units. Amplitude is along the horizontal axis. Setting the threshold level to the position at A introduces a large number of spikes from unit 1. Increasing the threshold to B reduces the number of spikes that are misclassified, but at the expense of many missed spikes. 3.2. Types of detection errors Very often it is not possible to separate the desired spikes from the background noise with perfect accuracy. The threshold level determines the trade-off between missed spikes (false negatives) and the number of background events that cross threshold (false positives), which is illustrated in figure 4. If the threshold is set to the level at A, all of the spikes from unit 1 are detected, but there is a very large number of false positives due the contamination of spikes from unit 2. If the threshold is increased to the level at B, only spikes from unit 1 are detected, but a large number fall below threshold. Ideally, the threshold should be set to optimize the desired ratio of false positives to false negatives. If the background noise level is small compared to the amplitude of the spikes and the amplitude distributions are well separated, then both of these errors will be close to zero and the position of the threshold hardly matters. 3.3. Misclassification error due to overlaps In addition to the background noise, which, to first approximation, is Gaussian in nature (we will have more to say about that below), the spike height can vary greatly if there are other neurons in the local region that generate action potentials of significant size. If the peak of the desired unit and the dip of a background unit line up, a spike will be missed. This is illustrated in figure 5. How frequently this will occur depends on the firing rates of the units involved. A rough estimate for the percentage of error due to overlaps can be calculated as follows. The percentage of missed spikes, like the one shown in figure 5(b), is determined by the probability that the peak of the isolated spike will occur during the negative phase of the background spike, which is expressed as %missed spikes = 100rd/1000 (1) where r is the firing rate in hertz and d is the duration of the negative phase in milliseconds. Thus if the background neuron is firing at 20 Hz and the duration of the negative phase is approximately 0.5 ms, then approximately 1% of the spikes will be missed. Note that this is only a problem when the negative phase of the background spikes is sufficiently large to cause the spikes of interest to drop below threshold. The Gaussian mixture model density • The likelihood of the data given a particular class ck is given by p(x|ck, µk,⌃k) • x is the spike waveform, µk and ⌃k are the mean and covariance for class ck. • The marginal likelihood is computed by summing over the likelihood of the K classes p(x|✓1:K) = KX k=1 p(x|ck, ✓k)p(ck) • ✓1:K defines the parameters for all of the classes, ✓1:K = {µ1,⌃1, . . . , µK,⌃K}. • p(ck) is the probability of the kth class, with P k p(ck) = 1. • What does this mean in this example? • How do we determine the class ck from the data x ? • Again use Bayes’ rule p(ck|x(n), ✓1:K) = pk,n = p(x(n)|ck, ✓k)p(ck)P k p(x (n)|ck, ✓k)p(ck) • This tells is the probability that waveform x(n) came from class ck. Bayesian classification with multivariate Gaussian mixtures Estimating the parameters: fitting the model density to the data • The objective of density estimation is to maximize the likelihood of the data the data • If we assume the samples are independent, the data likelihood is just the product of the marginal likelihoods p(x1:N|✓1:K) = NY n=1 p(xn|✓1:K) • The class parameters are determined by optimization. • Is far more practical to optimize the log-likelihood. • One elegant approach to this is the EM algorithm. The Gaussian mixture • EM stands for Expectation-Maximization, and involves two steps that are iterated. For the case of a Gaussian mixture model: 1. E-step: Compute pn,k = p(ck|x(n), ✓1:K). Let pk = P n pi,n 2. M-step: Compute new mean, covariance, and class prior for each class: µk X n pn,kx (n)/pk ⌃k X n pn,k(x(n) � µk)(x(n) � µk)T/pk p(ck) pk • This is just the sample mean and covariance, weighted by the class conditional probabilities pn,k. • Derived by solving setting log-likelihood gradient to zero (i.e. the maximum). −200 0 200 400 600 −400 −300 −200 −100 0 100 200 300 400 1st component score 2n d co m po ne nt s co re Four cluster solution with decision boundaries But wait! Here’s a nine cluster solution R64 M S Lewicki −200 0 200 400 600 −400 −300 −200 −100 0 100 200 300 400 1st component score 2 n d c o m p o n e n t s c o re −200 0 200 400 600 −400 −300 −200 −100 0 100 200 300 400 1st component score 2 n d c o m p o n e n t s c o re (a) (b) Figure 9. Application of Gaussian clustering to spike sorting. (a) The ellipses show the three- sigma error contours of the four clusters. The lines show the Bayesian decision boundaries separating the larger clusters. (b) The same data modelled with nine clusters. The elliptical line extending across the bottom is the three-sigma error contour of the largest cluster. Classification is performed by calculating the probability that a data point belongs to each of the classes, which is obtained with Bayes’ rule p(ck|x, ⌥1:K) = p(x|ck, ⌥k)p(ck)� k p(x|ck, ⌥k)p(ck) . (5) This implicitly defines the Bayesian decision boundaries for the model. Because the cluster membership is probabilistic, the cluster boundaries can be computed as a function of confidence level. This will yield better classification, because if the model is accurate the boundaries will be optimal, i.e. the fewest number of misclassifications. The class parameters are optimized by maximizing the likelihood of the data p(x1:N|⌥1:K) = N⇥ n=1 p(xn|ck, ⌥1:K) . (6) For the examples shown here, the cluster parameters were obtained using the publicly available software package AutoClass (Cheeseman and Stutz 1996). This package uses the Bayesian methods described above to determine both the means and the covariance matrices as well as the class probabilities, p(ck). The ellipses (or circles) in figure 9 show the three-sigma (three standard deviations) error contours of the Gaussian model for each cluster. The figure illustrates two different models of the data, one with four clusters and one with nine. In this case, the clusters corresponding to the two large spikes appear in both solutions, but this illustrates that choosing the number of clusters is not always an easy task. This issue will be discussed further below. Note that the cluster in the middle, corresponding to the background spikes, is not modelled by a single class, but by two or more overlapping Gaussians. The lines in figure 9(a) shows the Bayesian decision boundaries that separate the three larger clusters. The decision boundary for the smaller circular cluster is not shown, but it is in roughly the same position as the cluster’s error contour. If the Gaussian cluster model were accurate, most of the data would fall within the three-sigma error boundary. In this case, three of the contours match the variability in the data, but the upper-right contour is significantly larger than the cluster itself. The reason Uh oh... How many clusters are there really? 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 k = number of clusters D is to rti on How do we choose k? • Increasing k, will always decrease our distortion. This will overfit the data. - How can we avoid this? - Or how do we choose the best k? • One way: cross validation • Use our distortion metric: • Then just measure the distortion on a test data set, and stop when we reach a minimum. D = N� n=1 K� k=1 rnk ⇥ xn � µk ⇥2 −100 0 100 200 300 −200 −100 0 100 200 300 1st PC score 2n d P C s co re k=10 clusters overfitting test set error training set error