Machine Learning Clustering
Dariush Hosseini
Department of Computer Science University College London
Copyright By PowCoder代写 加微信 powcoder
Lecture Overview
Lecture Overview
1 Lecture Overview
2 Introduction
3 k -means
4 Mixture of Gaussians
Lecture Overview
Lecture Overview
By the end of this lecture you should:
1 Understand the nature and (broad) purpose of Unsupervised Learning
2 Know the k-means clustering algorithm and the assumptions that it makes
3 Understand the use of the Mixture of Gaussians Model and the EM algorithm for clustering
Introduction
Lecture Overview
1 Lecture Overview
2 Introduction
3 k -means
4 Mixture of Gaussians
Introduction
Unsupervised Learning
Recall that in Unsupervised Learning we seek to learn from
unlabelled data
There are many cases where we may not have any associated labelling information with our input data, but where we would nevertheless like to learn more about the inherent structure of the data
There are two main tasks associated with unsupervised learning:
Introduction
Unsupervised Learning Clustering:
Seek to learn groups or clusters from our input data, and in so doing assign each instance to a cluster
Sample task: market segmentation analysis Dimensionality Reduction:
Reduce the dimensionality of the input data, while still retaining certain structural properties
Sample task: data visualisation
Introduction
Clustering
So how do we infer groups and assign instances? Intuitively this is straightforward:
We group objects such that similar objects end up in the same group…
…And such that dissimilar objects are separated into different groups
Introduction
Similarity1
But how should we measure similarity?
1Illustrations based on Shalev-Shwartz & Ben-David, ‘Understanding Machine Learning’ [2014]
Introduction
Similarity
Transitivity?
Here points are clustered according to whether they are closely related to neighbours…
…Regardless of how far apart the first and last neighbours are
Introduction
Similarity
Proximity?
Here points are clustered according to some sort of mutual shared distance…
…They cluster around a shared point
Introduction
Assessment
Assuming we can agree a similarity measure, how do we assess how well we are doing?
In classification tasks we can compare how well our predictions compare with the labels of our data
…But in clustering tasks:
There are no labels
Often we don’t even know how many labels there are
Introduction
Impossibility?
So is clustering in fact impossible?
Introduction
Impossibility?
Theorem is contingent on acceptance of a set of (plausible) desirable properties of a clustering function
But such properties change depending on the problem setting
In fact there are many clustering algorithms, appropriate for different problem settings. Broadly:
Partitional Algorithms: For example, Spectral Clustering techniques
Hierarchichal Algorithms: For example, Agglomerative Clustering techniques
Introduction
Many, Many Clustering Approaches2
2 http://scikit- learn.org
Introduction
Applications
Instead of seeking a generically useful clustering algorithm, we focus on the task at hand, and let that guide us towards a sensible approach
What kind of tasks might we be interested in?
Data Exploration
Image Segmentation Market Research
Data Preprocessing
Image Compression Semi-supervised learning
Introduction
Example: Image Compression
We plot each pixel as a point in the R3 RGB space
Introduction
Example: Image Compression
Now we seek to cluster the data in this space
Each cluster is represented by some centroid – a point which is the prototype for all data in that cluster
In this context, we call the centroids code book vectors
Each point is then codified with the coordinate of the centroid
In this context, this process is called vector quantization
Each pixel in the original picture is now coloured according to its vector quantized RGB-value:
Introduction
Example: Image Compression
The result is a compressed image
Lecture Overview
1 Lecture Overview
2 Introduction
3 k -means
4 Mixture of Gaussians
k-means Algorithm
In the k-means algorithm, instances that lie inside a cluster will have small inter-point distances when compared with instances lying outside of the cluster
We assume that we are interested in finding k clusters, and the j-th of these is represented by a centroid, or prototype, that we will represent as μ[j] ∈ Rm
These prototypes represent the ‘centres’ of the clusters
Our goal is to find these centroids such that the sum of squares of the distances of each instance to its closest centroid is minimised
As usual, we assume that our input data consists of n instances:
x(i)ni=1 where: x(i) ∈ Rm
For each of these instances, we introduce a set of indicator variables, ρi[j]kj=1, (where ρi[j] ∈ {0, 1}), that are used to describe whether x(i) is assigned to cluster j:
So, for a particular point, x(i), let:
j∗ =argmin∥x(i) −μ[l]∥2
1, if j=j∗
0 otherwise.
Distortion Function
In the k-means setting, our objective function is know as the distortion function, Θ, and is defined as the sum of squared distances between each instance and its assigned cluster:
n k i [ j ] ( i ) [ j ] 2 ρ ∥x −μ ∥2
Optimisation
But minimising this function is hard…
It looks like a chicken-and-egg problem:
If we had good prototypes, μ[j]kj=1, we could solve by assigning data points to the nearest cluster
If we had good assignments, ρi[j]n,k , we could solve for the prototypes i =1,j =1
Greedy Optimisation: k-means
The k-means algorithm, also know as Lloyd’s Algorithm, attempts
to solve the optimisation problem in a greedy fashion:
We start by choosing some initial (random) value for each μ[j]
Then, in the first step, we minimise Θ with respect to keeping all of the μ[j] fixed
Then, in the second step, we minimise Θ with respect to keeping all of the ρi[j] fixed
μ j=1 by The E-Step and M-Step are then repeated until convergence
i[j]n,k
i=1,j=1 by [j]k
The E step determines the values of ρi[j]kj=1
Each of the n terms are independent and so we can optimise each independently
We choose ρi[j] to be equal to 1 for whichever value of j gives the minimum value of the squared Euclidean distances:
Ties are broken in a random manner
0 otherwise.
if j = argmin ∥x(i) − μ[l]∥2 l 2
The M step determines the values of μ[j]kj=1
Θ is a quadratic function of μ[j], so we can take the derivative of Θ with respect to μ[j] and set it equal to zero to obtain:
n i [ j ] ( i ) [ j ]
2 ρ(x−μ)=0
i=1 Then solving for μ[j] gives:
n ρi[j]x(i) [j] i=1
μ = ni=1 ρi[j]
Convergence By definition:
n k i=1 j=1
i[j] (i) ρ(t)∥x
n k i[j] (i)
[j] 2 − μ(t−1)∥2
i [ j ] [ j ] n k i [ j ] ( i ) [ j ] 2 ρ(t), μ(t) ρ(t)∥x − μ(t−1)∥2
− μ(t−1)∥2
= Θρi[j] ,μ[j] (t −1) (t −1)
i [ j ] [ j ] n k i [ j ] ( i ) [ j ] 2 ρ(t), μ(t) = ρ(t)∥x − μ(t)∥2
Convergence Thus:
Θ ρi[j], μ[j] Θ ρi[j] (t) (t) (t−1)
, μ[j] (t−1)
So the k-means algorithm will converge… But not necessarily to a global optimum
Example: Old Faithful3
3Illustrations based on Bishop, ‘Pattern Recognition & Machine Learning’ [2007]
Example: Old Faithful
BCS Summer School, Exeter, 2003 . Bishop
Example: Old Faithful
BCS Summer School, Exeter, 2003 . Bishop
Example: Old Faithful
BCS Summer School, Exeter, 2003 . Bishop
Example: Old Faithful
BCS Summer School, Exeter, 2003 . Bishop
Example: Old Faithful
BCS Summer School, Exeter, 2003 . Bishop
Example: Old Faithful
BCS Summer School, Exeter, 2003 . Bishop
Example: Old Faithful
BCS Summer School, Exeter, 2003 . Bishop
Example: Old Faithful
BCS Summer School, Exeter, 2003 . Bishop
Example: Old Faithful
BCS Summer School, Exeter, 2003 . Bishop
The value of k must be specified prior to performing clustering:
And clusters can be unstable wrt change of k The objective is non-convex
Works poorly on clusters which are non-convex in shape: Assumes clusters are hyperspheres
Works poorly on clusters of different sizes Sensitive to data-scaling
Hard assignments to clusters not stable under small perturbations of data
Mixture of Gaussians
Lecture Overview
1 Lecture Overview
2 Introduction
3 k -means
4 Mixture of Gaussians
Mixture of Gaussians
Latent Variable Models
A latent variable is one that cannot be observed directly
It is inferred, through a model
It might be real, but unobservable, or more abstract (but nonetheless
A latent variable model is a statistical model that relates a set of observable variables to a set of latent variables
Latent variable models allow us to express joint distributions more efficiently at the expense of model assumptions
Mixture of Gaussians
Latent Variable Models: Boolean Example
Consider 4 Boolean random vaiables, A, B, C, D, with outcomes
denoted by a, b, c, d respectively:
In order to fully characterise the joint pmf, pABCD(a, b, c, d), ∀a, b, c, d, we require: (16 – 1) = 15 variables
Mixture of Gaussians
Latent Variable Models: Boolean Example
Now consider adding a Boolean latent random vaiable, E, with
outcomes denoted by e, such that: AB
Mixture of Gaussians
Latent Variable Models: Boolean Example This diagram is a short-hand for the following:
pA,B,C,D,E(a, b, c, d, e) = pA(a|e)pB(b|e)pC(c|e)pD(c|e)pE(e)
And in order to fully characterise the joint pmf, pABCDE(a,b,c,d,e), ∀a,b,c,d,e, we require only: (2 + 2 + 2 + 2 + 1) = 9 variables
Mixture of Gaussians
Mixture Models
Not all random variables can be modelled solely with standard probability distributions functions
A mixture model gives us access to a richer set of probability distribution functions (which are often multimodal):
0.5 0.4 0.3 0.2 0.1
−10 −5 0 5 10
Mixture of Gaussians
Mixture of Gaussians for Clustering
A different, generative approach to clustering is to model the data using a latent variable mixture model
In particular we use a Mixture of Gaussians (also known as a Gaussian Mixture Model)
Mixture of Gaussians
A set of unlabelled points, {x(i) ∈ Rm}ni=1, and k clusters
Each point has an unknown, latent, cluster assignment associated with it, z(i) ∈ {1,…,k}, which is the outcome of a multinomial random variable,
pZ(z(i) =j)=π[j],
Contingent on the cluster assignment, z(i), each x(i) is the outcome of a
Gaussian random variable, X:
x(i)|(z(i) = j) ∼ N(x(i); μ[j], Σ[j])
Where: μ[j] ∈ Rm,Σ[j] ∈ Rm×m are the mean and covariance associated with cluster j
z(i) ∼ Multinomial(π[1],…,π[k])
π[i] =1, π[j] 0, ∀j
Mixture of Gaussians
Gaussian Mixture Model
This the joint distribution is given by:
pX,Z(x(i), z(i)) = pX(x(i)|z(i))pZ(z(i))
pX(x(i)) = =
pX(x(i)|z(i))pZ(z(i)) π[j]N(x(i); μ[j], Σ[j])
Thus the unconditional probability of each x(i) is modelled as a Gaussian Mixture Model (GMM)
Mixture of Gaussians
Gaussian Mixture Model4
4Illustrations based on Bishop, ‘Pattern Recognition & Machine Learning’ [2007]
Mixture of Gaussians
Maximum Likelihood
In order to use this model we need to learn its parameters: {μ[j], Σ[j], π[j]}kj=1
We could try to maximise the log-likelihood (L):
Subject to:
log pX(x(i); {μ[j], Σ[j], π[j]}kj=1) n k
log π[j]N(x(i); μ[j], Σ[j]) j=1
π[j] =1, Σ[j] ≻0
But this is a constrained, non-concave optimisation problem. It is hard to solve.
Mixture of
Let’s try to tackle this problem by assuming that it’s concave, and unconstrained, to start with
We’ll differentiate L with respect to each variable of interest and set the result equal to zero:
Mixture of : μ[j]
∂N(x(i);μ[j],Σ[j]) ∂μ[j]
n π[j]N(x(i);μ[j],Σ[j]) ∂ (i) [j] [j]
∂L ∂N(x(i); μ[j], Σ[j])
∂N(x(i); μ[j], Σ[j]) = k π[j]N(x(i); μ[j], Σ[j]) ∂μ[j]
= k π[j]N(x(i); μ[j], Σ[j]) ∂μ[j] log N(x ; μ , Σ ) i=1 j=1
γi[j] = π[j]N(x(i);μ[j],Σ[j]) kj=1 π[j]N(x(i); μ[j], Σ[j])
= − γi[j] (x(i) − μ[j])T Σ−1(xi − μ[j]) × const.
γi[j]Σ−1(xi −μ[j])×const.
i=1 2 ∂μ[j]
Mixture of : μ[j]
n γi[j]x(i) [j]∗ i=1
μ = ni=1γi[j]
Mixture of : Σ[j]
∂ L n i [ j ] ∂ ( i ) [ j ] [ j ]
∂Σ[j] = γ ∂Σ[j] logN(x ;μ ,Σ ) i=1
γi[j](x(i) − μ[j])(x(i) − μ[j])T Σ[j]∗ = i=1 ni=1 γi[j]
(Note that strictly we should enforce the psd constraint, but the result would not differ)
Mixture of : π[j]
Finally we optimise with respect to π[j], and enforce the
kj=1 π[j] = 1 constraint using the method of
This yields the following Lagrangian:
n k k
L= log i=1
π[j]N(x(i);μ[j],Σ[j])+λ πj −1 j=1 j=1
Differentiating, and enforcing stationarity, we obtain:
∂L n γi[j]
Mixture of : π[j]
Rearranging and summing over j:
k n k γi[j] = −λ
j=1 i=1 j=1 λ = −n
[ j ] ∗ 1 n π=nγ
Mixture of
But each of our expressions for μ[j]∗, Σ[j]∗, π[j]∗ is in terms of γi[j] And γi[j] is a function of μ[j]∗, Σ[j]∗, π[j]∗
So it’s still not clear how to solve the related equations:
Mixture of Gaussians
Responsibility
γi[j] can be viewed as the responsibility that component j takes for
explaining x(i):
γi[j] = π[j]N(x(i); μ[j], Σ[j]) kj=1 π[j]N(x(i); μ[j], Σ[j])
It looks like another chicken-and-egg problem:
If we knew μ[j]∗, Σ[j]∗, π[j]∗ responsibilities
, we could solve for the
If we knew the responsibilities, γi[j]n,k
, we could solve for
μ[j]∗, Σ[j]∗, π[j]∗
k i,j=1 j=1
Mixture of Gaussians
The EM Algorithm for GMM
One approach is to adopt an analogue to the k-means algorithm for solving this problem
The Expectation Maximisation (EM) algorithm for Gaussian Mixture Models is such an approach
We’ll begin by stating the algorithm before moving on to examine some of it’s properties from a more general perspective:
Mixture of Gaussians
The EM Algorithm for GMM k
We start by choosing some initial (random) value for μ[j]∗, Σ[j]∗, π[j]∗
Then, in the first step, we compute the responsibilities, γi[j] given the current
parameters μ[j ]∗ , Σ[j ]∗ , π[j ]∗
Then, in the second step, we maximise L with respect to keeping all of the γi[j] fixed:
k μ[j]∗,Σ[j]∗,π[j]∗ by
n γi[j]x(i) [j]∗ i=1
γi[j](x(i) − μ[j])(x(i) − μ[j])T ni = 1 γ i [ j ]
μ = ni = 1 γ i [ j ]
The E-Step and M-Step are then repeated until convergence
Mixture of Gaussians
Motivation
Let’s look at the EM algorithm more generally:
Let θ be the set of parameters which we want to solve for (in the [j]∗ [j]∗ [j]∗k
GMMcaseθ= μ ,Σ ,π j=1
Then our log-likelihood function is given by:
(i) log pX(x ; θ)
= log pX,Z(x ,z ;θ)
Where the last line follows by the hidden variable assumption
Mixture of Gaussians
Motivation
Now, for each i, let q(i) be some probability distribution over z(i)
So z(i) qZ (z )=1andqZ (z )0
Then, we can always state, without loss of generality:
(i) (i) pX,Z(x(i), z(i); θ)
L(θ) = log qZ (z ) q(i)(z(i)) (1)
Mixture of Gaussians
Jensen’s Inequality
Jensen’s Inequality states that for a convex function, f , and a
random variable, X:
(Recall convexity definition from earlier lectures:
λf (x ) + (1 − λ)f (y ) f (λx + (1 − λ)y )).
In particular, f(x) = −log(x) is convex since:
∇2x(−log(x))= 1 >0 x2
E[log(X)] log(E[X]) (2)
E[f (X)] f (E[X])
Mixture of Gaussians
Motivation
Then from expressions (1) & (2):
L(θ) = log
qZ (z )log q(i)(z(i)) (3)
(i) (i)
pX,Z(x(i), z(i); θ) q(i)(z(i))
iz(i) Z
(i) (i) This holds for any q(i)…
…But we want the tightest bound possible since we want to maximise L
…This obtains at equality…
…Let’s try to obtain equality by guessing a form for q(i): Z
pX,Z(x(i), z(i); θ) i z(i) Z
Mixture of Gaussians
Motivation
Let’s try q(i) = pZ(z(i)|x(i); θ), then:
q(i)(z(i))logpX,Z(x(i),z(i);θ) = p (z(i)|x(i),θ)logpX,Z(x(i),z(i);θ)
Z q(i)(z(i)) Z pZ(z(i)|x(i); θ) i z(i) Z i z(i)
= pZ(z(i)|x(i); θ) log pX(x(i); θ) i z(i)
(i) = log pX(x ;θ)
As desired
Now, let’s revisit the EM algorithm in this more general setting:
Mixture of Gaussians
A More General EM Algorithm
We start by choosing some initial (random) value for θ, which allows us to
calculate pZ(z(i)|x(i);θ) E-Step:
We set, q(i) = p (z(i)|x(i); θ), giving the tightest lower bound on the L(θ) which ZZ
we are trying to maximise
Then, in the second step, we optimise the RHS of expression (3) with this setting
of q(i), which allows us to optimise the likelihood further with respect to θ Z
The E-Step and M-Step are then repeated until convergence
Mixture of Gaussians
Convergence
At the end of the t-th epoch, we have, by expression (3):
pX,Z(x(i), z(i); θ(t)) q(i) (z(i))
θ(t) = argmax q(i) (z(i)) log pX,Z(x(i), z(i); θ) θ Z(t) q(i) (z(i))
(i) L(θ(t)) qZ(t)(z
But in the preceding M-step, θ(t) was set as follows:
Therefore: (i) (i)
qZ(t)(z ) log (i) (i) qZ(t)(z ) log (i) (i)
i z(i) qZ(t)(z ) i z(i) qZ(t)(z )
(i) (i) pX,Z(x ,z ;θ(t)) (i) (i) pX,Z(x ,z ;θ(t−1))
Mixture of Gaussians
Convergence
But the RHS of this inequality is the output of the preceding E-step,
in which the tightest bound is achieved by setting qZ(t) such that: q(i) (z(i))logpX,Z(x(i),z(i);θ(t−1)) = L(θ(t−1)) (6)
Z(t) q(i) (z(i)) i z(i) Z(t)
So, by expressions (4), (5), & (6):
L(θ(t)) L(θ(t−1))
So, the EM algorithm causes the likelihood to converge monotonically…
But not necessarily to a global optimum
Mixture of Gaussians
The EM Algorithm: Convergence
Mixture of Gaussians
The EM Algorithm: Convergence
Mixture of Gaussians
The EM Algorithm: Convergence
q(i) = p z(i)|x(i);θ ZZ (t)
Mixture of Gaussians
The EM Algorithm: Convergence
q(i) = p z(i)|x(i);θ ZZ (t)
Mixture of Gaussians
The EM Algorithm: Convergence
q(i) = p z(i)|x(i);θ ZZ (t)
θ(t) θ(t+1)
Mixture of Gaussians
The EM Algorithm: Convergence
q(i) = p Z Z
z(i)|x(i);θ (t+1)
q(i) = p z(i)|x(i);θ ZZ (t)
θ(t) θ(t+1)
Mixture of Gaussians
The EM Algorithm for GMM: Revisited
By the E-Step: By Bayes’ Rule:
q(i) = pZ(z(i)|x(i); θ) Z
pZ(z(i) = j|x(i); θ) = pX(x(i)|z(i) = j; θ)pZ(z(i) = j) kj=1 pX(x(i)|z(i) = j; θ)pZ(z(i) = j)
So for the GMM in particular:
pZ(z(i) = j|x(i); θ) = π[j]N(x(i); μ[j], Σ[j]) kj=1 π[j]N(x(i); μ[j], Σ[j])
And this justifies our form for the responsibilities earlier
Mixture of Gaussians
Example: Old Faithful5
3Illustrations based on Bishop, ‘Pattern Recognition & Machine Learning’ [2007]
Mixture of Gaussians
Example: Old Faithful
BCS Summer School, Exeter, 2003 Christopher
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com