CS计算机代考程序代写 GMM algorithm AI Outline

Outline
� Latent Variable Models
� The Expectation Maximization Procedure � Gaussian Mixture Models
� K-Means Clustering
� Kernel K-Means
1/24

Motivation
PCA of Iris dataset PCA of Boston dataset
PCA of Diabetes dataset PCA of Digits dataset
Complex data cannot be modeled accurately by standard probability distributions (e.g. Gaussian) and require a more complex description, e.g. with latent variables.
2/24

Latent Variable Models
A latent variable model (LVM) is a description of the data distribution that makes use of a layer of abstraction (latent variables) to improve modeling capability.
“Classical” descrip�on
p(x|θ)
data
Latent variable
descrip�on p ( z | θ )
p(x|z,θ) data
Relation between the ‘classsical’ and latent variable description:
p(�|θ) = � p(z|θ) · p(�|z, θ). z∈Z
z
x
x
3/24

Learning a Latent Variable Model
Assuming a dataset D where we assume that the samples have been drawn i.i.d.. In that case, the log-likelihood of the data according to the model can be written as:
log P(D|θ) = log � p(�|θ) �� ∈ D
= logp(�|θ) �∈D
and we wish to maximize that quantity. Injecting the latent variable model, the objective to optimize becomes
= �log�p(z|θ)·p(�|z,θ) �∈D z∈Z
Problem: The maximum of log p(D|θ) cannot be found analytically.
4/24

Learning a Latent Variable Model
Strategy: If the function p(D|θ) cannot be easily optimized, find a lower-bound of that function that is easier to optimize.
5/24

Building a Lower-Bound
Jensen’s inequality
For any element λ of the d-dimensional simplex (i.e. λ � 0 and λ�1 = 1, and any concave function f : Rd → R, we have
�d f (
i=1
λi ai ) ≥
�d i=1
λi f (ai )
Application to the latent variable model:
� Let λ be some probability distribution q(z|�) independent from θ.
� Let �i containing the remaining terms.
� Becausethelatentvariablemodel��∈Dlog�z∈Zp(z|θ)·p(�|z,θ)does not contain such distribution q(z|�) independent from θ, create it!
6/24

Building a Lower-Bound
Step 1: Add q(z|�) both on the numerator and denominator: log p(D|θ) = � log � p(z|θ) · p(�|z, θ) · q(z|�)
Step 2: Applying the Jensen inequality
≥ ��log�p(z|θ)·p(�|z,θ)�·q(z|�)
�∈D z q(z|�) … and verify the bound:
= ��log�p(�|θ)·p(z|�,θ)�·q(z|�) �∈D z q(z|�)
= ��log�p(�|θ)�·q(z|�)−��log� q(z|�) �·q(z|�) �∈D z �∈D z p(z|�, θ)
log P(D|θ) KL(q(z|�) � p(z|�,θ)) ≥ 0
�∈D z q(z|�)
� �� �� �� �
7/24

Building a Lower-Bound
Question: How to ensure that
KL(q(z|�) � p(z|�, θ))
is small, so that maximizing the lower-bound with θ gives a good approximation of the true log-likelihood?
Chicken & egg problem:
� If we use a simple q, e.g. uniformly distributed, we get a loose lower-bound from which we get a bad θ.
� To get a tight lower-bound, we need to choose q(z|�) ≈ p(z|�, θ) which requires that we know θ.
8/24

The Expectation-Maximization (EM) Algorithm
Strategy: Start with some random solution θ and alternately estimate q and θ, until we converge.
9/24

The Expectation-Maximization (EM) Algorithm
θ0 ← random() q1(z|�) ← p(z|�, θ0)
(E-step)
(M-step)
(E-step) (M-step)
θ1 ←argmax��log�p(�,z|θ)�·q1(z|�) θ q1(z|�)
�∈D z q2(z|�) ← p(z|�, θ1)
θ2 ←argmax��log�p(�,z|θ)�·q2(z|�)
.
θ q2(z|�) �∈D z
10/24

The Expectation-Maximization (EM) Algorithm
q
Properties of the algorithm:
� Block coordinate descent
� Locally optimal step size
� The algorithm lands in a local minimum of the function p(D|θ).
Advantages of EM compared to gradient descent:
� no learning rate
� noneedtocomputethegradients.
θ
(θ0,q1)
(θn,qn)
11/24

Application: Cluster Analysis
PCA of Iris dataset PCA of Boston dataset
PCA of Diabetes dataset PCA of Digits dataset
Question: Can we build a model of the cluster structure of the data?
12/24

The Gaussian Mixture Model (GMM)
� The Gaussian Mixture Model is a latent variable model specialized for clustering.
Latent variable descrip�on (general formula�on)
p ( z | θ ) p(x|z,θ)
data
Gaussian mixture model
p(z=i|θ)
p(x|z=i,θ) data
i=1
i=2
i=3
i=4
z
x
� The latent variable z of a GMM is an integer between 1 and C. Each value of z indicates a cluster of the data.
� For each cluster, we associate a different data-generating distribution (e.g. Gaussian with different means and covariances).
x
13/24

The Gaussian Mixture Model (GMM)
The GMM is formally defined by the two equations:
p(z = i|θ) = αi � 1 � p(�|z=i,θ)=(2π)−d/2·|Σi|−1/2·exp − (�−μ)�Σ−1(x−μ)
The parameters θ of the model to be learned are:
� The mixing coefficients (αi)Ci=1 subject to αi ≥ 0 and �Ci=1 αi = 1 � The mean vectors (μi)Ci=1.
� The covariances (Σi)Ci=1 subject to positive semi-definiteness.
Various simplifications of the GMM model can be used in practice, e.g. for speed, statistical robustness, or simplicity of implementation.
� diagonal/isotropic/tied/fixed covariance matrices, � fixed mixing coefficients, etc.
2ii i
14/24

EM for the GMM (simplified)
Consider the simplified GMM:
p(z = i|θ) = 1
C p(�|z = i, θ) = 1
exp(−γ�� − μi �2)
E-step: (Apply Bayes rule)
(π/γ)d/2
q(z = i|�) = p(z = i|θ)·p(�|z = i,θ) = �exp(−γ�� −μi�2) p(�|θ) Ci=1 exp(−γ�� − μi �2)
M-step: (Set lower-bound gradient to zero)
C
�∈D i=1
⇒μi = ��∈D q(z =i|�)
∂�� �
∂θ log(exp(−γ�� − μi �2)) · q(z = i|�) = 0
�∈D � · q(z = i|�)
15/24

Implementing GMMs
The (simplified) GMM model can be implemented easily in few lines of code:
16/24

GMM Demo (1st try)
������������� ������������� �������������
�� �� ��
���
���
���
�� �� ��
�� � � � �� �� � � � �� �� � � � ��
������������� ������������� �������������
�� �� ��
���
���
���
�� �� ��
�� � � � �� �� � � � �� �� � � � ��
17/24

GMM Demo (2nd try)
������������� ������������� �������������
�� �� ��
���
���
���
�� �� ��
�� � � � �� �� � � � �� �� � � � ��
������������� ������������� �������������
�� �� ��
���
���
���
�� �� ��
�� � � � �� �� � � � �� �� � � � ��
18/24

Gaussian Mixture Model (Summary)
� Number of clusters need to be determined in advance.
� EM converge to some local minima after a few iterations
� EM is non-convex, initialization matters.
� For a better result, run the algorithm multiple times and retain the best solution.
19/24

The K-Means Algorithm
Cluster assignments step
c (� ) = arg min �� − μk �2 k
Centroid update step ��:c(�)=k � μk = ��:c(�)=k 1
� K-means can be interpreted as the limit case of EM for Gaussian distributions with γ → ∞.
� K-means produces hard-assignments (i.e. data points belong to a single cluster).
� K-means has the same limitations as GMM (non-convex, predefined cluster shape).
� K-means has only one parameter: the number of clusters.
20/24

Kernelizing K-Means
Problem: Standard k-means requires clusters to have round shape. It fails on ‘banana’ datasets.
Idea: Rep�lace � by the nonlinear feature map φ(�) and de- fine μi = j βijφ(�j), i.e. centroids become weighted com- binations of data points.
1. Cluster assignments step: c(�)=argmin �φ�(�)−μi�2
i
=argmax 2 βijk(�,�j)−β�i Kβi
i
j
2. Centroid update step:
βi =argmin � �φ(�)−μi�2
Standard k-means
= K
−1
��:c(�)=i 1
βi ·

�:c(�)=i
�:c(�)=i k(X,�)�
Kernel k-means
21/24

Other Cluster Algorithms
� Deep k-means
� Hierarchical clustering
� Divisive clustering
� Agglomerative clustering
� DBScan
� Affinity Propagation
22/24

Beyond Clustering
Mixture model Product of experts (today’s lecture) (next week’s lecture)
23/24

Summary
� Latent variable models is a powerful approach to model complex data distribution.
� Expectation-maximization is a framework to efficiently optimize these models.
� Latent variable models can be used for clustering, e.g. the Gaussian mixture model.
� The well-known k-means algorithm can be seen as a limit case of the Gaussian mixture model.
� K-means can be kernelized to produce nonlinear boundaries between clusters.
24/24