https://xkcd.com/1338/
Mixture Models and Expectation
Pre-read/watch:
Copyright By PowCoder代写 加微信 powcoder
K-Means algorithm (PRML 9.1) and update equations of Gaussian Mixture Models (MML book
EM revisited
– An alternative view of EM
– Connections between GMM and K-means
– Bernoulli mixture
EM in general – does it
Practical considerations and other topics
– impossibility of clustering [Kleinberg 2003]
really maximise likelihood, and why?
Maximisation
– Kmeans++ [
Vassilvitskii and Arthur, 2006
Recap: what is clustering
uxburg, U.V., Williamson, R.C., & Guyon, I. (2012).
Clustering: Science or Art?
Transfer Learning.
ICML Unsupervised and
(initialise a set of cluster centers / component
Expectation
Maximisation step
Re-assign data points
Re-compute the cluster means – update {𝝁k}
to clusters,
Gaussian Mixture Models
K-means and GMM – hard vs
soft assignments
Wait … what is being maximised in
In each maximisation step?
Expectation-maximization
observed, Z “latent”
compute posterior P(Z | X, 𝞱)
Take expectation of the complete
Maximise this expectation w.r.t. 𝞱
data log-likelihood P(X, Z | 𝞱) w.r.t. Z
Convince ourselves that this is true for GM
MAP objective
for GMM, revisited
K-means and GMM – differences
connections
K-means and GMM – differences
Complexity: how many parameters for each?
Implementation:
Can a cluster center “die” in each model, how to handle? What if some Gaussian components are singular?
connections
Does not matter if >0
D separate Bernoulli Distributions
binary variables xi , each governed by Bernoulli distribution with mean 𝝁i
of Bernoulli
Distributions
Data likelihood 🙁
data likelihood
Expectations of complete data likelihood
EM revisited
– Connections between GMM and K-means
– Bernoulli mixture
EM in general – does it really maximise likelihood, and
Practical considerations and other topics
– impossibility of
clustering [Kleinberg 2003]
– Kmeans++ [V
assilvitskii and Arthur, 2006]
KL divergence (reminder)
Apply Jensen’s inequality, -ln() is convex
Equality will only hold iff. p(x) = q(x) for all x
If we have “incorrectly” represented p(x) with q(x), how much more
information do we need to recover p(x)?
The EM algorithm
in general
Goal: show that the EM algorithm
maximises the
likelihood function.
Illustrating the decomposition
set q(Z) = p(Z | X, 𝞱 )
p(Z | X, 𝞱old )
as alternating
maximization
Lower bound
L(q, 𝞱) is a convex function having a unique maximum (for mixture components from the
exponential family).
Extensions: Generalised
seeks to improve rather than maximise L(q, 𝞱); expectation conditional
maximisation
seeks to maximise L(q, 𝞱) for a subset of the parameters; Incremental algorithms also
EM revisited
– Connections between GMM and K-means
– Bernoulli mixture
EM in general – does it really maximise likelihood, and
Practical considerations and other topics
– impossibility of
clustering [Kleinberg 2003]
– Kmeans++ [V
assilvitskii and Arthur, 2006]
[NeuRIPS 2003]
K-clusters Distance-r
Single-linkage operates by initializing each point as its own cluster, and then repeatedly
merging the pair of clusters whose distance to one another (as measured from their
closest points of approach) is minimum. Stopping conditions:
K-means ++
Vassilvitskii, Sergei, and . “k-means++: The
advantages of careful seeding.” In Proceedings of the
eighteenth annual ACM-SIAM symposium on Discrete
algorithms, pp. 1027-1035. 2006.
Problem with K-means
Finds a local optimum that could be
Algorithm summary:
arbitrarily worse than the global optimum.
What do you do in practice? Normalise data. Use Kmeans++ to initialise Kmeans (sklearn.cluster.kmeans_plusplus). Use Kmeans (and spherical covariances) to initialise GMM. Try a number of different initialisations sklearn.cluster.KMeans(n_clusters=8, *, init=’k-means++’, n_init=10, …
EM revisited
– Connections between GMM and K-means
– Bernoulli mixture
EM in general – does it really maximise likelihood, and
Practical considerations and other topics
– impossibility of
clustering [Kleinberg 2003]
– Kmeans++ [V
assilvitskii and Arthur, 2006]
Gradescope logistics:
● Rubric list is available – feel free
1 “post mortem”
remember to link submission pages to questions
few notable errors
● Theory Q4.2 – derivation error propagates through to rest of the
center data or not?
to look at what are the rest of the de-scoring
● Programming Q2 – use pytorch to find weights but did not show workings
● No grades from automarker if there is a python error – more access to
auto-grader in assignment 2
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com