Introduction to Machine Learning EM algorithm
Prof. Kutty
Generative models
Copyright By PowCoder代写 加微信 powcoder
Gaussian Mixture Model (GMM)
Mixture of Gaussians
image source: Bishop 2006
MLE of GMM with known labels: Example
1.4 -0.625
2.1, 0, 3.5, −1, 1.5, 2.5, −0.5, 0.05,1, −2, 0, 1, −2, 1.1, −0.5, −0.03
Log-Likelihood for GMMs with known labels
!”! =$%((̅”,*(“))
Sso f Maximum log likelihood objective
=$%((̅”|*(“))%(*(“)) ! (“#$
=$-. / 0)(1 (̅(“) 2̅(‘),3′))4’) “#$ ‘#$
ln# $! =ln&'( ) *)(- /̅(“) 0̅(%),2%))3%)
“#$ %#$ !&
=”() *)ln($!%&'(#)(&(!),*!%)) “#$ %#$
Gaussian Mixture Model (GMM) Model Parameters
How many independent model parameters in a mixture of 4 spherical Gaussians?
use this link for in-class exercises
https://forms.gle/jqAdK1sSMhcx6zDHA
spherical Gaussian R e c a l l P d f o f ( j ) 2
1 (j) 2 2 2 | | x ̄ μ ̄ g | |
P(x ̄|μ ̄ , )= N
D e l l a j
( 2 ⇡ j2 ) d / 2
Gaussian Mixture Model (GMM) Model Parameters
How many independent model parameters in a mixture of 4 spherical Gaussians?
use this link for in-class exercises
https://forms.gle/jqAdK1sSMhcx6zDHA
12 ||x ̄ μ ̄(j)||2
( 2 ⇡ j2 ) d / 2
(j) 2 P(x ̄|μ ̄ , )=
✓ ̄ = [ 1, …, k, μ ̄(1), …, μ ̄(k), 12, …., k2]
MLE for GMMs with known labels
Maximum log likelihood objective
,,-. /)ln(.!/1020(!),4!$)) !”# %”#
MLE solution (given “cluster labels”):
Dune maize JW
65=∑! ()*) % “#$
3 % = !* 5 !
0̅(%)=$∑! () *)/̅(“) !*5 “#$
number of points assigned to cluster j
fraction of points assigned to cluster j
mean of points in cluster j
2)=$∑! ()*)/̅”−0̅(%) % +!*5 “#$
spread in cluster j
MLE for GMMs with unknown labels
Parameters of GMMs
2.1, 0, 3.5, −1, 1.5, 2.5, −0.5, 0.05,1, −2, 0, 1, −2, 1.1, −0.5, −0.03
Learning the Model Parameters
!” =∏! %((̅”)=∏! ∑( %((̅”,,” =-) ! “#$ “#$ ‘#$
=./%((̅ ” |, ” “#$ ‘#$
=-)%(, ” =-) =./1 (̅(“) 2̅(‘),3′))4’
!( “#$ ‘#$
Given the training data, find the model parameters that maximize the log-likelihood
ln(# $! ) !&!&
=ln &’-/̅” 0̅%,2%))3% =’ln ‘-/̅” 0̅%,2%))3% “#$ %#$ “#$ %#$
Expectation Maximization for GMMs
Expectation Maximization for GMMs: overview
Iterate until convergence
– E step: use current estimate of mixture model to softly
assign examples to clusters
– M step: re-estimate each cluster model separately based on the points assigned to it (similar to the “known label” case)
access parameters
assume model
to estimate grain parameters
fix ✓ ̄=[ 1,…, k,μ ̄(1),…,μ ̄(k), 12,…., k2]
Expectation Maximization for GMMs
softly assign points to clusters according to posterior prob
&'( = *6+(-̅(7)|/̅6,168) ∑9*9+(-̅(7)|/̅9,198) eg
P blue dataptz
pdf bluegaussian
i3 “soft” cluster assignment % – 5
blue evaluated for torrentz.ie I i
Pdf bluegaussian evaluated for datapoint3
given a datapoint (̅(“) what is the probability that cluster – generated it Analogousto6 -5 notethat∑’% -5 =1
Expectation Maximization for GMMs E-step: Example
E-step: softly assign points to clusters according to current guess of
model parameters
j P (x ̄(i)|μ ̄(j), j2) (i) ̄
use this link for in-class exercises
https://forms.gle/jqAdK1sSMhcx6zDHA
Datapoints
Cluster 1 0.5
Cluster 2 0.3
Cluster 3 0.2
variance = 1
mean = 1 variance = 1
mean = 3 variance = 4
) j P at M 6,2 265#) 25#)
p211 p311 31#4(#),5#)= 1 exp−1#−4#
which is the likeliest cluster for datapoint 1(#)
Expectation Maximization for GMMs E-step: Example
E-step: softly assign points to clusters according to current guess of
model parameters Example
j P (x ̄(i)|μ ̄(j), j2) (i) ̄
Datapoints
Cluster 1 0.5
Cluster 2 0.3
Cluster 3 0.2
variance = 1
mean = 1 variance = 1
mean = 3 variance = 4
0.5 * 0.39894
0.2 * 0.06476
0 . 3 * 0 . 2 4 1 9 7a t o # #
L 31# 4(#),5#) = 1 exp−1 −4 t 265#) 25#)
Expectation Maximization for GMMs • M-SteXp: optimizes each cluster separately given p(j|i)
nˆ = p ( j | i ) ˆ ( j ) ( i )
j μ ̄ = nˆ p ( j | i ) x ̄ i=1 j i=1
(i) ˆ(j)2 p ( j | i ) | | x ̄ μ ̄ | |
ˆ j = d nˆ
Expectation Maximization for GMMs:
M step (note correspondence with known labels)
if you knew the “soft” cluster assignment 9 ) * , you could compute MLE parameters :̅ as follows
MLE for GMM with known labels Xn ;:=∑$-./) nˆ=p(j|i)
= % = $- ! ˆ j = nˆ j
effective number of points assigned to cluster j “fraction” of points assigned to cluster j
1 Xn nˆji=1
5)=#∑$ -./)1̅!−4̅(%))
ˆ j = d nˆ
p(j|i)x ̄(i)
% .$-! !”#
– . /) 1̅(!) μˆ ̄(j) =
weighted mean of points in cluster j
p ( j | i ) | | x ̄
ˆ(j) 2 μ ̄ | |
weighted spread in cluster j
Expectation Maximization for GMMs M-step: Example
• M-Step: optimizes each cluster separately given p(j|i)
Xn nˆ 1Xn 1Xn j ˆ(j) (i)
nˆ=p(j|i) μ ̄= p(j|i)x ̄ 2
j ˆ j = nˆ ˆ j = d nˆ p ( j | i ) | | x ̄
(i)ˆ(j) μ ̄ |
use this link for in-class exercises
https://forms.gle/jqAdK1sSMhcx6zDHA
i=1 n j i=1 j i=1 ;: # =
Datapoints
1̅(#) = 0,1 /
1̅()) = 2,1 /
1̅(*) = 1,1 /
1̅(+) = 0,2 /
1̅(,) = 2,2 /
Expectation Maximization for GMMs M-step: Example
• M-Step: optimizes each cluster separately given p(j|i)
Xn nˆ 1Xn 1Xn j ˆ(j) (i)
nˆ=p(j|i) μ ̄= p(j|i)x ̄ 2
j ˆ j = nˆ ˆ j = d nˆ p ( j | i ) | | x ̄
(i)ˆ(j) μ ̄ |
i=1 n j i=1 j i=1 ;:# =0.2+0.1+0.4+0.7+0.8=2.2
?(#) (!) 4̅ =;:#,D1/1̅
=:# =;:# =2.2=0.44
Datapoints
1̅(#) = 0,1 /
1̅()) = 2,1 /
1̅(*) = 1,1 /
1̅(+) = 0,2 /
1̅(,) = 2,2 /
=2.2ED11 1̅(#)+D12 1̅())+D13 1̅(*)+D14 1̅(+) +D15 1̅(,)F
= 1 0.20,1/+0.12,1/+0.41,1/+0.70,2/+0.82,2/ 2.2 # ?)
Similarlycompute5:)= ∑, D1/ 1̅! −4̅# # .$-” !”#
Expectation Maximization for GMMs
EM algorithm for GMM: initialize parameters
• E-step: softly assign points to clusters according to posterior
• M-StepX: optimizes each clusXter separately given p(j|i) n1n
p(j|i) = j P (x ̄(i)|μ ̄(j), j2) (i) ̄
nˆ = p ( j | i ) ˆ ( j ) ( i ) j μ ̄ = nˆ p ( j | i ) x ̄
2 (i) ˆ(j)2
Iterate until convergence
ˆ j = d nˆ p ( j | i ) | | x ̄ μ ̄ | | j i=1
Expectation Maximization
this example
with general nuns
Iretivariate
normal distribute
Model Selection: how to pick k?
Bayesian Information Criterion (BIC)
Log-likelihood
̄ ̄ BIC(D; ✓) = l(D; ✓)
number of training data
Here we’d want to maximize the BIC.
Sometimes defined as the negative of above definition. In such cases, we want to minimize.
model complexity
where the penalty term approximates the advantage that we would expect to get from
arger k regardless of the data. The BIC score is easy to evaluate for each k as we ˆ
lready get l(D; ✓) from the EM algorithm. All that we need in addition is to evaluate he penalty term. A mixture of k spherical Gaussians in d dimensions has exactly
Model Selection for Mixtures
(d + 2) 1 parameters. Figure 1 below shows that the resulting BIC score indeed has
he highest value for the correct 3-component mixture.
BIC is an asymptotic (large n) approximation to a statistically more well-founded
Bayesian Information Criterion (BIC)
riterion known as the Bayesian score. As such, BIC will select the right model (under ertain regularity conditions) when n is (very) large relative to d. However, we will
dopt the BIC criterion here even for smaller n due to its simplicity. G=2 G=3 G=4
−2 −1 0 1 2 3 4 5
−2 −1 0 1 2 3 4 5
BIC(D; ) = 131.16 BIC(D; ) = 118.93 BIC(D; ) = 121.78
igure 1: 2, 3, and 4 component mixtures estimated for the same data. The correspond-
From Jaakkola
ng log-likelihoods and the BIC scores are shown below each plot.
−2 −1 0 1 2 3 4 5
Bayesian Networks: Applications
Alexiou Athanasios, D., H., A. [2017] A Bayesian Model for the Prediction and Early Diagnosis of Alzheimer’s Disease
Bayesian Networks by Example
nodes: variable
directed edges: dependencies 1 1# )
1# is a parent of 1* 1* is a child of 1#
1%:1& == 1$ 2×2 = 4 rows
intuitively, read this edge as “influences”
Pr 1* =J|L0,L1
Pr 1* =M|L0,L1
joint probability distribution: Pr 1#, 1), 1*
= Pr(1#) Pr(1)) Pr 1* 1#, 1)
1%:1& ==1$
Pr 1* =J|L0,L1
Pr 1* =M|L0,L1
Two notions of Independence
Marginal independence
Pr =$, =) = Pr(=$)Pr(=))
Conditional independence
Pr =$, =)|=:
= Pr(=$|=:)Pr(=)|=:) =$ ⊥ =)|=:
Alternately, Pr =$|=), =: = Pr(=$|=:)
Bayesian Networks provide us a way to determine these via the dependency graph
d-separation: Inferring independence
Inferring independence properties
P ⊥ J | S?
Step 1: keep only “ancestral” graph of the variables of interest
Inferring independence properties
P ⊥ J | S?
Step 2: connect nodes with common child and change graph to undirected
* if multiple parents connect pairwise
Inferring independence properties
If all paths between variables of interest go through a particular node, then the variables are independent given that node
intuitively can say that that node “blocks” the influence from the first variable to the second n
P ⊥ J | S?
If there is no path between variables of interest, then they are marginally independent
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com