Data Mining and Machine Learning
HMM Training Peter Jančovič
Slide 1
Data Mining and Machine Learning
Objectives
Reminder: Maximum Likelihood (ML) parameter estimation
– ML for Gaussian PDFs and for GMMs
ML for HMMs
– Viterbi-style training
– Baum-Welch algorithm
Slide 2
Data Mining and Machine Learning
Fitting a Gaussian PDF to Data
Suppose y = y1,…,yn,…,yT is a sequence of T data values
Given a Gaussian PDF p with mean and variance
, define:
py|,py |,
t1
How do we choose and ?
Define the best fitting Gaussian to be the one such that p(y|,) is maximised – Maximum Likelihood (ML) estimation of ,
T
t
Slide 3
Data Mining and Machine Learning
ML estimation of , Intuitively:
– The maximum likelihood estimate of should be the average value of y1,…,yT, (the sample mean)
– The maximum likelihood estimate of should be the variance of y1,…,yT. (the sample variance)
This turns out to be true: p(y| , ) is maximised by
setting:
1T 1T
y, y
t
T t1 T t1
t
2
Slide 4
Data Mining and Machine Learning
ML training for GMMs
Now consider
– A Gaussian Mixture Model with M components has
– M means: 1,…,M
– M variances 1,…,M
– M mixture weights w1,…,wM
– A training sequence y1,…,yT
How do we find the maximum likelihood estimate
of 1,…,M, 1,…,M, w1,…,wM? Data Mining and Machine Learning
Slide 5
GMM Parameter Estimation
If we knew which component each sample yt came from, then parameter estimation would be easy
– Set m to be the average value of the samples which belong to the mth component
– Set m to be the variance of the samples which belong to the mth component
– Set wm to be the proportion of samples which belong to the mth component
But we don’t know which component each sample belongs to
Slide 6
Data Mining and Machine Learning
Solution – the E-M Algorithm (1) Guess initial values
0,…,0,0,…,0,w0,…,w0 1M1M1M
1. For each m calculate the probabilities p y py |0,0
mttmm
Slide 7
m,t Pm|yt
Data Mining and Machine Learning
2. Use these probabilities to estimate how much each sample yt ‘belongs to’ the mth component
Solution – the E-M Algorithm (2) 3. Calculate the new GMM parameters
This is a measure of how muchyt ‘belongsto’the mth component
1 t1 mT
T
y
1 t1 m T
m,t t
m,t t1
T (y (1))2 m,t t m
m,t t1
REPEAT steps 1-3
Slide 8
Data Mining and Machine Learning
Calculation of λm,t
In other words, λm,t is the probability of the mth
component given the data point yt From Bayes’ theorem
Calculate from mth Gaussian component
mth
weight
m,t P(m|yt)p(yt|m)P(m) p(yt )
pm ( yt )
w m
M
p (y )w
ktk k1
Sum over all components
Slide 9
Data Mining and Machine Learning
ML training for HMMs Now consider
– An N state HMM M, each of whose states is associated with a Gaussian PDF
– A training sequence y1,…,yT
For simplicity assume that each yt is 1-dimensional
Slide 10
Data Mining and Machine Learning
ML training for HMMs
If we knew that:
– y1,…,ye(1) correspond to state 1
– ye(1)+1,…,ye(2) correspond to state 2 –:
– ye(n-1)+1,…,ye(n) correspond to state n –:
Then we could set the mean of state n to the average value of ye(n-1)+1,…,ye(n)
Slide 11
Data Mining and Machine Learning
ML Training for HMMs
Slide 12
y1,…,ye(1), ye(1)+1,…,ye(2), ye(2)+1,…,yT Unfortunately we don’t know that ye(n-1)+1,…,ye(n)
correspond to state n…
Data Mining and Machine Learning
Solution
1. Define an initial HMM – M0
2. Use the Viterbi algorithm to compute the optimal state sequence between M0 and y1,…,yT
y1 y2 y3 … … yt … yT
Slide 13
Data Mining and Machine Learning
Solution (continued)
Use optimal state sequence to segment y
y1 ye(1) ye(1)+1 … … ye(2) … yT
Reestimate parameters to get a new model M1
Slide 14
Data Mining and Machine Learning
Solution (continued)
Now repeat whole process using M1 instead of M0,
to get a new model M2
Then repeat again using M2 to get a new model M3 ….
p(y|M0) p(y|M1) p(y|M2) …. p(y|Mn)….
Slide 15
Data Mining and Machine Learning
Local optimization
Local optimum
Global optimum
p(y|M)
M0 M1…Mn
M
Slide 16
Data Mining and Machine Learning
Baum-Welch optimization
The algorithm just described is often called Viterbi training or Viterbi reestimation
It is often used to train large sets of HMMs
An alternative method is called Baum-Welch reestimation – it is a soft version of the Viterbi estimation
Reestimation of mean value associated with state i:
Slide 17
Data Mining and Machine Learning
Baum-Welch Reestimation
P(xt=i|Y) = t(i) y1 y2 y3 y4 y5 yt yt+1yT
Slide 18
Data Mining and Machine Learning
‘Forward’ Probabilities
y1 y2 y3 y4 y5 yt yt+1yT
Slide 19
Data Mining and Machine Learning
‘Backward’ Probabilities
t(i)Prob(yt1,…,yT |xt i,M)aijt1(j)bj(yt1) j
y1 y2 y3 y4 y5 yt yt+1yT
Slide 20
Data Mining and Machine Learning
‘Forward-Backward’ Algorithm
y1 y2 y3 y4 y5 yt yt+1 yT
Slide 21
Data Mining and Machine Learning
Notes on HMM parameter estimation
The Baum-Welch/Viterbi algorithm is only guaranteed to find a locally optimal HMM set – hence choice of M0 can be important
Baum-Welch/Viterbi is a supervised training algorithm which requires labelled speech data
The labelling need not be at the same level as the HMM set – phoneme level HMMs can be trained using data labelled orthographically at the phrase or sentence level
For large applications B-W reestimation can be very computationally expensive
Slide 22
Data Mining and Machine Learning
Summary
Maximum Likelihood (ML) estimation
Viterbi HMM parameter estimation
Baum-Welch HMM parameter estimation – Forward and backward probabilities
Slide 23
Data Mining and Machine Learning