Data Mining and Machine Learning
HMMs for Automatic Speech Recognition:
Word and Sub-Word Level HMMs
Peter Jančovič Slide 1
Data Mining and Machine Learning
Content
Word level HMMs
Sub-word HMMs
– Phoneme-level HMMs
Context-sensitive sub-word HMMs – Biphone HMMs
– Triphone HMMs
Triphone HMM training issues
Phoneme Decision Trees (PDTs)
Slide 2
Data Mining and Machine Learning
Word Level HMMs
Early systems (1980s) used word level HMMs
I.e. each word modelled by a single, dedicated HMM (c.f. “zero” picture)
– Advantages:
– Good performance due to explicit modelling
of word-dependent variability
Slide 3
Data Mining and Machine Learning
6 state HMM of the digit ‘zero’
Slide 4
Data Mining and Machine Learning
Word Level HMMs
Disadvantages:
– Many examples of each word required for training – Fails to exploit regularities in spoken language
Word-level systems typically restricted to well- defined, demanding, small vocabulary applications
Slide 5
Data Mining and Machine Learning
Sub-Word Level HMMs
Build HMMs for a complete set of sub-word ‘building blocks’
Construct word-level HMMs by concatenation of sub-word HMMs
slide = / s l aI d /
E.g.
/ s / / l / / aI / / d /
Slide 6
Data Mining and Machine Learning
Sub-Word Level HMMs
Advantages
– Able to exploit regularities in speech patterns
– More efficient use of training data – e.g. in phoneme-based system “five” (/ f aI v /) and “nine” (/n aI n /) both contribute to /aI/ model.
– Flexibility – acoustic models can be built immediately for words which did not occur in the training data
Slide 7
Data Mining and Machine Learning
Phoneme-Level HMMs
Why choose phonemes rather than any other sub- word unit?
Disadvantages
– Phonemes are defined in terms of the contrastive properties of speech sounds within a language – not their consistency with HMM assumptions!
Slide 8
Data Mining and Machine Learning
Advantages of Phoneme-HMMs
Completeness & compactness – approx. 50 phonemes required to describe English
Well studied – potential for exploitation of ‘speech knowledge’ (e.g. pronunciation differences due to accent…)
Availability of extensive phoneme-based pronunciation dictionaries
Slide 9
Data Mining and Machine Learning
Context-Sensitivity Problem
– Acoustic realization of a phoneme depends on the context in which it occurs
– Think of your lip shape for the “k” sound in the words “book shop” and “thick”
Slide 10
Data Mining and Machine Learning
Biphones and Triphones
Solution
– Context-sensitive phoneme-level HMMs – E.g.
– ‘biphones’ : (k:_S) in “book shop” – ‘triphones’ : (k:u_S) in “book shop”
Almost all systems use triphone HMMs
Slide 11
Data Mining and Machine Learning
Triphones – problems
Increased number of model parameters
– Need more (well-chosen) training data Which triphone?
– If a word in the application contains a triphone which was not in the training set, which triphone HMM should we use?
Slide 12
Data Mining and Machine Learning
Number of parameters
If there are 50 phones, the maximum number of triphone HMMs is 503=125,000
Most ruled out by phonological constraints – most phone triples never occur in speech
But many are legal
Slide 13
Data Mining and Machine Learning
Example: Model Parameters
Each model has 3 emitting states
Each state modelled as, say, a 10 component Gaussian mixture
Each feature vector is 40 dimensional
Hence number of parameters per model is:
Number of states
3(10 (40+40+1)+9)=2,457
Number of Mean Variance Mixture Transition mixture vector vector weight probs
components
Slide 14
Data Mining and Machine Learning
Acoustic model parameters
So, even if we only have 1,000 acoustic models (instead of 125,000), total acoustic model parameters will be 2,457,000
Too many to estimate with practical quantity of data
Most common solution is HMM parameter tying
Different HMMs share same parameters
Slide 15
Data Mining and Machine Learning
Tied variance
Variances are more costly to estimate than means
Simple solution – divide set of all HMMs into classes, so that within a class all HMM state PDFs have same variance
This is tied variance
If all HMM state PDFs share the same variance, the
variance is referred to as grand variance
Slide 16
Data Mining and Machine Learning
Phone decision trees
Most common approach to general HMM tying is
decision tree clustering
Decision tree clustering can be applied to individual states or to whole HMMs – we’ll consider states
Basic idea is to use knowledge about which phones are likely to induce similar contextual effects
‘Logical’ models
‘physical’ state
Slide 17
Data Mining and Machine Learning
Phonetic knowledge
For example, we know that /f/ and /s/ are both unvoiced fricatives, produced in a similar manner
Therefore we might hypothesise that, for example, an utterance of the vowel /e/ preceded by /f/ might be similar to one preceded by /s/
This is the basic idea behind decision tree clustering
Slide 18
Data Mining and Machine Learning
Phone Decision Tree
{/e/, /i/, /A/; L} N
{/s/, /f/; L} N
/e/ N
{/p/, /t/, /k/; L}
Y
{/s/, /f/; R}
N
{/#/; R} N
{/e/, /i/; R} N
Slide 19
Data Mining and Machine Learning
Summary
Word-level and Sub-Word HMMs
Phoneme-level HMMs
Context-sensitivity
– Biphones & Triphones
Triphone decision trees
Slide 20
Data Mining and Machine Learning