Data Mining and Machine Learning
Language Modelling for Automatic Speech Recognition
Peter Jančovič Slide 1
Data Mining and Machine Learning
Objectives
Understand role of language model in speech recognition
Approaches to Language Modelling: – Rule-Based Language Models
– Statistical Language Models
N-gram Language Models
Slide 2
Data Mining and Machine Learning
Speech Recognition: Statistical Methods Given an unknown utterance y, want to find the word
sequence W such that P(W | y) is maximised
By Bayes’ Theorem,
py|WPW py
PW | y
P(W) – probability that the word sequence W is in
application language – language model probability
Slide 3
Data Mining and Machine Learning
Language Modelling
Language Model (Grammar) used to compute the probability
P(W) that the sequence of words W ‘belongs to’ the language
Constrains recognition problem – fewer possible interpretations
Basically there are two types of candidate LM: – Rule-based (traditional) language model
– Probabilistic language model
Slide 4
Data Mining and Machine Learning
Rule-Based Language Models Language models in linguistics and natural language
processing typically rule-based
A rule-based language model consists of:
– A set of non-terminal units (e.g. sentence, noun-phrase, verb-phrase,…)
– A set of terminal units (e.g. words)
– A set of rules, defining how non-terminal units can be expanded into sequences of non-terminal and terminal units
Corresponds to formal notion of grammar – like in school
Slide 5
Data Mining and Machine Learning
Rule-Based Language Models
Let S denote the non-terminal root node corresponding to ‘sentence’
A sequence of words is grammatical if it can be derived from S by a sequence of rules
Example: Consider the sentence “The cat devoured the tiny mouse”
(From Geoffrey Finch, “How to study linguistics”, MacMillan, 1998)
Slide 6
Data Mining and Machine Learning
Rule-Based Language Models
Example rules: – S :- NP + VP
– NP:-det+noun
– VP:-verb+NP
– NP:- det + adj + noun
S
NP
det noun
– det:-“the”
– noun:-“cat”
– verb:-“devoured” – adj:-“tiny”
– noun:-“mouse”
VP
verb NP
det adj noun
The cat devoured the tiny mouse
Slide 7
Data Mining and Machine Learning
Rule-Based Language Models
Disadvantages
– Normally applied to written language
– A deterministic model of this type may not be able to accommodate variability of spoken language
– Cannot easily handle uncertainty
– Cannot be derived automatically from example
Slide 8
data and is – based on human knowledge Data Mining and Machine Learning
Rule-Based Language Models Advantages
– Can model complex structure, e.g. non-local dependencies
– Significant human expertise and knowledge already exists
– Much effort has already been devoted to the construction of large language models of this type
“She ran, waving enthusiastically, across the bridge”
Slide 9
Data Mining and Machine Learning
Finite State Language Models
Describe all possible sentences as routes through a finite state network
Typically ‘hand-crafted’ using graphical design tools
Not normally used for vocabulary sizes greater than ~1,000 words
Slide 10
Data Mining and Machine Learning
Finite-State Syntax
start
*letter
*digit
*digit
*digit
end
*letter
*letter
*letter
Slide 11
Data Mining and Machine Learning
Expansion of Macros
One Two Three Four
*digit = *start *pause Five Six
Seven Eight Nine zero
*pause
*end
Slide 12
Data Mining and Machine Learning
Statistical Language Models
With a rule based language model, a sequence of words W is either
– in the language (grammatical) or
– outside the language (not grammatical)
With a statistical language model, a sequence of words W is in the language (grammatical) with probability P(W)
The most common statistical language model is known as the N-gram model
Slide 13
Data Mining and Machine Learning
N-gram Language Models Let W = W1, W2,…,WK be a sequence of words
In general:
P(W) = P(W1)P(W2|W1)…P(Wn|Wn-1 ,…,W1)…P(WK|WK-1 ,…,W1)
In an N-gram language model, we assume: P(Wk|Wk-1,Wk-2,…,W1) = P(Wk|Wk-1,…Wk-N+1)
i.e. the probability of the kth word in the sequence depends only on identities of the previous N-1 words
The most commonly used N-gram models are 2-gram (bigram) and 3-gram (trigram) models
Slide 14
Data Mining and Machine Learning
Bigram and Trigram Models In a Bigram Language Model, we assume:
P(Wk|Wk-1,Wk-2,…,W1) = P(Wk|Wk-1)
Similarly, in a Trigram Language Model, we assume:
P(Wk|Wk-1,Wk-2,…,W1) = P(Wk|Wk-1,Wk-2)
These probabilities can be estimated from data
Slide 15
Data Mining and Machine Learning
Estimation of Bigram Probabilities For example, given a training text, an estimate of the bigram
probability P(W2|W1) is given by: P(W2|W1) = N(W1,W2)/N(W1)
where:
– N(W1,W2) = number of times the word pair W1,W2 occurs in the training text
– and N(W1) = number of times the word W1 occurs in the training text
Slide 16
Data Mining and Machine Learning
Bigram Probabilities – Example
Consider the training text:
“John sat on the old chair. John read the old book. John was interesting. The book was interesting”
Suppose this is used to train a bigram grammar.
‘the’ occurs 3 times in the text, while the bigrams ‘the old’ and ‘the book’ occur twice and once respectively. Hence
P(‘old’|’the’)=2/3, and P(‘book’|‘the’)=1/3.
Similarly, if the symbol # denotes start of sentence, then
Slide 17
P(‘john’|#)=3/4, and P(‘the’|#)=1/4 Data Mining and Machine Learning
Example Continued
The probability of the sentence S
“John sat on the old chair” is given by:
P(S)
= P(john|#) • P(sat|john) • P(on|sat) • P(the|on) • P(old|the) • P(chair|old) • P($|chair)
= 3/4 • 1/3 • 1 • 1 • 2/3 • 1/2 • 1 = 1/12
Similarly
P(“The old chair”)=1/12
P(“John read the old chair”)=1/12
But P(“John read the interesting book”)=0
Slide 18
Data Mining and Machine Learning
Bigram & Trigram Estimation
Most practical systems use a trigram language model
In reality, there is never enough text to estimate trigram probabilities in this simple way
E.g. experiments with trigram language models for a 1,000 word vocabulary application
– using 1.5 million words for training, and 300,000 words to test the models
– 23% of the trigrams in the test corpus were absent from the training corpus
Hence much more sophisticated training procedures are needed
Slide 19
Data Mining and Machine Learning
Estimation of N-gram statistics
In general, there will not be enough data to estimate
N-gram statistics reliably.
Possible solutions:
– Robust estimation methods from statistics – Deleted interpolation
– ‘Back-off’
Slide 20
Data Mining and Machine Learning
Deleted interpolation
‘Interpolate’ trigram probability from estimated trigram, bigram and unigram probabilities:
ˆ
Pw | w w Pw | w w Pw | w Pw
321132123233 Estimate 1, 2, 3 through recognition experiments
Slide 21
Data Mining and Machine Learning
‘Backoff’
Decide how many examples T are needed for robust estimation.
Then:
Pw|ww ifwww T 321 123
ˆ
Pw|ww Pw|w ifwww Tandww T
32132 123 12
Pw3 otherwise
Slide 22
Data Mining and Machine Learning
N-gram Language Models – Summary
Advantages
– Can be trained automatically from data – Probabilistic model
– Consistent with acoustic model
– Mathematically sound algorithms
Slide 23
Data Mining and Machine Learning
N-gram Language Models – Summary
Disadvantages
– Large amounts of training data needed
– Difficult to incorporate human knowledge
– Cannot model long term dependency: “She walked, hands in pockets, quickly across the bridge”
Slide 24
Data Mining and Machine Learning
Summary
Role of language modelling in speech recognition
Rule based language models
Finite state language models
N-gram language models
Difficulty of estimating N-gram statistics
Slide 25
Data Mining and Machine Learning