程序代写代做代考 data mining algorithm Data Mining and Machine Learning

Data Mining and Machine Learning
Language Modelling for Automatic Speech Recognition
Peter Jančovič Slide 1
Data Mining and Machine Learning

Objectives
 Understand role of language model in speech recognition
 Approaches to Language Modelling: – Rule-Based Language Models
– Statistical Language Models
 N-gram Language Models
Slide 2
Data Mining and Machine Learning

Speech Recognition: Statistical Methods  Given an unknown utterance y, want to find the word
sequence W such that P(W | y) is maximised
By Bayes’ Theorem,
py|WPW py
PW | y
 P(W) – probability that the word sequence W is in
application language – language model probability
Slide 3
Data Mining and Machine Learning

Language Modelling
 Language Model (Grammar) used to compute the probability
P(W) that the sequence of words W ‘belongs to’ the language
 Constrains recognition problem – fewer possible interpretations
 Basically there are two types of candidate LM: – Rule-based (traditional) language model
– Probabilistic language model
Slide 4
Data Mining and Machine Learning

Rule-Based Language Models  Language models in linguistics and natural language
processing typically rule-based
 A rule-based language model consists of:
– A set of non-terminal units (e.g. sentence, noun-phrase, verb-phrase,…)
– A set of terminal units (e.g. words)
– A set of rules, defining how non-terminal units can be expanded into sequences of non-terminal and terminal units
 Corresponds to formal notion of grammar – like in school
Slide 5
Data Mining and Machine Learning

Rule-Based Language Models
 Let S denote the non-terminal root node corresponding to ‘sentence’
 A sequence of words is grammatical if it can be derived from S by a sequence of rules
 Example: Consider the sentence “The cat devoured the tiny mouse”
(From Geoffrey Finch, “How to study linguistics”, MacMillan, 1998)
Slide 6
Data Mining and Machine Learning

Rule-Based Language Models
 Example rules: – S :- NP + VP
– NP:-det+noun
– VP:-verb+NP
– NP:- det + adj + noun
S
NP
det noun
– det:-“the”
– noun:-“cat”
– verb:-“devoured” – adj:-“tiny”
– noun:-“mouse”
VP
verb NP
det adj noun
The cat devoured the tiny mouse
Slide 7
Data Mining and Machine Learning

Rule-Based Language Models
 Disadvantages
– Normally applied to written language
– A deterministic model of this type may not be able to accommodate variability of spoken language
– Cannot easily handle uncertainty
– Cannot be derived automatically from example
Slide 8
data and is – based on human knowledge Data Mining and Machine Learning

Rule-Based Language Models  Advantages
– Can model complex structure, e.g. non-local dependencies
– Significant human expertise and knowledge already exists
– Much effort has already been devoted to the construction of large language models of this type
“She ran, waving enthusiastically, across the bridge”
Slide 9
Data Mining and Machine Learning

Finite State Language Models
 Describe all possible sentences as routes through a finite state network
 Typically ‘hand-crafted’ using graphical design tools
 Not normally used for vocabulary sizes greater than ~1,000 words
Slide 10
Data Mining and Machine Learning

Finite-State Syntax
start
*letter
*digit
*digit
*digit
end
*letter
*letter
*letter
Slide 11
Data Mining and Machine Learning

Expansion of Macros
One Two Three Four
*digit = *start *pause Five Six
Seven Eight Nine zero
*pause
*end
Slide 12
Data Mining and Machine Learning

Statistical Language Models
 With a rule based language model, a sequence of words W is either
– in the language (grammatical) or
– outside the language (not grammatical)
 With a statistical language model, a sequence of words W is in the language (grammatical) with probability P(W)
 The most common statistical language model is known as the N-gram model
Slide 13
Data Mining and Machine Learning

N-gram Language Models  Let W = W1, W2,…,WK be a sequence of words
 In general:
P(W) = P(W1)P(W2|W1)…P(Wn|Wn-1 ,…,W1)…P(WK|WK-1 ,…,W1)
 In an N-gram language model, we assume: P(Wk|Wk-1,Wk-2,…,W1) = P(Wk|Wk-1,…Wk-N+1)
i.e. the probability of the kth word in the sequence depends only on identities of the previous N-1 words
 The most commonly used N-gram models are 2-gram (bigram) and 3-gram (trigram) models
Slide 14
Data Mining and Machine Learning

Bigram and Trigram Models  In a Bigram Language Model, we assume:
P(Wk|Wk-1,Wk-2,…,W1) = P(Wk|Wk-1)
 Similarly, in a Trigram Language Model, we assume:
P(Wk|Wk-1,Wk-2,…,W1) = P(Wk|Wk-1,Wk-2)
 These probabilities can be estimated from data
Slide 15
Data Mining and Machine Learning

Estimation of Bigram Probabilities  For example, given a training text, an estimate of the bigram
probability P(W2|W1) is given by: P(W2|W1) = N(W1,W2)/N(W1)
where:
– N(W1,W2) = number of times the word pair W1,W2 occurs in the training text
– and N(W1) = number of times the word W1 occurs in the training text
Slide 16
Data Mining and Machine Learning

Bigram Probabilities – Example
 Consider the training text:
“John sat on the old chair. John read the old book. John was interesting. The book was interesting”
 Suppose this is used to train a bigram grammar.
 ‘the’ occurs 3 times in the text, while the bigrams ‘the old’ and ‘the book’ occur twice and once respectively. Hence
P(‘old’|’the’)=2/3, and P(‘book’|‘the’)=1/3.
 Similarly, if the symbol # denotes start of sentence, then
Slide 17
P(‘john’|#)=3/4, and P(‘the’|#)=1/4 Data Mining and Machine Learning

Example Continued
 The probability of the sentence S
“John sat on the old chair” is given by:
P(S)
= P(john|#) • P(sat|john) • P(on|sat) • P(the|on) • P(old|the) • P(chair|old) • P($|chair)
= 3/4 • 1/3 • 1 • 1 • 2/3 • 1/2 • 1 = 1/12
 Similarly
P(“The old chair”)=1/12
P(“John read the old chair”)=1/12
 But P(“John read the interesting book”)=0
Slide 18
Data Mining and Machine Learning

Bigram & Trigram Estimation
 Most practical systems use a trigram language model
 In reality, there is never enough text to estimate trigram probabilities in this simple way
 E.g. experiments with trigram language models for a 1,000 word vocabulary application
– using 1.5 million words for training, and 300,000 words to test the models
– 23% of the trigrams in the test corpus were absent from the training corpus
 Hence much more sophisticated training procedures are needed
Slide 19
Data Mining and Machine Learning

Estimation of N-gram statistics
 In general, there will not be enough data to estimate
N-gram statistics reliably.
 Possible solutions:
– Robust estimation methods from statistics – Deleted interpolation
– ‘Back-off’
Slide 20
Data Mining and Machine Learning

Deleted interpolation
 ‘Interpolate’ trigram probability from estimated trigram, bigram and unigram probabilities:
ˆ
Pw | w w   Pw | w w   Pw | w   Pw 
321132123233  Estimate 1, 2, 3 through recognition experiments
Slide 21
Data Mining and Machine Learning

‘Backoff’
 Decide how many examples T are needed for robust estimation.
 Then:
Pw|ww ifwww T 321 123
ˆ
Pw|ww Pw|w ifwww Tandww T
32132 123 12
 Pw3  otherwise
Slide 22
Data Mining and Machine Learning

N-gram Language Models – Summary
 Advantages
– Can be trained automatically from data – Probabilistic model
– Consistent with acoustic model
– Mathematically sound algorithms
Slide 23
Data Mining and Machine Learning

N-gram Language Models – Summary
 Disadvantages
– Large amounts of training data needed
– Difficult to incorporate human knowledge
– Cannot model long term dependency: “She walked, hands in pockets, quickly across the bridge”
Slide 24
Data Mining and Machine Learning

Summary
 Role of language modelling in speech recognition
 Rule based language models
 Finite state language models
 N-gram language models
 Difficulty of estimating N-gram statistics
Slide 25
Data Mining and Machine Learning