Data Mining and Machine Learning
Application of HMMs for Automatic Speech Recognition – Introduction
Peter Jančovič Slide 1
Data Mining and Machine Learning
Objectives
Introduce automatic speech recognition
Understand why speech recognition is difficult – Continuity
– Variability
– Confusability
– Effects of accent
Speech recognition terminology
Slide 2
Data Mining and Machine Learning
Why is speech recognition difficult?
Intuitively…
Meaning is represented by sentences
A Sentence is a sequences of words
A Word is a sequences of phonemes
…
This view of speech is based on text
But speech is NOT just “Acoustic text”
Slide 3
Data Mining and Machine Learning
Speech is…. Continuous
– “We were away a year ago” Variable
– “bread and butter” or “brembudder” Ambiguous
– “The grey tape can fix that leak”
– “The great ape can fix that leek”
– “The great ape can fix that league”
– “The great tape can. Fix that’ll eek!”
Slide 4
Data Mining and Machine Learning
Speech is Continuous
Slide 5
Data Mining and Machine Learning
Variability
Slide 6
Data Mining and Machine Learning
Confusability
Slide 7
Data Mining and Machine Learning
English Vowels : /h_d/
Slide 8
Data Mining and Machine Learning
“League” or “Leek”?
“league” = / l i g /
“leek” = / l i k /
Difference appears to be in the final consonant:
– /g/ is voiced
– /k/ is unvoiced
But in natural fluent speech, the duration of the vowel /i/ may be a more important cue to recognition!
Slide 9
Data Mining and Machine Learning
ABI – Accents of the British Isles
Corpus of recordings of 15 different accents of British English
– 300+subjects
– Approx.20minutesofspeechpersubject
– 20+ subjects (10m, 10f) per accent
ABI
Accents of the British Isles
– Eachsubjectbornintown,livedthereallof his or her life, parents born in town
– Fundedby20/20Speech
– Upto£2K(academic)or£20K(industry)
Slide 10
Data Mining and Machine Learning
ABI
ABI
Accents of the British Isles
ABI II
– Extendedcorpus
– Systematicstudyofeffectofaccentonrecogniserperformance – Systematicstudyofacousticcorrelatesofaccent
Lowestoft Elgin Glasgow Ulster Denbigh
Slide 11
Data Mining and Machine Learning
Approaches to Speech Recognition Many approaches to speech recognition have been tried in past
Researchers in early days believed there was insufficient information in the acoustic data to recognise speech, and that additional sources of knowledge were necessary
– acoustic-phonetic, lexical (words), syntactic (grammar), semantic and domain-specific knowledge
Most successful approach to-date is based on a combination of hidden Markov models (HMMs) with (deep) neural networks
Slide 12
Data Mining and Machine Learning
Speech Recognition Terminology
Basic problem in speech recognition is variability Early attempts to solve problem by removing it
Speaker variability
– Speaker-dependent speech recognition systems train on, and subsequently recognise, a single speaker
– Multiple-speaker systems work for a particular population of speakers
– Speaker Independent systems work for any speaker, with no implicit or explicit training
– Speaker adaptive systems automatically adapt to a new speaker. E.G: begin with a speaker-independent system, and then adapt the system to a particular speaker to obtain a speaker-dependent system.
Slide 13
Data Mining and Machine Learning
Terminology (Continued)
Another source of variability is co-articulation between words
Isolated word recognition systems require the user to leave gaps between words
Connected speech recognition systems recognize isolated phrases or sentences
Continuous speech recognition systems recognize continuous speech.
Slide 14
Data Mining and Machine Learning
Vocabulary Size
Another important issue is vocabulary size
Small vocabulary systems work with vocabularies of 10-100 words
Medium vocabularies comprise around 100 to 5,000 words
Large Vocabulary Continuous Speech Recognition
(LVCSR) systems can cope with 60,000 words, while
Unlimited vocabulary systems have no vocabulary size
limitation
Slide 15
Data Mining and Machine Learning
1970 1980
1990
2000 2010
1970s US ARPA programme
Whole word pattern matching (DP) Sakoe & Chiba
Bruce Lowerre’s ‘HARPY system (CMU)
1980s US DARPA prog. Resource Management (RM) task
‘Popularisation’ of HMMs (Rabiner, Levinson)
IBM Tangora system (Jelinek, Bahl)
The SPHINX system (Kai-Fu Lee) (CMU)
1990s DARPA programme: WSJ, BN, Switchboard – large vocabulary systems
Large-scale HMM-based systems (Cambridge University, LIMSI, IBM, Dragon,…
Google, Amazon, Apple, … Hybrid DNN-HMM systems
Historical perspective
Slide 16
Data Mining and Machine Learning
Phoneme-HMM Speech Recogniser
Speech signal
Phoneme HMM store
Application Compiler
N-gram language model
Viterbi Decoder
Optimal word sequence Or N-Best List,
Pronunciation dictionary
Slide 17
Data Mining and Machine Learning
Front-end signal
processing
Summary
Why automatic speech recognition is difficult
– Speech is not “acoustic text”: Continuity, Variability, Confusability
Speech recognition terminology Historical perspective
Slide 18
Data Mining and Machine Learning