Data Mining and Machine Learning
Application of HMMs for ASR: Feature representation of speech
Peter Jančovič Slide 1
Data Mining and Machine Learning
Objectives
Front-end analysis for ASR – feature representation of speech
– To understand motivation and stages for ‘typical’ parameterisation of speech signals used for ASR
– Mel Frequency Cepstral Coefficients (MFCCs)
Slide 2
Data Mining and Machine Learning
What is “Front-End Analysis”
First stage in any speech recognition system
Goal is to convert the raw acoustic speech waveform into a form which is suitable (or even optimal) for automatic speech recognition
In general pattern recognition terms, front-end analysis is feature extraction
Where do we start?
Slide 3
Data Mining and Machine Learning
The Human Auditory System
taken from J N Holmes, “Speech Synthesis and Recognition”, Van Nostrand Reinhold (1988)
Slide 4
Data Mining and Machine Learning
The Basilar Membrane
Australian National University – http://online.anu.edu.au/IT A/ACAT/drw/PPofM/heari ng/hearing3.html
Slide 5
Data Mining and Machine Learning
Frequency response of the basilar membrane
School for advanced studies, Triste, Italy –
http::/poirot.sissa.it/multidisc/cochlea/utils/basilar.htm
Slide 6
Data Mining and Machine Learning
Lessons from Psycho-Acoustics
Human speech perception begins with frequency analysis on the basilar membrane
Frequency is not perceived on a linear scale – hence use of non-linear perceptual frequency scales: mel scale, bark scale,…
Individual point on the basilar membrane can be modelled as band-pass filter – a critical band is the implicit bandwidth of such an ‘auditory filter’
Loudness perceived on logarithmic scale
Phase of limited significance for speech recognition
Slide 7
Data Mining and Machine Learning
Front-end analysis for ASR
Speech waveform typically low-pass filtered at 4kHz to 8kHz Sampled 8,000 to 16,000 samples per second
Frequency analysis:
– 20 ms analysis window
– 10 ms overlap between windows – Hamming window
– Discrete Fourier Transform
Slide 8
Data Mining and Machine Learning
Frequency analysis for ASR
Analogue Speech Signal
16k sam/s A/D conversion
20 ms window, 10 ms overlap DFT
100 ‘spectra’ (160 point) per second
8kHz Low-pass filter
Example: 8kHz bandwidth system
Slide 9
Data Mining and Machine Learning
Log Power Spectrum
Phase ignored by taking the modulus of the complex spectrum
Logarithm applied
– For consistency with psycho-acoustic results – To compress dynamic range
160 point short-time spectrum
160 point short-time log-power- spectrum
modulus & logarithm
Slide 10
Data Mining and Machine Learning
Mel-scale & smoothing
The mel spectrum can be computed by averaging the short- time Fourier spectrum over ‘bins’ whose width depends on frequency…
…or by using band-pass filters with appropriate, frequency- dependent, band-widths
160 point short-time log-power spectrum
20 point, smoothed mel- frequency log-power spectrum
Mel scale ‘binning’
Slide 11
Data Mining and Machine Learning
Mel Scale Filterbank
From Steve Young, “The HTK Book”, Cambridge University Engineering Department
Slide 12
Data Mining and Machine Learning
Cepstrum
Cosine transform applied to remove correlation between components of mel-scale log power spectrum
– Mel Cepstrum: MFCC = Mel Frequency Cepstral Coefficients
– Mathematical expediency
20 point mel scale log power spectrum
Cosine Transform
20 MFCCs (use only first 12)
Slide 13
Data Mining and Machine Learning
Energy & Delta Coefficients
Add energy as 13th parameter
Compute estimate of time-derivative of each parameter – delta cepstrum (or cepstrum)
Compute estimate of time-acceleration of each parameter – delta2 cepstrum (or 2 cepstrum)
Cepstum + Cepstrum + 2 Cepstrum = ‘standard’ 39 dimensional representation (e.g. in HTK)
Slide 14
Data Mining and Machine Learning
Front-end analysis – summary
Speech signal
Mel frequency binning
Abs, Log & power
Frame windowing
DFT
Cosine Transform
cepstrum & 2 cepstrum
Add energy
Slide 15
Data Mining and Machine Learning
Summary
Introduction to front-end speech processing for ASR – Motivations from human hearing
– Description of ‘typical’ front-end representation
– Short-time log power spectrum – Mel scale filtering
– Cosine transform
– and 2 parameters
Slide 16
Data Mining and Machine Learning