程序代写代做代考 data mining Data Mining and Machine Learning

Data Mining and Machine Learning
Application of HMMs for ASR: Feature representation of speech
Peter Jančovič Slide 1
Data Mining and Machine Learning

Objectives
 Front-end analysis for ASR – feature representation of speech
– To understand motivation and stages for ‘typical’ parameterisation of speech signals used for ASR
– Mel Frequency Cepstral Coefficients (MFCCs)
Slide 2
Data Mining and Machine Learning

What is “Front-End Analysis”
 First stage in any speech recognition system
 Goal is to convert the raw acoustic speech waveform into a form which is suitable (or even optimal) for automatic speech recognition
 In general pattern recognition terms, front-end analysis is feature extraction
 Where do we start?
Slide 3
Data Mining and Machine Learning

The Human Auditory System
taken from J N Holmes, “Speech Synthesis and Recognition”, Van Nostrand Reinhold (1988)
Slide 4
Data Mining and Machine Learning

The Basilar Membrane
Australian National University – http://online.anu.edu.au/IT A/ACAT/drw/PPofM/heari ng/hearing3.html
Slide 5
Data Mining and Machine Learning

Frequency response of the basilar membrane
School for advanced studies, Triste, Italy –
http::/poirot.sissa.it/multidisc/cochlea/utils/basilar.htm
Slide 6
Data Mining and Machine Learning

Lessons from Psycho-Acoustics
 Human speech perception begins with frequency analysis on the basilar membrane
 Frequency is not perceived on a linear scale – hence use of non-linear perceptual frequency scales: mel scale, bark scale,…
 Individual point on the basilar membrane can be modelled as band-pass filter – a critical band is the implicit bandwidth of such an ‘auditory filter’
 Loudness perceived on logarithmic scale
 Phase of limited significance for speech recognition
Slide 7
Data Mining and Machine Learning

Front-end analysis for ASR
 Speech waveform typically low-pass filtered at 4kHz to 8kHz  Sampled 8,000 to 16,000 samples per second
 Frequency analysis:
– 20 ms analysis window
– 10 ms overlap between windows – Hamming window
– Discrete Fourier Transform
Slide 8
Data Mining and Machine Learning

Frequency analysis for ASR
Analogue Speech Signal
16k sam/s A/D conversion
20 ms window, 10 ms overlap DFT
100 ‘spectra’ (160 point) per second
8kHz Low-pass filter
Example: 8kHz bandwidth system
Slide 9
Data Mining and Machine Learning

Log Power Spectrum
 Phase ignored by taking the modulus of the complex spectrum
 Logarithm applied
– For consistency with psycho-acoustic results – To compress dynamic range
160 point short-time spectrum
160 point short-time log-power- spectrum
modulus & logarithm
Slide 10
Data Mining and Machine Learning

Mel-scale & smoothing
 The mel spectrum can be computed by averaging the short- time Fourier spectrum over ‘bins’ whose width depends on frequency…
 …or by using band-pass filters with appropriate, frequency- dependent, band-widths
160 point short-time log-power spectrum
20 point, smoothed mel- frequency log-power spectrum
Mel scale ‘binning’
Slide 11
Data Mining and Machine Learning

Mel Scale Filterbank
From Steve Young, “The HTK Book”, Cambridge University Engineering Department
Slide 12
Data Mining and Machine Learning

Cepstrum
 Cosine transform applied to remove correlation between components of mel-scale log power spectrum
– Mel Cepstrum: MFCC = Mel Frequency Cepstral Coefficients
– Mathematical expediency
20 point mel scale log power spectrum
Cosine Transform
20 MFCCs (use only first 12)
Slide 13
Data Mining and Machine Learning

Energy & Delta Coefficients
 Add energy as 13th parameter
 Compute estimate of time-derivative of each parameter – delta cepstrum (or  cepstrum)
 Compute estimate of time-acceleration of each parameter – delta2 cepstrum (or 2 cepstrum)
 Cepstum +  Cepstrum + 2 Cepstrum = ‘standard’ 39 dimensional representation (e.g. in HTK)
Slide 14
Data Mining and Machine Learning

Front-end analysis – summary
Speech signal
Mel frequency binning
Abs, Log & power
Frame windowing
DFT
Cosine Transform
 cepstrum & 2 cepstrum
Add energy
Slide 15
Data Mining and Machine Learning

Summary
 Introduction to front-end speech processing for ASR – Motivations from human hearing
– Description of ‘typical’ front-end representation
– Short-time log power spectrum – Mel scale filtering
– Cosine transform
–  and 2 parameters
Slide 16
Data Mining and Machine Learning