Coursework 1 briefing
1
Briefing for Coursework 1
2
Data collection
• Decision on sampling frequency (8KHz, 16KHz, 44kHz) – can use
MATLAB function resample() to change this (but only to reduce)
• Scaling
• Check signal amplitudes are consistent
• Apply rescaling or normalisation
• Clipping
• Check recording gain setting to avoid clipping
• Lead-in or zero amplitude signal
• If amplitude takes time to settle, or contains zeros, cut this out of
recordings or apply filter
• Make sure you are happy with recording condition before you
undertake a long data collection – speak to me first if unsure
3
Data collection – clipping
500 1000 1500 2000 2500 3000 3500 4000 4500 5000
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
500 1000 1500 2000 2500 3000 3500 4000 4500 5000
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Adjust recording level to avoid clipping
Correct Clipped
4
Data collection – leading zeros
500 1000 1500 2000 2500 3000 3500 4000 4500 5000
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
1000 2000 3000 4000 5000 6000 7000
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Remove/crop leading or trailing zeros – adjust label file
accordingly
May cause problems with subsequent feature extraction
Mel-scale filterbank
0 1 2 3 4
Mel-scaled filterbank – ~20 channels
Freq:
kHz
speech
signal
Pre-
emphasis
Spectral
analysis
Mel-scale
filterbank Log DCT
feature
vectors
Hamming
window
• Three things to decide:
• Linear/non-linear frequency mapping
• Number of channels
• Shape of filterbank channels
• Frequency mapping – ideally mel-scale (c.f. equation) but better to start with a
linear mapping where channels are spaced equally in frequency
• Number of channels – if sampling at 8kHz then 20-25 is typical, if sampling at
16kHz then 30 is typical
• Shape of filterbank channels – standard MFCC implementation uses triangular
shaped filterbank the overlap. A simple implementation could use non-
overlapping rectangular filterbanks
Rectangular filterbank
0 20 40 60 80 100 120 140
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 20 40 60 80 100 120 140
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Channel 1
assume K channel filterbank:
for channel = 1:K
Compute firstBin for channel
Compute lastBin for channel
filterbank(channel) = sum( mag(firstBin:lastBin)
end
feature
vector
8.341
2
K
Rectangular filterbank
0 20 40 60 80 100 120 140
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 20 40 60 80 100 120 140
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 20 40 60 80 100 120 140
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Channel 1 Channel 2
assume K channel filterbank:
for channel = 1:K
Compute firstBin for channel
Compute lastBin for channel
filterbank(channel) = sum( mag(firstBin:lastBin)
end
8.34
feature
vector
2.17
1
2
K
Mel-scale filterbank
0 20 40 60 80 100 120 140
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Magnitude spectrum
0 1 2 3 4
Mel-scaled filterbank – ~20 channels
Freq:
kHz
feature
vector
0 20 40 60 80 100 120 140
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Mel-scale filterbank
Channel 1
0 1 2 3 4
Mel-scaled filterbank – ~20 channels
Freq:
kHz
feature
vector
8
0 20 40 60 80 100 120 140
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Mel-scale filterbank
Channel 2
0 1 2 3 4
Mel-scaled filterbank – ~20 channels
Freq:
kHz
feature
vector
8
6
11
Discrete cosine transform
speech
signal
Pre-
emphasis
Spectral
analysis
Mel-scale
filterbank Log DCT
feature
vectors
Hamming
window
• MATLAB has dct function
• Need to decide level of truncation following DCT
• If have 20 channel filterbank and input that into DCT then have 20 DCT coefficients
as output
• Level of truncation will affect recognition performance
• As a simple rule, truncate to keep half the number of DCT coefficients
12
Writing HTK files
• HTK files are binary files and follow a strict structure – 12 byte header
followed by the coefficients (4 byte floats each) – see HTK manual
% Open file for writing:
fid = fopen(filename, ‘w’, ‘ieee-be’);
% Write the header information%
fwrite(fid, numVectors, ‘int32’); % number of vectors in file (4 byte int)
fwrite(fid, vectorPeriod, ‘int32’); % sample period in 100ns units (4 byte int)
fwrite(fid, numDims * 4, ‘int16’); % number of bytes per vector (2 byte int)
fwrite(fid, parmKind, ‘int16’); % code for the sample kind (2 byte int)
% Write the data: one coefficient at a time:
for i = 1: numVectors,
for j = 1:numDims,
fwrite(fid, data(i, j), ‘float32’);
end
end
• Further details in HTK book
• Use HList to check that HTK file contains correct data from MATLAB
13
Coding ideas
• Important to use a structured approach to feature extraction
for frameNumber = 1:numFrames,
{
frame = x(frameStart:frameEnd);
hamming
getMagSpec
filterbank
log
dct
truncation
}
write parameterised file
• Each stage can be a MATLAB function with inputs and outputs
14
Analysing results
• Sentence level, word level – correct and accuracy
• Hits, substitutions, deletions and insertions
.lab = one two three four
.rec = one three four deletion error
15
Analysing results
• Sentence level, word level – correct and accuracy
• Hits, substitutions, deletions and insertions
.lab = one two three four
.rec = one two five four substitution error
16
Analysing results
• Sentence level, word level – correct and accuracy
• Hits, substitutions, deletions and insertions
.lab = one two three four
.rec = one seven two three four insertion error
17
Research ideas
• Many variations on design and implementation
• Many parameters to adjust
• Interesting to see the effect that changes make on recognition
accuracy
• Keep training data and test data fixed and adjust one parameter
at a time – can then see the result of that change
• Evaluation should investigate this and show graphs, tables, etc
• Don’t need to include loads of confusion matrices – take up a lot of
space.
• Key number is %Acc – confusion matrix may go to explain a particular
result