MLEnd_starter_kit (1)
Environment set up¶
In this section we will set up a Colab environment for the MLEnd mini-project. Before starting, follow these simple instructions:
Go to https://drive.google.com/
Create a folder named ‘Data’ in ‘MyDrive’. On the left, click ‘New’ > ‘Folder’, enter the name ‘Data’, and click ‘create’
Open the ‘Data’ folder and create a folder named ‘MLEnd’.
Move the file ‘trainingMLEnd.csv’ to the newly created folder ‘MyDrive/Data/MLEnd’.
In [ ]:
from google.colab import drive
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os, sys, re, pickle, glob
import urllib.request
import zipfile
#from IPython.display import Audio
import IPython.display as ipd
from tqdm import tqdm
import librosa
drive.mount(‘/content/drive’)
Run the following cell to check that the MLEnd folder contains the file ‘trainingMLEnd,csv’:
In [ ]:
path = ‘/content/drive/MyDrive/Data/MLEnd’
os.listdir(path)
Data download¶
In this section we will download the data that you need to build your solutions. Note that even though we call it “training” dataset you can do whatever you want with it, for instance validation tasks. Note that we keep a separate dataset for testing purposes, which we won’t share with anyone.
First, we will define a function that will allow us to download a file into a chosen location.
In [ ]:
def download_url(url, save_path):
with urllib.request.urlopen(url) as dl_file:
with open(save_path, ‘wb’) as out_file:
out_file.write(dl_file.read())
The next step is to download the file ‘training.zip’ into the folder ‘MyDrive/Data/MLEnd’. Note that this might take a while.
In [ ]:
url = “https://collect.qmul.ac.uk/down?t=6H8231DQL1NGDI9A/613DLM2R3OFV5EEH9INK2OG”
save_path = ‘/content/drive/MyDrive/Data/MLEnd/training.zip’
download_url(url, save_path)
Finally, let’s unzip the training file.
In [ ]:
directory_to_extract_to = ‘/content/drive/MyDrive/Data/MLEnd/training/’
with zipfile.ZipFile(save_path, ‘r’) as zip_ref:
zip_ref.extractall(directory_to_extract_to)
Once this step is completed, you should have all the audio files in the location ‘MyDrive/Data/MLEnd/training/training’.
Understanding our dataset¶
Let’s check how many audio files we have in our training dataset:
In [ ]:
files = glob.glob(‘/content/drive/MyDrive/Data/MLEnd/training/*/*.wav’)
len(files)
This figure (20k) corresponds to the number of items or samples in our dataset. Let’s listen to some random audio files:
In [ ]:
# five random files
for _ in range(5):
n = np.random.randint(20000)
display(ipd.Audio(files[n]))
Can you recognise the numeral and intonation? Can you recognise the speaker?
Let’s now load the contents of ‘trainingMLEnd.csv’ into a pandas DataFrame and explore them:
In [ ]:
labels = pd.read_csv(‘/content/drive/MyDrive/Data/MLEnd/trainingMLEnd.csv’)
labels
This file consists of 20k rows and 4 columns. Each row corresponds to one of the items in our dataset, and each item is described by four attributes:
File ID (audio file)
Numeral
Participand ID
Intonation
Could you explore this dataset further and identify how many items we have per numeral, per individual and per intonation?
Feature extraction : Picth¶
Audio files are complex data types. Specifically they are discrete signals or time series, consisting of values on a 1D grid. These values are known as samples themselves, which might be a bit confusing, as we have used this term to refer to the items in our dataset. The sampling frequency is the rate at which samples in an audio file are produced. For instance a sampling frequency of 5HZ indicates that 5 produce 5 samples per second, or 1 sample every 0.2 s.
Let’s plot one of our audio signals:
In [ ]:
n=0
fs = None # Sampling frequency. If None, fs would be 22050
x, fs = librosa.load(files[n],sr=fs)
t = np.arange(len(x))/fs
plt.plot(t,x)
plt.xlabel(‘time (sec)’)
plt.ylabel(‘amplitude’)
plt.show()
display(ipd.Audio(files[n]))
The file that we are listening to is:
In [ ]:
files[n]
Can you recognise the numeral and intonation? Compare them with the values for the numeral and intonation that you can find in the labels DataFrame. By changing the value of n in the previous cell, you can listen to other examples. If you are doing this during one of our lab sessions, please make sure that your mic is muted!
Exactly, how complex is an audio signal? Let’s start by looking at the number of samples in one of our audio files:
In [ ]:
n=0
x, fs = librosa.load(files[n],sr=fs)
print(‘This audio signal has’, len(x), ‘samples’)
If we are using a raw audio signal as a predictor, we will be operating in a predictor space consisting of tens of thousands of dimensions. Compare this figure with the number of samples that we have. Do we have enough samples to train a model that takes one of these audio signals as an input?
One approach is to extract a few features from our signals and use these features instead as predictors. In the following cell, we define a function that extracts four features from an audio signal, namely:
Power.
Pitch mean.
Pitch standard deviation.
Fraction of voiced region.
In the next cell, we define a new function that gets the pitch of an audio signal.
In [9]:
def getPitch(x,fs,winLen=0.02):
#winLen = 0.02
p = winLen*fs
frame_length = int(2**int(p-1).bit_length())
hop_length = frame_length//2
f0, voiced_flag, voiced_probs = librosa.pyin(y=x, fmin=80, fmax=450, sr=fs,
frame_length=frame_length,hop_length=hop_length)
return f0,voiced_flag
Let’s now consider the problem of identifying a numeral between 0 and 9. Then next cell defines a function that takes a number of files and creates a NumPy array containing the 4 audio features used as predictors (X) and their labels (y).
In [10]:
def getXy(files,labels_file,scale_audio=False, onlySingleDigit=False):
X,y =[],[]
for file in tqdm(files):
fileID = file.split(‘/’)[-1]
yi = list(labels_file[labels_file[‘File ID’]==fileID][‘digit_label’])[0]
if onlySingleDigit and yi>9:
continue
else:
fs = None # if None, fs would be 22050
x, fs = librosa.load(file,sr=fs)
if scale_audio: x = x/np.max(np.abs(x))
f0, voiced_flag = getPitch(x,fs,winLen=0.02)
power = np.sum(x**2)/len(x)
pitch_mean = np.nanmean(f0) if np.mean(np.isnan(f0))<1 else 0
pitch_std = np.nanstd(f0) if np.mean(np.isnan(f0))<1 else 0
voiced_fr = np.mean(voiced_flag)
xi = [power,pitch_mean,pitch_std,voiced_fr]
X.append(xi)
y.append(yi)
return np.array(X),np.array(y)
Let's apply getXy to the first 500 files. Note that the first 500 files contains numerals outside the [0, 9] range, which we wil be discaarding.
In [ ]:
X,y = getXy(files[:500],labels_file=labels,scale_audio=True, onlySingleDigit=True)
# If you want to use all 20000 files, run next line instead
#X,y = getXy(files,labels_file=labels,scale_audio=True, onlySingleDigit=True)
The next cell shows the shape of X and y and prints the labels vector y:
In [ ]:
print('The shape of X is', X.shape)
print('The shape of y is', y.shape)
print('The labels vector is', y)
Finally, to be on the cautious side, let's eliminate any potential item with a NaN (not a number).
In [ ]:
# If nan sample, remove them
if np.sum(np.isnan(X)):
idx = np.isnan(X).sum(1)>0
X = X[~idx]
y = y[~idx]
print(np.sum(np.isnan(X)))
Modeling: Support Vector Machines¶
Let’s build a support vector machine (SVM) model for the predictive task of identifying digits in an audio signal, using the dataset that we have just created.
We will use the SVM method provided by scikit-learn and will split the dataset defined by X and y into a training set and a validation set.
In [ ]:
from sklearn import svm
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X,y,test_size=0.3)
X_train.shape, X_val.shape, y_train.shape, y_val.shape
Can you identify the number of items in the training and validation sets?
Let’s now fit an SVM model and print both the training accuracty and validation accuracy.
In [ ]:
model = svm.SVC(C=1)
model.fit(X_train,y_train)
yt_p = model.predict(X_train)
yv_p = model.predict(X_val)
print(‘Training Accuracy’, np.mean(yt_p==y_train))
print(‘Validation Accuracy’, np.mean(yv_p==y_val))
Compare the training and validation accuracies. What do you observe? What do you think the accuracy of a random classifier would be?
Let’s normalise the predictors, to see if the performance improves.
In [ ]:
mean = X_train.mean(0)
sd = X_train.std(0)
X_train = (X_train-mean)/sd
X_val = (X_val-mean)/sd
model = svm.SVC(C=1,gamma=2)
model.fit(X_train,y_train)
yt_p = model.predict(X_train)
yv_p = model.predict(X_val)
print(‘Training Accuracy’, np.mean(yt_p==y_train))
print(‘Validation Accuracy’, np.mean(yv_p==y_val))
Once again, compare the training and validation accuracies. Do you think this classifier is better than the previous one? What could you do to build a better classifier?