Introduction to supervised machine learning¶
In this session we will introduce the use of supervised machine learning on biological data. We will discuss nearest neighour classifier, support vector machine, neural networks and deep learning (specifically convolutional neural networks).
Our toy example will be to classify species as endangered or not based genomic data. The rationale is that species with a small (effective population size) will have higher chances to be threatened. The amount of genomic variability (e.g. polymorphic sites and haplotype diversity) is taken as a proxy for the (effective) population size of each species. The genomic variation at marker loci is represented as an image.
Objective¶
We assume that we have collected some genomic data on many species which have already been categorised into 4 classes of conservation status:
1. least concern
2. vulnerable
3. endangered
4. critically endagered.
Our goal is to implement a classifier that given genomic data can predict whether a given species is endangered or not.
More specifically, we assume we have 4 classes of conservation status (labels: LC, VU, EN, CR) and 400 samples per class. Each data point is an image which represents the (biallelic) genetic variation (binary data) across individuals (on the rows) over several genetic loci (on the columns). Images are double sorted to remove some noise associated to the order of samples and polymorphic sites.
Manipulating images¶
We will use images for our classification. To do so we need to learn how we can import and manipulate images in python. We will use imageio package to load and save images which are stored as numpy objects.
In [ ]:
import imageio
Let’s assume to load and manipulate an image of a cat and learn of this data is managed in python.
In [ ]:
# change the directory accordingly
img = imageio.imread(“~/Documents/Teaching/statistical_inference/Slides/Learning/Pics/cat_generic.jpg”)
print(img)
Images are matrices with 3 dimensions. These are stored as numpy arrays with rank equal to 3. The third dimension is the colour channel. To being able to manipulate such objects we need to learn what numpy arrays are.
Numpy¶
An array is a grid of values, all of the same type. It is indexed by a tuple of non-negative integers. The number of its dimensions is called rank, while the shape is a tuple of integers giving the size along each dimension.
In [ ]:
import numpy as np
a = np.array([1, 2, 3]) # rank=1
print(a)
In [ ]:
print(type(a))
print(a.shape)
print(a[1])
a[0] = 5
print(a)
In [ ]:
b = np.array([[1,2,3],[4,5,6]]) # rank=2
print(b)
In [ ]:
print(b.shape)
b[0,0], b[0,1], b[1,0]
QUESTION (1)
What is the shape of img, the numpy object containing the image of a cat? What is its data type (hint: use .dtype method)? How about its rank? (Possible solutions for DIYs in Solutions/learning.py)
In [ ]:
# …
Now that we know the rank and shape of img we may want to access some of its values. Let’s see how we can perform array indexing in numpy.
In [ ]:
a = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12]])
print(a)
# slicing
b = a[:2,1:3]
print(b)
In [ ]:
# a slice of an array is a view into the same data, so modifying it will modify the original array!
print(a[0,1])
b[0,0] = 100
print(a[0,1])
QUESTION (2)
Extract the top left corner of the cat image (e.g. 200×200) and write a new image. Hint: you can use imageio.imwrite which takes the file name to save and the numpy object. What happens if you just take one slice of the third dimension?
In [ ]:
# imageio.imwrite(…)
In [ ]:
imageio.imwrite(“~/Downloads/cat_slice.jpg”, img[:200, :200, :])
imageio.imwrite(“~/Downloads/cat_grey.jpg”, img[:, :, 0])
Let’s see how to access values which satisfy a condition we set.
In [ ]:
a = np.array([[1,2],[3,4],[5,6]])
print(a > 2)
print(a[a > 2])
We can easily do mathematical operations involving matrices.
In [ ]:
x = np.array([[1,2],[3,4]], dtype=np.float32) # notice that we force the data type to be floats
y = np.array([[5,6],[7,8]], dtype=np.float32)
print(x + y)
print(np.add(x, y))
print(x – y)
#np.subtract(x, y)
print(x * y) # elementwise!
#np.multiply(x, y)
print(x / y) # elementwise!
#np.divide(x, y)
print(np.sqrt(x))
In [ ]:
x = np.array([[1,2],[3,4]])
print(x)
print(x.sum())
print(x.sum(axis=0))
print(x.sum(axis=1))
QUESTION (3)
What happens if we substract the cat image from the cat image? What happens if we add them? Write such images.
In [ ]:
# …
The last think to mention about numpy arrays is how we can manipulate them using broadcasting.
In [ ]:
# multiply a matrix by a constant
x = np.array([[1,2,3],[4,5,6],[7,8,9],[10,11,12]])
print(x * 2)
In [ ]:
x = np.array([[1,2,3],[4,5,6],[7,8,9],[10,11,12]])
y = np.array([1,0,1])
print(x.shape)
print(y.shape)
z = x + y
print(z)
QUESTION (4)
Create a tinted image of the cat, by multiplying its colour channel by [1, 0.95, 0.9]. Instead of writing a new file, plot it on the screen. We use the package matplotlib and its function imshow which takes the numpy array in input. You need to force the data type to be unsigned integer 8 bits and you can achieve this with the function np.uint8().
In [ ]:
%matplotlib notebook
import matplotlib.pyplot as plt
# …
# plt.imshow(???)
Training and testing data¶
The first think we do is to load images representing genomic data from various species. These images have already been converted into numpy arrays. (Note that we have already preprocessed our data, so pixel intensitities have been scaled between 0 and 1.)
In [ ]:
X_train = np.load(“Data/X_train.npy”)
print(X_train.shape)
If we want to plot one image, then we need to do a proper slicing.
In [ ]:
print(X_train[0,:,:,0].shape)
plt.imshow(X_train[0,:,:,0], cmap=plt.cm.binary)
which corresponds to the following matrix:
In [ ]:
X_train[0,:,:,0]
Apart from the data itself (called X), we also need labels (called y) associated to each data point. These labels indicate which class each data point belongs to. What is the rank and shape of this numpy array?
In [ ]:
y_train = np.load(“Data/y_train.npy”)
print(y_train.shape)
y_train
If you recall, we collected 400 samples per class so we should expect 1600 data points. We have 1200 entries in the training set. Why?
We split the data we have into training and testing. The learning will be done in the training set and the measurement of accuracy will happen on the testing set. What is the rank and shape of the these numpy objects?
In [ ]:
X_test = np.load(“Data/X_test.npy”)
print(X_test.shape)
y_test = np.load(“Data/y_test.npy”)
print(y_test.shape)
Finally, we want to predict the classification of an unknown species, the Marsican bear (labelled as ursus). Let’s load its image, check its rank and plot it.
In [ ]:
X_ursus = np.load(“Data/X_ursus.npy”)
print(X_ursus.shape)
plt.imshow(X_ursus[0,:,:,0], cmap=plt.cm.binary)
Nearest Neighbour Classifier¶
Now we wan to implement a NN classifier to assign a label to our image of the Marsican bear. The idea is to select the image(s) which is (are) closer to the one of interest. For instance, the simple elementwise difference between images can be calculated as:
In [ ]:
X_ursus[0,:,:,0] – X_train[0,:,:,0]
QUESTION (5) Implement a NN classifier using the L1 distance and predict the label for the Ursus image. A possible framework is given below but feel free to use your own creativity. (A possible solution is given in in Solutions/learning.py.)
In [ ]:
labels = [‘LC’, ‘VU’, ‘EN’, ‘CR’] # these are the 4 possible classes
min_distance = 9999 # initialise a value for the distance
for i in range(0, X_train.shape[0]):
distance = …
if …
…
…
Now you can easily implement a NN classifer applied to the whole testing set. What’s the achieved accuracy? You can do it yourself, but the idea is that you have to execute the above operation for each sample from the testing set.
TensorFlow / keras¶
We will be using TensorFlow to build and train our deep neural network. In particular, we will use Keras which is TensorFlow’s high-level API. Keras is used for fast prototyping and production since it is
• User friendly: a simple, consistent interface optimized for common use cases.
• Modular and composable: models are made by connecting building blocks together.
• Easy to extend: you can write custom building blocks, new layers, loss functions, etc. etc.
tensorflow.keras is the implementation of Keras API specification. tf.keras can run any Keras-compatible code. In this practical, we will just focus on using keras directly.
Sequential model¶
Let’s build a simple model. In Keras, you assemble layers to build models. A model is a graph of layers. The most common type of model is a stack of layers: the keras.Sequential model.
In [ ]:
from keras import models, layers
For instance, we can build a simple, fully-connected network with the following code.
In [ ]:
model = models.Sequential()
# Adds a densely-connected layer with 16 units to the model:
model.add(layers.Dense(16, activation=’relu’))
# Add another:
model.add(layers.Dense(16, activation=’relu’))
# Add a softmax layer with 10 output units:
model.add(layers.Dense(5, activation=’softmax’))
Configure the layers¶
There are several keras.layers available with some common constructor parameters:
• activation: set the activation function for the layer, specified by the name of a built-in function or as a callable object.
• kernel_initializer and bias_initializer: the initialization schemes that create the layer’s weights (kernel and bias).
• kernel_regularizer and bias_regularizer: the regularization schemes that apply the layer’s weights (kernel and bias). The following examples show how to create instances of keras.layers.Dense layers using constructor arguments.
In [ ]:
from keras import activations, regularizers, initializers
In [ ]:
# Create a sigmoid layer:
layers.Dense(16, activation=’sigmoid’)
# Or:
layers.Dense(16, activation=activations.sigmoid)
# A linear layer with L1 regularization of factor 0.01 applied to the kernel matrix:
layers.Dense(16, kernel_regularizer=regularizers.l1(0.01))
# A linear layer with L2 regularization of factor 0.01 applied to the bias vector:
layers.Dense(16, bias_regularizer=regularizers.l2(0.01))
# A linear layer with a kernel initialized to a random orthogonal matrix:
layers.Dense(16, kernel_initializer=’orthogonal’)
# A linear layer with a bias vector initialized to 2.0s:
layers.Dense(16, bias_initializer=initializers.constant(2.0))
Training¶
After the model is constructed, we can configure its learning process by calling the compile method.
In [ ]:
from keras import optimizers
In [ ]:
model = models.Sequential([
# Adds a densely-connected layer with 16 units to the model:
layers.Dense(16, activation=’relu’),
# Add another:
layers.Dense(16, activation=’relu’),
# Add a softmax layer with 5 output units:
layers.Dense(5, activation=’softmax’)])
model.compile(optimizer=optimizers.Adam(0.001),
loss=’categorical_crossentropy’,
metrics=[‘accuracy’])
model.compile takes three main arguments:
• optimizer: it specifies the training procedure, such as Adam, RMSProp, or GradientDescent.
• loss: function to minimize during optimization, such as mean square error (mse), categorical_crossentropy, and binary_crossentropy, specified by name or by passing a callable object from the keras.losses module.
• metrics: to monitor training; string names or callables from the keras.metrics module.
Let’s make a quick example using numpy to train and evaluate our model using the fit method.
In [ ]:
data = np.random.random((1000, 32))
labels = np.random.random((1000, 5))
model.fit(data, labels, epochs=10, batch_size=32)
model.fit takes three main arguments:
• epochs: training is structured into epochs, iterations over the entire input data.
• batch_size: the model slices the data into smaller batches and iterates over these batches during training.
• validation_data: tuple of inputs and labels for validation.
When prototyping a model, you want to easily monitor its performance on some validation data. Passing the validation_data argument allows the model to display the loss and metrics in inference mode for the passed data, at the end of each epoch. Here’s an example using validation_data.
In [ ]:
data = np.random.random((1000, 32))
labels = np.random.random((1000, 5))
val_data = np.random.random((100, 32))
val_labels = np.random.random((100, 5))
model.fit(data, labels, epochs=10, batch_size=32,
validation_data=(val_data, val_labels))
For large datasets or multi-device training you can use the Datasets API.
Evaluate and predict¶
The model.evaluate and model.predict methods can use numpy data (and/or a tensorflow.data.Dataset) to evaluate the inference-mode loss and metrics.
In [ ]:
data = np.random.random((1000, 32))
labels = np.random.random((1000, 5))
model.evaluate(data, labels, batch_size=32)
We can predict the output for the data provided.
In [ ]:
result = model.predict(data, batch_size=32)
print(result.shape)
You can read the full documentation for building custom models and more advanced models. You may also want to check out TensorBoard and Low Level information which is helpful for debugging. A series of tutorials is also available.
Predicting conservation status from genomic data¶
Let’s build our convnet. First thing, we need to define the architecture. To do that, we need to import some modules from keras, as previously seen.
In [ ]:
import numpy as np
import matplotlib.pyplot as plt
from keras import models, layers, activations, optimizers, regularizers
Let’s load the data.
In [ ]:
X_train = np.load(“Data/X_train.npy”)
y_train = np.load(“Data/y_train.npy”)
X_test = np.load(“Data/X_test.npy”)
y_test = np.load(“Data/y_test.npy”)
X_ursus = np.load(“Data/X_ursus.npy”)
In [ ]:
y_train
As a first example, we will just do a binary classification into either ‘LC’ or ‘VU’ classes for a subset of the data.
In [ ]:
# training data
y2_train = np.zeros(y_train.shape[0], dtype=’float32′)
y2_train[np.where(np.argmax(y_train, axis=1) < 2)[0]] = 1.
# testing data
y2_test = np.zeros(y_test.shape[0], dtype='float32')
y2_test[np.where(np.argmax(y_test, axis=1) < 2)[0]] = 1.
In [ ]:
y2_train
Architecture¶
Building the neural network requires configuring the layers of the model, then compiling the model. The sequential model type from keras is simply a linear stack of neural network layers. Each layer will be added (or stacked) to the initial model.
In [ ]:
model = models.Sequential()
The model needs also to know what input shape it should expect. The first layer (but only the first) needs to receive such information. Let's add a convolutional layer. We can read how to do it from here. Also note that we can add activation and padding layers directly into this convolutional layer.
In [ ]:
model.add(layers.Conv2D(filters=16, kernel_size=(3,3), strides=(1,1), activation="relu", padding="same", input_shape=(64, 64, 1)))
After a convolutional layer it is often used a max-pooling layer. Read its definition here.
In [ ]:
model.add(layers.MaxPooling2D(pool_size=(2,2)))
You can add several cycles of Conv-MaxPool-Dropout. If you then want to move towards the final fully connected neural network, you need to first flatten the network.
In [ ]:
model.add(layers.Flatten())
Finally we can add a fully connected network with a relu activation using a Dense layer, another dropout layer, and then the output layer with a softmax activation.
In [ ]:
model.add(layers.Dense(units=32, activation="relu"))
# we need 1 units at the output since it is a binary classification, also the final activation is 'sigmoid'
model.add(layers.Dense(units=1, activation="sigmoid"))
We can even print a summary of our architecture with the specification of the learnable parameters.
In [ ]:
model.summary()
Compiling¶
We need to "compile" the network and prepare it for the training. We do it by specifying the loss function and optimisation to use.
In [ ]:
model.compile(optimizer='rmsprop',
loss='binary_crossentropy',
metrics=['accuracy'])
Training¶
Let's train our network. We pass the training data set and the network will optimise its parameters to minimise the loss function. We can allocate a portion of the training data as validation data. You can read more here.
In [ ]:
hist = model.fit(X_train, y2_train, batch_size=32, epochs=10, validation_split=0.15)
The validation accuracy is used to tune the hyper-parameters (e.g. learning rate, dropout rate). It's convenient to plot the decay of loss and increase of accuracy for both the training and validation set.
In [ ]:
from matplotlib import rcParams
train_loss = hist.history['loss']
val_loss = hist.history['val_loss']
train_acc = hist.history['acc']
val_acc = hist.history['val_acc']
xc = range(10)
x_axis = np.zeros(len(xc))
for x,i in enumerate (xc):
x_axis[i] = x + 1
rcParams['axes.titlepad'] = 20
plt.figure(1,figsize=(7,5),facecolor='white')
plt.plot(x_axis,train_loss)
plt.plot(x_axis,val_loss)
plt.xlabel('Epoch', fontsize=12)
plt.ylabel('Loss', fontsize=12)
plt.title('Training loss and validation loss',fontsize=12)
plt.grid(True)
plt.legend(['Training loss','Validation loss'],fontsize=12)
plt.style.use(['classic'])
plt.figure(2,figsize=(7,5),facecolor='white')
plt.plot(x_axis,train_acc)
plt.plot(x_axis,val_acc)
plt.xlabel('Epoch',fontsize=12)
plt.ylabel('Accuracy',fontsize=12)
plt.title('Training accuracy and validation accuracy',fontsize=12)
plt.grid(True)
plt.legend(['Training accuracy','Validation accuracy'],fontsize=12,loc=4)
plt.style.use(['classic'])
Evaluation¶
Now we need to test our network. In other words we evaluate it using the testing data set.
In [ ]:
score = model.evaluate(X_test, y2_test)
print(score) # will return the test loss and test accuracy
We can even plot a confusion matrix.
In [ ]:
from sklearn.metrics import confusion_matrix
# if you don't have sklearn install it with: conda install scikit-learn
y2_pred = np.where(model.predict(X_test, batch_size=None, verbose=1)<0.5,0.,1.)
cm = confusion_matrix(y2_test, y2_pred)
In [ ]:
classes = ['low', 'high'] # these are the 2 labels
np.set_printoptions(precision=2)
fig = plt.figure(facecolor='white')
title = 'Normalized confusion matrix'
cmap = plt.cm.Blues
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=90, fontsize=8)
#plt.xticks(tick_marks, rotation=45, fontsize=6) # alternative
plt.yticks(tick_marks, classes, fontsize=8)
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
Prediction¶
Finally we used the final optimised network to predict the label for unknown entries. More info here.
In [ ]:
y_ursus = model.predict(X_ursus)
print(y_ursus)
The nice thing is that these are proper posterior probabilities. Therefore we can even calculate Bayes factors and do model testing. Imagine that we want to test whether this species is threatened (so not "least concern" or "vulnerable"). We can calculate this Bayes factor as follows:
In [ ]:
(y_ursus[0]) / (1 - y_ursus[0])
BYON (Build Your Own Network)¶
Based on the previous templare, build and train a network to predict the 4 different classes of conservation status. Build your own architecture and compiler using the full data set (X_train and y_train). Test the accuracy on the testing data set. Finally, predict the conservation status of X_ursus.
You need to think carefully about your final layer(s), loss function and validation metrics. metrics. Also, be aware of various tensor sizes and make sure they are correct in your layers and in your predictions.
Some general tips:
• you can use multiple cycles of Conv+MaxPool layers
• you can play with the dropout rates and/or regularizers to prevent overfitting
• you can tune the learning rate
• you can add more/less units (or neurons) in the dense layers
• you can add more/less filters or change their size
• try a leakyReLu or other activation functions
• change how you initialise the weights
• ...
Be aware that the accuracy won't improve by default if you make the network deeper: you have more parameters to optimise with the same data set! It's always a good idea to check the number of parameters to estimate (and size of each layer). Do not overcomplicate things!
Ideally, you may want to try different configurations and retain the one with the highest validation accuracy. Then this network will be passed to the testing set. For instance, assume that you have one hyper-parameter and you want to estimate it. How would you build such pipeline?
The best way of learning is by doing. Read the keras manual https://keras.io/ to learn how to implement your ideas. Pick a partner or more and start a team.
It's a competition.
Good luck!
In [ ]:
# ...