Deep learning¶
Our toy example will be to classify species as endangered or not based genomic data. The rationale is that species with a small (effective population size) will have higher chances to be threatened. The amount of genomic variability (e.g. polymorphic sites and haplotype diversity) is taken as a proxy for the (effective) population size of each species. The genomic variation at marker loci is represented as an image.
Objective¶
We assume that we have collected some genomic data on many species which have already been categorised into 4 classes of conservation status: least concern, vulnerable, endangered, critically endagered. Our goal is to implement a classifier that given genomic data can predic whether that species is endangered or not.
Now the know how to process and manipulate images with python, we can actually do some science! As explained before our goal is to build a classifier to predict whether a certain species is endangered or not. We use genetic information as proxy for the ability of the species to react to novel conditions.
We assume we have 4 classes of conservation status (LC, VU, EN, CR) and 400 samples per class. Each data point is an image which represents the (biallelic) genetic variation (so it’s binary) across individuals (on the rows) over several genetic loci (on the columns). Images are double sorted to remove some noise associated to the order of samples and polymorphic sites.
In [ ]:
%matplotlib notebook
import matplotlib.pyplot as plt
import numpy as np
Training and testing data¶
The first think we do is to load images representing genomic data from various species. These images have already been converted into numpy arrays. (Note that we have already preprocessed our data, so pixel intensitities have been scaled between 0 and 1.)
In [ ]:
X_train = np.load(“Data/X_train.npy”)
print(X_train.shape)
If we want to plot one image, then we need to do a proper slicing.
In [ ]:
print(X_train[0,:,:,0].shape)
plt.imshow(X_train[0,:,:,0], cmap=plt.cm.binary)
which corresponds to the following matrix:
In [ ]:
X_train[0,:,:,0]
Apart from the data itself (called X), we also need labels (called y) associated to each data point. These labels indicate which class each data point belongs to. What is the rank and shape of this numpy array?
In [ ]:
classes = [“LC”, “VU”, “EN”, “CR”]
In [ ]:
y_train = np.load(“Data/y_train.npy”)
print(y_train.shape)
y_train
In [ ]:
for idx, elem in enumerate(classes):
index = np.where(np.argmax(y_train, axis=1)==idx)[0][0]
print(idx, elem, index)
plt.figure()
plt.title(elem)
plt.imshow(X_train[index,:,:,0], cmap=plt.cm.binary)
If you recall, we collected 400 samples per class so we should expect 1600 data points. We have 1200 entries in the training set. Why?
We split the data we have into training and testing. The learning will be done in the training set and the measurement of accuracy will happen on the testing set. What is the rank and shape of the these numpy objects?
In [ ]:
X_test = np.load(“Data/X_test.npy”)
print(X_test.shape)
y_test = np.load(“Data/y_test.npy”)
print(y_test.shape)
Finally, we want to predict the classification of an unknown species, the Marsican bear (labelled as ursus). Let’s load its image, check its rank and plot it.
In [ ]:
X_ursus = np.load(“Data/X_ursus.npy”)
print(X_ursus.shape)
plt.imshow(X_ursus[0,:,:,0], cmap=plt.cm.binary)
keras / TensorFlow¶
We will be using TensorFlow to build and train our deep neural network. In particular, we will use Keras which is TensorFlow’s high-level API. Keras is used for fast prototyping and production since it is
• User friendly: a simple, consistent interface optimized for common use cases.
• Modular and composable: models are made by connecting building blocks together.
• Easy to extend: you can write custom building blocks, new layers, loss functions, etc. etc.
tf.keras is the implementation of Keras API specification. tf.keras can run any Keras-compatible code.
Predicting conservation status¶
Let’s build our convnet. First thing, we need to define the architecture. To do that, we need to import some modules from keras. Instead of using tensorflow, we can use the high-levekl API keras with a tensforlow backend.
In [ ]:
import keras
In [ ]:
X_train = np.load(“Data/X_train.npy”)
y_train = np.load(“Data/y_train.npy”)
X_test = np.load(“Data/X_test.npy”)
y_test = np.load(“Data/y_test.npy”)
X_ursus = np.load(“Data/X_ursus.npy”)
Architecture¶
Building the neural network requires configuring the layers of the model, then compiling the model. The sequential model type from keras is simply a linear stack of neural network layers. Each layer will be added (or stacked) to the initial model.
In [ ]:
model = keras.Sequential()
The model needs also to know what input shape it should expect. The first layer (but only the first) needs to receive such information. Let’s add a convolutional layer. We can read how to do it from here. Also note that we can add activation and padding layers directly into this convolutional layer.
In [ ]:
model.add(keras.layers.Conv2D(filters=16, kernel_size=(3,3), strides=(1,1), activation=”relu”, padding=”same”, input_shape=(64, 64, 1)))
After a convolutional layer it is often used a max-pooling layer. Read its definition here.
In [ ]:
model.add(keras.layers.MaxPooling2D(pool_size=(2,2)))
To prevent overfitting, one trick is to use a Dropout layer.
In [ ]:
model.add(keras.layers.Dropout(rate=0.5))
You can add several cycles of Conv-MaxPool-Dropout. If you then want to move towards the final fully connected neural network, you need to first flatten the network.
In [ ]:
model.add(keras.layers.Flatten())
Finally we can add a fully connected network with a relu activation using a Dense layer, another dropout layer, and then the output layer with a softmax activation.
In [ ]:
model.add(keras.layers.Dense(units=32, activation=”relu”))
# we need 4 units at the output since we have 4 classes
model.add(keras.layers.Dense(units=4, activation=”softmax”))
We can even print a summary of our architecture with the specification of the learnable parameters.
In [ ]:
model.summary()
In [ ]:
from keras.utils import plot_model
plot_model(model, to_file=’model.png’)
# xdg-open model.png
Compiling¶
We need to “compile” the network and prepare it for the training. We do it by specifying the loss function and optimisation to use.
In [ ]:
model.compile(optimizer=keras.optimizers.Adam(0.001),
loss=’categorical_crossentropy’,
metrics=[‘accuracy’])
Training¶
Let’s train our network. We pass the training data set and the network will optimise its parameters to minimise the loss function. We can allocate a portion of the training data as validation data. You can read more here.
In [ ]:
hist = model.fit(X_train, y_train, batch_size=32, epochs=10, validation_split=0.20)
The validation accuracy is used to tune the hyper-parameters (e.g. learning rate, dropout rate). It’s convenient to plot the decay of loss and increase of accuracy for both the training and validation set.
In [ ]:
from matplotlib import rcParams
train_loss = hist.history[‘loss’]
val_loss = hist.history[‘val_loss’]
train_acc = hist.history[‘acc’]
val_acc = hist.history[‘val_acc’]
xc = range(10)
x_axis = np.zeros(len(xc))
for x,i in enumerate (xc):
x_axis[i] = x + 1
rcParams[‘axes.titlepad’] = 20
plt.figure()
plt.plot(x_axis,train_loss)
plt.plot(x_axis,val_loss)
plt.xlabel(‘Epoch’, fontsize=12)
plt.ylabel(‘Loss’, fontsize=12)
plt.title(‘Training loss and validation loss’,fontsize=12)
plt.grid(True)
plt.legend([‘Training loss’,’Validation loss’],fontsize=12)
plt.style.use([‘classic’])
In [ ]:
plt.figure()
plt.plot(x_axis,train_acc)
plt.plot(x_axis,val_acc)
plt.xlabel(‘Epoch’,fontsize=12)
plt.ylabel(‘Accuracy’,fontsize=12)
plt.title(‘Training accuracy and validation accuracy’,fontsize=12)
plt.grid(True)
plt.legend([‘Training accuracy’,’Validation accuracy’],fontsize=12,loc=4)
plt.style.use([‘classic’])
Evaluation¶
Now we need to test our network. In other words we evaluate it using the testing data set.
In [ ]:
score = model.evaluate(X_test, y_test)
print(score) # will return the test loss and test accuracy
We can even plot a confusion matrix.
In [ ]:
from sklearn.metrics import confusion_matrix
# if you don’t have sklearn install it with: conda install scikit-learn
Y_pred = model.predict(X_test, batch_size=None, verbose=1)
y_pred = np.argmax(Y_pred, axis=1)
classes = [“LC”, “VU”, “EN”, “CR”] # these are the 4 possible classes
cm = confusion_matrix(np.argmax(y_test,axis=1), y_pred)
np.set_printoptions(precision=2)
fig = plt.figure(facecolor=’white’)
title = ‘Normalized confusion matrix’
cmap = plt.cm.Blues
cm = cm.astype(‘float’) / cm.sum(axis=1)[:, np.newaxis]
plt.imshow(cm, interpolation=’nearest’, cmap=cmap)
plt.title(title)
plt.colorbar()
#plt.colorbar(shrink=0.7) # alternative
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=90, fontsize=8)
#plt.xticks(tick_marks, rotation=45, fontsize=6) # alternative
plt.yticks(tick_marks, classes, fontsize=8)
plt.tight_layout()
plt.ylabel(‘True label’)
plt.xlabel(‘Predicted label’)
Prediction¶
Finally we used the final optimised network to predict the label for unknown entries. More info here.
In [ ]:
y_ursus = model.predict(X_ursus)
print(classes)
print(y_ursus)
The nice thing is that these are proper posterior probabilities. Therefore we can even calculate Bayes factors and do model testing. Imagine that we want to test whether this species is threatened (so not “least concern”). We can calculate this Baye factor as follows:
In [ ]:
(1 – y_ursus[0,0])/(3/4) / (y_ursus[0,0])/(1/4)