20th August 2019
Introduction to machine learning and neural networks applied to biological data
https://bitbucket.org/mfumagal/
statistical_inference
Matteo Fumagalli
Intended Learning Outcomes
By the end of this session, you will be able to:
Describe the three key components of a classifier: score function, loss function, optimisation
Identify the elements of a neural networks, including neurons and hyper-parameters
Illustrate the specific layers in a neural network for visual recognition
Appreciate the use of deep learning to solve biological problems
Demonstrate how to implement, train and evaluate a deep neural network in python
Who is the deepest learner? It’s a competition!
The challenge: predict whether a species is endangered, vulnerable or of least concern from genomic data.
The score to beat: 75% by me. The prize: a free drink at the pub.
Ursus arctos marsicanus
Evolution of AI
What do you see?
What does the computer see?
Is it THAT difficult?
Challenges
invariant to the cross product of all these variations retaining sensitivity to the inter-class variations
Data-driven approach
We need a (large) training dataset of labeled images.
Pipeline for image classification
1. Training set: N images of K classes
2. Learning: training a classifier
3. Evaluation: against the ground truth
Nearest Neighbour Classifier
Figure 1: CIFAR-10 dataset: 60k tiny images of 10 classes.
The nearest neighbour classifier will take a test image, compare it to every single one of the training images, and predict the label of the closest training image.
Nearest Neighbour Classifier
Nearest Neighbour Classifier
Nearest Neighbour Classifier
The choice of distance
L1 distance: d1(I1, I2) = pixel |I1p − I2p| L2distance: d1(I1,I2)=pixel(I1p −I2p)2
What’s their accuracy?
What’s human accuracy?
What’s state-of-the-art neural networks’ accuracy?
k-Nearest Neighbour Classifier
Figure 2: An example of the difference between Nearest Neighbor and a 5-Nearest Neighbor classifier, using 2-dimensional points and 3 classes (red, blue, green).
What value of k should we use? Which distance?
Hyperparameter tuning
Agree or disagree?
The engineer says: ”We should try out many different values and see what works best.”
Validation test
Split your training set into training set and a validation set. Use validation set to tune all hyperparameters.
At the end run a single time on the test set and report performance.
The good engineer says: ”Evaluate on the test set only a single time, at the very end.”
Data splits
Figure 3: The training set is split into folds: 1-4 become the training set while 5 is the validation set used to tune the hyperparameters.
Where is the Nearest Neighbour classifier spending most of its (computational) time?
Wrap up
the problem of image classification: predicting labels for novel test entries
training set vs testing set
a simple Nearest Neighbor classifier requires hyperparameters
validation set to tune hyperparameters
Nearest Neighbor classifier has low accuracy (distances based on raw pixel values!) and is expensive at testing
Our aim: a solution which gives 90% accuracy, discards the training set once learning is complete, and evaluates a test image in less than a millisecond!
Linear classification
New approach based on:
score function to map raw data to class scores
loss function to quantify the agreement between predicted and true labels
Parameterised mapping from images to label scores
Our aim is to define the score function that maps the pixel values of an image to confidence scores for each class.
Assuming that:
N images, each with dimensionality D, and K distinct classes
xi ∈ RD is image i-th with dimensions D and label yi, with i = 1…N and yi ∈ 1…K
then we define a score function: f : RD → RK
Linear classifier
Linearmapping: f(xi;W,b)=Wxi +b
W are called weights and b is the bias vector. What are the dimensions of xi , W and b?
Linear classifier
Linearmapping: f(xi;W,b)=Wxi +b
W are called weights and b is the bias vector.
What are the dimensions of xi , W and b? xi has size [D x 1]
W hassize[KxD]
b has size [K x 1]
Linear classifier
Interpreting a linear classifier (i)
Interpreting a linear classifier (ii)
Interpreting a linear classifier (iii)
Template (or prototype) matching.
Bias trick
Our new score function: f (xi ; W ) = Wxi
Loss function*
To measure our ”unhappiness” with predicted outcomes.
Loss function*
To measure our ”unhappiness” with predicted outcomes.
* sometimes called cost function or objective
Multiclass Support Vector Machine (SVM) loss
The SVM loss is set so that the SVM ”wants” the correct class for each image to a have a higher score than the incorrect ones by some fixed margin.
Li =max(0,sj −syi +δ) j ̸=yi
Example:
s = [13,−7,11],yi = 0,δ = 10 Li =
Multiclass Support Vector Machine (SVM) loss
The SVM loss is set so that the SVM ”wants” the correct class for each image to a have a higher score than the incorrect ones by some fixed margin.
Li =max(0,sj −syi +δ) j ̸=yi
Example:
s = [13,−7,11],yi = 0,δ = 10 Li = 8
Hinge loss
max(0,−) or max(0,−)2
Regularisation
If W correctly classifies each sample, then all λW with λ > 1 will have zero loss.
Which W should we choose?
Regularisation
If W correctly classifies each sample, then all λW with λ > 1 will have zero loss.
Which W should we choose?
Our new multiclass SVM loss function is:
L= 1 [max(0,f(x;W) −f(x;W) +δ)]+λW2 Nijiyi k,l
i j̸=yi k l
including one data loss and one regularisation loss term λR(W), specifically L2 penalty.
Softmax classifier
Generalisation of the binary logistic regression classifier to multiple classes.
Cross-entropy loss function:
Li = −log(j efj )
e fyi
Probabilistic interpretation of Softmax scores
e fyi P(yi|xi;W) = j efj
Likelihood or Bayesian?
SVM vs. Softmax classifier
Wrap up
A score function maps image pixels to class scores (using a linear function that depends on W and b).
Once we learning is done, we can discard the training data and prediction is fast.
A loss function (e.g. SVM and Softmax) measures how compatible a given set of parameters is with respect to the ground truth labels in the training dataset.
How do we determine (optimise) the parameters that give the lowest loss?
Key components for image classification
1 score function
2 loss function
3 optimisation
Optimisation is the process of finding the set of parameters W that minimise the loss function L.
Visualising the loss function
If W0 random starting point, W1 random direction, then compute L(W0 + aW1) for different values of a.
(averaged across all images, xi )
Optimisation
Random search
Random local search
Gradient descent (numerical or analytical)
df(x) = lim f(x +h)−f(x) dxh→0 h
Hyperparameters
Step size or learning rate
Batch size:
Compute the gradient over batches (e.g. 32, 64, 128…) of the training data.
Backpropagation
We can compute the gradient analytically using the chain rule. f (x , y , z ) = (x + y )z
q = x + y and f = qz
df = df dq dx dq dx
Wrap up
The 3 elements: score function, loss function, optimisation. Next: let’s put them all together in a neural network.
Neurons
Activation functions
It defines the firing rate
Sigmoid non-linearity squashes real numbers to range between [0, 1]
Rectified Linear Unit (ReLU): f (x ) = max (0, x )
Neural network architecture
Collection of neurons connected in an acyclic graph. Last output layer represents class scores.
A 2-layer Neural Network Size:
Neural network architecture
Collection of neurons connected in an acyclic graph. Last output layer represents class scores.
A 2-layer Neural Network
Size: 4+2=6neurons,[3×4]+ [4 x 2] = 20 weights and 4 + 2 = 6 biases, for a total of 26 learnable parameters.
A 3-layer Neural Network Size:
Neural network architecture
Collection of neurons connected in an acyclic graph. Last output layer represents class scores.
A 2-layer Neural Network
Size: 4+2=6neurons,[3×4]+ [4 x 2] = 20 weights and 4 + 2 = 6 biases, for a total of 26 learnable parameters.
A 3-layer Neural Network
Size: 4+4+1=9neurons,[3x 4] + [4 x 4] + [4 x 1] = 12 + 16 + 4 = 32 weights and 4 + 4 + 1 = 9 biases, for a total of 41 learnable parameters.
Representational power
Given any continuous function f (x ) and some ε > 0, there exists a Neural Network g(x;W) with one hidden layer (with a reasonable choice of non-linearity, e.g. sigmoid) such that for all x,
| f (x) − g(x) |< ε.
In other words, the neural network can approximate any continuous function.
In practice, more layers work better...
Setting up the architecture
Capacity vs. ?
Setting up the architecture
Capacity vs. ? Overfitting
We aim at a better generalisation.
Setting up the data
Data preprocessing: mean subtraction
normalisation
PCA and Whitening
Setting up the model
Weights’ initialisation: all zero
small random numbers calibrate the variances sparse
Setting up the model
Regularisation
Options: L2, L1, maxnorm and dropout.
Dropout
Dropout can be interpreted as sampling a Neural Network within the full Neural Network, and only updating the parameters of the sampled network based on the input data.
Setting up the model
Loss functions:
SVM (hinge loss)
cross-entropy hierarchical softmax attribute classification for regression
Setting up the learning
effects of different learning rates
loss decay
Setting up the learning
Training vs. validation accuracy
Wrap up
Neural Networs are made of layers of neurons/units with activation functions
Choice of the architecture: capacity vs overfitting
Preprocessing of the data and choice of hyperparameters for the model and learning
What about images? Can we use neural networks straight from images? What’s the issue?
Convolutional Neural Networks (CNN)
A CNN arranges its neurons in three dimensions (width, height, depth). Every layer of a CNN transforms the 3D input volume to a 3D output volume of neuron activations. Neurons in a layer are connected only to a small region of the layer before it.
CNN architecture
Convolutional Layer Pooling Layer Fully-Connected Layer
We will stack these layers to form a full CNN architecture.
Convolutional layer
Set of learnable filters which slides across the width and height of the input volume.
3 hyper-parameters: depth (nr of filters), stride, zero-padding.
Convolutional layer
Accepts a volume of size W1xH1xD1
Requires 4 parameters: number of filters K, their size F, the stride S, the amount of zero padding P
Produces a volume of size W2xH2xD2 where:
W2 = (W1 − F + 2P)/S + 1 H2 = (H1 − F + 2P)/S + 1 D2 = K
Usually: F = 3,S = 1,P = 1.
Demo: http://cs231n.github.io/convolutional-networks/
Weights in filters
Example filters learned by Krizhevsky et al. Each of the 96 filters shown here is of size [11x11x3], and each one is shared by the 55*55 neurons in one depth slice
Pooling layer
Size will influence the proportion of weights retained.
Layer patterns
INPUT → [[CONV → RELU]*N → POOL?]*M → [FC → RELU]*K → FC
More examples:
http://cs231n.github.io/convolutional-networks/
Applications to biological data
From various sources:
-omics (gen-, transcript-, epigen-, prote-, metabol-) bioimaging (cellular images, ...)
medical images (clinical imaging)
brain/body machine interfaces (ECG, EEG, ...)
Applications to biological data
Omics
Mining DNA/RNA sequence data to:
identify splicing junction
classify somatic point mutation-based cancer predict DNA- and RNA-binding motifs
relate disease-associated variants to gene expression estimate DNA methylation patterns
...
Splice junctions
Deep neural networs outperform other methods.
Open issues
theory of deep learning is not completely understood making outcomes difficult to interpret
susceptible to misclassification and overclassification uncertainty in building architectures
bootstrapping not possible
Future perspectives
improving theoretical foundations on the basis of experimental data
assessment of model’s computational complexity and learning efficiency
novel data visualization tools
in biology: reduce data redundancy and extract novel information
ad hoc computational infrastructures
Wrap up
CNN arranges its neurons in three dimensions Different type of layers (convolution, pooling, ...) Weights in filters are learned
Promising applications to biological data.
TensorFlow
easy and intuitive way to do ML
TensorFlow
concept-heavy but code-light
many parameters, but only few are important to adjust
TensorFlow
Low-level APIs
What is a tensor?
Examples of data tensors
vector data: 2D tensors of shape (samples, features) timeseries or sequence data: 3D tensors of shape (samples,
timesteps, features)
images: 4D tensors of shape (samples, height, width, channels)
video: 5D tensors of shape (samples, frames, height, width, channels)
The first axis is the sample or batch dimension.
Image data
A batch of 128 colour images can be stored in a 4D tensor with shape
(128,256,256,3)
Anatomy of a neural network
Build your first neural network
1 collect and preprocessing a dataset: most of the actual work
Build your first neural network
1 collect and preprocessing a dataset: most of the actual work
2 build your model: few lines of code
Build your first neural network
1 collect and preprocessing a dataset: most of the actual work
2 build your model: few lines of code
3 train: one line of code
Build your first neural network
1 collect and preprocessing a dataset: most of the actual work
2 build your model: few lines of code
3 train: one line of code
4 evaluate: one line of code
Build your first neural network
1 collect and preprocessing a dataset: most of the actual work
2 build your model: few lines of code
3 train: one line of code
4 evaluate: one line of code
5 predict: one line of code
source: Get started with TensorFlow’s High-Level APIs (Google I/O ’18)
Step 1: collect a dataset
Import the data and spend a lot of time asking questions on: rank, shape, number of objects, printing, format, data type, ...: very basic questions!
Step 1: collect a dataset
70,000 28x28 grayscale images in 10 categories of clothing articles
Step 1: collect a dataset
Step 2: build a model
Step 2: build a model
start simple! do not overlearn the training set.
define loss function
define optimization (important but defaults are good)
When you choose a network topology, you constrain your space of possibilities (hypothesis space) to a specific series of tensor operations, mapping input data to output data.
Step 3: train the model
Only ”epochs” (and ”batch size”) matter here.
Step 4: evaluate
Step 5: predict
https://www.tensorflow.org/tutorials/keras/basic_
classification
Keras
”Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano.”
Keras
Intended Learning Outcomes
At the end of this session, you are now be able to:
Describe the three key components of a classifier: score function, loss function, optimisation
Identify the elements of a neural networks, including neurons and hyper-parameters
Illustrate the specific layers in a neural network for visual recognition
Appreciate the use of deep learning to solve biological problems
Demonstrate how to implement, train and evaluate a deep neural network in python
IUCN Red List of Threatened Species
LC: least concern
EN: endangered
VU: vulnerable CR: critically endangered
Population genetics
Population genetics
Genomic data
haplotypes/individuals on rows, genomic positions on columns
CNN applied to population genomic data
Who is the deepest learner? It’s a competition!
The challenge: predict whether a species is endangered, vulnerable or of least concern from genomic data.
The score to beat: 75% by me. The prize: a free drink at the pub.
Ursus arctos marsicanus
Well done!
You are all data scientists* now!
*data science: statistics done by non-statisticians