程序代写代做 chain Keras DNA data science AI Bayesian graph deep learning 20th August 2019

20th August 2019
Introduction to machine learning and neural networks applied to biological data
https://bitbucket.org/mfumagal/
statistical_inference
Matteo Fumagalli

Intended Learning Outcomes
By the end of this session, you will be able to:
Describe the three key components of a classifier: score function, loss function, optimisation
Identify the elements of a neural networks, including neurons and hyper-parameters
Illustrate the specific layers in a neural network for visual recognition
Appreciate the use of deep learning to solve biological problems
Demonstrate how to implement, train and evaluate a deep neural network in python

Who is the deepest learner? It’s a competition!
The challenge: predict whether a species is endangered, vulnerable or of least concern from genomic data.
The score to beat: 75% by me. The prize: a free drink at the pub.
Ursus arctos marsicanus

Evolution of AI

What do you see?

What does the computer see?
Is it THAT difficult?

Challenges
invariant to the cross product of all these variations retaining sensitivity to the inter-class variations

Data-driven approach
We need a (large) training dataset of labeled images.

Pipeline for image classification
1. Training set: N images of K classes
2. Learning: training a classifier
3. Evaluation: against the ground truth

Nearest Neighbour Classifier
Figure 1: CIFAR-10 dataset: 60k tiny images of 10 classes.
The nearest neighbour classifier will take a test image, compare it to every single one of the training images, and predict the label of the closest training image.

Nearest Neighbour Classifier

Nearest Neighbour Classifier

Nearest Neighbour Classifier

The choice of distance
L1 distance: d1(I1, I2) = 􏰀pixel |I1p − I2p| L2distance: d1(I1,I2)=􏰓􏰀pixel(I1p −I2p)2
What’s their accuracy?
What’s human accuracy?
What’s state-of-the-art neural networks’ accuracy?

k-Nearest Neighbour Classifier
Figure 2: An example of the difference between Nearest Neighbor and a 5-Nearest Neighbor classifier, using 2-dimensional points and 3 classes (red, blue, green).
What value of k should we use? Which distance?

Hyperparameter tuning
Agree or disagree?
The engineer says: ”We should try out many different values and see what works best.”

Validation test
Split your training set into training set and a validation set. Use validation set to tune all hyperparameters.
At the end run a single time on the test set and report performance.
The good engineer says: ”Evaluate on the test set only a single time, at the very end.”

Data splits
Figure 3: The training set is split into folds: 1-4 become the training set while 5 is the validation set used to tune the hyperparameters.
Where is the Nearest Neighbour classifier spending most of its (computational) time?

Wrap up
the problem of image classification: predicting labels for novel test entries
training set vs testing set
a simple Nearest Neighbor classifier requires hyperparameters
validation set to tune hyperparameters
Nearest Neighbor classifier has low accuracy (distances based on raw pixel values!) and is expensive at testing
Our aim: a solution which gives 90% accuracy, discards the training set once learning is complete, and evaluates a test image in less than a millisecond!

Linear classification
New approach based on:
score function to map raw data to class scores
loss function to quantify the agreement between predicted and true labels

Parameterised mapping from images to label scores
Our aim is to define the score function that maps the pixel values of an image to confidence scores for each class.
Assuming that:
N images, each with dimensionality D, and K distinct classes
xi ∈ RD is image i-th with dimensions D and label yi, with i = 1…N and yi ∈ 1…K
then we define a score function: f : RD → RK

Linear classifier
Linearmapping: f(xi;W,b)=Wxi +b
W are called weights and b is the bias vector. What are the dimensions of xi , W and b?

Linear classifier
Linearmapping: f(xi;W,b)=Wxi +b
W are called weights and b is the bias vector.
What are the dimensions of xi , W and b? xi has size [D x 1]
W hassize[KxD]
b has size [K x 1]

Linear classifier

Interpreting a linear classifier (i)

Interpreting a linear classifier (ii)

Interpreting a linear classifier (iii)
Template (or prototype) matching.

Bias trick
Our new score function: f (xi ; W ) = Wxi

Loss function*
To measure our ”unhappiness” with predicted outcomes.

Loss function*
To measure our ”unhappiness” with predicted outcomes.
* sometimes called cost function or objective

Multiclass Support Vector Machine (SVM) loss
The SVM loss is set so that the SVM ”wants” the correct class for each image to a have a higher score than the incorrect ones by some fixed margin.
Li =􏰑max(0,sj −syi +δ) j ̸=yi
Example:
s = [13,−7,11],yi = 0,δ = 10 Li =

Multiclass Support Vector Machine (SVM) loss
The SVM loss is set so that the SVM ”wants” the correct class for each image to a have a higher score than the incorrect ones by some fixed margin.
Li =􏰑max(0,sj −syi +δ) j ̸=yi
Example:
s = [13,−7,11],yi = 0,δ = 10 Li = 8

Hinge loss
max(0,−) or max(0,−)2

Regularisation
If W correctly classifies each sample, then all λW with λ > 1 will have zero loss.
Which W should we choose?

Regularisation
If W correctly classifies each sample, then all λW with λ > 1 will have zero loss.
Which W should we choose?
Our new multiclass SVM loss function is:
L= 1 􏰑􏰑[max(0,f(x;W) −f(x;W) +δ)]+λ􏰑􏰑W2 Nijiyi k,l
i j̸=yi k l
including one data loss and one regularisation loss term λR(W), specifically L2 penalty.

Softmax classifier
Generalisation of the binary logistic regression classifier to multiple classes.
Cross-entropy loss function:
Li = −log(􏰀j efj )
e fyi

Probabilistic interpretation of Softmax scores
e fyi P(yi|xi;W) = 􏰀j efj
Likelihood or Bayesian?

SVM vs. Softmax classifier

Wrap up
A score function maps image pixels to class scores (using a linear function that depends on W and b).
Once we learning is done, we can discard the training data and prediction is fast.
A loss function (e.g. SVM and Softmax) measures how compatible a given set of parameters is with respect to the ground truth labels in the training dataset.
How do we determine (optimise) the parameters that give the lowest loss?

Key components for image classification
1 score function
2 loss function
3 optimisation
Optimisation is the process of finding the set of parameters W that minimise the loss function L.

Visualising the loss function
If W0 random starting point, W1 random direction, then compute L(W0 + aW1) for different values of a.
(averaged across all images, xi )

Optimisation
Random search
Random local search
Gradient descent (numerical or analytical)
df(x) = lim f(x +h)−f(x) dxh→0 h

Hyperparameters
Step size or learning rate
Batch size:
Compute the gradient over batches (e.g. 32, 64, 128…) of the training data.

Backpropagation
We can compute the gradient analytically using the chain rule. f (x , y , z ) = (x + y )z
q = x + y and f = qz
df = df dq dx dq dx

Wrap up
The 3 elements: score function, loss function, optimisation. Next: let’s put them all together in a neural network.

Neurons

Activation functions
It defines the firing rate
Sigmoid non-linearity squashes real numbers to range between [0, 1]
Rectified Linear Unit (ReLU): f (x ) = max (0, x )

Neural network architecture
Collection of neurons connected in an acyclic graph. Last output layer represents class scores.
A 2-layer Neural Network Size:

Neural network architecture
Collection of neurons connected in an acyclic graph. Last output layer represents class scores.
A 2-layer Neural Network
Size: 4+2=6neurons,[3×4]+ [4 x 2] = 20 weights and 4 + 2 = 6 biases, for a total of 26 learnable parameters.
A 3-layer Neural Network Size:

Neural network architecture
Collection of neurons connected in an acyclic graph. Last output layer represents class scores.
A 2-layer Neural Network
Size: 4+2=6neurons,[3×4]+ [4 x 2] = 20 weights and 4 + 2 = 6 biases, for a total of 26 learnable parameters.
A 3-layer Neural Network
Size: 4+4+1=9neurons,[3x 4] + [4 x 4] + [4 x 1] = 12 + 16 + 4 = 32 weights and 4 + 4 + 1 = 9 biases, for a total of 41 learnable parameters.

Representational power
Given any continuous function f (x ) and some ε > 0, there exists a Neural Network g(x;W) with one hidden layer (with a reasonable choice of non-linearity, e.g. sigmoid) such that for all x,
| f (x) − g(x) |< ε. In other words, the neural network can approximate any continuous function. In practice, more layers work better... Setting up the architecture Capacity vs. ? Setting up the architecture Capacity vs. ? Overfitting We aim at a better generalisation. Setting up the data Data preprocessing: mean subtraction normalisation PCA and Whitening Setting up the model Weights’ initialisation: all zero small random numbers calibrate the variances sparse Setting up the model Regularisation Options: L2, L1, maxnorm and dropout. Dropout Dropout can be interpreted as sampling a Neural Network within the full Neural Network, and only updating the parameters of the sampled network based on the input data. Setting up the model Loss functions: SVM (hinge loss) cross-entropy hierarchical softmax attribute classification for regression Setting up the learning effects of different learning rates loss decay Setting up the learning Training vs. validation accuracy Wrap up Neural Networs are made of layers of neurons/units with activation functions Choice of the architecture: capacity vs overfitting Preprocessing of the data and choice of hyperparameters for the model and learning What about images? Can we use neural networks straight from images? What’s the issue? Convolutional Neural Networks (CNN) A CNN arranges its neurons in three dimensions (width, height, depth). Every layer of a CNN transforms the 3D input volume to a 3D output volume of neuron activations. Neurons in a layer are connected only to a small region of the layer before it. CNN architecture Convolutional Layer Pooling Layer Fully-Connected Layer We will stack these layers to form a full CNN architecture. Convolutional layer Set of learnable filters which slides across the width and height of the input volume. 3 hyper-parameters: depth (nr of filters), stride, zero-padding. Convolutional layer Accepts a volume of size W1xH1xD1 Requires 4 parameters: number of filters K, their size F, the stride S, the amount of zero padding P Produces a volume of size W2xH2xD2 where: W2 = (W1 − F + 2P)/S + 1 H2 = (H1 − F + 2P)/S + 1 D2 = K Usually: F = 3,S = 1,P = 1. Demo: http://cs231n.github.io/convolutional-networks/ Weights in filters Example filters learned by Krizhevsky et al. Each of the 96 filters shown here is of size [11x11x3], and each one is shared by the 55*55 neurons in one depth slice Pooling layer Size will influence the proportion of weights retained. Layer patterns INPUT → [[CONV → RELU]*N → POOL?]*M → [FC → RELU]*K → FC More examples: http://cs231n.github.io/convolutional-networks/ Applications to biological data From various sources: -omics (gen-, transcript-, epigen-, prote-, metabol-) bioimaging (cellular images, ...) medical images (clinical imaging) brain/body machine interfaces (ECG, EEG, ...) Applications to biological data Omics Mining DNA/RNA sequence data to: identify splicing junction classify somatic point mutation-based cancer predict DNA- and RNA-binding motifs relate disease-associated variants to gene expression estimate DNA methylation patterns ... Splice junctions Deep neural networs outperform other methods. Open issues theory of deep learning is not completely understood making outcomes difficult to interpret susceptible to misclassification and overclassification uncertainty in building architectures bootstrapping not possible Future perspectives improving theoretical foundations on the basis of experimental data assessment of model’s computational complexity and learning efficiency novel data visualization tools in biology: reduce data redundancy and extract novel information ad hoc computational infrastructures Wrap up CNN arranges its neurons in three dimensions Different type of layers (convolution, pooling, ...) Weights in filters are learned Promising applications to biological data. TensorFlow easy and intuitive way to do ML TensorFlow concept-heavy but code-light many parameters, but only few are important to adjust TensorFlow Low-level APIs What is a tensor? Examples of data tensors vector data: 2D tensors of shape (samples, features) timeseries or sequence data: 3D tensors of shape (samples, timesteps, features) images: 4D tensors of shape (samples, height, width, channels) video: 5D tensors of shape (samples, frames, height, width, channels) The first axis is the sample or batch dimension. Image data A batch of 128 colour images can be stored in a 4D tensor with shape (128,256,256,3) Anatomy of a neural network Build your first neural network 1 collect and preprocessing a dataset: most of the actual work Build your first neural network 1 collect and preprocessing a dataset: most of the actual work 2 build your model: few lines of code Build your first neural network 1 collect and preprocessing a dataset: most of the actual work 2 build your model: few lines of code 3 train: one line of code Build your first neural network 1 collect and preprocessing a dataset: most of the actual work 2 build your model: few lines of code 3 train: one line of code 4 evaluate: one line of code Build your first neural network 1 collect and preprocessing a dataset: most of the actual work 2 build your model: few lines of code 3 train: one line of code 4 evaluate: one line of code 5 predict: one line of code source: Get started with TensorFlow’s High-Level APIs (Google I/O ’18) Step 1: collect a dataset Import the data and spend a lot of time asking questions on: rank, shape, number of objects, printing, format, data type, ...: very basic questions! Step 1: collect a dataset 70,000 28x28 grayscale images in 10 categories of clothing articles Step 1: collect a dataset Step 2: build a model Step 2: build a model start simple! do not overlearn the training set. define loss function define optimization (important but defaults are good) When you choose a network topology, you constrain your space of possibilities (hypothesis space) to a specific series of tensor operations, mapping input data to output data. Step 3: train the model Only ”epochs” (and ”batch size”) matter here. Step 4: evaluate Step 5: predict https://www.tensorflow.org/tutorials/keras/basic_ classification Keras ”Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano.” Keras Intended Learning Outcomes At the end of this session, you are now be able to: Describe the three key components of a classifier: score function, loss function, optimisation Identify the elements of a neural networks, including neurons and hyper-parameters Illustrate the specific layers in a neural network for visual recognition Appreciate the use of deep learning to solve biological problems Demonstrate how to implement, train and evaluate a deep neural network in python IUCN Red List of Threatened Species LC: least concern EN: endangered VU: vulnerable CR: critically endangered Population genetics Population genetics Genomic data haplotypes/individuals on rows, genomic positions on columns CNN applied to population genomic data Who is the deepest learner? It’s a competition! The challenge: predict whether a species is endangered, vulnerable or of least concern from genomic data. The score to beat: 75% by me. The prize: a free drink at the pub. Ursus arctos marsicanus Well done! You are all data scientists* now! *data science: statistics done by non-statisticians