Deep Learning Classification in Keras
COMP3220 — Document Processing and the Semantic Web
Week 04 Lecture 1: Deep Learning for Text Classification
Diego Moll ́a
Department of Computer Science Macquarie University
COMP3220 2021H1
Diego Moll ́a
W04L1: Deep Learning 1/23
Deep Learning Classification in Keras
Programme
1 Deep Learning
2 Classification in Keras
Reading
Deep Learning Book Chapters 2, 3, and 6.1.
Additional Reading
Jurafsky & Martin, Chapter 7 ”Neural Networks and Neural Language Models” (7.5 will be covered in week 5)
Diego Moll ́a
W04L1: Deep Learning 2/23
Programme
1 Deep Learning
A Neural Network
Deep Learning
2 Classification in Keras
Deep Learning
Classification in Keras
A Neural Network Deep Learning
W04L1: Deep Learning 3/23
Diego Moll ́a
Deep Learning
Classification in Keras
A Neural Network Deep Learning
What is Deep Learning?
Deep learning is an extension to the neural networks first developed during the late 20th century.
The main differences between deep learning and the early neural networks are:
1 A principled manner to combine simple neural network architectures to build complex architectures.
2 Better algorithms to train the architectures.
Besides improvements in the theory, three main drivers of the success of deep learning are:
1 The availability of large training data.
2 The availability of much faster computers.
3 Massive parallel methods that use specialised hardware.
Graphic Processing Units.
Diego Moll ́a
W04L1: Deep Learning 4/23
Programme
1 Deep Learning
A Neural Network
Deep Learning
2 Classification in Keras
Deep Learning
Classification in Keras
A Neural Network
Deep Learning
W04L1: Deep Learning 5/23
Diego Moll ́a
Deep Learning
Classification in Keras
A Neural Network
Deep Learning
Linear Regression: The Simplest Neural Network
Linear regression is one of the simplest machine learning methods to predict a numerical outcome.
For example, we want to predict the height of a person based on its age.
Based on the training data, linear regression will try to find the line that best fits the training data:
Height
Height = θ0 + θ1Age
Age
Diego Moll ́a
W04L1: Deep Learning 6/23
Deep Learning
Classification in Keras
A Neural Network
Deep Learning
Linear Regression with Multiple Variables
For example, we want to predict the value of a house based on two features:
x1 Area in squared metres. x2 Number of bedrooms.
We can predict the value based on a linear combination of the two features:
f(x1,x2)=θ0 +θ1×1 +θ2×2 Where all θi are learnt during the training stage.
1
θ0
θ1
x1 y
θ2
x2
Diego Moll ́a
W04L1: Deep Learning 7/23
Deep Learning
Classification in Keras
A Neural Network
Deep Learning
Supervised Machine Learning as an Optimisation Problem
The machine learning approach will attempt to learn the parameters of the learning function that minimise the loss (prediction error) in the training data.
Θ = argminΘL(X,Y)
X = {x(1), x(2), · · · , x(n)} is the training data, and
Y = {y(1),y(2),··· ,y(n)} are the labels of the training data. In linear regression:
f(x(i))=θ +θx(i)+···+θx(i) 011pp
L(X,Y)=1n (y(i)−f(x(i)))2 n i=1
This loss is the mean squared error.
Where
Diego Moll ́a
W04L1: Deep Learning 8/23
Deep Learning
Classification in Keras
A Neural Network
Deep Learning
Optimisation Problems in Other Approaches
Logistic Regression
Logistic regression is commonly used for classification
1+e −θ0 −θ1 x1 −···−θp xp
L(X,Y) = 1 n y(i) ×logf(x(i))+(1−y(i))×log(1−f(x(i)))
Support Vector Machines
Initially, SVM was formulated differently but it can also be seen as: f (x(i)) = signp(x(i))
(i) (i) (i) p(x )=θ0+θ1×1 +···+θpxp
L(X,Y)= 1 max{0,1−y(i) ×p(x(i))} n
This is called the hinge loss.
f (x(i)) = 1
(i) (i)
n i=1
This loss is called cross-entropy.
Diego Moll ́a
W04L1: Deep Learning 9/23
Deep Learning
Classification in Keras
A Neural Network
Deep Learning
Solving the Optimisation Problem
A common approach to find the minimum of the loss function is to find the value where the gradient of the loss function is zero.
This results in a system of equations that can be solved.
System of equations in linear regression
∂ 1/nn (y(i)−θ −θx(i)−···−θx(i))2=0 ∂θ0 i=1 011 pp
∂ 1/nn (y(i)−θ −θx(i)−···−θx(i))2=0 ∂θ1 i=1 011 pp
···
∂ 1/nn (y(i)−θ −θx(i)−···−θx(i))2=0 ∂θp i=1 01p pp
Diego Moll ́a
W04L1: Deep Learning 10/23
Deep Learning
Classification in Keras
A Neural Network
Deep Learning
Gradient Descent
Solving the system of equations
∂L L(X,Y) = 0, ∂ L(X,Y) = 0,… can be too
∂θ0 ∂θ1 time-consuming.
e.g. in linear regression, the complexity of computing the formula that solves the system of equations is O(n3).
Some loss functions are very complex (e.g. in deep learning approaches) and it is not practical to attempt to solve the equations at all.
Gradient descent is an iterative approach that finds the minimum of the loss function.
Diego Moll ́a
W04L1: Deep Learning 11/23
Deep Learning
Classification in Keras
A Neural Network
Deep Learning
Gradient Descent Algorithm
1 θ0 =0,…,θp =0
2 Repeat until convergence:
θj =θj −α ∂ L(X,Y) ∂θj
Diego Moll ́a
W04L1: Deep Learning 12/23
Deep Learning
Classification in Keras
A Neural Network
Deep Learning
Batch Gradient Descent
There are automated methods to compute the derivatives of many complex loss functions.
This made it possible to develop the current deep learning approaches.
Note, however, that every step of the gradient descent algorithm requires to process the entire training data.
This is what is called batch gradient descent.
Diego Moll ́a
W04L1: Deep Learning 13/23
Deep Learning
Classification in Keras
A Neural Network
Deep Learning
Mini-batch Gradient Descent
In mini-batch gradient descent, only part of the training data is used to compute the gradient of the loss function.
The entire data set is partitioned into small batches, and at each step of the gradient descent iterations, only one batch is processed.
If the batch size is 1, this is usually called stochastic gradient descent.
When all batches are processed, we say that we have completed an epoch and start processing the first batch again.
batch 1 batch 2 batch 3 batch 4
Diego Moll ́a
W04L1: Deep Learning 14/23
Deep Learning
Classification in Keras
A Neural Network
Deep Learning
Mini-Batch Gradient Descent Algorithm
1 2
θ0 =0,…,θp =0
Repeat until (near) convergence:
1 Shuffle (X , Y ) and split it into n mini-batches (X0,Y0),··· ,(Xn,Yn).
2 For every mini-batch (Xi , Yi ): 1 θj=θj−α∂L(Xi,Yi)
∂θj
https://towardsdatascience.com/gradient- descent- algorithm- and- its- variants- 10f652806a3
Diego Moll ́a
W04L1: Deep Learning 15/23
Deep Learning
Classification in Keras
A Neural Network
Deep Learning
Batch vs. Mini-Batch Gradient Descent
Batch Gradient Descent
At each iteration step we take the most direct path towards reaching a minimum.
The algorithm converges in a relatively small number of steps.
Each step may take long to compute (if the training data is large).
Mini-batch Gradient Descent
At each iteration step there’s some random noise introduced and we take a path roughly in the direction of the minimum.
The algorithm reaches near convergence in a larger number of steps.
Each step is very quick to compute.
Diego Moll ́a
W04L1: Deep Learning 16/23
Programme
1 Deep Learning
A Neural Network
Deep Learning
2 Classification in Keras
Deep Learning
Classification in Keras
A Neural Network
Deep Learning
W04L1: Deep Learning
17/23
Diego Moll ́a
Deep Learning
Classification in Keras
A Neural Network
Deep Learning
A Deep Learning Architecture
A deep learning architecture is a large neural network. The principle is the same as with a simple neural network.
1 Define a complex network that generates a complex prediction f(x1,x2,··· ,xp). This is normally based on simpler building blocks.
2 Define a loss function L(X , Y ). There are some popular loss functions for classification, regression, etc.
3 Determine the gradient of the loss function. This is done automatically.
Diego Moll ́a
W04L1: Deep Learning 18/23
Deep Learning
Classification in Keras
A Neural Network
Deep Learning
A feedforward neural network
a.k.a. multilayer perceptron (MLP)
Input Hidden
Hidden Output
x1 θ1
θ2 h11 θ′ h21 θ1
xθ 2 θ′′o 23 θ′ 21
3 θ′′ θ4h12 h223
x3 o2 h13 h23
x4
h11 = fh11(θ0 + θ1×11 + θ2×12 + θ3×13 + θ4×14)
h21 = fh21(θ0′ + θ1′ h11 + θ2′ h12 + θ3′ h13)
o1 = fo1(θ′′ + θ′′h21 + θ′′h22 + θ3h23) 012
θ1′
′′
Diego Moll ́a
W04L1: Deep Learning 19/23
Programme
1 Deep Learning
2 Classification in Keras
Deep Learning
Classification in Keras
W04L1: Deep Learning
20/23
Diego Moll ́a
Deep Learning
Classification in Keras
Classification in Keras
This section is based on the jupyter notebooks provided by the Deep Learning book: https://github.com/fchollet/ deep-learning-with-python-notebooks
Simple Classification of numbers. Binary classification of movie reviews. Multi-class classification of news wires.
Study these notebooks carefully since they contain important information about how neural networks are constructed and how they operate. The notebooks also introduce important terminology that you need to understand.
Diego Moll ́a
W04L1: Deep Learning 21/23
Deep Learning
Classification in Keras
Take-home Messages
1 Understand the general process in deep learning.
2 Understand the jargon in deep learning: activation, loss,
batches, epochs, . . .
3 Implement and evaluate a feedforward network in Keras for text classification.
Diego Moll ́a
W04L1: Deep Learning 22/23
Deep Learning
Classification in Keras
What’s Next
Week 5
Embeddings and Text Sequences
Reading
Deep Learning book, chapter 6.
Additional Reading
Jurafsky & Martin’s book, chapter 9. (9.4 may be introduced in week 6)
Diego Moll ́a
W04L1: Deep Learning 23/23