COMP3220 — Document Processing and the Semantic Web
Week 04 Lecture 1: Deep Learning for Text Classification
Diego Moll ́a COMP3220 2021H1
Abstract
Deep learning has recently achieved spectacular results in several text processing applications. In this lecture we will introduce the basics of deep learning and how it can be applied to text classification.
Contents
1 Deep Learning
1
Update March 14, 2021
1.1 ANeuralNetwork …………………………………… 2 1.2 DeepLearning …………………………………….. 6
2 Classification in Keras 7
Reading
Deep Learning Book Chapters 2 and 3, and 6.1. Additional Reading
Jurafsky & Martin, Chapter 7 ”Neural Networks and Neural Language Models” (7.5 will be covered in week 5)
1 Deep Learning
What is Deep Learning?
Deep learning is an extension to the neural networks first developed during the late 20th century. The main differences between deep learning and the early neural networks are:
1. A principled manner to combine simple neural network architectures to build complex architec- tures.
2. Better algorithms to train the architectures.
Besides improvements in the theory, three main drivers of the success of deep learning are:
1. The availability of large training data.
1
2. The availability of much faster computers.
3. Massive parallel methods that use specialised hardware.
– Graphic Processing Units. 1.1 A Neural Network
Linear Regression: The Simplest Neural Network
Linear regression is one of the simplest machine learning methods to predict a numerical outcome. For example, we want to predict the height of a person based on its age.
Based on the training data, linear regression will try to find the line that best fits the training data:
Height
Height = θ0 + θ1Age
Age
In this example, the neural network needs to learn the values of θ0 and θ1 that generates the line that best fits the training data.
Linear Regression with Multiple Variables
For example, we want to predict the value of a house based on two features: – x1 Area in squared metres.
– x2 Number of bedrooms.
We can predict the value based on a linear combination of the two features:
f(x1,x2)=θ0 +θ1×1 +θ2×2 Where all θi are learnt during the training stage.
1
θ0
θ1
x1 y
θ2 x2
2
Supervised Machine Learning as an Optimisation Problem
The machine learning approach will attempt to learn the parameters of the learning function that minimise the loss (prediction error) in the training data.
Θ = argminΘL(X, Y ) – X = {x(1), x(2), · · · , x(n)} is the training data, and
– Y = {y(1), y(2), · · · , y(n)} are the labels of the training data.
In linear regression:
(i) (i) (i) –f(x )=θ0+θ1×1 +···+θpxp
–L(X,Y)=1n (y(i)−f(x(i)))2Thislossisthemeansquarederror. n i=1
The notation of the above samples is explained here:
Θ stands for all the parameters θ0, θ1, θ2 . . ..
x(1) represents the first sample in the training data, x(2) represents the second sample, and so on. Each sample is a vector of features, so the sample at position i is represented as x(i) = x(i), x(i), · · · .
y(i) represents the label of sample at position i, that is, the label of x(i).
Optimisation Problems in Other Approaches
Logistic Regression
Logistic regression is commonly used for classification
1+e−θ0 −θ1 x1 −···−θp xp
L(X, Y ) = 1 n y(i) × log f(x(i)) + (1 − y(i)) × log (1 − f(x(i))) This loss is called cross-entropy.
Where
f(x(i))= 1
(i) (i)
12
n i=1 Support Vector Machines
Initially, SVM was formulated differently but it can also be seen as: f(x(i))=signp(x(i))p(x(i))=θ0+θ1x(i)+···+θpx(i)
1p
L(X, Y ) = 1 max{0, 1 − y(i) × p(x(i))} This is called the hinge loss.
n
The symbol means that the contents of this slide is supplementary and will not be asked in the exam.
The only thing that you need to learn from this slide is that there are multiple types of loss functions, and a particular machine learning problem will need to use the appropriate loss function. We will talk about this further but yes, in general:
The mean squared error is used for regression problems.
Cross-entropy is used in classification problems.
The hinge loss is used when we want to implement Support Vector Machines.
3
Solving the Optimisation Problem
A common approach to find the minimum of the loss function is to find the value where the gradient
of the loss function is zero.
This results in a system of equations that can be solved.
System of equations in linear regression
∂ n (i) (i) (i) 2 ∂θ01/n i=1(y −θ0−θ1×1 −···−θpxp ) =0
∂ n (i) (i) (i) 2 ∂θ11/n i=1(y −θ0−θ1×1 −···−θpxp ) =0
···
∂θp1/n i=1(y −θ0−θ1xp −···−θpxp ) =0
∂ n (i)
(i) (i) 2
Gradient Descent
Solving the system of equations ∂L L(X, Y ) = 0, ∂ L(X, Y ) = 0, . . . can be too time-consuming. ∂θ0 ∂θ1
e.g. in linear regression, the complexity of computing the formula that solves the system of equations is O(n3).
Some loss functions are very complex (e.g. in deep learning approaches) and it is not practical to attempt to solve the equations at all.
Gradient descent is an iterative approach that finds the minimum of the loss function.
Gradient Descent Algorithm
1. θ0 =0,…,θp =0
2. Repeat until convergence: θj = θj − α ∂ L(X, Y ) ∂θj
Batch Gradient Descent
There are automated methods to compute the derivatives of many complex loss functions.
– This made it possible to develop the current deep learning approaches.
Note, however, that every step of the gradient descent algorithm requires to process the entire training
data.
This is what is called batch gradient descent.
4
Mini-batch Gradient Descent
In mini-batch gradient descent, only part of the training data is used to compute the gradient of the
loss function.
The entire data set is partitioned into small batches, and at each step of the gradient descent iterations, only one batch is processed.
– If the batch size is 1, this is usually called stochastic gradient descent.
When all batches are processed, we say that we have completed an epoch and start processing the first
batch again.
batch 1 batch 2 batch 3 batch 4
Mini-Batch Gradient Descent Algorithm
1. θ0 =0,…,θp =0
2. Repeat until (near) convergence:
(a) Shuffle (X, Y ) and split it into n mini-batches (X0, Y0), · · · , (Xn, Yn). (b) For every mini-batch (Xi, Yi):
i.θj=θj−α∂ L(Xi,Yi) ∂θj
https://towardsdatascience.com/gradient- descent- algorithm- and- its- variants- 10f652806a3
The figure shows the trajectories of the parameters Θ towards the minimum. In the figure, only two dimensions are shown, and the ovals mean regions of the loss function with equal loss value. In batch gradient descent, the values of Θ move directly towards a minimum until the minimum loss is reached, at which point the values are stationary. In mini-batch and stochastic gradient descent, the path towards reaching the minimum is more erratic and the values will not become stationary.
Batch vs. Mini-Batch Gradient Descent Batch Gradient Descent
At each iteration step we take the most direct path towards reaching a minimum. The algorithm converges in a relatively small number of steps.
Each step may take long to compute (if the training data is large).
Mini-batch Gradient Descent
5
At each iteration step there’s some random noise introduced and we take a path roughly in the direction of the minimum.
The algorithm reaches near convergence in a larger number of steps.
Each step is very quick to compute.
1.2 Deep Learning
A Deep Learning Architecture
A deep learning architecture is a large neural network.
The principle is the same as with a simple neural network.
1. Define a complex network that generates a complex prediction f (x1 , x2 , · · · , xp ). This is normally based on simpler building blocks.
2. Define a loss function L(X, Y ). There are some popular loss functions for classification, regression, etc.
3. Determine the gradient of the loss function. This is done automatically.
A feedforward neural network
a.k.a. multilayer perceptron (MLP)
Input Hidden Hidden Output
x1 θ1 θ1′
θh1 h2θ′′
2 1 θ′ 1 1
x 2 θ′′o
2 θ3 θ′ 2 1
3 θ′′ θ4h12 h223
x3
x4
o2
h13
h23
h11 = fh11(θ0 + θ1×11 + θ2×12 + θ3×13 + θ4×14)
h21 =fh21(θ0′ +θ1′h11 +θ2′h12 +θ3′h13)
o =f (θ′′+θ′′h2 +θ′′h2 +θh2) 1 o1 0 1 1 2 2 3 3
A feedforward neural network is one of the earliest deep learning architectures:
The network is composed of a sequence of densely connected layers.
The first layer is the input layer and it receives the vector that represents the input data.
The last layer is the output layer and it has one node per classification label.
There may be one or more intermediate layers called hidden layers.
The layers are densely connected: each node in a layer is connected to each node in the next layer. The output of a node is a function of the weighted sum of all nodes from the previous layer, where the weights are the parameters that will be learned at the training stage.
The function that applies to a node is called the activation function. There are several popular activation functions. We will see some of them later.
6
2 Classification in Keras
Classification in Keras
This section is based on the jupyter notebooks provided by the Deep Learning book: https://github. com/fchollet/deep-learning-with-python-notebooks
Simple Classification of numbers.
Binary classification of movie reviews. Multi-class classification of news wires.
Study these notebooks carefully since they contain important information about how neural networks are constructed and how they operate. The notebooks also introduce important terminology that you need to understand.
Take-home Messages
1. Understand the general process in deep learning.
2. Understand the jargon in deep learning: activation, loss, batches, epochs, . . . 3. Implement and evaluate a feedforward network in Keras for text classification.
What’s Next Week 5
Embeddings and Text Sequences Reading
Deep Learning book, chapter 6. Additional Reading
Jurafsky & Martin’s book, chapter 9. (9.4 may be introduced in week 6)
7