.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Introduction to Neural Networks
CS918 Natural Language Processing
Elena Kochkina
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Table of contents
Basics
Neuron
Activation functions
Feedforward NN
Backpropagation
Types of Neural Networks
Recursive NNs
Convolutional NNs
Recurrent NNs
Network regularisation
Regularisation methods
Resources
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Neuron
Neuron
Inputs
x1
x2
…
xn
x3 ⌃
Inputs:
X = [x1, x2, .., xn]
X ∈ Rn
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Neuron
Neuron
Inputs
x1
x2
…
xn
x3 ⌃
!1
!2
!3
!n
Inputs:
X = [x1, x2, .., xn]
X ∈ Rn
Weights:
W = [ω1, ω2, .., ωn]
W ∈ Rn×m, where
m – number of
neurons
Bias:
B = [b1, .., bm],
b ∈ Rm
Neuron performs the following computation:
∑n
i=1 ωixi + b
In matrix form: W · X + B
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Neuron
Neuron
Inputs
x1
x2
…
xn
x3 ⌃
!1
!2
!3
!n
+1
!n+1
Inputs:
X = [x1, x2, .., xn, 1]
X ∈ Rn
Weights: W =
[ω1, ω2, .., ωn, ωn+1]
W ∈ R(n+1)×m,
where m – number
of neurons
Neuron performs the following computation:
∑n
i=1 ωixi
In matrix form: W · X
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Activation function
Neuron
Inputs
x1
x2
…
xn
x3 ⌃
!1
!2
!3
Activation function
f(x)
Output
y
!n
+1
!n+1
Inputs:
X = [x1, x2, .., xn, 1]
X ∈ Rn
Weights: W =
[ω1, ω2, .., ωn, ωn+1]
W ∈ R(n+1)×m,
where m – number
of neurons
Activation
function f(x)
The output is y = f (
∑n
i=1 ωixi )
In matrix form: f (W · X )
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Activation functions
Function Formula
Linear f (X ) = X
Sigmoid/Logistic f (X ) = 1
1+e−X
Rectified Linear (ReLU) f (X ) = max(0,X )
Tanh f (X ) = tanh(X )
Softmax f : Yi =
eXi∑
eXi
Most activations are element-wise and non-linear
Derivatives are easy to compute
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Feedforward Neural Network with One Hidden Layer
X
H
Y
W
x
Wy
f
g
Hidden
Input Output
Input: X ∈ Rn
Target output: Ŷ ∈ Rk
Predicted output: Y ∈ Rk
Hidden: H ∈ Rp
Weights: Wx ∈ Rp×n,
Wy ∈ Rk×p
Activations: f, g
Loss function: L
Loss:E ∈ R
Forward Propagation
H = f (Wx · X )
Y = g(Wy · H)
E = L(Y , Ŷ )
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Backpropagation
Backpropagation is based on a
chain rule.
Lets assume:
g(X ) = X
f (X ) = tanh(X )
L(Y , Ŷ ) = 1/2
∑k
i=1(Yi − Ŷi )
2
∂E
∂WY
= ∂E
∂Y
∂Y
∂WY
= (Ŷ −Y )× (H)
∂E
∂WX
= ∂E
∂Y
∂Y
∂H
∂H
∂WX
=
= ((Ŷ − Y )WY )⊗ (1−
tanh2(WX · X ))× X
X
H
Y
W
x
Wy
f
g
Stochastic Gradient Descent
ϵ > 0 – learning rate
W = W − ϵ ∂E
∂W
, ∀W ∈ [Wy ,Wx ]
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Types of Neural Networks
▶ Feedforward
▶ Recursive
▶ Convolutional
▶ Recurrent
▶ etc
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Recursive NNs
Recursive Neural Networks (RNNs) have been successful in learning
sequence and tree structures in natural language processing.
Source: Deep Learning for Natural Language Processing
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Recursive NNs
Simplest architecture: nodes are combined
into parents using a weight matrix that is
shared across the whole network, and a
non-linearity such as tanh.
p1,2 = tanh(W [c1; c2]),
where W is learned n × 2n weight matrix
source: Wikipedia
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Recursive NNs
Typically, Stochastic Gradient Descent (SGD) is used to train the
network.
The gradient is computed using Backpropagation Through
Structure (BPTS)
Principally the same as general backpropagation
Two differences resulting from the tree structure:
▶ Split derivatives at each node
▶ Sum derivatives of W from all nodes
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Convolution
Continuous convolution:
(f ∗ g)(t) =
∫∞
−∞ f (τ)g(t − τ)dτ
Disctrete convolution:
(f ∗ g)(c) =
∑
a f (a) · g(c − a)
(f ∗ g)(c) =
∑
a+b=c f (a) · g(b)
Properties:
▶ Convolution is commutative
f ∗ g = g ∗ f
▶ Convolution is associative
(f ∗ g) ∗ h = f ∗ (g ∗ h)
Figure: source: Wikipedia
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Convolution
source(left): developer.apple.com
source(right):
http://docs.gimp.org/en/plug-in-
convmatrix.html
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Convolutional NN layers. Illustration
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Convolutional NN layers. Illustration
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Convolutional NN layers. Illustration
A max-pooling layer takes the maximum of features over small
blocks of a previous layer.
source: http://colah.github.io/posts/2014-07-Conv-Nets-Modular/
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Convolutional NNs
CNNs are trained using backpropagation
CNNs are widely used in computer vision for image and video
recognition: Krizhevsky et al, 2012, Ciresan et al, 2011, etc
CNNs can also be used in NLP for sentence modeling,
classification, semantic parsing, etc : Collobert et al, 2008,
Kalchbrenner et al, 2014, etc
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Recurrent NNs
A recurrent neural network (RNN) is a class of artificial neural
network where connections between units form a directed cycle.
WH WH WH
WX WX
WY WY
X
Y
W
H
Xt Xt+1
Yt+1Yt g g
Zt, Ht Zt+1, Ht+1f f
Unfold
RNNs are often used for processing sequential data/ time series.
The standard way of training RNN is Backpropagation through
time (BPTT).
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Recurrent NNs.
Forward Propagation:
Given H0
Zt = WX · Xt +WH · Ht−1
Ht = f (Zt)
Yt = g(WY · Ht)
Et = L(Yt , Ŷt)
E =
∑T
t=1 Et
Backpropagation through time
is analogous to standard
backpropagation, we sum up the
errors, we also sum up the
gradients at each time step for
one training example.
WH WH WH
WX WX
WY WY
Xt Xt+1
Yt+1Yt g g
Zt, Ht Zt+1, Ht+1f f
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Recurrent NNs. Illustration
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Recurrent NNs. Illustration
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Recurrent NNs. Illustration
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Recurrent NNs. Illustration
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Recurrent NNs. Illustration
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Recurrent NNs. Illustration
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Recurrent NNs. Illustration
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Recurrent NNs. Illustration
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Recurrent NNs. Illustration
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Recurrent NNs. Illustration
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Recurrent NNs. Illustration
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Recurrent NNs. Illustration
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Recurrent NNs. Illustration
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Recurrent NNs. Illustration
source: http://iamtrask.github.io
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Long Short-Term Memory (LSTM)
Forward Propagation:
Given H0,C0
Ct = ft
⊙
Ct−1+ it
⊙
f1(Wx ·Xt +WH ·Ht−1)
Ht = ot
⊙
f2(Ct)
it = σ(Wi ,X · Xt +Wi ,C · Ct−1 +Wi ,H · Ht−1)
ot = σ(Wo,X · Xt +Wo,C · Ct +Wo,H · Ht−1)
ft = σ(Wf ,X · Xt +Wf ,C · Ct−1 +Wf ,H ·Ht−1)
(Graves, 2006)
See also:
▶ Gated Recurrent Units (GRU)
▶ Bi-directional RNN
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Neural Attention
Rocktäschel T. et al. Reasoning about entailment with neural attention
Bahdanau D., Cho K., Bengio Y. Neural
machine translation by jointly learning to
align and translate
Xu K. et al. Show, attend and tell: Neural image caption generation with visual attention
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Multitask Learning
xa, yaInput(s)
Shared layer(s)
Label BLabel A
xb, yb
Output(s)
▶ Sebastian Ruder (2017). An Overview of Multi-Task Learning
in Deep Neural Networks.
▶ Sogaard, Anders, and Yoav Goldberg. ”Deep multi-task
learning with low level tasks supervised at lower layers.”
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Regularisation methods
Regularisation helps prevent overfitting.
Common regularisation methods:
▶ L2/L1 weight regularisation
▶ Dropout
▶ Batch Normalisation
source: Wikipedia
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Weight regularisation
The idea of weight regularization is to add an extra term,
regularization penalty, to the cost function.
E = L(Y , Ŷ ) + λR(W )
L2 – regularisation (most common):
E = L(Y , Ŷ ) + λ
∑
w
w2
L1 – regularisation:
E = L(Y , Ŷ ) + λ
∑
w
|w |
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Dropout
source: Srivastava et al. in Dropout: A Simple Way to Prevent
Neural Networks from Overfitting
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Batch Normalisation
▶ Batch Normalization enables higher learning rates
▶ Batch Normalization regularizes the model
Ioffe, Sergey, and Christian Szegedy. ”Batch normalization:
Accelerating deep network training by reducing internal covariate
shift.”
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Parameter space of deep learning models
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Existing Deep Learning Frameworks
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Resources
▶ Deep Learning for Natural Language Processing (without
Magic)
▶ The Unreasonable Effectiveness of Recurrent Neural Networks
▶ Long Short-Term Memory in Recurrent Neural Net- work
▶ CS231n: Convolutional Neural Networks for Visual
Recognition
▶ Anyone Can Learn To Code an LSTM-RNN in Python
▶ Blog posts about Neural networks
▶ AI, Deep Learning, NLP Blog – Good explanation of BPTT
Basics
Neuron
Activation functions
Feedforward NN
Backpropagation
Types of Neural Networks
Recursive NNs
Convolutional NNs
Recurrent NNs
Network regularisation
Regularisation methods
Resources