程序代写代做代考 python deep learning chain .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Introduction to Neural Networks
CS918 Natural Language Processing

Elena Kochkina

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Table of contents

Basics
Neuron
Activation functions
Feedforward NN
Backpropagation

Types of Neural Networks
Recursive NNs
Convolutional NNs
Recurrent NNs

Network regularisation
Regularisation methods

Resources

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Neuron

Neuron

Inputs

x1

x2

xn

x3 ⌃

Inputs:
X = [x1, x2, .., xn]
X ∈ Rn

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Neuron

Neuron

Inputs

x1

x2

xn

x3 ⌃

!1

!2

!3

!n

Inputs:
X = [x1, x2, .., xn]
X ∈ Rn
Weights:
W = [ω1, ω2, .., ωn]
W ∈ Rn×m, where
m – number of
neurons
Bias:
B = [b1, .., bm],
b ∈ Rm

Neuron performs the following computation:
∑n

i=1 ωixi + b
In matrix form: W · X + B

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Neuron

Neuron

Inputs

x1

x2

xn

x3 ⌃

!1

!2

!3

!n

+1
!n+1

Inputs:
X = [x1, x2, .., xn, 1]
X ∈ Rn
Weights: W =
[ω1, ω2, .., ωn, ωn+1]
W ∈ R(n+1)×m,
where m – number
of neurons

Neuron performs the following computation:
∑n

i=1 ωixi
In matrix form: W · X

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Activation function

Neuron

Inputs

x1

x2

xn

x3 ⌃

!1

!2

!3

Activation function

f(x)

Output

y

!n

+1

!n+1

Inputs:
X = [x1, x2, .., xn, 1]
X ∈ Rn
Weights: W =
[ω1, ω2, .., ωn, ωn+1]
W ∈ R(n+1)×m,
where m – number
of neurons
Activation
function f(x)

The output is y = f (
∑n

i=1 ωixi )
In matrix form: f (W · X )

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Activation functions

Function Formula
Linear f (X ) = X

Sigmoid/Logistic f (X ) = 1
1+e−X

Rectified Linear (ReLU) f (X ) = max(0,X )

Tanh f (X ) = tanh(X )

Softmax f : Yi =
eXi∑
eXi

Most activations are element-wise and non-linear
Derivatives are easy to compute

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Feedforward Neural Network with One Hidden Layer

X

H

Y

W
x

Wy

f

g
Hidden

Input Output

Input: X ∈ Rn
Target output: Ŷ ∈ Rk
Predicted output: Y ∈ Rk
Hidden: H ∈ Rp
Weights: Wx ∈ Rp×n,
Wy ∈ Rk×p
Activations: f, g
Loss function: L
Loss:E ∈ R

Forward Propagation

H = f (Wx · X )

Y = g(Wy · H)

E = L(Y , Ŷ )

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Backpropagation

Backpropagation is based on a
chain rule.
Lets assume:
g(X ) = X
f (X ) = tanh(X )
L(Y , Ŷ ) = 1/2

∑k
i=1(Yi − Ŷi )

2

∂E
∂WY

= ∂E
∂Y

∂Y
∂WY

= (Ŷ −Y )× (H)

∂E
∂WX

= ∂E
∂Y

∂Y
∂H

∂H
∂WX

=

= ((Ŷ − Y )WY )⊗ (1−
tanh2(WX · X ))× X

X

H

Y

W
x

Wy

f

g

Stochastic Gradient Descent
ϵ > 0 – learning rate
W = W − ϵ ∂E

∂W
, ∀W ∈ [Wy ,Wx ]

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Types of Neural Networks

▶ Feedforward

▶ Recursive

▶ Convolutional

▶ Recurrent

▶ etc

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Recursive NNs
Recursive Neural Networks (RNNs) have been successful in learning
sequence and tree structures in natural language processing.

Source: Deep Learning for Natural Language Processing

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Recursive NNs

Simplest architecture: nodes are combined
into parents using a weight matrix that is
shared across the whole network, and a
non-linearity such as tanh.

p1,2 = tanh(W [c1; c2]),

where W is learned n × 2n weight matrix

source: Wikipedia

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Recursive NNs

Typically, Stochastic Gradient Descent (SGD) is used to train the
network.
The gradient is computed using Backpropagation Through
Structure (BPTS)

Principally the same as general backpropagation

Two differences resulting from the tree structure:

▶ Split derivatives at each node

▶ Sum derivatives of W from all nodes

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Convolution

Continuous convolution:
(f ∗ g)(t) =

∫∞
−∞ f (τ)g(t − τ)dτ

Disctrete convolution:
(f ∗ g)(c) =


a f (a) · g(c − a)

(f ∗ g)(c) =

a+b=c f (a) · g(b)

Properties:

▶ Convolution is commutative
f ∗ g = g ∗ f

▶ Convolution is associative
(f ∗ g) ∗ h = f ∗ (g ∗ h)

Figure: source: Wikipedia

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Convolution

source(left): developer.apple.com
source(right):
http://docs.gimp.org/en/plug-in-
convmatrix.html

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Convolutional NN layers. Illustration

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Convolutional NN layers. Illustration

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Convolutional NN layers. Illustration

A max-pooling layer takes the maximum of features over small
blocks of a previous layer.

source: http://colah.github.io/posts/2014-07-Conv-Nets-Modular/

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Convolutional NNs

CNNs are trained using backpropagation

CNNs are widely used in computer vision for image and video
recognition: Krizhevsky et al, 2012, Ciresan et al, 2011, etc

CNNs can also be used in NLP for sentence modeling,
classification, semantic parsing, etc : Collobert et al, 2008,
Kalchbrenner et al, 2014, etc

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Recurrent NNs

A recurrent neural network (RNN) is a class of artificial neural
network where connections between units form a directed cycle.

WH WH WH

WX WX

WY WY

X

Y

W
H

Xt Xt+1

Yt+1Yt g g

Zt, Ht Zt+1, Ht+1f f

Unfold

RNNs are often used for processing sequential data/ time series.
The standard way of training RNN is Backpropagation through
time (BPTT).

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Recurrent NNs.

Forward Propagation:
Given H0
Zt = WX · Xt +WH · Ht−1
Ht = f (Zt)
Yt = g(WY · Ht)
Et = L(Yt , Ŷt)
E =

∑T
t=1 Et

Backpropagation through time
is analogous to standard
backpropagation, we sum up the
errors, we also sum up the
gradients at each time step for
one training example.

WH WH WH

WX WX

WY WY

Xt Xt+1

Yt+1Yt g g

Zt, Ht Zt+1, Ht+1f f

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Recurrent NNs. Illustration

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Recurrent NNs. Illustration

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Recurrent NNs. Illustration

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Recurrent NNs. Illustration

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Recurrent NNs. Illustration

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Recurrent NNs. Illustration

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Recurrent NNs. Illustration

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Recurrent NNs. Illustration

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Recurrent NNs. Illustration

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Recurrent NNs. Illustration

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Recurrent NNs. Illustration

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Recurrent NNs. Illustration

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Recurrent NNs. Illustration

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Recurrent NNs. Illustration

source: http://iamtrask.github.io

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Long Short-Term Memory (LSTM)

Forward Propagation:
Given H0,C0
Ct = ft


Ct−1+ it


f1(Wx ·Xt +WH ·Ht−1)

Ht = ot

f2(Ct)
it = σ(Wi ,X · Xt +Wi ,C · Ct−1 +Wi ,H · Ht−1)
ot = σ(Wo,X · Xt +Wo,C · Ct +Wo,H · Ht−1)
ft = σ(Wf ,X · Xt +Wf ,C · Ct−1 +Wf ,H ·Ht−1)

(Graves, 2006)

See also:

▶ Gated Recurrent Units (GRU)

▶ Bi-directional RNN

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Neural Attention

Rocktäschel T. et al. Reasoning about entailment with neural attention

Bahdanau D., Cho K., Bengio Y. Neural
machine translation by jointly learning to
align and translate

Xu K. et al. Show, attend and tell: Neural image caption generation with visual attention

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Multitask Learning

xa, yaInput(s)

Shared layer(s)

Label BLabel A

xb, yb

Output(s)

▶ Sebastian Ruder (2017). An Overview of Multi-Task Learning
in Deep Neural Networks.

▶ Sogaard, Anders, and Yoav Goldberg. ”Deep multi-task
learning with low level tasks supervised at lower layers.”

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Regularisation methods

Regularisation helps prevent overfitting.

Common regularisation methods:

▶ L2/L1 weight regularisation

▶ Dropout

▶ Batch Normalisation

source: Wikipedia

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Weight regularisation

The idea of weight regularization is to add an extra term,
regularization penalty, to the cost function.

E = L(Y , Ŷ ) + λR(W )

L2 – regularisation (most common):

E = L(Y , Ŷ ) + λ

w

w2

L1 – regularisation:

E = L(Y , Ŷ ) + λ

w

|w |

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Dropout

source: Srivastava et al. in Dropout: A Simple Way to Prevent
Neural Networks from Overfitting

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Batch Normalisation

▶ Batch Normalization enables higher learning rates
▶ Batch Normalization regularizes the model

Ioffe, Sergey, and Christian Szegedy. ”Batch normalization:
Accelerating deep network training by reducing internal covariate
shift.”

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Parameter space of deep learning models

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Existing Deep Learning Frameworks

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Resources

▶ Deep Learning for Natural Language Processing (without
Magic)

▶ The Unreasonable Effectiveness of Recurrent Neural Networks

▶ Long Short-Term Memory in Recurrent Neural Net- work

▶ CS231n: Convolutional Neural Networks for Visual
Recognition

▶ Anyone Can Learn To Code an LSTM-RNN in Python

▶ Blog posts about Neural networks

▶ AI, Deep Learning, NLP Blog – Good explanation of BPTT

Basics
Neuron
Activation functions
Feedforward NN
Backpropagation

Types of Neural Networks
Recursive NNs
Convolutional NNs
Recurrent NNs

Network regularisation
Regularisation methods

Resources