Linear models: Recap
Linear models: I Perceptron
score(y, x; ✓) = ✓ · f (x, y)
I Na ̈ıve Bayes:
log P(y|x; ✓) = log P(x|y; ) + log P(y; u) = log B(x) + ✓ · f (x, y)
I Logistic Regression
log P(y|x; ✓) = ✓ · f (x, y) log X exp ✓ · f (x, y0) y02Y
Features and weights in linear models: Recap
I Feature representation: f (x , y )
f (x , y = 1) = [x ; 0; 0; · · · ; 0]
| {z }
(K 1)⇥V
f (x , y = 2) = [0; 0; · · · ; 0 x ; 0; 0; · · · ; 0]
| {z } | {z }
V (K 2)⇥V f (x , y = K ) = [0; 0; · · · ; 0; x ]
| {z }
I Weights: ✓
✓ = [✓1; ✓2; · · · ; ✓V ; ✓1; ✓2; · · · ; ✓V ; · · · ; ✓1; ✓2; · · · ; ✓V ]
(K 1)⇥V
| {z }| {z } | {z }
y=1 y=2 y=K
Rearranging the features and weights
I Represent the features x as a column vector of length V , and represent the weights as a ⇥ as K ⇥ V matrix
23 2×1 x2···xV3 x1 y=16✓1,1 ✓1,2 ··· ✓1,V 7
x=64×275 ⇥= y=26✓2,1 ✓2,2 ··· ✓2,V 7 ··· ··· 4 ··· ··· ··· ··· 5
xV y=K ✓K,1 ✓K,2 ··· ✓K,V I What is ⇥x?
Scores for each class
I Verify that 1, class
2, · · · ,
=⇥x=64✓2·x= 275
K correspond to the scores for each 26 ✓ 1 · x = 1 37
··· ✓3·x= K
Implementation in Pytorch
Digression: Matrix multiplication
I Matrix with m rows and n columns:
A P2 R m ⇥ n , B 2 R n ⇥ p , C = A B 2 R m ⇥ p where Cij = nk=1 AikBkj
I Example:
2 3 ⇥1 0 2 =2 3 7 12 011 124
Digression: 3-D matrix multiplication
Tensor shape: (batch-size, sentence-length, embedding size)
SoftMax
j
I SoftMax, also known as normalized exponential function. SoftMax ( ) = exp i
i PKj exp for i = 1,2,··· ,K
I Applying SoftMax turns the scores into a probabilistic distribution:
26P(y = 1)37 SoftMax( ) = 64P(y = 2)75
P(y = K) I Verify this is exactly logistic regression
···
Logistic regression as a neural network
y = SoftMax(⇥x) V=5K=3
Going deep
I There is no reason why we can’t add layers in the middle
z = (⇥1x)
y = SoftMax(⇥2z)
Going even deeper
I There is no reason why we can’t add layers in the middle
I But why?
z1 = (⇥1x)
z2 = (⇥2z1)
y = SoftMax(⇥3z2)
Non-linear classification
Linear models like Logistic regression can map data into a high-dimensional vector space, and they are expressive enough and work well for many NLP problems, why do we need more complex non-linear models?
I There have been rapid advances in deep learning, a family of nonlinear methods that learn complex functions of the input through multiple layers of computation.
I Deep learning facilitates the incorporation of word embeddings, which are dense vector representations of words, that can be learned from massive amounts of unlabeled data.
I It has evolved from early static embeddings (e.g., Word2vec, Glove) to recent dynamtic embeddings (ELMO, BERT, XLNet)
I Rapid advances in specialized hardware called graphic processing units (GPUs). Many deep learning models can be implemented e ciently on GPUs.
Feedforward Neural networks: an intuitive justification
I In image classification, instead of using the input (pixels) to predict the image type directly, you can imagine a scenario that you can predict the shapes of parts of an image, mouth, hand, ear.
I In text processing, we can imagine a similar scenario. Let’s say we want to classify movie reviews (or movies themselves) into a label set of {Good,Bad,OK}. Instead predicting these labels directly, we first predict a set of composite features such as the story, acting, soundtrack, cinematography, etc. from raw input (words in the text).
Face Recognition
Feedforward neural networks
Formally, this is what we do:
I Use the text x to predict the features z. Specifically, train a logistic regression classifier to compute P(zk|x), for each
k 2 {1,2,··· ,Kz}
I Use the features z to predict the label y. Train a logistic regression classifier to compute P(y|z). z is unknown or hidden, so we will use the P(z|x) as the features.
Caveat: it’s easy to demonstrate what this is what the model does for image processing, but it’s hard to show this is what’s actually going on in language processing. Interpretability is a major issue in neural models for language processing.
The hidden layer: computing the composite features
I If we assume each zk is binary, that is, zk 2 {0, 1}, then P(zk|x) can be modeled with binary logistic regression:
P(zk =1|x;⇥(x!z))= (✓x!z ·x) k
= (1 + exp( ✓x!z · x)) 1 k
I The weight matrix ⇥(x!z) 2 Rkz⇥V is constructed by stacking (not concatenating, as in linear models) the weight
vectors for each zk,
⇥(x!z) = ⇥✓x!z,✓x!z,··· ,✓x!z⇤>
I 1 2 Kz
We assume an o↵set/bias term is included in x and its
parameter is included in each ✓x!z k
Notations: ⇥(x!z) 2 Rkz ⇥V is a real number matrix with a dimension of kz rows and V columns
Activation functions
I Sigmoid: The range of sigmoid function is (0, 1). (x) = 1
Question: what’s the value of the sigmoid function when
x = 0?
I Tanh: The range of the tanh activation function is ( 1, 1)
tanh(x) = e2x 1 e2x +1
Question: what’s the value of the tanh function when x = 0? I ReLU: The rectified linear unit (ReLU) is zero for negative
inputs, and linear for positive inputs
ReLU(x) = max(x,0) = (0 x < 0
x otherwise
Sigmoid and tanh are sometimes described as squashing functions.
1+e x
Activation functions in Pytorch
from torch import nn import torch
input = torch . randn (4) sigmoid = nn.Sigmoid() output = sigmoid(input)
tanh = nn.Tanh() output = tanh(input)
relu = nn.ReLU() output = relu(input)
The output layer
I The output layer is computed by the multiclass logistic regression probability
P exp(✓(z!y) ·z +bj) P(y = j|z;⇥(z!y),b) = j
I The weight matrix ⇥(z !y ) 2 Rky ⇥kz again is constructed by
stacking weight vectors for each yk:
⇥(z!y) = h✓z!y,✓z!y,··· ,✓z!yi> 1 2 Ky
I The vector of probabilities over each possible value of y is denoted:
P(y|z;⇥(z!y),b) = SoftMax(⇥(z!y)z +b)
exp(✓(z!y) · z + b0) j02Y j0 j
The negative loglikelihood or cross-entropy loss
In a multi-class classification setting, a softmax output produces a probabilistic distribution over possible labels. It works well together with negative conditional likelihood (just like logistic regression)
L = or cross entropy loss:
XN i=1
log P(y(i)|x(i); ⇥) y ̃ , P r ( Xy = j | x ( i ) ; ⇥ )
j
N
L= ey(i) ·logy ̃ i=1
where ey(i) is a one-hot vector of zeros with a value of one at the position y(i)
Alternative loss functions
I There are alternatives to SoftMax and cross-entropy loss, just as there are alternatives in linear models.
I Pairing an a ne transformation (remember perceptron) with a margin loss:
(y; x(i), ⇥) = ✓(z!y) · z + b yy
`MARGIN(⇥;x(i),y(i))= max ⇣1+ (y;x(i),⇥) (y(i);x(i),⇥)⌘ y6=y(i)
+
Training a feedforward network with the backpropagation algorithm
A feedforward network with a cross-entropy loss
The following sums up a feed-forward neural network with one hidden layer, complete with a cross-entropy loss that drives parameter estimation:
z f(⇥x!zx(i)), z 2Rkz
y ̃ SoftMax(⇥z!yz+b), y ̃2Rky `(i) ey(i) ·logy ̃
where f is an elementwise activation function (e.g., or ReLU), `(i) is the loss at instance i
Updating the parameters of a feedforward network
As usual, ⇥x!z, ⇥z!y, and b can be estimated by online gradient based optimization such as stochastic gradient descent:
b b ⌘(t)rb`(i)
✓z!y ✓z!y ⌘(t)r✓z!y `(i)
kkk
✓x!z ✓x!z ⌘(t)r✓x!z `(i) nnn
where ⌘(t) is the learning rate at iteration t, `(i) is the loss on instance (or minibatch) i , ✓x!z is the nth column of the matrix
⇥
x!z z!y n z!y , and ✓k is the kth column of the matrix ⇥
.
Compute the gradient of the cross-entropy loss on hidden layer weights ⇥z!y
(i) k
h @`(i) @`(i) @`(i) i> @✓z!y , @✓z!y ,··· , @✓z!y
k,1 k,2 k,Ky
r✓z!y `
@ ` ( i ) @ 0@ ( z ! y )
=
·z log
=⇣P(y =j|z;⇥(z!y),b) ⇣j =y(i)⌘⌘zk
@✓z!y = @✓z!y ✓y(i) k,j k,j
y2Y
exp✓y ·z
X ( z ! y ) 1A
where j = y(i) returns 1 if j = y(i) and 0 otherwise, zk is the kth element in z.
Applying the chain rule to compute the derivatives
We use the chain rule to break down the gradient of the loss on the hidden layer weights:
@`(i) @`(i) @oj @✓z!y = @oj @✓z!y
where
✓eoi ◆ Xo `= ey(i)·logy ̃= logPkeok = oi+log ek
k
(i)
oj = ✓(z!y)z
k,j k,j
j
Note: oi is the logit that corresponds to the true label y(i)
Derivative of the cross-entropy loss with respect to the logits
@`(i) @ Xo! @o =@o oi+log ek
jjk
= @ logXeok @ oi
@oj k @oj =P1 eoj @oi
k eok @oj eoj @
= P eok @o oi kj
= y ̃ ( j = y ( i ) ) j
Gradient on the hidden layer weights ⇥z!y
One more step to compute the gradient of the loss with respect to
the weights:
@oj = @ ✓z!y ·z= @ X✓z!yzk =zk @✓(z!y) @✓(z!y) j @✓(z!y) k,j
k,jk,j k,jk
@`(i) @✓(z!y)
=
@`(i) @oj @oj @✓(z!y)
(i) =(y ̃ (j=y ))z
jk
k,j
Similarly, we can also compute the derivative with respect to the
hidden units:
@oj @zk
@`(i) @zk
k,j
= ✓(z!y) k,j
=
=(y ̃ (j=y ))✓ j
k,j
@`(i) @oj @oj @zk
(i) (z!y)
This value will be useful for computing the gradient of the loss function from the inputs to the hidden layer.
Compute the gradient on the input weights ⇥(x!z) Apply the old chain rule in di↵erentiation:
@`(i) @`(i) @zk @✓(x!z) = @zk @✓(x!z)
n,k n,k
(i) @f ⇣✓(x!z) · x⌘
= @` k
@zk @✓(x!z)
n,k
@`(i) 0⇣ (x!z) ⌘
=@zf ✓k ·xxn k
where f 0(✓(x!z) · x) is the derivative f the activation function f , k (x!z)
applied at the input ✓k · x. Depending on what the actual activation is, the derivative will also be di↵erent.
Derivative of the sigmoid activation function
@`(i) @`(i) ⇣ (x!z) ⌘⇣ ⇣ (x!z) ⌘⌘ @✓x!z = @z ✓k ·x 1 ✓k ·x xn
n,k k @`(i)
= @z zk(1 zk)xn k
I If the negative log-likelihood `(i) does not depend much on zk, then @`(i) ⇡ 0. In this case it doesn’t matter how zk is
@zk
computed, and so @`(i) ⇡ 0
@✓x!z n,k
I If xn = 0, then it does not matter how we set the weights
@`(i) ,so @`(i) =0 @✓x!z @✓x!z
n,k n,k
A simple neural network
x1
s1 z1
x 2 o y ̃
y ( i )
x3
s2 z2
” x1 x2 x3 #
s1 ✓(x!s) ✓(x!s) ✓(x!s) 11 12 13
s2 ✓(x!s) ✓(x!s) ✓(x!s) 21 22 23
hz1 z2i
⇥(z!o) = o ✓(z!o) ✓(z!o) 11 12
⇥(x!s) =
A simple neural network: forward computation
x1
s1 z1
x2 o y ̃ y ( i )
s2 z2
x3
s1 s2
z1
z2
= ✓(x!s)x1 + ✓(x!s)x2 + ✓(x!s)x3 11 12 13
= ✓(x!s)x1 + ✓(x!s)x2 + ✓(x!s)x3 21 22 23
= (s1)
= (s2)
11 12
y ̃ = (o)
o = ✓(z!o)z1 + ✓(z!o)z2
Note: When making predictions with the sigmoid function, choose the label 1 if y ̃ > 0.5. Otherwise choose 0
A simple neural network with cross-entropy sigmoid loss
I Cross-entropy sigmoid loss:
LH(p,q) = Xpi logqi = ylogy ̃ (1 y)log(1 y ̃)
i
I y 2 {0,1} is the true label
I y ̃ is the predicted value
I If the true label is 1, loss is log(y ̃). Loss is bigger if y ̃ is smaller
I If the true label is 0, loss is log(1 y ̃). Loss is bigger if y ̃ is bigger
A simple neural network: Compute the derivatives
@ L = y ̃ y @o
@o = ✓(z!o) @o = ✓(z!o) @z1 11 @z2 12
@o =z @o =z @✓(z!o) 1 @✓(z!o)
2
@z1 =z1(1 z1) @z2 =z2(1 z2)
11 12 @s1 @s2
@s1 =x @s1 =x @s1 =x @✓(x!s) 1 @✓(x!s) 2 @✓(x!s) 3
11 12 13
@s2 =x @s2 =x @s2 =x
@✓(x!s) 1 @✓(x!s) 2 @✓(x!s) 21 22 23
3
Compute the derivatives on the weights ⇥z!o Apply the chain rule:
@L =@L @o =(y ̃ y)z1 @✓(z!o) @o ✓(z!o)
11 11
@L =@L @o =(y ̃ y)z2
@✓(z!o) @o ✓(z!o) 12 12
Compute derivatives with respect to the weights ⇥x!s
@L @✓(x!s)
11
@L @✓(x!s)
12
@L @✓(x!s)
13
@L @✓(x!s)
21
@L @✓(x!s)
22
@L @✓(x!s)
23
=@L@o@z1 @o @z1 @s1
=@L@o@z1 @o @z1 @s1
=@L@o@z1 @o @z1 @s1
=@L@o@z2 @o @z2 @s2
=@L@o@z2 @o @z2 @s2
=@L@o@z2 @o @z2 @s2
@s1 = (y ̃ y)✓(z!o)z1(1 z1)x1 @✓(x!s) 11
11
@s1 = (y ̃ y)✓(z!o)z1(1 z1)x2
@✓(x!s) 11 12
@s1 = (y ̃ y)✓(z!o)z1(1 z1)x3 @✓(x!s) 11
13
@s1 = (y ̃ y)✓(z!o)z2(1 z2)x1
@✓(x!s) 12 21
@s1 = (y ̃ y)✓(z!o)z2(1 z2)x2 @✓(x!s) 12
22
@s1 = (y ̃ y)✓(z!o)z2(1 z2)x3
@✓(x!s) 12 23
Vectorizing forward computation
“✓(x!s) ✓(x!s) ✓(x!s)# 2×13 s=⇥(x!s)x= 11 12 13 ⇥4×25
✓(x!s) ✓(x!s) ✓(x!s)
“21 22 23 x3#
✓(x!s)x1 + ✓(x!s)x2 + ✓(x!s)x3 s1 =11 12 13 =
✓(x!s)x1 + ✓(x!s)x2 + ✓(x!s)x3 s2 21 22 23
z = ✓s1 ◆ = (s1) = z1
s2 (s2) z2
o = ⇥(z!o)z = h✓(z!o) ✓(z!o)i ⇥ z1 = h✓(z!o)z1 + ✓(z!o)z2i
11 12 z2 11 12 Note: Summing over inputs in forward computation
y ̃ = (o)
Backpropagation
I When applying the chain rule, local derivatives are frequently reused. For example, @L is used to compute the gradient on
@o
both ⇥z!o and ⇥x!s. It would be useful to cache them.
I We can only compute the derivatives of the parameters when we have all the necessary “inputs” demanded by the chain rule. So careful sequencing is necessary to take advantage of the cached derivatives.
I This combination of sequencing, caching, and di↵erentiation is called backpropagation.
I In order to make this process manageable, backpropagation is typically done in vectorial form (tensors)
Elements in a neural network
I Variables includes input x, hidden units z, outputs y, and the loss function.
I Inputs are not computed from other nodes in the graph. For example, inputs to a feedforward neural network can be the
I feature vector extracted from a training (or test) instance Backpropagation computes gradient of the loss with respect to all variables other than inputs, and propagates to parameters.
I Parameters include weights and bias. They do not have parents, and are initialized and then updated by gradient descent
I Loss is not used to compute any other nodes in the graph, and is usually computed with the predicted label yˆ and the true label y(i).
Backpropagation: caching “error signals”
Let Do represent the gradient of the loss function with respect to o and Ds be the gradient of the loss function with respect to s. Assuming a cross-entropy loss and a sigmoid (which can be genralized to SoftMax) function, the error signal at the output o is: D o = [ y ̃ y ( i ) ]
This error signal can then be used to compute the gradient with respect to ⇥(z!o)
r⇥(z!o) =Doz=⇥y ̃ y⇤⇥z1 z2⇤
= ⇥(y ̃ y)z1 (y ̃ y)z2⇤
Backpropagation: Computing error signals
Using Do, we can also compute the error signals at the hidden layer:
Ds =(⇥(z!o)Do) F0
=⇣h@o @o i⇥⇥@`⇤⌘ h@z1 @z2i
= (y ̃ y)✓(z!o)z1(1 z1) (y ̃ y)✓(z!o)z2(1 z2) 11 12
@z1 @z2 @o @s1 @s2
= ⇣h✓(z!o) ✓(z!o)i ⇥ ⇥y ̃ y⇤⌘ ⇥z1(1 z1) z2(1 z2)⇤
h11 12 i
Backpropagation: caching “error signals”
Using error signals Ds , we can now compute the gradient on ⇥(x!s):
r⇥(x!s) =Ds>X
“(y ̃ y)✓(z!o)z1(1 z1)#
⇥ ⇤
⇥x1 x2 x3
(y ̃ y)✓(z!o)z1(1 z1)x2 11
= 11
(y ̃ y)✓(z!o)z2(1 z2)
” 12
(y ̃ y)✓(z!o)z1(1 z1)x1
(y ̃ y)✓(z!o)z1(1 z1)x3 11
#
= 11
(y ̃ y)✓(z!o)z2(1 z2)x1 (y ̃ y)✓(z!o)z2(1 z2)x2 (y ̃ y)✓(z!o)z2(1 z2)x3 12 12 12
Computation graph of a three-layer feedforward neural
network
Y(i) 2 Rl⇥n xent
Y ̃ 2 Rl⇥n SoftMax
O 2 Rl⇥n dot
⇥(2!out) 2 Rl⇥k F0(Z(2)) 2 Rk⇥n
r⇥(2!out) 2 Rl⇥k
r⇥(1!2) 2 Rk⇥m
r⇥(in!1) 2 Rm⇥a
loss
D(out) 2 Rl⇥n
C(2) 2 Rk⇥n
D(2) 2 Rk⇥n
C(1) 2 Rm⇥n
D(1) 2 Rm⇥n
Z(2) 2 Rk⇥n f (.)
S(2) 2 Rk⇥n dot
⇥(1!2) 2 Rk⇥m
Z(1) 2 Rm⇥n f (.)
S(1) 2 Rm⇥n dot
F0(Z(1)) 2 Rm⇥n
⇥(in!1) 2 Rm⇥a
X 2 Ra⇥n
Forward computation in matrix form
S(1) = ⇥in!1X Z(1) = f (S(1))
Y(i) 2 Rl⇥n xent
l ⇥n Y ̃ 2 R
SoftMax
l ⇥n D2R O2R
1
S(2) = ⇥1!2Z(1)
(out )
l ⇥n
loss
dot
Z(2) = f2(S(2)) O = ⇥2!oZ(2)
r⇥(2!out) 2 Rl⇥k
r⇥(1!2) 2 Rk⇥m
r⇥(in!1) 2 Rm⇥a
C(2) 2 Rk⇥n
D(2) 2 Rk⇥n
C(1) 2 Rm⇥n
D(1) 2 Rm⇥n
⇥(2!out) 2 Rl⇥k F0(Z(2)) 2 Rk⇥n
Z(2) 2 Rk⇥n f (.)
S(2) 2 Rk⇥n dot
= SoftMax(O> Note SoftMax
applies to a vector.
y ̃ [j]
[j]
)
⇥(1!2) 2 Rk⇥m
Z(1) 2 Rm⇥n f (.)
S(1) 2 Rm⇥n dot
F0(Z(1)) 2 Rm⇥n
⇥(in!1) 2 Rm⇥a
X 2 Ra⇥n
Matrix multiplication in forward computation sums over inputs, hidden units. We assume the input and hidden units in batches
Backpropagation: Computing error signals
D o u t = Y ̃ Y ( i ) C(2) = (⇥2!o)>Dout
D(2) =F0(Z(2)) C(2) C(1) = (⇥1!2)>D(2) D(1) =F0(Z(1)) C(1)
r⇥(2!out) 2 Rl⇥k
r⇥(1!2) 2 Rk⇥m
r⇥(in!1) 2 Rm⇥a
loss
D(out) 2 Rl⇥n
C(2) 2 Rk⇥n
D(2) 2 Rk⇥n
C(1) 2 Rm⇥n
D(1) 2 Rm⇥n
Y(i) 2 Rl⇥n xent
l ⇥n Y ̃ 2 R
SoftMax
O 2 Rl⇥n dot
⇥(2!out) 2 Rl⇥k F0(Z(2)) 2 Rk⇥n
Z(2) 2 Rk⇥n f (.)
S(2) 2 Rk⇥n dot
⇥(1!2) 2 Rk⇥m
Z(1) 2 Rm⇥n f (.)
S(1) 2 Rm⇥n dot
F0(Z(1)) 2 Rm⇥n
⇥(in!1) 2 Rm⇥a
X 2 Ra⇥n
Matrix multiplication in error signal computation involves summing over outputs and hidden units.
Backpropagation: Compute the gradient with error signals
r⇥2!out = D(out)(Z(2))> r⇥1!2 = D(2)(Z(1))> r⇥in!1 = D(1)(X )>
r⇥(2!out) 2 Rl⇥k
r⇥(1!2) 2 Rk⇥m
r⇥(in!1) 2 Rm⇥a
loss
D(out) 2 Rl⇥n
C(2) 2 Rk⇥n
D(2) 2 Rk⇥n
C(1) 2 Rm⇥n
D(1) 2 Rm⇥n
Y(i) 2 Rl⇥n xent
l ⇥n Y ̃ 2 R
SoftMax
O 2 Rl⇥n dot
⇥(2!out) 2 Rl⇥k F0(Z(2)) 2 Rk⇥n
Z(2) 2 Rk⇥n f (.)
S(2) 2 Rk⇥n dot
⇥(1!2) 2 Rk⇥m
Z(1) 2 Rm⇥n f (.)
S(1) 2 Rm⇥n dot
F0(Z(1)) 2 Rm⇥n
⇥(in!1) 2 Rm⇥a
X 2 Ra⇥n
Matrix multiplication when computing gradient sums over instances in mini-batch.
Notes in backpropagation
I Where does caching take place?
I Why is sequencing important?
I What computation patterns can you observe?
Pay attention to what you sum over
I In forward computation, you sum over all inputs (feature vector) for each output
I In backward computation, you sum over the gradient of all outputs for each input
I When you compute the gradient for the weights, you sum over the gradient for each data point (and average them)
I Make sure you line up the columns of the first matrix and the rows of the second matrix. That’s what you sum over.
Automatic di↵erentiation software
I Backpropagation is mechanical. There is no reason for everyone to repeat the same work
I Currently many automatic di↵erentiation software libraries exist, e.g., Torch (PyTorch), MXNet, TensorFlow
I Using these libraries, users only need to specify the forward computation, and the gradient can be computed automatically. These libraries have accelerated research and development in deep learning considerably.
I Libraries that support dynamic computation graphs are better suited for many NLP problems.
Bells and whistles in neural net training
Tricks in training neural networks
There are various tricks that people use when training neural networks:
I Regularization: Adjusting the gradient
I Dropout: Adjusting the hidden units
I Optimization methods: Adjusting the learning rate I Initialization: Using particular forms of initialization
Regularization
Neural networks can be regularized in a similar way as linear models. Neural networks can also with Frobenius norm, which is a trivial extension to L2 norm for matrices. In fact, in many cases it is just referred to as L2 regularization.
XN i=1
`(i) + z!yk⇥(z!y)k2F + x!zk⇥(x!z)k2F i,j i,j
L=
where k⇥k2 = P ✓2 is the squred Frobenius norm, which
F
generalizes the L2 norm to matrices. The bias parameters b are not regularized, as they do not contribute to the classifier to the inputs.
L2 regularization
I Compute the gradient of a loss with L2 regularization
I Update the weights
@ L @✓ =
XN @ ` ( i )
@✓ + ✓
XN@`(i) !
@✓ + ✓ pulls a weight back when it has become too big
I Question: Does it matter which layer ✓ is from when computing the regularization term?
i=1
✓=✓ ⌘
I “Weigh decay factor”: is a tunable hyper parameter that
i=1
L1 regularization
I L1 regularization loss
L=
I Compute the gradient
XN i=1
`(i) + z!yk⇥(z!y)k1 + x!zk⇥(x!z)k1
@ L @✓=
XN @ ` ( i )
@✓ + sign(✓)
I update the weights
XN @ ` ( i ) !
@✓ + sign(✓)
✓=✓ ⌘
i=1
i=1
Comparison of L1 and L2
I In L1 regularization, the weights shrink by a constant amount toward 0. In L2 regularization, the weights shrink by an amount which is proportional to w.
I When a particular weight has a large absolute value, |✓|, L1 regularization shrinks the weight much less than L2 regularization does. By contrast, when |✓| is small, L1 regularization shrinks the weight much more than L2 regularization.
I The net result is that L1 regularization tends to concentrate the weight of the network in a relatively small number of high-importance connections, while the other weights are driven toward zero. So L1 regularization e↵ectively does feature selection.
Dropout
I Randomly drops a certain percentage of the nodes to prevent over-reliance on a few features or hidden units, or feature co-adaptation, where some features are only useful when working together with a few other features. The ultimate goal is to avoid overfitting.
Dropout
I Dropout can be achieved using a mask: z(1) =g1(⇥(1)x+b1)
m1 ⇠ Bernouli(r1) (1) 1 (1)
z ̃ = m z
(2) 2 (2)(1) 2
z =g(⇥ z ̃ +b) m2 ⇠ Bernouli(r2)
(2) 2 (2)
z ̃ = m z (3) (2)
y = ⇥ z ̃
where m1 and m2 are mask vectors. The values of the elements in these vectors are either 1 or 0, drawn from a Bernouli distribution with parameter r (usually r = 0.5)
Optimization methods
I Moment
I Adgrad
I Root Mean Square Prop (RMSProp) I Adam
Momentum
At each timestep t, compute r⇥ and , and then compute the momentum as follows:
Vt = Vt 1 + ⌘r⇥(L) ⇥ = ⇥ Vt
The momentum term increases for dimensions whose gradient point in the same directions and reduces updates for dimensions whose gradient change directions.
Adgrad
Weight and bias update for Adgrad at each time step t: V r ⇥ = V r ⇥ + r 2⇥
⇥=⇥ ⌘p r⇥ Vr⇥ +✏
e.g., ✏ = 10 8
Root Mean Square Prop (RMSProp)
Weight update for RMSprop at each time step t: Sr⇥ = Sr⇥ +(1 )r2⇥
⇥ = ⇥ ⌘ p r⇥ Sr⇥ +✏
e.g. = 0.9,⌘ = 0.001,✏ = 10 8
Adaptive Moment Estimation (Adam)
Weight update at time step t for Adam:
Vr⇥ = 1Vr⇥ +(1 1)r⇥
Sr⇥ = 2Sr⇥ +(1 2)r2⇥
Vcorrected = Vr⇥ r⇥ 1t
Scorrected = Sr⇥ r⇥ 2t
V corrected
⇥=⇥ ⌘q r⇥ Scorrected + ✏
r⇥ Adam combines Momentum and RMSProp
Define a neural net
from torch import nn class Net(nn.Module):
‘‘‘subclass from nn.Module is important to insp def init (self , in dim=25, out dim=3, batch
super(Net, self). init ()
self .in dim = in dim
self.out dim = out dim
self.linear = nn.Linear(self.in dim, self.o self.softmax = nn.Softmax(dim=1) #softmax o
def forward(self , input matrix ):
logit = self.linear(input matrix)
return logit #return raw score , not normali
def xtropy loss(self , input matrix , target labe loss = nn. CrossEntropyLoss ()
logits = self .forward(input matrix)
return loss(logits ,target label vec)
z
Use optimizers in Pytorch
import torch . optim as optim
net = Net(input dim , output dim)
optimizer = optim.Adam(net.parameters(), lr=lrate) for epoch in range(epochs ):
total nll = 0
for batch in batchize(train data , batch size ):
optimizer . zero grad () #zero out the gradient . vectorized = vectorize batch(batch , feat index , labe feat vec = map( itemgetter (0) , vectorized )
label vec = map(itemgetter(1), vectorized)
feat list = list(feat vec)
label list = list(label vec)
x = torch.Tensor(feat list)
y = torch.LongTensor(label list)
loss = net.xtropy loss(x,y) total nll += loss
loss .backward()
optimizer . step ()
torch.save(net.state dict(), net path)
l
Sparse and Dense embeddings as input to neural networks
Input to feedforward neural networks
I Assuming a bag-of-words model, when the input x is the I count of each word (feature) xi .
To compute the hidden unit zk:
j=1
I embedding of word j.
If there is a lot of training data, word embeddings can be
I learned within the same network as the classification task. Word embeddings can also be learned separately from unlabeled data, using techniques such as Word2Vec and
I GLOVE.
The latest trend is to learn contextualized word embeddings which are computed dynamically for each classification instance (e.g., ELMO, BERT). The requires more advanced architectures (Transformers) that we will talk about later in
✓x!zxj
form a vector ✓(x!z) is sometimes described as the
zk =
I The connections from word j to each of the hidden units zk
j
XV
j,k
the course.
One-hot encodings for features
A one-hot encoding is one in which each dimension corresponds to a unique feature, and the resulting feature vector of a classification instance can be thought of as the sum of indicator feature vectors in which a single dimension has a value of one while all others have a value of zero.
Example
When considering a bag-of-words representation over a vocabulary of 40000 words. A short document of 20 words will be represented with a very sparse 40000-dimensional vector in which at most 20 dimensions have non-zero values
Combination of sparse vectors
⇥0 0 0 1 0 0 0 0⇤ ⇥+⇤
01000000
⇥+⇤ 00000010
⇥=⇤ 01010010
Sparse vs Dense representations
I Sparse representation
I Each feature is a sparse vector in which one dimension is 1 and I the rest are 0s (thus “one-hot”)
Dimensionality of one-hot vector is same as number of distinct I features
Features can be completely independent of one another. The
feature “word is ‘dog’ ” is as dissimilar to “word is ‘thinking’ ” I as it is to “word is ‘cat”’
Features for one classifying instance can only combined by summation.
I Dense representation
I Each feature is a d-dimensional vector, with a d that is I generally much shorter than that of a one-hot vector.
Model training will cause similar features to have similar I vectors – information is shared between similar features.
Features can be combined via summation (or averaging), concatenation, or some combination of the two.
I Concatenation if we care about relative position.
Using dense vectors in a classifier
Each core feature is embedded into a d dimensional space (typically 50-300), and represented as a vector in that space.
I Extract a set of core linguistic features f1, · · · , fk that are relevant for predicting the output class
I For each feature fi of interest, retrieve the corresponding vector vi .
I Combine the vectors (either by concatenation, summation, or a combination of both) into an input vector x.
I Note: concatenation doesn’t work for variable-length vectors such as document classification
I Feed x into a nonlinear classifier (feed-forward neural network).
Relationship between one-hot and dense vectors
I Dense representations are typically pre-computed or pre-trained word embeddings
I One-hot and dense representations may not be as di↵erent as one might think
I In fact, using sparse, one-hot vectors as input when training a neural network amounts to dedicating the first layer of the network to learning a dense embedding vector [for each feature] based on training data.
I With task-specific word embedding, the training set is typically smaller, but the training objective for the embedding and the task objective are one and same
I With pre-trained word embeddings, the training data is easy to come by (just unannotated text), but the embedding object and task objective may diverge.
Two approaches of getting dense word vectors
I Count based methods, known in NLP as Distributional Semantic Models (DSM) or Vector Semantic Models (VSM)
I Predictive methods, originating from the neural network community, aimed at producing Distributed Representations for words, commonly known as word embeddings
I Distributed word representations were initially a by-product of neural language models and later became a separate task on its own
Distributional semantics
I Based on the well-known observation of Z. Harris: Words are similar if they occur in the same context (Harris, 1954)
I Further summarized it as a slogan: “You shall know a word by the company it keeps.” (J. R. Firth, 1957)
I A long history of using word-context matrices to represent word meaning where each row is a word and each column represents a context word it can occur with in a corpus
I Each word is represented as a sparse vector in a high-dimensional space
I Then word distances and similarities can be computed with such a matrix
Steps for building a distributional semantic model
I Preprocess a (large) corpus: tokenization at a minimum, possibly lemmatization, POS tagging, or syntactic parsing
I Define the “context” for a target term (words or phrases). The context can be a window centered on the target term, terms that are syntactically related to the target term (subject-of, object-of, etc.).
I Compute a term-context matrix where each row corresponds to a term and each column corresponds to a context term for the target term.
I Each target term is then represented with a high-dimensional vector of context terms.
Mathematical processing for building a DSM
I Weight the term-context matrix with association strength metrics such as Positive Pointwise Mutual Information (PPMI) to correct frequency bias
PPMI(x,y) = max(log p(x,y) ,0) p(x)p(y)
I Its dimensionality can also be reduced by matrix factorization techniques such as singular value decomposition (SVD)
A = U⌃VT
A 2 Rm⇥n,U 2 Rm⇥k,⌃ 2 Rk⇥k,V 2 Rn⇥k,n >> k
I This will result in a matrix that has much lower dimension but retains most of the information of the original matrix.
Predictive methods
I Learns word embeddings from large naturally occurring text, using various language model objectives.
I Decide on the context window
I Define the objective function that is used to predict the context words based on the target word or predict the target word based on context
I Train the neural network
I The resulting weight matrix will serve as the vector
representation for the target word
I “Don’t count, predict!” (Baroni et al, 2014) conducted systematic studies and found predict-based word embeddings outperform count-based embeddings.
I One of popular early word emdeddings are Word2vec embeddings.
Word2vec
I Word2vec is a software package that consists of two main models: CBOW (Continuous Bag of Words) and Skip-gram.
I It popularized the use of distributed representations as input to neural networks in natural language processing, and inspired many follow-on works, e.g., GLOVE, ELMO, BERT, XLNet)
I It has it roots in language modeling (the use of window-based context to predict the target word), but gives up the goal of getting good language models and focus instead on getting good word embeddings.
Understanding word2vec: A simple CBOW model with
only one context word input
I Inputx2RV isaone-hotvectorwherexk =1andxk0 =0 for k0 6= k. ⇥ 2 RN⇥V is the weight matrix from the input layer to the hidden layer. Each column of ⇥ is an N-dimensional vector representation vw of the associated word of the input layer.
z =⇥x :=vwi
I ⇥0 2 RV ⇥N is the matrix from the hidden layer to the output layer and uwj is the j-th row of ⇥0 . A “similarity” score oj for each target word wj and context word wi can be computed as:
o j = u w> v w i j
I Finally we use softmax to obtain a posterior distribution p(wj|wi) = yj = P exp(oj)
where yj is the output of the j-th unit in the output layer
V exp(oj0) j0=1
Computing the hidden layer is just embedding lookup
Hidden layer computation “retrieves” vwi : vwi =z=⇥x=
203 607
617
20.1 0.3 0.5 0.4 0.6 0.1 0.3 0.5 0.4 0.63 607 20.53 40.2 0.5 0.8 0.7 0.9 0.4 0.8 0.2 0.5 0.15 ⇥ 607 = 40.85 0.2 0.5 0.8 0.7 0.9 0.4 0.8 0.2 0.5 0.1 607 0.8
64075
Note there is no activation at the hidden layer (or there is a linear activation function), so this is a “degenerate neural network”.
Computing the output layer
o = ⇥0z =
20.3 0.4 60.7 0.1 60.5 0.2 60.2 0.6 60.6 0.5 60.3 0.1 60.2 0.4 640.3 0.2
0.63 0.67 0.77 0.37 0.67 0.17 0.87 0.175 0.6 0.1
20.53
⇥ 40.85 = 0.8
20.953 60.917 60.977 60.827
61.187
60.317
61.067
640.3975 0.95
0.3 0.4 0.3 0.5
0.63 Each row of ⇥0 correspond to vector for a target word wj .
Taking the softmax over the output
20.110392153 60.106063627 60.112622227 60.096934857
y = softmax(o) = 60.138939577 60.058208957
60.123228347 64 0.063057 75 0.11039215 0.08016116
The output y is a probabilistic distribution over the entire vocabulary.
Input vector and output vector
Since there is no activation function at the hidden layer, the output is really just the dot product of the vector of the input context word and the vector of the output target word.
p(wj|wi)=yj =P
exp(o ) j
=P
exp(uw> vwi ) j
V j0=1
exp(o 0 ) j
V j0=1
exp(u> v ) wj wi
where vwi from ⇥ is the input vector for word wi and uwj from ⇥0 is the output vector for word wj
Computing the gradient on the hidden-output weights
I Use the familiar cross-entropy loss
L = X|V| tj logyj = logyj?
j
where j? is the index of the target word
I Given yj is the output of a softmax function, the gradient on
the output is:
@L =yj tj @oj
@L=@L@oj =(yj tj)zi @✓j0i @oj @✓j0i
I Update the hidden!output weights
✓ j0 i = ✓ j0 i ⌘ ( y j t j ) z i
Updating input!hidden weights
I Compute the error at the hidden layer
@ L XV @ L @ o j XV
@z = @o @z = (yj tj)✓j0i
i j=1 j i j=1
I Since
XV
zi =
The derivative of L on the input! hidden weights:
@L @Lzi XV
@✓ =@z✓ = (yj tj)✓j0ixk
ik iikj=1 I Update the input!hidden weights
XV j=1
k=1
✓ikxk
✓ki =✓ki ⌘
(yj tj)✓i0jxk
Gradient computation in matrix form
20.110392153 60.106063627 60.112622227 60.096934857
Do = Y T = 60.138939577 60.058208957
60.123228347 64 0.063057 75 0.11039215 0.08016116
203 607
617 607
607 = 607
607 64075
0 0
2 0.11039215 3 6 0.10606362 7 6 0.887377787 6 0.09693485 7 6 0.13893957 7 6 0.05820895 7 6 0.12322834 7 64 0.063057 75 0.11039215 0.08016116
Computing the errors at the hidden layer
D z = D o> ⇥ 0 =
⇥0.110 0.106 0.887 0.097 0.139 0.058 0.123 0.063 0.110 0.080⇤
20.3 0.4 0.63
60.7 0.1 0.67
60.5 0.2 0.77
60.2 0.6 0.37 ⇥ ⇤
⇥ 60.6
60.3 0.1
60.2 0.4 640.3 0.2 0.3 0.4 0.3 0.5
0.17 0.87 0.175 0.6 0.1
0.5 0.67 = 0.115 0.157 0.194
Computing the updates to ⇥0 r⇥0 =Doz> =
2 0.110 3 6 0.106 7 6 0.8877 6 0.097 7 6 0.139 7 6 0.058 7 6 0.123 7 64 0.063 75 0.110 0.080
⇥⇤
⇥ 0.5 0.8 0.8 =
2 0.055 0.088 0.088 3 6 0.053 0.085 0.085 7 6 0.443 0.710 0.7107 6 0.048 0.078 0.0778 7 6 0.069 0.111 0.111 7 6 0.029 0.047 0.0467 7
6 0.062 64 0.032 0.055 0.040
0.099 0.050 0.088 0.064
0.099 7 0.050 75 0.088 0.064
Computing the update to ⇥ 203
607 6107
r⇥ = xD = 607 607
2 64075
0. 6 0.
6 0.11538456
6 0.
= 6 0.
6 0.
6 0. 0.
64 0. 0. 0. 0. 0. 0.
0. 7 0.75
⇥ ⇥ 0.11538456
0. 0.
0.19388611⇤
0. 0.7 0.15687943 0.193886117 0. 0. 7 0. 0. 7 0. 0. 7
0.15687943
0. 0.
3
CBOW for multiple context words
z = 1 ⇥(x1 +x2 +···xM) M
= 1 (vw1 +vw2 +···+vwM) M
where M is the number of words in the context, w1, w2, · · · , wM are the words in the context and vw is an input vector. The loss function is
E= logp(w|wX,w,···,w ) j12 M
V
= oj? +log exp(oj0)
j0=1 XV
= uw> z + log exp(uw>0 z) jj
j0=1
Computing the hidden layer for multiple context words
z = ⇥x =
203 607
617
20.1 0.3 0.5 0.4 0.6 0.1 0.3 0.5 0.4 0.63 6017 21.53 40.2 0.5 0.8 0.7 0.9 0.4 0.8 0.2 0.5 0.15 ⇥ 607 = 43.05 0.2 0.5 0.8 0.7 0.9 0.4 0.8 0.2 0.5 0.1 617 3.0
6401075 During backprop, update vectors for four words instead of just one.
Skip-gram: model
p(wc,j = wO,c |wI ) = yc,j = P exp(oc,j )
V exp(oj0)
where wc,j is the j-th word on the c-th panel of the output layer, wO,c is the actual c-th word in the output context words; wI is the only input word, yc,j is the output of the j-th unit on the c-th panel of the output layer; oc,j is the net input of the j-th unit on the c-th panel of the output layer.
oc,j =oj =uwj ·z,forc=1,2,···,C
j0=1
Skip-gram: loss function
L = logpY(wO,1,wO,2,··· ,wO,C|wI) C
= log Pexp(oc,jc?)
V exp(oj0)
XC
XV j0=1
c=1 j0=1
oc,jc? +C⇥log
where jc? is the index of the actual c-th output context word.
Combine the loss of C context words with multiplication. Note: oj0 is the same for all C panels
= c=1
exp(oj0)
Skip-gram: updating the weights
I We take the derivative of L with regard to the net input of every unit on every panel of the output layer, oc,j, and obtain
ec,j= @L =yc,j tc,j @oc,j
which is the prediction error of the unit.
We define a V-dimensional vector E = E1,··· ,EV as the sum
IP of the prediction errors of the context word: Ej =
@ L = XC @ L · @ o c , j = E j · z i @✓j0i c=1 @oc,j @✓j0i
I Updating the hidden!output weight matrix: ✓j0i =✓j0i ⌘·Ej ·zi
Cc=1 ec,j
I No change in how the input!hidden weights are updated.
Optimizing computational e ciency
I Computing softmax at the output layer is expensive. It involves iterative over the entire vocabulary
I Two methods for optimizing computational e ciency
I Hierarchical softmax: an alternative way to compute the probability of a word that reduces the computation complexity
I from |V| to log|V|.
Negative sampling: Instead of updating the weights for all the words in the vocabulary, only sample a small number of words that are not actual context words in the training corpus
Hierarchical softmax
0.3
n15 0.65
n13
0.7
0.35
n14
0.4 0.6
n11 n12
0.8 0.3 0.7
Australia penguin watch 0.11 0.06 0.15
0.5
koala 0.1
n9 n10
0.5 0.25 0.75
kangaroo tree eat 0.1 0.11 0.34
0.2
leaf 0.028
Computing the probabilities of the leaf nodes
P(“Kangaroo00|z) = Pn15 (Left|z) ⇥ Pn13 (Left|z) ⇥ Pn9 (Right|z) Pn(Right|z) = 1 Pn(Left|z)
Pn(Left|z) = ( n>z)
where n is a vector from a set of new parameters that replace ⇥
Hu↵man Tree Building
A simple algorithm:
I Prepare a collection of n initial Hu↵man trees, each of which is a single leaf node. Put the n trees onto a priority queue organized by weight (frequency).
I Remove the first two trees (the ones with lowest weight). Join these two trees to create a new tree whose root has the two trees as children, and whose weight is the sum of the weights of the two children trees. Put this new tree into the priority queue.
I Repeat steps 2-3 until all of the partial Hu↵man trees have been combined into one.
Negative sampling
I Computing softmax over the vocabulary is expensive. Another alternative is to approximate softmax by only updating a small sample of (context) words at a time.
I Given a pair of words (w,c), let P(D = 1|w,c) be the probability of the pair of words came from the training corpus, and P(D = 0|w,c) be the probability that the pair did not come from the corpus.
I This probability can be modeled as a sigmoid function:
P(D = 1|w,c) = (uw>vc) = 1 1+euw>vc
New learning objective for negative sampling
I We need a new objective for negative sampling, which is to minimize the following loss function:
L= X log (owj) X log ( owj) wj2D wj2D0
where D is a set of correct context – target word pairs and D0 is a set of incorrect context – target word pairs.
I Note that we use the sum for positive samples as well as negative samples. In the skip-gram algorithm, there will be multiple positive context words in the output. In the CBOW algorithm, there will be only one positive target word.
I The derivative of the loss function with respect to the output word will be:
@L = (owj) twj @ owj
wheretwj =1ifwj 2D andtwj =0ifwj 2D0
Updates to the hidden!output weights
I Compute the gradient on the output weights
@L =( (ow ) tw )z @uw j j
j
I When updating the output weights, only the weight vectors for words in the positive sample and negative sample need to be updated:
Updates to the input!hidden weights
I Computing the derivative of the loss function with respect to
the hidden layer
@L = X ( (ow ) tw )uw @z j j j
wj 2D[D0
I In the CBOW algorithm, the weights for all input context words will be updated. In the Skip-gram algorithm, the weight for the target word will be updated.
vwi =vwi ⌘( (owj) twj)uwjxi
How to pick the negative samples?
I If we just randomly pick a word from a corpus, the probability of any given word wi getting picked is:
p(wi ) = P freq(wi )
Vj = 0 f r e q ( w j )
More frequent words will be more likely to be picked and this may not be ideal.
I Adjust the formula to give the less frequent words a bit more chance to get picked:
freq(w )3 p(wi)=P i4
I Generate a sequence of words using the adjusted probability, and randomly pick nD0 words
V freq(w )3 j=0 j4
Use of embeddings: word and short document similarity
I Word embeddings can be used to compute word similarity with cosine similarity
simcos = u·v kuk2kvk2
I How accurately can they be used to compute word similarity can also be used to evaluate word embeddings
I They can also be used to compute the similarity of short documents
Xm Xn i=1 j=1
simdoc (D1, D2) =
cos(wi1, wj2)
Use of embeddings: word analogy
I What’s even more impressive is that they can be used to compute word analogy
analogy(m:w!k:?) = argmax cos(v,k m+w) v2V\m,k,w
analogy(m:w!k:?) = argmax cos(v,k) cos(v,m) v2V\m,k,w
+cos(v,w)
analogy(m : w ! k :?) = argmax cos(v,k)cos(v,w) v2V\m,k,w cos(v,m)+✏
Word analogy
Use of
I
I I
I
word embeddings
Computing word similarities is not a “real” problem in the eyes of many
The most important use word embeddings is as input to predict the outcome of tasks that have real-world applications
Many follow-on work in develop more e↵ective word embeddings, e.g., GLOVE
I word2vec: http://vectors.nlpl.eu/repository
I fasttext: https://fasttext.cc/docs/en/english-vectors.html I GLOVE: https://nlp.stanford.edu/projects/glove
Contextualized word embeddings:
I https://allennlp.org/elmo
I BERT: https://github.com/google-research/bert
I Roberta: https://pytorch.org/hub/pytorch fairseq roberta
Embeddings in Pytorch
Commonly used neural architectures
I One important aspect of neural network modeling is to figure out the right representation for a problem
I Commonly used architectures
I Variants of the Recurrent Neural Networks (RNN), which have I been instrumental in improving many states of the art in NLP
Convolutional Networks (CNN), which have been very e↵ective in image processing and some NLP problems (e.g., sentence classification)
Convolutional Networks for text classification
Convolutional Networks
I A convolutional network is designed to identify indicative local indicators in a large structure, and combine them to produce a fixed size vector representation of the structure with a pooling function, capturing the local aspects that are most informative of the prediction task at hand.
I A convolutional network is not fully connected as a feedforward network is.
I It has been tremendously successful in image procession (or computer vision), where the input is the raw pixels of an image
I In NLP, it has been shown to be e↵ective in sentence classification, etc.
Why it has been so e↵ective in image processing
Image pixels
Four operations in a convolutional network
I Convolution
I Non-linear activation (ReLU)
I Pooling or subsampling (Max)
I Classification with a fully connected layer
Image convolution
O
Output
or “feature map”
(U)
X
1
1
1
0
1
1
0
0
1
2
2
0
2
1
0
0
1
=conv ,
Input image
“filter”
Kernel size = (2,2), strides = 1
Image convolution
conv(X, U)
Kernel size = (2,2), strides = 1
O
1×1
1×0
1
0x0
1×1
1
0
0
1
2
Image convolution
conv(X, U)
1
1×1
1×0
0
1×0
1×1
0
0
1
2
2
Kernel size = (2,2), strides = 1
O
Image convolution
conv(X, U)
1
1
1
0x1
1×0
1
0x0
0x1
1
2
2
0
Kernel size = (2,2), strides = 1
O
Image convolution
conv(X, U)
Kernel size = (2,2), strides = 1
O
1
1
1
0
1×1
1×0
0
0x0
1×1
2
2
0
2
Forward computation
o11 = x11u11 + x12u12 + x21u21 + x22u22 o12 = x12u11 + x13u12 + x22u21 + x23u22 o21 = x21u11 + x22u12 + x31u21 + x32u22 o22 = x22u11 + x23u12 + x32u21 + x33u22
ReLU
I Nonlinear transformation with ReLU
Output = ReLU(input) = max(0, input)
I As we know, ReLU is an element-wise transformation that does not change the dimension of the feature map
I ReLU replaces all negative pixel values in the featuremap with 0
Image ReLU
ReLU activation and Max pooling
I ReLU activation is a component-wise function and does not change the dimension of the input
2 2 = ReLU ✓2 2 ◆ 02 02
I Max pooling does change the dimension of the input. Need to specify the pool size and strides.
⇥2⇤ = Max ✓2 2 ◆ 02
pool size = (2, 2), strides = 2
Training a CNN
I Loss functions: Cross-entropy loss, squared error loss I What are the parameters of a CNN?
I The filters (kernels), weight matricies for the feedforward network on top of the convolution and pooling layers, biases
I Computing the gradient for the convolution layers is di↵erent from a feedforward neural network…
Computing the gradient on U
@E @ u11
@E @ u12
@E @ u21
@E @ u22
= @E @o11 + @E @o12 + @E @o21 + @E @o22 @ o11 @u11 @o12 @u11 @o21 @u11 @o22 @u11
= @E x11 + @E x12 + @E x21 + @E x22 @ o11 @o12 @o21 @o22
= @E @o11 + @E @o12 + @E @o21 + @E @o22 @ o11 @u12 @o12 @u12 @o21 @u12 @o22 @u12
= @E x12 + @E x13 + @E x22 + @E x23 @ o11 @o12 @o21 @o22
= @E @o11 + @E @o12 + @E @o21 + @E @o22 @ o11 @u21 @o12 @u21 @o21 @u21 @o22 @u21
= @E x21 + @E x22 + @E x31 + @E x32 @ o11 @o12 @o21 @o22
= @E @o11 + @E @o12 + @E @o21 + @E @o22 @ o11 @u22 @o12 @u22 @o21 @u22 @o22 @u22
= @E x11 + @E x23 + @E x32 + @E x33 @ o22 @o12 @o21 @o22
Summing up errors from all outputs that the filter component has contributed to.
Reverse Convolution
The computation of the gradient on the filter can be vectorized as a reverse convolution:
“@E @u11
@E # 02×11 x12 x133 “@E @u12 = conv @4×21 x22 x235 , @o11
@E #1 @o12 A
@E @u21
@E @E @u22 x31 x32 x33 @o21
@E @o22
Computing the gradient on X (if this is not the input layer)
@E = @ x11
@E = @ x12
@E = @ x13
@E = @ x21
@E = @ x22
@E = @ x23
@E = @ x31
@E = @ x32
@E = @ x12
@E u11 + @E 0+ @E 0+ @E 0 @o11 @o12 @o21 @o22
@Eu12+ @Eu11+ @E0+ @E0 @o11 @o12 @o21 @o22
@E 0+ @E u12 + @E 0+ @E 0 @o11 @o12 @o21 @o22
@Eu21+ @E0+ @Eu11+ @E0 @o11 @o12 @o21 @o22
@Eu22+ @Eu21+ @Eu12+ @Eu11 @o11 @o12 @o21 @o22
@E0+ @ o11
@E0+ @ o11
@E0+ @ o11
@E0+ @ o11
@Eu22+ @E0+ @Eu12 @o12 @o21 @o22
@E0+ @Eu21+ @E0 @o12 @o21 @o22
@E0+ @Eu22+ @Eu21 @o12 @o21 @o22
@E0+ @E0+ @Eu22 @o12 @o21 @o22
Full convolution
2@E@E@E3 “# !
6@x11 @x12 @x13 7 @E @E u u 4 @E @E @E 5 = full conv @o11 @o12 , 11 12
@x31 @x32 @u33
@x21 @x22 @u23 @E @E @E @E @E @o21 @o22
u21 u22
Gradient on X if it is not the inputs
u22
u21
u12
δo11u11
δo12
δo21
δo22
u22
u21
δo11u12
δo12u11
δo21
δo22
u22
u21
δo11
δo12u12
u11
δo21
δo22
δo11u22
δo12u21
δo21u12
δo22u11
δo11
x12
u21
δo21
x22
u11
u22
u12
u22
δo11u21
δo21u11
δo11
δo21u21
δo12
δo22
δo11
δo12
δo21u22
δo22u21
u12
u11
δo11
δo12
δo21
δo22u22
u21
u12
u11
δo12
δo22
u12
u11
Sample code of 2D convolution with Keras
model = Sequential ()
model.add(Conv2D(32, kernel size=(5, 5), strides=(1, 1),
activation=’ relu ’ ,
input shape=input shape )) model.add(MaxPooling2D(pool size=(2, 2), strides=(2, 2)))
model.add(Conv2D(64, (5, 5), activation=’relu ’)) model.add(MaxPooling2D(pool size=(2, 2)))
model . add ( F l a t t e n ( ) )
model.add(Dense(1000, activation=’relu ’)) model.add(Dense(num classes , activation=’softmax ’))
Tutorials on how to train CNNs can be found on the Tensorflow website: https://www.tensorflow.org/tutorials/images/cnn
Why convolutational networks for NLP?
I Even though bag-of-word models are simple and work well in some text classification tasks, they don’t account for cases where multiple words combine to create meaning, such as “not interesting”.
I The analogy with image processing is if the pixels are treated as separate features. (The analogy might be going too far).
Input to a convolutional network in a text classification task
The input to a convolutional network can be pretrained word embeddings (e.g., the weight matrix produced by Word2Vec or GLOVE1) and the input sentence:
X(0) = ⇥(x!z)[ew1,ew2,··· ,ewM ] X(0) 2RKe⇥M
where ewm is a column vector of zeros, with a 1 at position wm, Ke is the size of embeddings
1 https://nlp.stanford.edu/projects/glove
Alternative text representations
I Alternatively, a text can be represented as a sequence of word tokens w1, w2, w3, · · · , wM . This view is useful for models such as Convolutional Neural Networks, or ConvNets, which processes text as a sequence.
I Each word token wm is represented as a one-hot vector ewm , with dimension V . The complete document can be represented by the horizontal concatenation of these one-hot vectors: W = [ew1,ew2,··· ,ewm] 2 RV⇥M.
I To show that this is equivalent to the bag-of-words model, we can recover the word count from the matrix-vector product W[1,1,···,1]> 2RV.
“Convolve” the input with a set of filters
I A filter is a weight matrix of dimension C(k) 2 RKe⇥h where C(k) is the kth filter. Note the first dimension of the filter is the same as the size of the embedding.
I Unlike image processing, the filter doesn’t have to cover the full width of the image.
I To merge adjacent words, we convolve X(0) by sliding a set of filters across it:
X(1) =f(b+C⇤X(0))
where f is an activation function (e.g., tanh, ReLU), b is a vector of bias terms, and ⇤ is the convolution operator.
Computing the convolution
I At each position m (the mth word in the sequence), we compute the element-wise product of the kth filter and the sequence of words of window size h (think of it as an ngram
of length h) starting at m and take its sum: C(k) X(0)
I m:m+h 1
The value of the mth position with the kth filter can be computed as:
Kh! x(1)=f b+Xe Xc(k)⇥x(0)
m k k0,n k0,m+n 1 k0=1 n=1
I When we finish the convolution step, if we have Kf filters of dimension RKe⇥h, then X(1) 2 RKf ⇥M h+1
I In practice, filters of di↵erent sizes are often used to captured ngrams of di↵erent lengths, so X(1) will be Kf vectors of variable lengths, and we can write the size of each vector of hk
Convolution step when processing text
X(0) U
0.4
0.7
0.8
0.3
2.3
3.2
0.7
0.5
1.3
0.2
1.5
0.8
0.6
1.8
1.2
2.2
1.1
4.3
0.5
0.3
3.4
1
0
0
1
1
0
Conv( =
Input text sequence
X(1)
,) “filter”
2.9
3.1
3.4
?
?
?
Padding
I To deal with the beginning and end of the input, the base matrix is often padded with h 1 column vectors of zeros at the beginning and end. this is called wide convolution
I If no padding is applied, then the output of each convolution layer will be h 1 units smaller than the input. This is known as narrow convolution.
Pooling
I After D convolutional layers, assuming filters have identical lengths, we have a representation of the document as a matrix
X(D) 2RKz⇥M.
I It is very likely that the documents will be of di↵erent lengths, so we need to turn them into matricies of the same length before feeding them to a feedward network to perform classification.
I This can done by pooling across times (over the sequence of words)
Prediction and training with CNN
I The CNN needs to be fed into a feedforward network to make a prediction yˆ and compute the loss `(i) in training.
I Parameters of a CNN includes the weight matrics for the feedforward network and the filters C(k) of the CNN, as well as the biases.
I The parameters can be updated with backpropagation, which may involve computing the gradient for the max pooling function.
(1, x(D) = max⇣x(D),x(D),··· ,x(D)⌘ k,m k,1 k,2 k,M
0, Otherwise
@zk
(D) = @xk,m
Di↵erent pooling methods
I Max pooling
zk = max⇣x(D),x(D),··· ,x(D)⌘
I Average pooling
1 XM ( D ) zk = M xk,m
m=1
k,1 k,2 k,M
A graphic representation of a CNN
Figure 1: Caption
Sample code of convolution with Keras
from tensorflow . keras . models import Sequential from tensorflow . keras . layers import Dense , \\
Embedding , Flatten , Conv1D , GlobalMaxPooling1D
model = Sequential () model.add(Embedding(input dim=n input words ,
output dim=embedding dim ,
input length=maxlen )) model . add(Conv1D( num filters , kernel size , \\
activation=’relu ’)) model . add ( GlobalMaxPooling1D ( ) )
model.add(Dense(50, activation=’relu ’)) model.add(Dense(3, activation=’softmax ’)) model.compile(optimizer=’adam’ ,
loss=’categorical crossentropy ’, m e t r i c s =[ ’ a c c u r a c y ’ ] )
An Interim Summary of Supervised Learning Methods
I In all the linear and non-linear models we have discussed so far, we assume we have labeled training data where we can perform supervised learning:
I a training set where you get observations x and labels y; I a test set where you only get observations x
I A summary of the supervised learning models we have discussed so far:
I Linear models: Na ̈ıve Bayes, Logistic Regression, Perceptron, I Support Vector Machines
Non-linear models: feed-forward networks, Convolutional I Networks
Sparse vs dense feature representations as input to classifiers
I Given su cient amounts of high-quality data, supervised learning methods tend to produce more accurate classifiers than alternative learning paradigms
NLP problems that can be formulated as simple text classifications
I An NLP problem can be formulated as a simple text classification if there is no inter-dependence between the labels (y) of the classification instances:
I Word sense disambiguation
I Sentiment and opinion analysis I Genre classification
I Others
I NLP problems that cannot be formulated as simple text classifications (or you can, but the results won’t be optimal)
I Sequence labeling problems such as POS tagging, Named I Entity Recognition
Structured prediction problems such as syntactic parsing
Beyond Supervised Learning
There are other learning scenarios where labeled training sets are available to various degree or not available at all
I When there is no labeled data at all, we’ll have to do unsupervised learning (e.g., K-Means clustering, variants of the EM algorithm)
I When there is a small amount of labeled data, we might want to try semi-supervised learning
I When there is a lot of labeled data in one domain but there is only a small of labeled data in the target domain, we might try domain adaptation
K-Means clustering algorithm
K-means clustering algorithm
procedure K-MEANS(x1:N,K) for i 2 1 · · · N do
z(i) RandomInt(1,K) . Intializing class membership
repeat
for k 2 1···K do PN . recompute cluster centers
k PN 1 (z(i) = k)x(i) i=1 (z(i)=k) i=1
for i 2 1 · · · N do
z(i) argminK ||x(i) k||2
until Converged return z(i)
K-Means training
10 9 8 7 6 5 4 3 2 1 0
10 9 8 7 6 5 4 3 2 1 0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
10 9 8 7 6 5 4 3 2 1 0
10 9 8 7 6 5 4 3 2 1 0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
I K-means clustering is non-parametric and has no parameters I to update
The number of clusters need to be pre-specified before the training process starts
Semi-supervised learning
I Initialize parameters with supervised learning and then apply unsupervised learning (such as the EM algorithm)
I Multi-view learning: Co-training
I divide features into multiple views, and train a classifier for I each view
Each classifier predicts labels for a subset of the unlabeled instances, using only the features available in its view. These predictions are then used as ground truth to train the
I classifiers associated with the other views
Named entity example: named entity view and local context
I view
Word sense disambiguation: local context view and global context view
Domain adaptation
Supervised domain adaptation: “Frustratingly simple” domain adaptation (Daum ́e III, 2007)
I Creates copies of each feature: one for each domain and one for the cross-domain setting
f (x , y , d ) = {(boring, NEG, Movie):1, (boring, NEG, *):1, (flavorless, NEG, Movie):1, (flavorless, NEG, *):1, (three-day-old, NEG, Movie):1, (three-day-old, NEG, *):1,
···}
where d is the domain.
I Let the learning algorithm allocate weights between domain specific features and cross-domain features: for words that facilitate prediction in both domains, the learner will use cross-domain features. For words that are only relevant to a particular domain, domain-specific features will be used.
Other learning paradigms
I Active learning: A learning that is often used to reduce the number of instances that have to be annotated but can still produce the same level of accuracy
I Distant supervision: There is no labeled data, but you can generate some (potentially noisy) training data with some external resource such as a dictionary. For example, you can generate named entity annotation with a list of names.
I Multitask learning: The learning induces a representation that can be used to solve multiple tasks (learning POS tagging with syntactic parsing)