l7-feedforward-v3
COPYRIGHT 2021, THE UNIVERSITY OF MELBOURNE
1
COMP90042
Natural Language Processing
Lecture 7
Semester 1 2021 Week 4
Jey Han Lau
Deep Learning for NLP:
Feedforward Networks
COMP90042 L7
2
Outline
• Feedforward Neural Networks Basics
• Applications in NLP
• Convolutional Networks
COMP90042 L7
3
Deep Learning
• A branch of machine learning
• Re-branded name for neural networks
• Why deep? Many layers are chained together in
modern deep learning models
• Neural networks: historically inspired by the way
computation works in the brain
‣ Consists of computation units called neurons
COMP90042 L7
4
Feed-forward NN
• Aka multilayer perceptrons
• Each arrow carries a
weight, reflecting its
importance
• Certain layers have non-
linear activation functions
COMP90042 L7
5
Neuron
• Each neuron is a function
‣ given input x, computes
real-value (scalar) h
‣ scales input (with weights, w) and adds offset (bias, b)
‣ applies non-linear function, such as logistic sigmoid,
hyperbolic sigmoid (tanh), or rectified linear unit
‣ and are parameters of the model
h = tanh ∑
j
wjxj + b
w b
8.1 • UNITS 3
Figure 8.1 The sigmoid function takes a real value and maps it to the range [0,1]. Because
it is nearly linear around 0 but has a sharp slope toward the ends, it tends to squash outlier
values toward 0 or 1.
value by a weight (w1, w2, and w3, respectively), adds them to a bias term b, and then
passes the resulting sum through a sigmoid function to result in a number between 0
and 1.
x1 x2 x3
y
w1 w2 w3
∑
b
σ
+1
z
a
Figure 8.2 A neural unit, taking 3 inputs x1, x2, and x3 (and a bias b that we represent as a
weight for an input clamped at +1) and producing an output y. We include some convenient
intermediate variables: the output of the summation, z, and the output of the sigmoid, a. In
this case the output of the unit y is the same as a, but in deeper networks we’ll reserve y to
mean the final output of the entire network, leaving a as the activation of an individual node.
Let’s walk through an example just to get an intuition. Let’s suppose we have a
unit with the following weight vectors and bias:
w = [0.2,0.3,0.9]
b = 0.5
What would this unit do with the following input vector:
x = [0.5,0.6,0.1]
COMP90042 L7
6
Matrix Vector Notation
‣ Typically have several hidden units, i.e.
‣ Each with its own weights ( ) and bias term ( )
‣ Can be expressed using matrix and vector operators
‣ Where is a matrix comprising the weight vectors, and
is a vector of all bias terms
‣ Non-linear function applied element-wise
hi = tanh ∑
j
wijxj + bi
wi bi
h = tanh (W x + b )
W b
COMP90042 L7
7
Output Layer
• Binary classification problem
‣ e.g. classify whether a tweet is + or – in sentiment
‣ sigmoid activation function
• Multi-class classification problem
‣ e.g. native language identification
‣ softmax ensures probabilities > 0 and sum to 1
[
exp(v1)
∑
i
exp(vi)
,
exp(v2)
∑
i
exp(vi)
, . . . ,
exp(vm)
∑
i
exp(vi) ]
COMP90042 L7
8
Learning from Data
• How to learn the parameters from data?
• Consider how well the model “fits” the training data,
in terms of the probability it assigns to the correct
output
‣ want to maximise total probability, L
‣ equivalently minimise -log L with respect to parameters
• Trained using gradient descent
‣ tools like tensorflow, pytorch, dynet use autodiff to
compute gradients automatically
L =
m
∏
i=0
P(yi |xi)
COMP90042 L7
9
Regularisation
• Have many parameters, overfits easily
• Low bias, high variance
• Regularisation is very very important in NNs
• L1-norm: sum of absolute values of all parameters
( , , etc)
• L2-norm: sum of squares
• Dropout: randomly zero-out some neurons of a
layer
W b
COMP90042 L7
10
Dropout
• If dropout rate = 0.1, a
random 10% of neurons
now have 0 values
• Can apply dropout to any
layer, but in practice,
mostly to the hidden
layers
0
00
COMP90042 L7
11
• It prevents the model from being over-reliant on certain
neurons
• It penalises large parameter weights
• It normalises the values of different neurons of a layer,
ensuring that they have zero-mean
• It introduces noise into the network
PollEv.com/jeyhanlau569
Why Does Dropout Work?
http://PollEv.com/jeyhanlau569
COMP90042 L7
12
COMP90042 L7
13
Applications in NLP
COMP90042 L7
14
Topic Classification
• Given a document, classify it into a predefined set
of topics (e.g. economy, politics, sports)
• Input: bag-of-words
love cat dog doctor
doc 1 0 2 3 0
doc 2 2 0 2 0
doc 3 0 0 0 4
doc 4 3 0 0 2
COMP90042 L7
15
Topic Classification – Training
• Randomly initialise and
• = [0, 2, 3, 0]
• = [0.1, 0.6, 0.3]: probability
distribution over
• if true label is
h1 = tanh (W1 x + b1)
h2 = tanh (W2h1 + b2)
y = softmax (W3h2)
W b
x
y
C1, C2, C3
L = − log(0.1) C1
COMP90042 L7
16
Topic Classification – Prediction
• = [1, 3, 5, 0] (test
document)
• = [0.2, 0.1, 0.7]
• Predicted class =
h1 = tanh (W1 x + b1)
h2 = tanh (W2h1 + b2)
y = softmax (W3h2)
x
y
C3
COMP90042 L7
17
Topic Classification – Improvements
• + Bag of bigrams as input
• Preprocess text to lemmatise words and remove
stopwords
• Instead of raw counts, we can weight words using
TF-IDF or indicators (0 or 1 depending on
presence of words)
COMP90042 L7
18
Language Model Revisited
• Assign a probability to a sequence of words
• Framed as “sliding a window” over the sentence,
predicting each word from finite context
E.g., n = 3, a trigram model
• Training involves collecting frequency counts
‣ Difficulty with rare events → smoothing
P(w1, w2, . . . , wm) =
m
∏
i=1
P(wi |wi−2, wi−1)
COMP90042 L7
19
Language Models as Classifiers
LMs can be considered simple classifiers, e.g. for a
trigram model:
classifies the likely next word in a sequence, given
“salt” and “and”.
P(wi |wi−2 = salt, wi−1 = and)
COMP90042 L7
20
Feed-forward NN Language Model
• Use neural network as a classifier to model
• Input features = the previous two words
• Output class = the next word
• How to represent words? Embeddings
P(wi |wi−2 = salt, wi−1 = and)
0.1 -1.5 2.3 0.9 -3.2 2.5 1.1
COMP90042 L7
21
Word Embeddings
• Maps discrete word symbols to continuous vectors
in a relatively low dimensional space
• Word embeddings allow the model to capture
similarity between words
‣ dog vs. cat
‣ walking vs. running
COMP90042 L7
22
Topic Classification
h1 = tanh (W1 x + b1)
h2 = tanh (W2h1 + b2)
y = softmax (W3h2)
First layer = sum of input word embeddings
Word Embeddings!
COMP90042 L7
23
Training a FFNN LM
•
• Lookup word embeddings ( ) for a, cow and eats
• Concatenate them and feed it to the network
P(wi = grass |wi−3 = a, wi−2 = cow, wi−1 = eats)
W1
x = va ⊕ v cow ⊕ v eats
h = tanh(W2 x + b1)
y = softmax(W3 h )
a grass eats hunts cow
0.9 0.2 -3.3 -0.1 -0.5
0.2 -2.3 0.6 -1.5 1.2
-0.6 0.8 1.1 0.3 -2.4
1.5 0.8 0.1 2.5 0.4
COMP90042 L7
24
Training a FFNN LM
• gives the probability distribution over all words in
the vocabulary
•
• Most parameters are in the word embeddings
(size = ) and the output embeddings
(size = )
y
P(wi = grass |wi−3 = a, wi−2 = cow, wi−1 = eats) = 0.8
L = − log(0.8)
W1
d1 × |V | W3
|V | × d3
0.01 0.80 0.05 0.10 0.04
rabbit grass eats hunts cow
COMP90042 L7
25
Input and Output Word Embeddings
•
• Lookup word embeddings ( ) for a, cow and eats
• Concatenate them and feed it to the network
P(wi = grass |wi−3 = a, wi−2 = cow, wi−1 = eats)
W1
x = va ⊕ v cow ⊕ v eats
h = tanh(W2 x + b1)
y = softmax(W3 h )
a grass eats hunts cow
0.9 0.2 -3.3 -0.1 -0.5
0.2 -2.3 0.6 -1.5 1.2
-0.6 0.8 1.1 0.3 -2.4
1.5 0.8 0.1 2.5 0.4
Word embeddings W1
d1 × |V |
Output word embeddings W3
|V | × d3
COMP90042 L7
26
Language Model: Architecture
Bengio et al, 2003
a cow eats
concatenate word embeddings
non-linear activation
softmax to produce a
probability distribution over
all words in the vocabulary
P(wt = grass |context)
[
exp(v1)
∑
i
exp(vi)
,
exp(v2)
∑
i
exp(vi)
, . . . ,
exp(vm)
∑
i
exp(vi) ]
COMP90042 L7
27
Advantages of FFNN LM
• Count-based N-gram models (lecture 3)
‣ cheap to train (just collect counts)
‣ problems with sparsity and scaling to larger contexts
‣ don’t adequately capture properties of words
(grammatical and semantic similarity), e.g., film vs
movie
• FFNN N-gram models
‣ automatically capture word properties, leading to more
robust estimates
COMP90042 L7
28
– Very slow to train
– Captures only limited context
– Unable to handle unseen n-grams
– Unable to handle unseen words
PollEv.com/jeyhanlau569
What Are The Disadvantages of
Feedforward NN Language Model?
http://PollEv.com/jeyhanlau569
COMP90042 L7
29
COMP90042 L7
30
POS Tagging
• POS tagging can also be framed as classification:
‣ classifies the likely POS tag for “eats”.
• FFNN LM architecture can be adapted to the task
directly
P(ti |wi−1 = cow, wi = eats)
COMP90042 L7
31
Feed-forward NN for Tagging
• Inputs:
‣ recent words
‣ recent tags
• And outputs: current tag
• Frame as neural network with
‣ 5 inputs: 3 x word embeddings and 2 x tag embeddings
‣ 1 output: vector of size , using softmax
• Train to minimise
𝑤𝑖−2, 𝑤𝑖−1, 𝑤𝑖
𝑡𝑖−2, 𝑡𝑖−1
𝑡𝑖
|T |
− ∑𝑖 log𝑃(𝑡𝑖 |𝑤𝑖−2, 𝑤𝑖−1, 𝑤𝑖, 𝑡𝑖−2, 𝑡𝑖−1)
COMP90042 L7
32
FFNN for Tagging
COMP90042 L7
33
Convolutional Networks
COMP90042 L7
34
Convolutional Networks
• Commonly used in computer vision
• Identify indicative local predictors
• Combine them to produce a fixed-size
representation
https://adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-Neural-Networks/
https://adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-Neural-Networks/
COMP90042 L7
35
Convolutional Networks for NLP
• Sliding window (e.g. 3 words) over sequence
• W = convolution filter (linear transformation+tanh)
• max-pool to produce a fixed-size representation
COMP90042 L7
36
Final Words
• Pros
‣ Excellent performance
‣ Less hand-engineering of features
‣ Flexible — customised architecture for different tasks
• Cons
‣ Much slower than classical ML models… needs GPU
‣ Lots of parameters due to vocabulary size
‣ Data hungry, not so good on tiny data sets
‣ Pre-training on big corpora helps
COMP90042 L7
37
Readings
• Feed-forward network: G15, section 4; JM Ch. 7.3
• Convolutional network: G15, section 9