COMP5046
Natural Language Processing
Lecture 3: Word Classification and Machine Learning
Dr. Caren Han
Semester 1, 2021
School of Computer Science, University of Sydney
0
LECTURE PLAN
Lecture 3: Word Classification and Machine Learning
1.
2. 3.
4.
Previous Lecture: Word Embedding Review
Word Embedding Evaluation
Deep Neural Network for Natural Language Processing
1. Perceptron and Neural Network (NN)
2. Multilayer Perceptron
3. Applications
Next Week Preview
See how the Deep Learning can be used for NLP
– Text Classification, etc.
1
Previous Lecture Review Word2Vec Models
CBOW
Predict center word from (bag of) context words
Skip-gram
Predict context words given center word
1
Previous Lecture Review
Word2Vec with Continuous Bag of Words (CBOW)
Predict center word from (bag of) context words
Sentence: “Sydney is the state capital of NSW”
Using window slicing, develop the training data
Center word
[1,0,0,0,0,0,0]
[0,1,0,0,0,0,0]
Context (“outside”) word
[0,1,0,0,0,0,0], [0,0,1,0,0,0,0]
Sydney
is
the
state
capital
of
NSW
[1,0,0,0,0,0,0], [0,0,1,0,0,0,0], [0,0,0,1,0,0,0]
Sydney
is
the
state
capital
of
NSW
[0,0,1,0,0,0,0]
[0,0,0,1,0,0,0]
[0,0,0,0,1,0,0]
[0,0,0,0,0,1,0]
[1,0,0,0,0,0,0], [0,1,0,0,0,0,0] [0,0,0,1,0,0,0], [0,0,0,0,1,0,0]
[0,0,0,1,0,0,0], [0,0,0,0,1,0,0] [0,0,0,0,0,0,1]
Sydney
is
the
state
of
NSW
[0,1,0,0,0,0,0], [0,0,1,0,0,0,0] [0,0,0,0,1,0,0], [0,0,0,0,0,1,0]
[0,0,1,0,0,0,0], [0,0,0,1,0,0,0] [0,0,0,0,0,1,0], [0,0,0,0,0,0,1]
Sydney
Sydney
is
the
state
capital
capital
of
NSW
is
the
state
capital
of
NSW
Sydney
is
the
state
capital
of
NSW
[0,0,0,0,0,0,1]
[0,0,0,0,1,0,0], [0,0,0,0,0,1,0]
Sydney
is
the
state
capital
of
NSW
Center word
Context (“outside”) word
1
Input layer
Projection layer
Output layer
is (one-hot vectori)s
𝒗ෝ=Vis + Vthe + Vcapital + Vof 2m (window size)
the the (one-hot vector)
W ’ N x V x 𝒗ෝ = z
yො = softmax(z)
state (one-hot vector)
capital (one-hot vector) capital
• How does this weight work?
• What is the Softmax?
• What is the Cross Entropy?
• How to Train the model?
of (one-hot vectoro)f
Previous Lecture Review
CBOW – Neural Network Architecture
Predict center word from (bag of) context words
Sentence: “Sydney is the state capital of NSW”
state
N-Dimension
N = Dimension of Word Embedding (Representation) – parameter
1
is (one-hot vectori)s
the the (one-hot vector)
Input layer
Projection layer
𝒗ෝ=Vis + Vthe + Vcapital + Vof 2m (window size)
Output layer
W ’ N x V x 𝒗ෝ = z
yො = softmax(z)
state (one-hot vector)
capital (one-hot vector) capital
of (one-hot vectoro)f
How can we know whether this word2vec is well trained?
Previous Lecture Review
CBOW – Neural Network Architecture
Predict center word from (bag of) context words
Sentence: “Sydney is the state capital of NSW”
state
N-Dimension
N = Dimension of Word Embedding (Representation) – parameter
0
LECTURE PLAN
Lecture 3: Word Classification and Machine Learning
1.
2.
3.
4.
Previous Lecture: Word Embedding Review
Word Embedding Evaluation
Deep Neural Network for Natural Language Processing
1. Perceptron and Neural Network (NN)
2. Multilayer Perceptron
3. Applications
Next Week Preview
See how the Deep Learning can be used for NLP
– Text Classification, etc.
2
Word Embedding Evaluation How to evaluate word vectors?
Type
How to work / Benefit
Intrinsic
Evaluation on a specific/intermediate subtask
• Fast to compute
• Helps to understand that system
• Not clear if really helpful unless correlation to real task is established
Extrinsic
Evaluation on a real task
• Can take a long time to compute accuracy
• Unclear if the subsystem is the problem or its interaction or other subsystems
2
Word Embedding Evaluation Intrinsic word vector evaluation
Word Vector Analogies
a b :: c ??? man women :: king ???
• Evaluate word vectors by how well their cosine distance after addition captures intuitive semantic and syntactic analogy questions
2
Word Embedding Evaluation Intrinsic word vector evaluation
Word Vector Analogies
King – Man + Woman = ?
No
Training Dataset
Type
Result
1
TED Script
word2vec CBOW
President
2
word2vec Skip-gram
Luther
3
fastText CBOW
Kidding
4
fastText Skip-gram
Jarring
5
Google News
word2vec CBOW
queen
6
word2vec Skip-gram
queen
2
Word Embedding Evaluation Intrinsic word vector evaluation
Evaluation Result Comparison
The Semantic-Syntactic word relationship tests for understanding of a wide variety of relationships as shown below.
Using 640-dimensional word vectors, a skip-gram trained model achieved 55% semantic accuracy and 59% syntactic accuracy.
(Original Word2vec Paper – Mikolov et al.2013)
2
Word Embedding Evaluation Intrinsic word vector evaluation
Evaluation Result Comparison
The Semantic-Syntactic word relationship tests for understanding of a wide variety of relationships as shown below.
(Original Glove Paper – Pennington et al.2014)
2
Word Embedding Evaluation Intrinsic word vector evaluation
Evaluation Result Comparison
The Semantic-Syntactic word relationship tests for understanding of a wide variety of relationships as shown below.
Window-Size (m) and Vector Dimension (N)
(Original Glove Paper – Pennington et al.2014)
2
Word Embedding Evaluation How to evaluate word vectors?
Type
How to work / Benefit
Intrinsic
Evaluation on a specific/intermediate subtask
• Fast to compute
• Helps to understand that system
• Not clear if really helpful unless correlation to real task is established
Extrinsic
Evaluation on a real task
• Can take a long time to compute accuracy
• Unclear if the subsystem is the problem or its interaction or other
subsystems
0
LECTURE PLAN
Lecture 3: Word Classification and Machine Learning
1. 2. 3.
4.
Previous Lecture: Word Embedding Review
Word Embedding Evaluation
Deep Neural Network for Natural Language Processing
1. Perceptron and Neural Network (NN)
2. Multilayer Perceptron
3. Applications
Next Week Preview
See how the Deep Learning can be used for NLP
– Text Classification, etc.
3
Deep Learning for NLP
Deep Learning with Neural Network
Neuron and Perceptron
Neuron
Perceptron
Inputs Input Signals
Outputs Output Signals
3
Deep Learning for NLP
Deep Learning with Neural Network
Inputs and Outputs (Labels) for Natural Language Processing
xi
Inputs
Features
words (indices or vectors!), context windows, sentences, documents, etc.
yi
Outputs (labels)
What we try to predict/classify
• E.g. word meaning, sentiment, name entity
Perceptron
Inputs
Outputs
3
Deep Learning for NLP
Deep Learning with Neural Network
Input: x=number of apple given by Lisa Output: y=number of banana received by Lisa Parameters: Need to be estimated
Lisa, give me an apple. I will give you three bananas then!
Input (x)
1
3
Output (y)
3
Deep Learning for NLP
Deep Learning with Neural Network – Model
Input: x=number of apple given by Lisa Output: y=number of banana received by Lisa Parameters: Need to be estimated
Input (x)
1
W3 y=
Lisa, give me an apple. I will give you three
y =3x bananas then!
3
Output (y)
Weight
Input (x)
Output (y)
W
x
3
Deep Learning for NLP
3
Deep Learning for NLP
Deep Learning with Neural Network – Model
Input: x=number of apple given by Lisa Output: y=number of banana received by Lisa Parameters: Need to be estimated
Guess how much I will give you back!
3
Deep Learning for NLP
Deep Learning with Neural Network – Parameter
Input: x=number of apple given by Lisa Output: y=number of banana received by Lisa Parameters: Need to be estimated
1 5
6
3
Guess how much I will give you back!
y=
What is W then?
0 16
20
?
W
x
3
Deep Learning for NLP
Deep Learning with Neural Network – Parameter
Input: x=number of apple given by Lisa Output: y=number of banana received by Lisa Parameters: Need to be estimated
Data
0 16
20
Guess how much I will give you back!
y=
What is W then?
1 5
6
W
x
3
Deep Learning for NLP
Deep Learning with Neural Network – Parameter
Input: x=number of apple given by Lisa Output: y=number of banana received by Lisa Parameters: Need to be estimated
Data
x
y
1
0
5
16
6
20
W
x
y=
What is W then?
3
Deep Learning for NLP
Deep Learning with Neural Network – Parameter
Input: x=number of apple given by Lisa Output: y=number of banana received by Lisa Parameters: Need to be estimated
20 15 10
5
0
0 5 10 15 20
y=
What is W then?
x
1
5
6
y
0
W
x
16
20
3
Deep Learning for NLP
Deep Learning with Neural Network – Parameter
Input: x=number of apple given by Lisa Output: y=number of banana received by Lisa Parameters: Need to be estimated
20 15 10
5
0
0 5 10 15 20
y=
What if W is 3? 3=3×1 15 = 3×5 20 = 3×6
x
1
W
x
5
6
y
0
16
20
3
Deep Learning for NLP
Deep Learning with Neural Network – Parameter
Input: x=number of apple given by Lisa Output: y=number of banana received by Lisa Parameters: Need to be estimated
20 15 10
5
0
0 5 10 15 20
y=
What if W is 3.2? 3.2 = 3.2×1 16 = 3.2×5
19.2 = 3.2×6
x
1
W
x
5
6
y
0
16
20
3
Deep Learning for NLP
Deep Learning with Neural Network – Parameter
Input: x=number of apple given by Lisa Output: y=number of banana received by Lisa Parameters: Need to be estimated
20 15 10
5
0
0 5 10 15 20
x
y
1
weight
bias
W
x
5
6
0
16
20
y= + b Weight is not enough…
3
Deep Learning for NLP
Deep Learning with Neural Network – Parameter
Input: x=number of apple given by Lisa Output: y=number of banana received by Lisa Parameters: Need to be estimated
20 15 10
5
0
0 5 10 15 20
x
y
1
weight
bias
W
x
5
6
0
16
20
y= + b
How can we find the parameters, w and b?
3
Deep Learning for NLP
Deep Learning with Neural Network – Parameter
Input: x=number of apple given by Lisa Output: y=number of banana received by Lisa Parameters: Need to be estimated
20 15 10
5
0
0 5 10 15 20
x
y
1
weight
bias
W
x
5
6
0
16
20
y= + b
How can we find the parameters, w and b?
3
Deep Learning for NLP
Deep Learning with Neural Network – Parameter
Input: x=number of apple given by Lisa Output: y=number of banana received by Lisa Parameters: Need to be estimated
20 15 10
5
0
0 5 10 15 20
x
y
1
weight
bias
W
x
5
6
0
16
20
Model
y= + b
How can we find the parameters, w and b?
3
Deep Learning for NLP
Deep Learning with Neural Network – Cost
Actual Data
Model Ex#1
Model Ex#2
weight bias weight bias weight bias
?
x
1
x
2
x
y = + ? ŷ= + 0 ŷ= + 2
predicted actual
predicted actual
x
y
1
0
5
16
6
20
x
ŷ
y
1
1
0
5
5
16
6
6
20
x
ŷ
y
1
4
0
5
12
16
6
14
20
Weight Bias
??
Input (x)
Output (y)
Which one is closer?
3
Deep Learning for NLP
Deep Learning with Neural Network – Cost (loss)
Actual Data
Model Ex#1
Model Ex#2
weight bias weight bias weight bias
?
x
1
x
2
x
y = + ? ŷ= + 0 ŷ= + 2
predicted
actual cost predicted actual cost
x
y
1
0
5
16
6
20
x
ŷ
y
(y−ŷ)𝟐
1
1
0
5
5
16
6
6
20
x
ŷ
y
(y−ŷ)𝟐
1
4
0
5
12
16
6
14
20
Weight Bias
??
Input (x)
Output (y)
Let’s calculate the cost(loss)!
C(w,b) = ∑(yn-ŷn) n∈{0,1,2}
Mean Squared Error (MSE)
3
Deep Learning for NLP
WAIT! Loss Function? Cost Calculation?
Weight Bias Loss Function ??
Input(x)
Predicted(ŷ) Output(y)
1) Mean Squared Error (MSE): measures the average of the squares of the errors
2) Cross Entropy: calculating the difference between two probability distributions
3
Deep Learning for NLP
Deep Learning with Neural Network – Cost (loss)
Actual Data
Model Ex#1
Model Ex#2
weight bias weight bias weight bias
?
x
1
x
2
x
y = + ? ŷ= + 0 ŷ= + 2
predicted
actual cost predicted actual cost
x
y
1
0
5
16
6
20
x
ŷ
y
(y−ŷ)𝟐
1
1
0
1
5
5
16
121
6
6
20
196
Weight Bias
??
C(1,0) = 318
Let’s calculate the cost!
C(w,b) = ∑(yn-ŷn) n∈{0,1,2}
C(2,2) =
x
ŷ
y
(y−ŷ)𝟐
Input (x)
Output (y)
1
5
6
4
12
14
0
16
20
16
16
36
68
3
Deep Learning for NLP
Deep Learning with Neural Network – Cost (loss)
Actual Data
weight bias
Model Ex#1
Model Ex#2
?
x
y= + ?
x
y
1
0
5
16
6
20
Weight Bias
??
Input (x)
Output (y)
Let’s calculate the costs and get the lowest one!
arg min C(w,b) w,b∈[-∞,∞]
Deep Learning for NLP
Deep Learning with Neural Network – Optimizer
Actual Data
weight bias
y= + ?
Weight Bias
??
3
?
x
weight
2
2
y =Wx+b
x
y
1
0
5
16
6
20
The lowest!
bias
Input (x)
Output (y)
Let’s calculate the costs and get the lowest one!
arg min C(w,b) w,b∈[-∞,∞]
w,b = 4,-4 C(w0,b0) = 0
3
Deep Learning for NLP
Deep Learning with Neural Network – Optimizer
Backpropagation (weight update)
weight bias
4
x
-4
y=
weight
2
2
y =Wx+b
x
y
1
0
5
16
6
20
Backpropagation (weight update)
The lowest!
bias
Weight Bias
??
Loss Function
Predicted(ŷ) Output(y)
arg min C(w,b) w,b∈[-∞,∞]
Input(x)
w,b = 4,-4 C(w0,b0) = 0
3
Deep Learning for NLP
Deep Learning with Neural Network – Optimizer
Backpropagation (weight update)
weight bias
Oh, Wait….
Do we need to calculate all cost for all options of W and B?
Expensive to compute (hours or days)
arg min C(w,b) w,b∈[-∞,∞]
4
x
-4
y=
x
y
1
0
5
16
6
20
3
Deep Learning for NLP
Finding the Optimal weight and bias – Gradient Descent
Gradient =slope
weight
There are different types of Gradient descent optimization algorithms:
Batch Gradient Descent, Stochastic Gradient Descent, Momentum, Adam, etc.
Cost (Error)
3
Deep Learning for NLP Choose the optimal Learning Rate!
Learning Rate: a hyperparameter that controls how much to change the model in response to the estimated error each time the model weights are updated.
Unstable performance
weight
new_weight = existing_weight — learning_rate * gradient
Gradient =slope
Long training time
Gradient descent can be slow
weight
new_weight = existing_weight — learning_rate * (current_output – desired output) *gradient(current output) * existing_input
Cost (Error)
Cost (Error)
3
Deep Learning for NLP
Finding the Optimal weight and bias – Gradient Descent
There are different types of Gradient descent optimization algorithms:
Batch Gradient Descent, Stochastic Gradient Descent, Momentum, Adam, etc.
3
Deep Learning for NLP Stochastic Gradient Descent
The cost would be very expensive if we calculate it for all windows in the corpus! You would wait a very long time before making a single update!
The Solution can be used different Gradient Descent Method. The most common – “Stochastic Gradient Descent (SGD)”
Vanilla (Batch) gradient descent performs redundant computations for large datasets, as it recomputes gradients for similar examples before each parameter update. SGD does away with this redundancy by performing one update at a time. It is therefore usually much faster and can also be used to learn online.
3
Deep Learning for NLP
Finding the Optimal weight and bias – Gradient Descent
There are different types of Gradient descent optimization algorithms:
Batch Gradient Descent, Stochastic Gradient Descent, Momentum, Adam, etc.
3
Deep Learning for NLP
Deep Learning with Neural Network – Optimizer
Backpropagation (weight update)
weight bias
Oh, Wait….
Do we need to calculate all cost for all options of W and B?
Expensive to compute (hours or days)
arg min C(w,b) w,b∈[-∞,∞]
4
x
-4
y=
x
y
1
0
5
16
6
20
Deep Learning for NLP
Deep Learning with Neural Network – Optimizer
Actual Data
weight bias
y= + ?
3
Gradient!
?
x
weight
y =Wx+b
x
y
1
0
5
16
6
20
Weight Bias
??
bias
Should be used sparingly
Input (x)
Output (y)
arg min C(w,b) w,b∈[-∞,∞]
Deep Learning for NLP
Deep Learning with Neural Network – Optimizer
Actual Data
weight bias
y= + ?
Weight Bias
??
3
Gradient!
?
x
weight
2
2
y =Wx+b
arg min C(w,b) w,b∈[-∞,∞]
x
y
1
0
5
16
6
20
w,b = 2,2 C(w0,b0) =
68
68
bias
Input (x)
Output (y)
Deep Learning for NLP
Deep Learning with Neural Network – Optimizer
Actual Data
weight bias
y= + ?
Weight Bias
??
3
Gradient!
?
x
arg min C(w,b) w,b∈[-∞,∞]
hw = 1
hw
2
2
weight
y =Wx+b
x
y
1
0
5
16
6
20
w,b = 2,2 C(w0,b0) =
68
68
C(w0+hw ,b0) = C(3,2) =
26
bias
Input (x)
Output (y)
Deep Learning for NLP
Deep Learning with Neural Network – Optimizer
Actual Data
weight bias
y= + ?
3
Gradient!
?
x
arg min C(w,b) w,b∈[-∞,∞]
hw = 1
(C(w0+hw ,b0) – C(w0,b0)) 1
(C(2+1,2) – C(2,2)) = -42 1
hw
2
2
weight
y =Wx+b
x
y
1
0
5
16
6
20
w,b = 2,2 C(w0,b0) =
68
68
C(w0+ hw ,b0) = C(3,2) =
26
Weight Bias
r =
r =
bias
? ?
Input (x)
Output (y)
Deep Learning for NLP
Deep Learning with Neural Network – Optimizer
Actual Data
weight bias
y= + ?
Weight Bias
??
3
Gradient!
?
x
weight
arg min C(w,b) w,b∈[-∞,∞]
y =Wx+b
x
y
1
0
5
16
6
20
w,b = 2,2 C(w0,b0) =
68
2
hw
68
hw = 1 , r = -42
hw = 0.1 , r = -98
hw = 0.01 , r = -104 hw = 0.001 , r = -104
2
bias
Input (x)
Output (y)
Deep Learning for NLP
Deep Learning with Neural Network – Optimizer
Actual Data
weight bias
y= + ?
Weight Bias
??
3
Gradient!
?
x
weight
arg min C(w,b) w,b∈[-∞,∞]
y =Wx+b
x
y
1
0
5
16
6
20
w,b = 2,2 C(w0,b0) =
68
2
hw
68
hw = 1 , r = -42
hw = 0.1 , r = -98
hw = 0.01 , r = -104 hw = 0.001 , r = -104
2
bias
hw→0 ,
r =
∂C (w0,b0) ∂w
Input (x)
Output (y)
3
Deep Learning for NLP
Deep Learning with Neural Network – Optimizer
Actual Data
weight bias
Gradient!
?
x
y= + ?
∂C = ∂∑ (yn-ŷn)2
weight
2
2
y =Wx+b
arg min C(w,b) w,b∈[-∞,∞]
x
y
1
0
5
16
6
20
hw
w,b = 2,2 C(w0,b0) =
68
68
n
∂w ∂w
bias
Weight Bias
??
Input (x)
Output (y)
Deep Learning for NLP
Deep Learning with Neural Network – Optimizer
Actual Data
weight bias
y= + ?
Weight Bias
??
3
Gradient!
?
x
weight
arg min C(w,b) w,b∈[-∞,∞]
y =Wx+b
x
y
1
0
5
16
6
20
w,b = 2,2 C(w0,b0) =
68
∂C ∂w
=
n
= ∑2(yn-ŷn)xn n
∂∑ (yn-ŷn)2
2
hw
68
2
∂w
bias
Input (x)
Output (y)
Deep Learning for NLP
Deep Learning with Neural Network – Optimizer
Actual Data
weight bias
y= + ?
Weight Bias
??
3
Gradient!
?
x
arg min C(w,b) w,b∈[-∞,∞]
weight
y =Wx+b
x
y
1
0
5
16
6
20
hw
2
2
w,b = 2,2 C(w0,b0) =
68
68
∂∑ (yn-ŷn)2
n = ∑2(yn-ŷn)xn
∂C
∂w ∂w n
= hw→0 , r =
∂C (w0,b0) ∂w
bias
Input (x)
Output (y)
3
Deep Learning for NLP
Deep Learning with Neural Network – Optimizer
Actual Data
weight bias
Gradient!
weight bias
y = +2
?
x
2
x
y = + ?
∂C
arg min C(w,b) w,b∈[-∞,∞]
∂∑ (yn-ŷn)2
predicted
actual
x
y
1
0
5
16
6
20
x
ŷ
y
(y−ŷ)
2(y−ŷ)x
1
4
0
-4
-8
5
12
16
4
40
6
14
20
6
72
w,b = 2,2 C(w0,b0) =
68
= hw→0 , r =
= ∑2(yn-ŷn)xn n
∂C (w0,b0) = 104 ∂w
n
∂w
∂w
Weight Bias
??
Input (x)
Output (y)
Deep Learning for NLP
Deep Learning with Neural Network – Optimizer
Actual Data
weight bias
y= + ?
Weight Bias
??
3
Gradient!
?
x
arg min C(w,b) w,b∈[-∞,∞]
∂∑ (yn-ŷn)2
n = ∑2(yn-ŷn)xn
hw
2
2
weight
y =Wx+b
x
y
1
0
5
16
6
20
w,b = 2,2 C(w0,b0) =
68
∂C
∂w ∂w n
= =
∂∑ (yn-ŷn)2
bias
68
∂C ∂b
n
= ∑2(yn-ŷn) n
∂b
Input (x)
Output (y)
3
Deep Learning for NLP
Deep Learning with Neural Network – Optimizer
Actual Data
weight bias
Gradient!
weight bias
y = +2
?
x
2
x
y = + ?
arg min C(w,b) w,b∈[-∞,∞]
hw→0 , r = ∂C (w0,b0) = 104 ∂w
hb→0,r= ∂C(w0,b0)=12 ∂b
predicted
actual
x
y
1
0
5
16
6
20
x
ŷ
y
(y−ŷ)
2(y−ŷ)
1
4
0
-4
-8
5
12
16
4
8
6
14
20
6
12
w,b = 2,2 C(w0,b0) =
68
Weight Bias
??
Input (x)
Output (y)
Deep Learning for NLP
Deep Learning with Neural Network – Optimizer
Actual Data
weight bias
y= + ?
Weight Bias
??
3
Gradient!
?
x
arg min C(w,b) w,b∈[-∞,∞]
hw→0,r= ∂C
∂w (w0,b0) = 104
hb→0,r= ∂C(w0,b0)=12 ∂b
2
2
bias
weight
y =Wx+b
x
y
1
0
5
16
6
20
w,b = 2,2 C(w0,b0) =
68
Input (x)
Output (y)
3
Deep Learning for NLP
Deep Learning with Neural Network
Data Model Cost
x
y
1
0
5
16
6
20
C(w,b) = ∑(yn-ŷn) n∈{0,1,2}
Optimizer
arg min C(w,b) w,b∈[-∞,∞]
System
weight bias
y=
4
x
-4
3
Deep Learning for NLP
3
Deep Learning for NLP
Deep Learning with Neural Network
Input: x=number of apple given by Lisa Output: y=number of banana received by Lisa Parameters: Need to be estimated
1 5
6
3
y =4x-4 banagnivaesyIocuanbgaicvke!you
GuTehsesrheoiwsamliumcihtIowfill
0 16
20
8
3
Deep Learning for NLP
Deep Learning with Neural Network
pixel(1,1) pixel(1,2)
y = w1x1+w2x2+ w3x3+w4x4+ …+ wnxn+ b Data
Millions of Parameters Millions of Samples
3
Deep Learning for NLP
Deep Learning with Neural Network
Vector1 Vector2
y = w1x1+w2x2+ w3x3+w4x4+ …+ wnxn+ b Data
Millions of Parameters Millions of Samples
…
3
Deep Learning for NLP
Deep Learning with Neural Network
Input: x=number of apple given by Lisa Output: y=number of banana received by Lisa Parameters: Need to be estimated
1 5
6
3
There is a limit of bananas I can give you
0 16
20
8
3
Deep Learning for NLP
Deep Learning with Neural Network
Nonlinear Neural Network
weight bias
y=
4
x
-4
Data
x
y
1
0
5
16
6
20
20 15 10
5
0
0 5 10 15 20
3
Deep Learning for NLP
Deep Learning with Neural Network
Nonlinear Neural Network
weight bias
y=
2
x
+3
Data
x
y
1
0
5
16
6
20
9
20
11
20
20 15 10
5
Underfitting
Issue
0
0 5 10 15 20
3
Deep Learning for NLP
Deep Learning with Neural Network
Nonlinear Neural Network
Data
x
y
1
0
5
16
6
20
9
20
11
20
20 15 10
5
y = ???
possible?
rent linear functions the value of x
How to make
A: Us
depen
0
0 5 10 15 20
din
g
this
e
diffe
o
n
3
Deep Learning for NLP
Deep Learning with Neural Network
Nonlinear Neural Network
weight
y = ( +
bias
If x < 6 and 0
weight bias
w1
x
Data
+ (w2x+ possible?
How to make
b1)s1
this
b2)s2
If x >= 6 and 0
x
y
1
0
5
16
6
20
9
20
11
20
20 15 10
5
rent linear functions the value of x
A: Us
0
0 5 10 15 20
e
depen
diffe
din
g
o
n
3
Deep Learning for NLP
Deep Learning with Neural Network
Nonlinear Neural Network
bias
If x < 6 and 0
weight
y = ( +
weight bias
If x >= 6 and 0
rent linear functions the value of x
w1
x
Data
+ (w2x+ possible?
x
y
1
0
5
16
6
20
9
20
11
20
20 15 10
5
How to make
0
0 5 10 15 20
W1 =4, b1 = -4, W2 =0, b2 =20
b1)s1
this
b2)s2
A: Us
e
depen
diffe
din
g
o
n
3
Deep Learning for NLP
Deep Learning with Neural Network
3
Deep Learning for NLP
3
Deep Learning for NLP
3
Deep Learning for NLP
3
Deep Learning for NLP Single Neuron VS Multilayer
3
Deep Learning for NLP
CBOW – Neural Network Architecture (ReCAP with Optimizer)
Predict center word from (bag of) context words.
Summary of CBOW Training (Review your understanding with equations)
1. Initialise each word in a one-hot vector form.
Projection or
𝒙𝒌 = [0,…,0,1,0,…,0]
2. Use context words (2m, based on window size =m)
as input of the Word2Vec-CBOW model.
(𝒙𝒄−𝒎,𝒙𝒄−𝒎+𝟏,…,𝒙𝒄−𝟏,𝒙𝒄+𝟏,…,𝒙𝒄+𝒎−𝟏,𝒙𝒄+𝒎) ∈R|𝑽| 3. Has two Parameter Matrices:
1) Parameter Matrix (from Input Layer to Hidden/Projection Layer) 𝐖 ∈ R𝑉x𝑁
2) ParameterMatrix(toOutputLayer)
𝐖′ ∈ R𝑁x𝑉
3
Deep Learning for NLP
CBOW – Neural Network Architecture (ReCAP with Optimizer)
Predict center word from (bag of) context words.
Summary of CBOW Training (Review your understanding with equations)
4. Initial words are represented in one hot vector so
Projection or
multiplying a one hot vector with 𝐖 will give you 𝐕x𝐍
a 1 x N (embedded word) vector.
𝟏𝟎 𝟐 𝟏𝟖
e.g. 𝟎𝟏𝟎𝟎 × 𝟏𝟓 𝟐𝟐 𝟑 =[1522 3] 𝟐𝟓 𝟏𝟏 𝟏𝟗
𝟒 𝟕 𝟐𝟐
(𝒗𝒄−𝒎 = W𝑥𝑐−𝑚, … , 𝒗𝒄+𝒎 = W𝑥𝑐+𝑚) ∈ R𝒏
5. Average those 2m embedded vectors to calculate the value of the Hidden Layer.
𝑣ො = 𝒗 𝒄 − 𝒎 + 𝒗 𝒄 − 𝒎 + 𝟏 + … + 𝒗 𝒄 + 𝒎 2𝑚
3
Deep Learning for NLP
CBOW – Neural Network Architecture (ReCAP with Optimizer)
Predict center word from (bag of) context words.
Summary of CBOW Training (Review your understanding with equations)
6. Calculate the score value for the output layer. The higher score is produced when words are closer.
𝒛 = 𝐖 ′ × 𝑣ො ∈ R | 𝑽 |
7. Calculate the probability using softmax
𝑦ො = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 (𝒛) ∈ R|𝑽|
8. Train the parameter matrix using objective function. |𝑉|
H𝑦ො,𝑦 =−𝑦𝑙𝑜𝑔(𝑦ො) 𝑗𝒋
𝑗=1
* Focus on minimising the value
We use an one-hot vector (one 1, the rest 0) so it will be
calculated in only one.
H𝑦ො,𝑦 =−𝑦𝑙𝑜𝑔(𝑦ො) 𝑗𝒋
Projection or
3
is (one-hot vectori)s
W’NxV
Output layer
𝒗ෝ=Vis + Vthe + Vcapital + Vof 2m (window size)
the the (one-hot vector)
of (one-hot vectoro)f
Input layer
Projection layer
yො = softmax(z)
state (one-hot vector)
capital (one-hot vector) capital
• How does this weight work?
• What is the Softmax?
• What is the Cross Entropy?
• How to Train the model?
Deep Learning for NLP
CBOW – Neural Network Architecture (ReCAP with Optimizer)
Predict center word from (bag of) context words
Sentence: “Sydney is the state capital of NSW”
state
N-Dimension
N = Dimension of Word Embedding (Representation) – parameter
3
Deep Learning for NLP
CBOW – Neural Network Architecture (ReCAP with Optimizer)
Predict center word from (bag of) context words.
Summary of CBOW Training (Review your understanding with equations)
6. Calculate the score value for the output layer. The higher score is produced when words are closer.
𝒛 = 𝐖 ′ × 𝑣ො ∈ R | 𝑽 |
7. Calculate the probability using softmax
𝑦ො = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 (𝒛) ∈ R|𝑽|
Projection or
8. Train the parameter matrix using objective function. The softmax is an operator that will be used frequently. It
|𝑉|
transforms a vector into a vector whose i-th component is: H𝑦ො,𝑦 =−𝑦𝑙𝑜𝑔(𝑦ො)𝑦ො
– Dividing by σ calculated in only one.
to give probability
H𝑦ො,𝑦 =−𝑦𝑙𝑜𝑔(𝑦ො) 𝑗𝒋
𝑗 𝒋𝑒𝑖 𝑗=1 |𝑉|
𝑦ො
We use an one-ho|𝑉t|vector (one 1, the rest 0) so it will be
* Focus on minimising theσvalue 𝑒 𝑗=1
𝑗
– Exponentiate to make positive
𝑗=1
𝑒𝑦ො𝑗 normalizes the vector (σ𝑛 𝑦ො = 1) 𝑗=1 𝒋
3
Deep Learning for NLP
CBOW – Neural Network Architecture (ReCAP with Optimizer)
Predict center word from (bag of) context words.
Summary of CBOW Training (Review your understanding with equations)
Projection or
6. Calculate the score value for the output layer. The higher score is produced when words are closer.
𝒛 = 𝐖 ′ × 𝑣ො ∈ R | 𝑽 |
7. Calculate the probability using softmax
𝑦ො = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 (𝒛) ∈ R|𝑽|
8. Train the parameter matrix using objective function. |𝑉|
H𝑦ො,𝑦 =−𝑦𝑙𝑜𝑔(𝑦ො) CrossEntropy 𝑗𝒋
𝑗=1
* Focus on minimising the value
We use an one-hot vector (one 1, the rest 0) so it will be
calculated in only one.
H𝑦ො,𝑦 =−𝑦𝑙𝑜𝑔(𝑦ො) 𝑗𝒋
3
is (one-hot vectori)s
𝒗ෝ=Vis + Vthe + Vcapital + Vof 2m (window size)
the the (one-hot vector)
W’NxV
of (one-hot vectoro)f
Input layer
Projection layer
Output layer
state
state (one-hot vector)
yො = softmax(z)
capital (one-hot vector) capital
• How does this weight work?
• What is the Softmax?
• What is the Cross Entropy?
• How to Train the model?
Deep Learning for NLP
CBOW – Neural Network Architecture (ReCAP with Optimizer)
Predict center word from (bag of) context words
Sentence: “Sydney is the state capital of NSW”
N-Dimension
N = Dimension of Word Embedding (Representation) – parameter
3
Deep Learning for NLP
CBOW – Neural Network Architecture (ReCAP with Optimizer)
Predict center word from (bag of) context words.
Summary of CBOW Training (Review your understanding with equations)
8-1. Optimization Objective Function can be presented:
Projection or
𝑢𝑖=the output vector representation of word 𝑤𝑖
3
Prediction based Word representation Skip Gram – Neural Network Architecture
Predict context (“outside”) words (position independent) given center word
Summary of Skip Gram Training (Review your understanding with equations)
Pro
jection or
1. Initialise the centre word in a one-hot vector form.
𝒙𝒌 = [0,…,0,1,0,…,0] 𝒙 ∈ R|𝑽|
2. Has two Parameter Matrices:
1) Parameter Matrix (from Input Layer to Hidden/Projection Layer)
𝐖 ∈ R𝑉x𝑁
2) ParameterMatrix(toOutputLayer)
𝐖′ ∈ R𝑁x𝑉
3
Prediction based Word representation Skip Gram – Neural Network Architecture
Predict context (“outside”) words (position independent) given center word
Summary of Skip Gram Training (Review your understanding with equations)
3. Initial words are represented in one hot vector so
multiplying a one hot vector with 𝐖 will give you 𝐕x𝐍
a 1 x N (embedded word) vector.
𝟏𝟎 𝟐 𝟏𝟖
e.g. 𝟎𝟏𝟎𝟎 × 𝟏𝟓 𝟐𝟐 𝟑 =[1522 3] 𝟐𝟓 𝟏𝟏 𝟏𝟗
𝟒 𝟕 𝟐𝟐
𝒗 = 𝐖 ∈ R𝒏 (as there is only one input) 𝒄𝒙
4. Calculate the score value for the output layer by multiplying the parameter matrix W’
𝒛 = 𝐖′𝒗𝒄
3
Prediction based Word representation Skip Gram – Neural Network Architecture
Predict context (“outside”) words (position independent) given center word
Summary of Skip Gram Training (Review your understanding with equations) 5. Calculate the probability using softmax
𝑦ො = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 (𝒛)
6. Calculate 2m probabilities as we need to predict 2m context words.
𝑦ො𝒄−𝒎, … , 𝑦ො𝒄−𝟏, 𝑦ො𝒄+𝟏, … , 𝑦ො𝒄+𝒎
and compare with the ground truth (one-hot vector) 𝑦(𝑐−𝑚), … , 𝑦(𝑐−1), 𝑦(𝑐+1), … , 𝑦(𝑐+𝑚)
3
Prediction based Word representation Skip Gram – Neural Network Architecture
Predict context (“outside”) words (position independent) given center word
Summary of Skip Gram Training (Review your understanding with equations)
8. As in CBOW, use an objective function for us to evaluate the model. A key difference here is that we invoke a Naïve Bayes assumption to break out the probabilities. It is a strong naïve conditional independence assumption. Given the centre word, all output words are completely independent.
𝑢𝑖 =the output vector representation of word 𝑤𝑖
3
Prediction based Word representation Skip Gram – Neural Network Architecture
Predict context (“outside”) words (position independent) given center word
Summary of Skip Gram Training (Review your understanding with equations)
8-1. With this objective function, we can compute the gradients with respect to the unknown parameters and at each iteration update them via Stochastic Gradient Descent
4
Deep Learning for NLP
Word2Vec-SkipGram Overview
With a simple diagram
https://aegis4048.github.io/optimize_computational_efficiency_of_skip-gram_with_negative_sampling
4
Deep Learning for NLP
Key Parameter (2) for Training methods: Negative Samples (From lecture 2)
The number of negative samples is another factor of the training process.
Negative samples to our dataset – samples of words that are not neighbors
Negative sample: 2 Negative sample: 5
Input word
Output word
Target
eat
mango
1
eat
exam
0
eat
tobacco
0
Input word
Output word
Target
eat
mango
1
eat
exam
0
eat
tobacco
0
eat
pool
0
eat
supervisor
0
*1= Appeared, 0=Not Appeared
The original paper prescribes 5-20 as being a good number of negative samples. It also states that 2-5 seems to be enough when you have a large enough dataset.
4
Deep Learning for NLP
Word2Vec-SkipGram Overview – negative sampling
With a simple diagram
https://aegis4048.github.io/optimize_computational_efficiency_of_skip-gram_with_negative_sampling
4
Deep Learning for NLP
Application
Application #1: Embedding Pretraining
http://ronxin.github.io/wevi/
0
LECTURE PLAN
Lecture 3: Word Classification and Machine Learning
1. 2. 3.
4.
Previous Lecture: Word Embedding Review
Word Embedding Evaluation
Deep Neural Network for Natural Language Processing
1. Perceptron and Neural Network (NN)
2. Multilayer Perceptron
3. Applications
Next Week Preview
See how the Deep Learning can be used for NLP – Text Classification, etc.
/
Reference
Reference for this lecture
• Deng, L., & Liu, Y. (Eds.). (2018). Deep Learning in Natural Language Processing. Springer.
• Rao, D., & McMahan, B. (2019). Natural Language Processing with PyTorch: Build Intelligent Language Applications Using Deep Learning. ” O’Reilly Media, Inc.”.
• Manning, C. D., Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing. MIT press.
• Blunsom, P 2017, Deep Natural Language Processing, lecture notes, Oxford University
• Manning, C 2017, Natural Language Processing with Deep Learning, lecture notes, Stanford University