CS计算机代考程序代写 deep learning GPU flex Excel chain Deep Learning for NLP: Feedforward Networks

Deep Learning for NLP: Feedforward Networks
COMP90042
Natural Language Processing
Lecture 7
Semester 1 2021 Week 4 Jey Han Lau
COPYRIGHT 2021, THE UNIVERSITY OF MELBOURNE
1

COMP90042
L7
• • •
Feedforward Neural Networks Basics Applications in NLP
Convolutional Networks
Outline
2

COMP90042
L7




A branch of machine learning Re-branded name for neural networks
Deep Learning
Why deep? Many layers are chained together in modern deep learning models
Neural networks: historically inspired by the way computation works in the brain
‣ Consists of computation units called neurons
3

COMP90042
L7
Feed-forward NN
Aka multilayer perceptrons
Each arrow carries a weight, reflecting its importance

• •
Certain layers have non- linear activation functions
4

Figure 8.1
The sigmoid function takes a real value and maps it to the range 0, q
r
e e g
v e
COMP90042
[
it is nearly linear around 0 but has a sharp slope toward the ends, it tends to s values toward 0 or 1. L7
Neuron
value by a weight (w1, w2, and w3, respectively), adds them to a bias term passes the resulting sum through a sigmoid function to result in a numbe and 1.
y
a z

w1 w2 w3
b
σ
x1 x2 x3
+1

Each neuron is a function
‣ given input x, computes
 real-value (scalar) h
h=tanh ∑wjxj+b j
Figure 8.2 A neural unit, taking 3 inputs x1, x2, and x3 (and a bias b that we r weight for an input clamped at +1) and producing an output y. We include som
‣ scales input (with weights, w) and adds offset (bias, b) intermediate variables: the output of the summation, z, and the output of the si
this case the output of the unit y is the same as a, but in deeper networks we’ll ‣ applies non-linear function, such as logistic sigmoid,

w and b are parameters of the model
unit with the following weight vectors and bias:
mean the final output of the entire network, leaving a as the activation of an indi hyperbolic sigmoid (tanh), or rectified linear unit
Let’s walk through an example just to get an intuition. Let’s suppos
5

COMP90042
L7
Matrix Vector Notation
‣ Typically have several hidden units, i.e.
hi=tanh ∑wijxj+bi j
Each with its own weights (wi) and bias term (bi)
‣ Can be expressed using matrix and vector operators


h = tanh (W x + b )
Where W is a matrix comprising the weight vectors, and b
is a vector of all bias terms
‣ Non-linear function applied element-wise
6

COMP90042
L7
Output Layer
• Binaryclassificationproblem
‣ e.g. classify whether a tweet is + or – in sentiment ‣ sigmoid activation function

Multi-class classification problem
‣ e.g. native language identification
‣ softmax ensures probabilities > 0 and sum to 1
[ ∑ exp(v) ∑ exp(v) ∑ exp(v)] iiiiii
exp(v1) , exp(v2) , . . . , exp(vm)
7

COMP90042
L7
Learning from Data • Howtolearntheparametersfromdata?

Consider how well the model “fits” the training data, in terms of the probability it assigns to the correct output m
L = ∏P(yi|xi) i=0
‣ want to maximise total probability, L
‣ equivalently minimise -log L with respect to parameters
• Trainedusinggradientdescent
‣ tools like tensorflow, pytorch, dynet use autodiff to compute gradients automatically
8

COMP90042
L7
Regularisation
Have many parameters, overfits easily
Low bias, high variance
Regularisation is very very important in NNs
L1-norm: sum of absolute values of all parameters (W, b, etc)
• •
• • • •
L2-norm: sum of squares
Dropout: randomly zero-out some neurons of a layer
9

COMP90042
L7


If dropout rate = 0.1, a random 10% of neurons now have 0 values
Can apply dropout to any layer, but in practice, mostly to the hidden layers
0
Dropout
00
10

COMP90042
L7
Why Does Dropout Work?
• It prevents the model from being over-reliant on certain neurons
• It penalises large parameter weights
• It normalises the values of different neurons of a layer,
ensuring that they have zero-mean
• It introduces noise into the network
PollEv.com/jeyhanlau569
11

COMP90042
L7
12

COMP90042
L7
Applications in NLP
13

COMP90042
L7


Given a document, classify it into a predefined set of topics (e.g. economy, politics, sports)
Input: bag-of-words
Topic Classification
love
cat
dog
doctor
doc 1
0
2
3
0
doc 2
2
0
2
0
doc 3
0
0
0
4
doc 4
3
0
0
2
14

COMP90042
L7
• • •
Topic Classification – Training
h1 = tanh (W1 x + b1) h2 = tanh (W2h1 + b2)
y = softmax (W3h2)
Randomly initialise W and b
x =[0,2,3,0]
y = [0.1, 0.6, 0.3]: probability distribution over C1, C2, C3

L = − log(0.1) if true label is C1 15

COMP90042
L7

• •
Topic Classification – Prediction
h1 = tanh (W1 x + b1) h2 = tanh (W2h1 + b2)
y = softmax (W3h2) x =[1,3,5,0](test
document)
y = [0.2, 0.1, 0.7]
Predicted class = C3
16

COMP90042
L7
• •

+ Bag of bigrams as input
Preprocess text to lemmatise words and remove stopwords
Topic Classification – Improvements
Instead of raw counts, we can weight words using TF-IDF or indicators (0 or 1 depending on presence of words)
17

COMP90042
L7
• •
Assign a probability to a sequence of words
Framed as “sliding a window” over the sentence, predicting each word from finite context

E.g., n = 3, a trigram model
P(w1,w2,…,wm) = ∏m P(wi|wi−2,wi−1) i=1

Training involves collecting frequency counts ‣ Difficulty with rare events → smoothing
Language Model Revisited
18

COMP90042
L7
Language Models as Classifiers
LMs can be considered simple classifiers, e.g. for a trigram model:
P(wi | wi−2 = salt, wi−1 = and)
classifies the likely next word in a sequence, given “salt” and “and”.


19

COMP90042
L7
Feed-forward NN Language Model Use neural network as a classifier to model
P(wi | wi−2 = salt, wi−1 = and)
• •

Input features = the previous two words
Output class = the next word
• Howtorepresentwords?Embeddings
0.1
-1.5
2.3
0.9
-3.2
2.5
1.1
20

COMP90042
L7


Maps discrete word symbols to continuous vectors in a relatively low dimensional space
Word Embeddings
Word embeddings allow the model to capture similarity between words
‣ dog vs. cat
‣ walking vs. running
21

COMP90042
L7
Topic Classification
h1 = tanh (W1 x + b1) h2 = tanh (W2h1 + b2)
y = softmax (W3h2) Word Embeddings! First layer = sum of input word embeddings
22

COMP90042
L7
• •
P(wi = grass | wi−3 = a, wi−2 = cow, wi−1 = eats) Lookup word embeddings (W1) for a, cow and eats
Training a FFNN LM
a
grass
eats
hunts
cow
0.9
0.2
-3.3
-0.1
-0.5
0.2
-2.3
0.6
-1.5
1.2
-0.6
0.8
1.1
0.3
-2.4
1.5
0.8
0.1
2.5
0.4
• Concatenatethemandfeedittothenetwork x = va ⊕ v cow ⊕ v eats
h = tanh(W2 x + b1) y =softmax(W3h)
23

COMP90042
L7

y gives the probability distribution over all words in the vocabulary
rabbit grass eats hunts cow
P(wi = grass|wi−3 = a,wi−2 = cow,wi−1 = eats) = 0.8
L = − log(0.8)
Most parameters are in the word embeddings W1 (size = d1 × | V | ) and the output embeddings W3 (size = |V| × d3)
Training a FFNN LM
0.01
0.80
0.05
0.10
0.04
• •
24

COMP90042
L7
• •
Input and Output Word Embeddings
P(wi = grass | wi−3 = a, wi−2 = cow, wi−1 = eats) Lookup word embeddings (W1) for a, cow and eats
a
grass
eats
hunts
cow
0.9
0.2
-3.3
-0.1
-0.5
0.2
-2.3
0.6
-1.5
1.2
-0.6
0.8
1.1
0.3
-2.4
1.5
0.8
0.1
2.5
0.4
Word embeddings W1 d1 × | V |
• Concatenatethemandfeedittothenetwork x = va ⊕ v cow ⊕ v eats
h = tanh(W2 x + b1) y =softmax(W3h)
Output word embeddings W3 |V|×d3
25

COMP90042
L7
Language Model: Architecture
exp(v1) , [ ∑ exp(v)
exp(v2) ∑ exp(v)
, . . . ,
exp(vm)
∑ exp(v)]
P(wt = grass|context)
softmax to produce a
 probability distribution over
 all words in the vocabulary
non-linear activation
concatenate word embeddings
iiiiii
a cow eats
Bengio et al, 2003
26

COMP90042
L7

Count-based N-gram models (lecture 3)
‣ cheap to train (just collect counts)
‣ problems with sparsity and scaling to larger contexts
‣ don’t adequately capture properties of words (grammatical and semantic similarity), e.g., film vs movie
Advantages of FFNN LM
• FFNNN-grammodels
‣ automatically capture word properties, leading to more robust estimates
27

COMP90042
L7
What Are The Disadvantages of
 Feedforward NN Language Model?
– Very slow to train
– Captures only limited context
– Unable to handle unseen n-grams – Unable to handle unseen words
PollEv.com/jeyhanlau569
28

COMP90042
L7
29

COMP90042
L7
POS Tagging
POS tagging can also be framed as classification:
P(ti | wi−1 = cow, wi = eats)
‣ classifies the likely POS tag for “eats”.
• FFNNLMarchitecturecanbeadaptedtothetask directly

30

COMP90042
L7
Feed-forward NN for Tagging • Inputs:
recent words 𝑤𝑖−2, 𝑤𝑖−1, 𝑤𝑖 recent tags 𝑡𝑖−2, 𝑡𝑖−1
And outputs: current tag 𝑡𝑖
• Frameasneuralnetworkwith
‣ 5 inputs: 3 x word embeddings and 2 x tag embeddings
1 output: vector of size |T|, using softmax
• Tra∑intominimise
− 𝑖 log𝑃 (𝑡𝑖 | 𝑤𝑖−2, 𝑤𝑖−1, 𝑤𝑖, 𝑡𝑖−2, 𝑡𝑖−1)

‣ ‣

31

COMP90042
L7
FFNN for Tagging
32

COMP90042
L7
Convolutional Networks
33

COMP90042
L7



Commonly used in computer vision Identify indicative local predictors
Convolutional Networks
Combine them to produce a fixed-size representation
https://adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-Neural-Networks/
34

COMP90042
L7
Convolutional Networks for NLP
• Slidingwindow(e.g.3words)oversequence
• W=convolutionfilter(lineartransformation+tanh) • max-pooltoproduceafixed-sizerepresentation
35

COMP90042
L7

Pros
‣ Excellent performance
‣ Less hand-engineering of features
‣ Flexible — customised architecture for different tasks
Final Words
• Cons
‣ Much slower than classical ML models… needs GPU
‣ Lots of parameters due to vocabulary size ‣ Data hungry, not so good on tiny data sets
‣ Pre-training on big corpora helps
36

COMP90042
L7
• •
Feed-forward network: G15, section 4; JM Ch. 7.3 Convolutional network: G15, section 9
Readings
37