程序代写代做代考 deep learning Excel flex chain GPU Deep Learning for NLP: Feedforward Networks

Deep Learning for NLP: Feedforward Networks
COMP90042
Natural Language Processing Lecture 7
COPYRIGHT 2020, THE UNIVERSITY OF MELBOURNE
1

COMP90042
L7
Corrections on L3: page 21/22
2

COMP90042
L7
Corrections on L3: page 21/22
3

COMP90042
L7



A branch of machine learning Re-branded name for neural networks

Why deep? Many layers are chained together in modern deep learning models
Deep Learning
Neural networks: historically inspired by the way computation works in the brain
‣ Consists of computation units called neurons
4

COMP90042
L7
Feed-forward NN
Aka multilayer perceptrons
Each arrow carries a weight, reflecting its importance

• •
Sigmoid function represents a non-linear function
5

COMP90042
value by a weight (w1, w2, and w3, respectively), adds them passes the resulting sum through a sigmoid function to res and 1. L7
y
a z

w1 w2 w3
b
σ
x1 x2 x3
+1

NN Units Each “unit” is a function
‣ given input x, computes real-value (scalar) h
Figure 8.2 A neural unit, taking 3 inputs x1, x2, and x3 (and a
weight for an input clamped at +1) and producing an output y.
intermediate variables: the output of the summation, z, and the ‣ scales input (with weights, w) and adds offset (bias, b)
this case the output of the unit y is the same as a, but in deeper mean the final output of the entire network, leaving a as the activ
‣ applies non-linear function, such as logistic sigmoid, hyperbolic sigmoid (tanh), or rectified linear unit
Let’s walk through an example just to get an intuition. unit with the following weight vectors and bias:
w = [0.2,0.3,0.9] b = 0.5
6
u
b W o
a

COMP90042
L7
Matrix Vector Notation ‣ Typically have several hidden units, i.e.
hi =tanh0@Xwijxj +bi1A j
‣ Each with its own weights (wi) and bias term (bi)
‣ Can be expressed using matrix and vector operators
~h = tanh ⇣W ~x + ~b⌘
‣ Where W is a matrix comprising the weight vectors, and b
is a vector of all bias terms
‣ Non-linear function applied element-wise
7

COMP90042
L7

Binary classification problem (e.g. classify whether a tweet is positive or negative in sentiment):
‣ sigmoid activation function (aka logistic function)

Multi-class classification problem (e.g. classify the topics of a document)
‣ softmax ensures probabilities > 0 and sum to 1
8
Output Layer

COMP90042
L7
Feed-forward NN
h⃗ = t a n h 􏰛 W ⃗x + b⃗ 􏰜 111
h⃗ = t a n h 􏰛 W h⃗ + b⃗ 􏰜 2212
⃗y = softmax(W h⃗ ) 32
• Matricesandbiases= parameters of the model
9

COMP90042
L7
Learning from Data How to learn the parameters from data?
• Considerhowwellthemodel“fits”thetraining
data, in terms of the probability it assigns to the
correct output
􏰙m
L= P(yi|xi)
i=0
‣ want to maximise total probability, L
‣ equivalently minimise -log L with respect to parameters
• Trainedusinggradientdescent
‣ tools like tensorflow, pytorch, dynet use autodiff to compute gradients automatically

10

COMP90042
L7


Given a document, classify it into a predefined set of topics (e.g. economy, politics, sports)
Input: bag-of-words
Topic Classification
love
cat
dog
doctor
doc 1
0
2
3
0
doc 2
2
0
2
0
doc 3
0
0
0
4
doc 4
3
0
0
2
11

COMP90042
L7
Topic Classification
h⃗ = t a n h 􏰛 W ⃗x + b⃗ 􏰜 111
h⃗ = t a n h 􏰛 W h⃗ + b⃗ 􏰜 2212
⃗y = softmax(W h⃗ ) 32
• x=[0,2,3,0]forthefirst document
• y=[0.1,0.6,0.3]:probability distribution over the 3 classes
12

COMP90042
L7
• •

+ Bag of bigrams as input
Preprocess text to lemmatise words and remove stopwords
Topic Classification – Improvements
Instead of raw counts, we can weight words using TF-IDF or indicators (0 or 1 depending on presence of words)
13

COMP90042
L7
Authorship Attribution
• Givenadocument,infertheidentityofitsauthororcharacteristicsofthe author (e.g. gender, age, native language)
• Stylisticpropertiesoftextaremoreimportantthancontentwordsinthis task
‣ POS tags and function words (e.g. on, of, the, and)
• Goodapproximationoffunctionwords:top-300mostfrequentwordsina
large corpus
• Input:bagoffunctionwords,bagofPOStags,bagofPOSbigrams, trigrams
• Wordweighting:density(e.g.ratiobetweenno.offunctionwordsand content words in a window of text)
• Otherfeatures:distributionofdistancesbetweenconsecutivefunction words
14

COMP90042
L7
Language model (Recap)
Assign a probability to a sequence of words
Framed as “sliding a window” over the sentence, predicting each word from finite context

E.g., n = 3, a trigram model
𝑃(𝑤1, 𝑤2, .. 𝑤𝑚) = ∏𝑚 𝑃(𝑤𝑖|𝑤𝑖−2𝑤𝑖−1) 𝑖=1
• Training(estimation)fromfrequencycounts ‣ Difficulty with rare events → smoothing
• •
15

COMP90042
L7
Language Models as Classifiers
LMs can be considered simple classifiers, e.g. trigram model
𝑃(𝑤𝑖 𝑤𝑖−2 = “𝑐𝑜𝑤”, wi−1 = “𝑒𝑎ts”)
classifies the likely next word in a sequence. 
 

16

COMP90042
L7
Feed-forward NN Language Model Use neural network as a classifier to model
•(
𝑃 𝑤𝑖 𝑤𝑖−2 = “𝑐𝑜𝑤”, wi−1 = “𝑒𝑎ts”)
‣ input features = the previous two words ‣ output class = the next word

How to represent words? Embeddings
17

COMP90042
L7


Maps discrete word symbols to continuous vectors in a relatively low dimensional space

Word embeddings allow the model to capture similarity between words
‣ dog vs. cat
‣ walking vs. running
Alleviates data-sparsity problems
Word Embeddings
18

COMP90042
L7
Topic Classification
h⃗ = t a n h 􏰛 W ⃗x + b⃗ 􏰜 111
h⃗ = t a n h 􏰛 W h⃗ + b⃗ 􏰜 2212
⃗y = softmax(W h⃗ ) 32
• x=[0,2,3,0]forthefirst document
• y=[0.1,0.6,0.3]:probability distribution over the 3 classes
Word Embeddings!
First layer = sum of input word embeddings
19

COMP90042
L7
Language Model: Architecture
Bengio et al, 2003
20

COMP90042
L7
Example
P(wi = “grass”|wi−2 = “cow”,wi−1 = “eats”)
• Lookupwordembeddingsforcowandeats
rabbit
grass
eats
hunts
cow
0.9
0.2
-3.3
-0.1
-0.5
0.2
-2.3
0.6
-1.5
1.2
-0.6
0.8
1.1
0.3
-2.4
1.5
0.8
0.1
2.5
0.4
• Concatenatethemandfeedittothenetwork ⃗x=v⃗ ⊕v⃗
h⃗ = tanh(W ⃗x + b⃗ ) 111
⃗y = softmax(W h⃗ ) 21
cow eats
21

COMP90042
L7

y gives the probability distribution over all words in the vocabulary
rabbit grass eats hunts cow
P(wi = “grass”|wi−2 = “cow”,wi−1 = “eats”) = 0.8
Output
0.01
0.80
0.05
0.10
0.04

Most parameters are in the word embeddings (size = d x |V|) and the output embeddings (size = |V| x d)
22

COMP90042
L7
Example
P(wi = “grass”|wi−2 = “cow”,wi−1 = “eats”)
• Lookupwordembeddingsforcowandeats
rabbit
grass
eats
hunts
cow
0.9
0.2
-3.3
-0.1
-0.5
0.2
-2.3
0.6
-1.5
1.2
-0.6
0.8
1.1
0.3
-2.4
1.5
0.8
0.1
2.5
0.4
Word embeddings d x |V|
• Concatenatethemandfeedittothenetwork ⃗x=v⃗ ⊕v⃗
h⃗ = tanh(W ⃗x + b⃗ ) 111
⃗y = softmax(W h⃗ ) 21
cow eats
Output word embeddings |V| x d
23

COMP90042
L7
• NgramLMs
Why Bother?
‣ cheap to train (just compute counts)
‣ problems with sparsity and scaling to larger contexts
‣ don’t adequately capture properties of words (grammatical and semantic similarity), e.g., film vs movie
• NNLMsmorerobust
‣ force words through low-dimensional embeddings
‣ automatically capture word properties, leading to more robust estimates
‣ flexible: minor change to adapt to other tasks (tagging)
24

COMP90042
L7
POS Tagging
•POS tagging can also be framed as classification: 

𝑃(𝑡𝑖 𝑤𝑖−1 = “𝑐𝑜𝑤”, wi = “𝑒𝑎𝑡𝑠”) classifies the likely POS tag for “eats”.
• •
Why not use a fancier classifier? (Neural net)
NNLM architecture can be adapted to the task directly
25

COMP90042
L7
Feed-forward NN for Tagging • MEMMtaggertakesasinput:
recent words 𝑤𝑖−2, 𝑤𝑖−1, 𝑤𝑖 recent tags 𝑡𝑖−2, 𝑡𝑖−1
And outputs: current tag 𝑡𝑖
• Frameasneuralnetworkwith
‣ 5 inputs: 3 x word embeddings and 2 x tag embeddings ‣ 1 output: vector of size |T|, using softmax
• Tra∑intominimise
− 𝑖 log𝑃 (𝑡𝑖 | 𝑤𝑖−2, 𝑤𝑖−1, 𝑤𝑖, 𝑡𝑖−2, 𝑡𝑖−1)

‣ ‣
26

COMP90042
L7
FF-NN for Tagging
27

COMP90042
L7



Commonly used in computer vision Identify indicative local predictors
Convolutional Networks
Combine them to produce a fixed-size representation
https://adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-Neural-Networks/
28

COMP90042
L7
Convolutional Networks for NLP
• Slidingwindow(e.g.3words)oversequence
• W=convolutionfilter(lineartransformation+tanh) • max-pooltoproduceafixed-sizerepresentation
29

COMP90042
L7

Neural networks
‣ Robust to word variation, typos, etc
‣ Excellent generalization
‣ Flexible — customised architecture for different tasks
Final Words
• Cons
‣ Much slower than classical ML models… but GPU
acceleration
‣ Lots of parameters due to vocabulary size
‣ Data hungry, not so good on tiny data sets
‣ Pre-training on big corpora helps
30

COMP90042
L7
• •
Feed-forward network: G15, section 4 Convolutional network: G15, section 9
Readings
31