CS计算机代考程序代写 chain deep learning GPU flex Excel l7-feedforward-v3

l7-feedforward-v3

COMP90042
Natural Language Processing

Lecture 7
Semester 1 2021 Week 4

Jey Han Lau

Deep Learning for NLP:
Feedforward Networks

COMP90042 L7

Outline

• Feedforward Neural Networks Basics

• Applications in NLP

• Convolutional Networks

COMP90042 L7

Deep Learning

• A branch of machine learning

• Re-branded name for neural networks

• Why deep? Many layers are chained together in
modern deep learning models

• Neural networks: historically inspired by the way
computation works in the brain

‣ Consists of computation units called neurons

COMP90042 L7

Feed-forward NN

• Aka multilayer perceptrons

• Each arrow carries a
weight, reflecting its
importance

• Certain layers have non-
linear activation functions

COMP90042 L7

Neuron

• Each neuron is a function
‣ given input x, computes 

real-value (scalar) h

‣ scales input (with weights, w) and adds offset (bias, b)
‣ applies non-linear function, such as logistic sigmoid,

hyperbolic sigmoid (tanh), or rectified linear unit
‣ and are parameters of the model

h = tanh ∑
j

wjxj + b

w b

8.1 • UNITS 3

Figure 8.1 The sigmoid function takes a real value and maps it to the range [0,1]. Because
it is nearly linear around 0 but has a sharp slope toward the ends, it tends to squash outlier
values toward 0 or 1.

value by a weight (w1, w2, and w3, respectively), adds them to a bias term b, and then
passes the resulting sum through a sigmoid function to result in a number between 0
and 1.

x1 x2 x3

w1 w2 w3

∑

Figure 8.2 A neural unit, taking 3 inputs x1, x2, and x3 (and a bias b that we represent as a
weight for an input clamped at +1) and producing an output y. We include some convenient
intermediate variables: the output of the summation, z, and the output of the sigmoid, a. In
this case the output of the unit y is the same as a, but in deeper networks we’ll reserve y to
mean the final output of the entire network, leaving a as the activation of an individual node.

Let’s walk through an example just to get an intuition. Let’s suppose we have a
unit with the following weight vectors and bias:

w = [0.2,0.3,0.9]
b = 0.5

What would this unit do with the following input vector:

x = [0.5,0.6,0.1]

COMP90042 L7

Matrix Vector Notation
‣ Typically have several hidden units, i.e.

‣ Each with its own weights ( ) and bias term ( )
‣ Can be expressed using matrix and vector operators

‣ Where is a matrix comprising the weight vectors, and
is a vector of all bias terms

‣ Non-linear function applied element-wise

hi = tanh ∑
j

wijxj + bi

wi bi

h = tanh (W x + b )
W b

COMP90042 L7

Output Layer

• Binary classification problem

‣ e.g. classify whether a tweet is + or – in sentiment

‣ sigmoid activation function

• Multi-class classification problem

‣ e.g. native language identification

‣ softmax ensures probabilities > 0 and sum to 1

[
exp(v1)

∑
i
exp(vi)

,
exp(v2)

∑
i
exp(vi)

, . . . ,
exp(vm)

∑
i
exp(vi) ]

COMP90042 L7

Learning from Data

• How to learn the parameters from data?
• Consider how well the model “fits” the training data,

in terms of the probability it assigns to the correct
output

‣ want to maximise total probability, L
‣ equivalently minimise -log L with respect to parameters

• Trained using gradient descent
‣ tools like tensorflow, pytorch, dynet use autodiff to

compute gradients automatically

L =
m

∏
i=0

P(yi |xi)

COMP90042 L7

Regularisation

• Have many parameters, overfits easily

• Low bias, high variance

• Regularisation is very very important in NNs

• L1-norm: sum of absolute values of all parameters
( , , etc)

• L2-norm: sum of squares

• Dropout: randomly zero-out some neurons of a
layer

W b

COMP90042 L7

Dropout

• If dropout rate = 0.1, a
random 10% of neurons
now have 0 values

• Can apply dropout to any
layer, but in practice,
mostly to the hidden
layers

COMP90042 L7

• It prevents the model from being over-reliant on certain
neurons

• It penalises large parameter weights
• It normalises the values of different neurons of a layer,

ensuring that they have zero-mean
• It introduces noise into the network

PollEv.com/jeyhanlau569

Why Does Dropout Work?

http://PollEv.com/jeyhanlau569

COMP90042 L7

Applications in NLP

COMP90042 L7

Topic Classification

• Given a document, classify it into a predefined set
of topics (e.g. economy, politics, sports)

• Input: bag-of-words

love cat dog doctor

doc 1 0 2 3 0

doc 2 2 0 2 0

doc 3 0 0 0 4

doc 4 3 0 0 2

COMP90042 L7

Topic Classification – Training

• Randomly initialise and

• = [0, 2, 3, 0]

• = [0.1, 0.6, 0.3]: probability
distribution over

• if true label is

h1 = tanh (W1 x + b1)
h2 = tanh (W2h1 + b2)
y = softmax (W3h2)

W b

y
C1, C2, C3

L = − log(0.1) C1

COMP90042 L7

Topic Classification – Prediction

• = [1, 3, 5, 0] (test
document)

• = [0.2, 0.1, 0.7]

• Predicted class =

h1 = tanh (W1 x + b1)
h2 = tanh (W2h1 + b2)
y = softmax (W3h2)

y
C3

COMP90042 L7

Topic Classification – Improvements

• + Bag of bigrams as input

• Preprocess text to lemmatise words and remove
stopwords

• Instead of raw counts, we can weight words using
TF-IDF or indicators (0 or 1 depending on
presence of words)

COMP90042 L7

Language Model Revisited

• Assign a probability to a sequence of words

• Framed as “sliding a window” over the sentence,
predicting each word from finite context 
E.g., n = 3, a trigram model

• Training involves collecting frequency counts
‣ Difficulty with rare events → smoothing

P(w1, w2, . . . , wm) =
m

∏
i=1

P(wi |wi−2, wi−1)

COMP90042 L7

Language Models as Classifiers

LMs can be considered simple classifiers, e.g. for a
trigram model:

classifies the likely next word in a sequence, given
“salt” and “and”. 
 

P(wi |wi−2 = salt, wi−1 = and)

COMP90042 L7

Feed-forward NN Language Model

• Use neural network as a classifier to model

• Input features = the previous two words
• Output class = the next word

• How to represent words? Embeddings

P(wi |wi−2 = salt, wi−1 = and)

0.1 -1.5 2.3 0.9 -3.2 2.5 1.1

COMP90042 L7

Word Embeddings

• Maps discrete word symbols to continuous vectors
in a relatively low dimensional space

• Word embeddings allow the model to capture
similarity between words

‣ dog vs. cat

‣ walking vs. running

COMP90042 L7

Topic Classification

h1 = tanh (W1 x + b1)
h2 = tanh (W2h1 + b2)
y = softmax (W3h2)

First layer = sum of input word embeddings

Word Embeddings!

COMP90042 L7

Training a FFNN LM

•

• Lookup word embeddings ( ) for a, cow and eats

• Concatenate them and feed it to the network

P(wi = grass |wi−3 = a, wi−2 = cow, wi−1 = eats)
W1

x = va ⊕ v cow ⊕ v eats

h = tanh(W2 x + b1)

y = softmax(W3 h )

a grass eats hunts cow
0.9 0.2 -3.3 -0.1 -0.5
0.2 -2.3 0.6 -1.5 1.2

-0.6 0.8 1.1 0.3 -2.4
1.5 0.8 0.1 2.5 0.4

COMP90042 L7

Training a FFNN LM

• gives the probability distribution over all words in
the vocabulary

•

• Most parameters are in the word embeddings
(size = ) and the output embeddings
(size = )

P(wi = grass |wi−3 = a, wi−2 = cow, wi−1 = eats) = 0.8

L = − log(0.8)
W1

d1 × |V | W3
|V | × d3

0.01 0.80 0.05 0.10 0.04
rabbit grass eats hunts cow

COMP90042 L7

Input and Output Word Embeddings

•

• Lookup word embeddings ( ) for a, cow and eats

• Concatenate them and feed it to the network

P(wi = grass |wi−3 = a, wi−2 = cow, wi−1 = eats)
W1

x = va ⊕ v cow ⊕ v eats

h = tanh(W2 x + b1)

y = softmax(W3 h )

a grass eats hunts cow
0.9 0.2 -3.3 -0.1 -0.5
0.2 -2.3 0.6 -1.5 1.2

-0.6 0.8 1.1 0.3 -2.4
1.5 0.8 0.1 2.5 0.4

Word embeddings W1
d1 × |V |

Output word embeddings W3
|V | × d3

COMP90042 L7

Language Model: Architecture

Bengio et al, 2003

a cow eats

concatenate word embeddings

non-linear activation

softmax to produce a  
probability distribution over 
all words in the vocabulary

P(wt = grass |context)

[
exp(v1)

∑
i
exp(vi)

,
exp(v2)

∑
i
exp(vi)

, . . . ,
exp(vm)

∑
i
exp(vi) ]

COMP90042 L7

Advantages of FFNN LM

• Count-based N-gram models (lecture 3)
‣ cheap to train (just collect counts)
‣ problems with sparsity and scaling to larger contexts
‣ don’t adequately capture properties of words

(grammatical and semantic similarity), e.g., film vs
movie

• FFNN N-gram models
‣ automatically capture word properties, leading to more

robust estimates

COMP90042 L7

– Very slow to train
– Captures only limited context
– Unable to handle unseen n-grams
– Unable to handle unseen words

PollEv.com/jeyhanlau569

What Are The Disadvantages of 
Feedforward NN Language Model?

http://PollEv.com/jeyhanlau569

COMP90042 L7

POS Tagging

• POS tagging can also be framed as classification:

‣ classifies the likely POS tag for “eats”.

• FFNN LM architecture can be adapted to the task
directly

P(ti |wi−1 = cow, wi = eats)

COMP90042 L7

Feed-forward NN for Tagging

• Inputs:
‣ recent words

‣ recent tags

• And outputs: current tag

• Frame as neural network with
‣ 5 inputs: 3 x word embeddings and 2 x tag embeddings
‣ 1 output: vector of size , using softmax

• Train to minimise

𝑤𝑖−2, 𝑤𝑖−1, 𝑤𝑖
𝑡𝑖−2, 𝑡𝑖−1

𝑡𝑖

|T |

− ∑𝑖 log𝑃(𝑡𝑖 |𝑤𝑖−2, 𝑤𝑖−1, 𝑤𝑖, 𝑡𝑖−2, 𝑡𝑖−1)

COMP90042 L7

FFNN for Tagging

COMP90042 L7

Convolutional Networks

COMP90042 L7

Convolutional Networks

• Commonly used in computer vision

• Identify indicative local predictors

• Combine them to produce a fixed-size
representation

https://adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-Neural-Networks/

COMP90042 L7

Convolutional Networks for NLP

• Sliding window (e.g. 3 words) over sequence

• W = convolution filter (linear transformation+tanh)

• max-pool to produce a fixed-size representation

COMP90042 L7

Final Words

• Pros
‣ Excellent performance
‣ Less hand-engineering of features
‣ Flexible — customised architecture for different tasks

• Cons
‣ Much slower than classical ML models… needs GPU
‣ Lots of parameters due to vocabulary size
‣ Data hungry, not so good on tiny data sets
‣ Pre-training on big corpora helps

COMP90042 L7

Readings

• Feed-forward network: G15, section 4; JM Ch. 7.3

• Convolutional network: G15, section 9

Related Posts