Adaptive Supertagging [Clark & Curran, 2007]
Start with an initial prob. cuto↵ �
He reads the book
NP (S [pss]\NP)/NP NP/N N
N (S\NP)/NP NP/NP (S\NP)/NP
N/N S\NP N/N
NP/NP (S [pt]\NP)/NP
(S [dcl ]\NP)/NP
Adaptive Supertagging [Clark & Curran, 2007]
Prune a category, if its probability is below � times the prob. of the best
category
He reads the book
NP (S [pss]\NP)/NP NP/N N
N (S\NP)/NP NP/NP (S\NP)/NP
N/N S\NP N/N
NP/NP (S [pt]\NP)/NP
(S [dcl ]\NP)/NP
Adaptive Supertagging [Clark & Curran, 2007]
Decrease � if no spanning analysis
He reads the book
NP (S [pss]\NP)/NP NP/N N
N (S\NP)/NP NP/NP (S\NP)/NP
N/N S\NP N/N
NP/NP (S [pt]\NP)/NP
(S [dcl ]\NP)/NP
Adaptive Supertagging [Clark & Curran, 2007]
Decrease � if no spanning analysis
He reads the book
NP (S [pss]\NP)/NP NP/N N
N (S\NP)/NP NP/NP (S\NP)/NP
N/N S\NP N/N
NP/NP (S [pt]\NP)/NP
(S [dcl ]\NP)/NP
Neural Networks (Fig: courtesy R Socher)
Neural Networks can be built for different
input, output types.
– Outputs can be:
– Linear, single output (Linear)
– Linear, multiple outputs (Linear)
– Single output binary (Logistic)
– Multi output binary (Logistic)
– 1 of k Multinomial output (Softmax)
– Inputs can be:
– A scalar number
– Vector of Real numbers
– Vector of Binary
Goal of training: Given the training data (inputs, targets) and the
architecture, determine the model parameters.
Model Parameters for a 3 layer network:
– Weight matrix from input layer to the hidden (Wjk)
– Weight matrix from hidden layer to the output (Wkj)
– Bias terms for hidden layer
– Bias terms for output layer
Our strategy will be:
– Compute the error at the output
– Determine the contribution of each parameter to the error by
taking the differential of error wrt the parameter
– Update the parameter commensurate with the error it contributed.
Design Choices
• When building a neural network, the designer would choose the
following hyper parameters and non linearities based on the
application characteristics:
• Number of hidden layers
• Number of hidden units in each layer
• Learning rate
• Regularization coefft
• Number of outputs
• Type of output (linear, logistic, softmax)
• Choice of Non linearity at the output layer and hidden layer (See next slide)
• Input representation and dimensionality
Commonly used non linearities (fig: courtesy Socher)
Objective Functions and gradients
• Linear – Mean squared error
• 𝐸 𝑤 =
1
2𝑁
1
𝑁(𝑡𝑛 − 𝑦𝑛)
2
• Logistic with binary classifications: Cross Entropy Error
• Logistic with k outputs: k > 2: Cross Entropy Error
• Softmax: 1 of K multinomial classification: Cross Entropy Error, minimize NLL
• In all the above cases we can show that the gradient is: (yk – tk) where yk is
the predicted output for the output unit k and tk is the corresponding target
High Level Backpropagation Algorithm
• Apply the input vector to the network and forward propagate. This
will yield the activations for hidden layer(s) and the output layer
• 𝑛𝑒𝑡𝑗 = 𝑖𝑤𝑗𝑖 𝑧𝑖 ,
• 𝑧𝑗 = ℎ(𝑛𝑒𝑡𝑗) where h is your choice of non linearity. Usually it is sigmoid or
tanh. Rectified Linear Unit (RelU) is also used.
• Evaluate the error 𝛿𝑘 for all the output units
𝛿𝑘 = 𝑜𝑘 − 𝑡𝑘 where 𝑜𝑘 is the output produced by the model and 𝑡𝑘 is the
target provided in the training dataset
• Backpropagate the 𝛿’s to obtain 𝛿𝑗 for each hidden unit j
𝛿𝑗 = ℎ
′(𝑧𝑗) 𝑘𝑤𝑘𝑗 𝛿𝑘
• Evaluate the required derivatives
𝜕𝐸
𝜕𝑊𝑗𝑖
= 𝛿𝑗𝑧𝑖
Recurrent neural networks (RNN)
Recurrent neural networks
• Use the same computational function and parameters across different
time steps of the sequence
• Each time step: takes the input entry and the previous hidden state to
compute the output entry
• Loss: typically computed every time step
Recurrent neural networks
Figure from Deep Learning, by Goodfellow, Bengio and Courville
Label
Loss
Output
State
Input
Recurrent neural networks
Figure from Deep Learning,
Goodfellow, Bengio and Courville
Math formula:
Advantage
• Hidden state: a lossy summary of the past
• Shared functions and parameters: greatly reduce the capacity and
good for generalization in learning
• Explicitly use the prior knowledge that the sequential data can be
processed by in the same way at different time step (e.g., NLP)
Advantage
• Hidden state: a lossy summary of the past
• Shared functions and parameters: greatly reduce the capacity and
good for generalization in learning
• Explicitly use the prior knowledge that the sequential data can be
processed by in the same way at different time step (e.g., NLP)
• Yet still powerful (actually universal): any function computable by a
Turing machine can be computed by such a recurrent network of a
finite size (see, e.g., Siegelmann and Sontag (1995))
5 Recurrent Neural Networks
Recurrent Network Variations
This network can theoretically learn contexts arbitrarily far
back
Many structural variations
– Elman/Simple Net
– Jordan Net
– Mixed
– Context sub-blocks, etc.
– Multiple hidden/context layers, etc.
– Generalized row representation
How do we learn the weights?
Recurrent Neural Networks 6
Simple Recurrent Training – Elman Training
Can think of net as just being a normal MLP structure
where part of the input happens to be a copy of the last set
of state/hidden node activations. The MLP itself does not
even need to be aware that the context inputs are coming
from the hidden layer
Then can train with standard BP training
While network can theoretically look back arbitrarily far in
time, Elman learning gradient goes back only 1 step in
time, thus limited in the context it can learn
– Would if current output depended on input 2 time steps back
Can still be useful for applications with short term
dependencies
Recurrent Neural Networks 7
BPTT – Backprop Through Time
BPTT allows us to look back further as we train
However we have to pre-specify a value k, which is the
maximum that learning will look back
During training we unfold the network in time as if it were
a standard feedfoward network with k layers
– But where the weights of each unfolded layer are the same (shared)
We then train the unfolded k layer feedforward net with
standard BP
Execution still happens with the actual recurrent version
Is not knowing k apriori that bad? How do you choose it?
– Cross Validation, just like finding best number of hidden nodes,
etc., thus we can find a good k fairly reasonably for a given task
– But problematic if the amount of state needed varies a lot
Recurrent Neural Networks 8
9 Recurrent Neural Networks
• k is the number of
feedback/context blocks in
the unfolded net.
• Note k=1 is just standard
MLP with no feedback
• 1st block h(0) activations
are just initialized to a
constant or 0 so k=1 is still
same as standard MLP, so
just leave it out for
feedforward MLP
• Last context block is h(k-
1)
• k=2 is Elman training
Training RNN
• Principle: unfold the computational graph, and use backpropagation
• Called back-propagation through time (BPTT) algorithm
• Can then apply any general-purpose gradient-based techniques
Training RNN
• Principle: unfold the computational graph, and use backpropagation
• Called back-propagation through time (BPTT) algorithm
• Can then apply any general-purpose gradient-based techniques
• Conceptually: first compute the gradients of the internal nodes, then
compute the gradients of the parameters
Recurrent neural networks
Figure from Deep Learning,
Goodfellow, Bengio and Courville
Math formula:
Recurrent neural networks
Figure from Deep Learning,
Goodfellow, Bengio and Courville
Gradient at 𝐿(𝑡): (total loss
is sum of those at different
time steps)
Recurrent neural networks
Figure from Deep Learning,
Goodfellow, Bengio and Courville
Gradient at 𝑜(𝑡):
Recurrent neural networks
Figure from Deep Learning,
Goodfellow, Bengio and Courville
Gradient at 𝑠(𝜏):
Recurrent neural networks
Figure from Deep Learning,
Goodfellow, Bengio and Courville
Gradient at 𝑠(𝑡):
Recurrent neural networks
Figure from Deep Learning,
Goodfellow, Bengio and Courville
Gradient at parameter 𝑉:
Dealing with the vanishing/exploding
gradient in RNNs
Gradient clipping – for large gradients – type of adaptive LR
Linear self connection near one for gradient – Leaky unit
Skip connections
– Make sure can be influenced by units d skips back, still limited by
amount of skipping, etc.
Time delays and different time scales
LSTM – Long short term memory – Current state of the art
– Gated recurrent network
– Keeps self loop to maintain state and gradient constant as long as
needed – self loop is gated by another learning node – forget gate
– Learns when to use and forget the state
Recurrent Neural Networks 23
Other Recurrent Approaches
LSTM
RTRL – Real Time Recurrent Learning
– Do not have to specify a k, will look arbitrarily far back
– But note, that with an expectation of looking arbitrarily far back, you create a
very difficult problem expectation
– Looking back more requires increase in data, else overfit – Lots of irrelevant
options which could lead to minor accuracy improvements
– Have reasonable expectations
– n4 and n3 versions – too expensive in practice
Recursive Network – Dynamic tree structures
Reservoir computing: Echo State Networks and Liquid State machines
Hessian Free Learning
Tuned initial states and momentum
Neural Turing Machine – RNN which can learn to read/write memory
Relaxation networks – Hopfield, Boltzmann, Multcons, etc.
24 Recurrent Neural Networks
Supertagging with a RNN
• Using only dense features
– word embedding
– su�x embedding
– capitalization
• The input layer is a concatenation of all embeddings of all words in
a context window
Supertagging with a RNN
bought some books and… …
Supertagging with a RNN
bought some books and… …
Supertagging with a RNN
bought some books and… …
Supertagging with a RNN
bought some books and… …
Supertagging with a RNN
bought some books and… …
1-best Supertagging Results: dev
Model Accuracy Time
c&c (gold pos) 92.60 –
c&c (auto pos) 91.50 0.57
NN 91.10 21.00
RNN 92.63 –
RNN+dropout 93.07 2.02
Table 1 : 1-best tagging accuracy and speed comparison on CCGBank Section
00 with a single CPU core (1,913 sentences), tagging time in secs.
1-best Supertagging Results: test
Model Section 23 Wiki Bio
c&c (gold pos) 93.32 88.80 91.85
c&c (auto pos) 92.02 88.80 89.08
NN 91.57 89.00 88.16
RNN 93.00 90.00 88.27
Table 2 : 1-best tagging accuracy comparison on CCGBank Section 23 (2,407
sentences), Wikipedia (200 sentences) and Bio-GENIA (1,000 sentences).
Multi-tagging Results: dev
0.96
0.965
0.97
0.975
0.98
0.985
0.99
0.995
1
0 2 4 6 8 10 12 14 16
m
u
lt
i-
ta
g
g
in
g
a
c
c
u
ra
c
y
ambiguity level
RNN + dropout
RNN
NN
C&C
Multi-tagging Results: test
0.965
0.97
0.975
0.98
0.985
0.99
0.995
1
0 10 20 30 40 50 60 70 80 90
m
u
lt
i-
ta
g
g
in
g
a
c
c
u
ra
c
y
ambiguity level
RNN + dropout
NN
C&C
Final Parsing Results
CCGBank Section 23 Wikipedia
LP LR LF cov. LP LR LF
c&c 86.24 84.85 85.54 99.42 81.58 80.08 80.83 99.50
(NN) 86.71 85.56 86.13 99.92 82.65 81.36 82.00 100
(RNN) 87.68 86.47 87.07 99.96 83.22 81.78 82.49 100
c&c 86.24 84.17 85.19 100 81.58 79.48 80.52 100
(NN) 86.71 85.40 86.05 100 – – – –
(RNN) 87.68 86.41 87.04 100 – – – –
Table 3 : Parsing test results (auto pos). We evaluate on all sentences (100%
coverage) as well as on only those sentences that returned spanning analyses
(% cov.). RNN and NN both have 100% coverage on the Wikipedia data.