Adaptive Supertagging [Clark & Curran, 2007] Start with an initial prob. cuto↵
He
NP
N N/N NP /NP
reads
(S [pss ]\NP )/NP (S \NP )/NP S\NP
(S [pt ]\NP )/NP (S [dcl ]\NP )/NP
the
NP /N NP /NP N/N
book
N
(S \NP )/NP
Adaptive Supertagging [Clark & Curran, 2007] Prune a category, if its probability is below times the prob. of the best
He
NP
N N/N NP /NP
reads
(S [pss ]\NP )/NP (S \NP )/NP S\NP
(S [pt ]\NP )/NP (S [dcl ]\NP )/NP
the
NP /N NP /NP N/N
book
N
(S \NP )/NP
category
Adaptive Supertagging [Clark & Curran, 2007] Decrease if no spanning analysis
He
NP N N/N NP /NP
reads
(S [pss ]\NP )/NP (S \NP )/NP S\NP
(S [pt ]\NP )/NP (S [dcl ]\NP )/NP
the
NP /N NP /NP N/N
book
N
(S \NP )/NP
Adaptive Supertagging [Clark & Curran, 2007] Decrease if no spanning analysis
He
NP N N/N NP /NP
reads
(S [pss ]\NP )/NP (S \NP )/NP S\NP
(S [pt ]\NP )/NP (S [dcl ]\NP )/NP
the
NP /N NP /NP N/N
book
N
(S \NP )/NP
Recurrent neural networks (RNN)
Recurrent neural networks
• Use the same computational function and parameters across different time steps of the sequence
• Each time step: takes the input entry and the previous hidden state to compute the output entry
• Loss: typically computed every time step
Recurrent neural networks
Label
Loss
Output
State
Input
Figure from Deep Learning, by Goodfellow, Bengio and Courville
Recurrent neural networks
Math formula:
Figure from Deep Learning, Goodfellow, Bengio and Courville
Advantage
• Hidden state: a lossy summary of the past
• Shared functions and parameters: greatly reduce the capacity and
good for generalization in learning
• Explicitly use the prior knowledge that the sequential data can be processed by in the same way at different time step (e.g., NLP)
Advantage
• Hidden state: a lossy summary of the past
• Shared functions and parameters: greatly reduce the capacity and
good for generalization in learning
• Explicitly use the prior knowledge that the sequential data can be
processed by in the same way at different time step (e.g., NLP)
• Yet still powerful (actually universal): any function computable by a Turing machine can be computed by such a recurrent network of a finite size (see, e.g., Siegelmann and Sontag (1995))
Recurrent Neural Networks 5
Recurrent Network Variations
This network can theoretically learn contexts arbitrarily far back
Many structural variations
– Elman/Simple Net
– Jordan Net
– Mixed
– Context sub-blocks, etc.
– Multiple hidden/context layers, etc.
– Generalized row representation
How do we learn the weights?
Recurrent Neural Networks 6
Simple Recurrent Training – Elman Training
Can think of net as just being a normal MLP structure where part of the input happens to be a copy of the last set of state/hidden node activations. The MLP itself does not even need to be aware that the context inputs are coming from the hidden layer
Then can train with standard BP training
While network can theoretically look back arbitrarily far in time, Elman learning gradient goes back only 1 step in time, thus limited in the context it can learn
– Would if current output depended on input 2 time steps back
Can still be useful for applications with short term dependencies
Recurrent Neural Networks 7
BPTT – Backprop Through Time
BPTT allows us to look back further as we train
However we have to pre-specify a value k, which is the
maximum that learning will look back
During training we unfold the network in time as if it were a standard feedfoward network with k layers
– But where the weights of each unfolded layer are the same (shared)
We then train the unfolded k layer feedforward net with standard BP
Execution still happens with the actual recurrent version
Is not knowing k apriori that bad? How do you choose it?
– Cross Validation, just like finding best number of hidden nodes, etc., thus we can find a good k fairly reasonably for a given task
– But problematic if the amount of state needed varies a lot
Recurrent Neural Networks 8
• k is the number of feedback/context blocks in the unfolded net.
• Note k=1 is just standard MLP with no feedback
• 1st block h(0) activations are just initialized to a constant or 0 so k=1 is still same as standard MLP, so just leave it out for feedforward MLP
• Last context block is h(k- 1)
• k=2 is Elman training
Recurrent Neural Networks
9
Training RNN
• Principle: unfold the computational graph, and use backpropagation • Called back-propagation through time (BPTT) algorithm
• Can then apply any general-purpose gradient-based techniques
Training RNN
• Principle: unfold the computational graph, and use backpropagation • Called back-propagation through time (BPTT) algorithm
• Can then apply any general-purpose gradient-based techniques
• Conceptually: first compute the gradients of the internal nodes, then compute the gradients of the parameters
Recurrent neural networks
Math formula:
Figure from Deep Learning, Goodfellow, Bengio and Courville
Recurrent neural networks
Gradient at 𝐿(𝑡): (total loss is sum of those at different time steps)
Figure from Deep Learning, Goodfellow, Bengio and Courville
Recurrent neural networks
Gradient at 𝑜(𝑡):
Figure from Deep Learning, Goodfellow, Bengio and Courville
Recurrent neural networks
Gradient at 𝑠(𝜏):
Figure from Deep Learning, Goodfellow, Bengio and Courville
Recurrent neural networks
Gradient at 𝑠(𝑡):
Figure from Deep Learning, Goodfellow, Bengio and Courville
Recurrent neural networks
Gradient at parameter 𝑉:
Figure from Deep Learning, Goodfellow, Bengio and Courville
Dealing with the vanishing/exploding gradient in RNNs
Gradient clipping – for large gradients – type of adaptive LR
Linear self connection near one for gradient – Leaky unit
Skip connections
– Make sure can be influenced by units d skips back, still limited by
amount of skipping, etc.
Time delays and different time scales
LSTM – Long short term memory – Current state of the art
– Gated recurrent network
– Keeps self loop to maintain state and gradient constant as long as needed – self loop is gated by another learning node – forget gate
– Learns when to use and forget the state
Recurrent Neural Networks 23
Other Recurrent Approaches
GRUs
RTRL – Real Time Recurrent Learning
– Do not have to specify a k, will look arbitrarily far back
– But note, that with an expectation of looking arbitrarily far back, you create a very difficult problem expectation
– Looking back more requires increase in data, else overfit – Lots of irrelevant options which could lead to minor accuracy improvements
– Have reasonable expectations
– n4 and n3 versions – too expensive in practice
Recursive Network – Dynamic tree structures
Reservoir computing: Echo State Networks and Liquid State machines Hessian Free Learning
Tuned initial states and momentum
Neural Turing Machine – RNN which can learn to read/write memory Relaxation networks – Hopfield, Boltzmann, Multcons, etc.
Recurrent Neural Networks 24
Supertagging with a RNN
• Using only dense features
– word embedding – su x embedding – capitalization
• The input layer is a concatenation of all embeddings of all words in a context window
Supertagging with a RNN
… bought some books and …
Supertagging with a RNN
… bought some books and …
Supertagging with a RNN
… bought some books and …
Supertagging with a RNN
… bought some books and …
Supertagging with a RNN
… bought some books and …
1-best Supertagging Results: dev
Model
c&c (gold pos) c&c (auto pos) NN
RNN RNN+dropout
Accuracy Time 92.60 – 91.50 0.57 91.10 21.00 92.63 – 93.07 2.02
Table 1 :
00 with a single CPU core (1,913 sentences), tagging time in secs.
1-best tagging accuracy and speed comparison on CCGBank Section
1-best Supertagging Results: test
Model
c&c (gold pos) c&c (auto pos) NN
RNN
Section 23 93.32 92.02 91.57 93.00
Wiki Bio 88.80 91.85 88.80 89.08 89.00 88.16 90.00 88.27
Table 2 :
sentences), Wikipedia (200 sentences) and Bio-GENIA (1,000 sentences).
1-best tagging accuracy comparison on CCGBank Section 23 (2,407
Multi-tagging Results: dev
1 0.995 0.99 0.985 0.98 0.975 0.97 0.965 0.96
RNN + dropout
RNN
NN
C&C
0 2
4 6
8 10 12 14 16
ambiguity level
multi-tagging accuracy
Multi-tagging Results: test
1 0.995 0.99 0.985 0.98 0.975 0.97 0.965
RNN + dropout
NN
C&C
0 102030405060708090
ambiguity level
multi-tagging accuracy
LP c&c 86.24 (NN) 86.71
(RNN) 87.68 c&c 86.24 (NN) 86.71 (RNN) 87.68
LR LF cov. LP LR 84.85 85.54 99.42 81.58 80.08 85.56 86.13 99.92 82.65 81.36 86.47 87.07 99.96 83.22 81.78 84.17 85.19 100 81.58 79.48 85.40 86.05 100 – – 86.41 87.04100–
LF
80.83 99.50 82.00 100 82.49 100 80.52 100 — —
Final Parsing Results
CCGBank Section 23 Wikipedia
Table 3 : Parsing test results (auto pos). We evaluate on all sentences (100% coverage) as well as on only those sentences that returned spanning analyses (% cov.). RNN and NN both have 100% coverage on the Wikipedia data.