CS计算机代考程序代写 deep learning algorithm Adaptive Supertagging [Clark & Curran, 2007] Start with an initial prob. cuto↵

Adaptive Supertagging [Clark & Curran, 2007] Start with an initial prob. cuto↵
He
NP
N N/N NP /NP
reads
(S [pss ]\NP )/NP (S \NP )/NP S\NP
(S [pt ]\NP )/NP (S [dcl ]\NP )/NP
the
NP /N NP /NP N/N
book
N
(S \NP )/NP

Adaptive Supertagging [Clark & Curran, 2007] Prune a category, if its probability is below times the prob. of the best
He
NP
N N/N NP /NP
reads
(S [pss ]\NP )/NP (S \NP )/NP S\NP
(S [pt ]\NP )/NP (S [dcl ]\NP )/NP
the
NP /N NP /NP N/N
book
N
(S \NP )/NP
category

Adaptive Supertagging [Clark & Curran, 2007] Decrease if no spanning analysis
He
NP N N/N NP /NP
reads
(S [pss ]\NP )/NP (S \NP )/NP S\NP
(S [pt ]\NP )/NP (S [dcl ]\NP )/NP
the
NP /N NP /NP N/N
book
N
(S \NP )/NP

Adaptive Supertagging [Clark & Curran, 2007] Decrease if no spanning analysis
He
NP N N/N NP /NP
reads
(S [pss ]\NP )/NP (S \NP )/NP S\NP
(S [pt ]\NP )/NP (S [dcl ]\NP )/NP
the
NP /N NP /NP N/N
book
N
(S \NP )/NP

Recurrent neural networks (RNN)

Recurrent neural networks
• Use the same computational function and parameters across different time steps of the sequence
• Each time step: takes the input entry and the previous hidden state to compute the output entry
• Loss: typically computed every time step

Recurrent neural networks
Label
Loss
Output
State
Input
Figure from Deep Learning, by Goodfellow, Bengio and Courville

Recurrent neural networks
Math formula:
Figure from Deep Learning, Goodfellow, Bengio and Courville

Advantage
• Hidden state: a lossy summary of the past
• Shared functions and parameters: greatly reduce the capacity and
good for generalization in learning
• Explicitly use the prior knowledge that the sequential data can be processed by in the same way at different time step (e.g., NLP)

Advantage
• Hidden state: a lossy summary of the past
• Shared functions and parameters: greatly reduce the capacity and
good for generalization in learning
• Explicitly use the prior knowledge that the sequential data can be
processed by in the same way at different time step (e.g., NLP)
• Yet still powerful (actually universal): any function computable by a Turing machine can be computed by such a recurrent network of a finite size (see, e.g., Siegelmann and Sontag (1995))

Recurrent Neural Networks 5

Recurrent Network Variations
 This network can theoretically learn contexts arbitrarily far back
 Many structural variations
– Elman/Simple Net
– Jordan Net
– Mixed
– Context sub-blocks, etc.
– Multiple hidden/context layers, etc.
– Generalized row representation
 How do we learn the weights?
Recurrent Neural Networks 6

Simple Recurrent Training – Elman Training
 Can think of net as just being a normal MLP structure where part of the input happens to be a copy of the last set of state/hidden node activations. The MLP itself does not even need to be aware that the context inputs are coming from the hidden layer
 Then can train with standard BP training
 While network can theoretically look back arbitrarily far in time, Elman learning gradient goes back only 1 step in time, thus limited in the context it can learn
– Would if current output depended on input 2 time steps back
 Can still be useful for applications with short term dependencies
Recurrent Neural Networks 7

BPTT – Backprop Through Time
 BPTT allows us to look back further as we train
 However we have to pre-specify a value k, which is the
maximum that learning will look back
 During training we unfold the network in time as if it were a standard feedfoward network with k layers
– But where the weights of each unfolded layer are the same (shared)
 We then train the unfolded k layer feedforward net with standard BP
 Execution still happens with the actual recurrent version
 Is not knowing k apriori that bad? How do you choose it?
– Cross Validation, just like finding best number of hidden nodes, etc., thus we can find a good k fairly reasonably for a given task
– But problematic if the amount of state needed varies a lot
Recurrent Neural Networks 8

• k is the number of feedback/context blocks in the unfolded net.
• Note k=1 is just standard MLP with no feedback
• 1st block h(0) activations are just initialized to a constant or 0 so k=1 is still same as standard MLP, so just leave it out for feedforward MLP
• Last context block is h(k- 1)
• k=2 is Elman training
Recurrent Neural Networks
9

Training RNN
• Principle: unfold the computational graph, and use backpropagation • Called back-propagation through time (BPTT) algorithm
• Can then apply any general-purpose gradient-based techniques

Training RNN
• Principle: unfold the computational graph, and use backpropagation • Called back-propagation through time (BPTT) algorithm
• Can then apply any general-purpose gradient-based techniques
• Conceptually: first compute the gradients of the internal nodes, then compute the gradients of the parameters

Recurrent neural networks
Math formula:
Figure from Deep Learning, Goodfellow, Bengio and Courville

Recurrent neural networks
Gradient at 𝐿(𝑡): (total loss is sum of those at different time steps)
Figure from Deep Learning, Goodfellow, Bengio and Courville

Recurrent neural networks
Gradient at 𝑜(𝑡):
Figure from Deep Learning, Goodfellow, Bengio and Courville

Recurrent neural networks
Gradient at 𝑠(𝜏):
Figure from Deep Learning, Goodfellow, Bengio and Courville

Recurrent neural networks
Gradient at 𝑠(𝑡):
Figure from Deep Learning, Goodfellow, Bengio and Courville

Recurrent neural networks
Gradient at parameter 𝑉:
Figure from Deep Learning, Goodfellow, Bengio and Courville

Dealing with the vanishing/exploding gradient in RNNs
 Gradient clipping – for large gradients – type of adaptive LR
 Linear self connection near one for gradient – Leaky unit
 Skip connections
– Make sure can be influenced by units d skips back, still limited by
amount of skipping, etc.
 Time delays and different time scales
 LSTM – Long short term memory – Current state of the art
– Gated recurrent network
– Keeps self loop to maintain state and gradient constant as long as needed – self loop is gated by another learning node – forget gate
– Learns when to use and forget the state
Recurrent Neural Networks 23

Other Recurrent Approaches
 GRUs
 RTRL – Real Time Recurrent Learning
– Do not have to specify a k, will look arbitrarily far back
– But note, that with an expectation of looking arbitrarily far back, you create a very difficult problem expectation
– Looking back more requires increase in data, else overfit – Lots of irrelevant options which could lead to minor accuracy improvements
– Have reasonable expectations
– n4 and n3 versions – too expensive in practice
 Recursive Network – Dynamic tree structures
 Reservoir computing: Echo State Networks and Liquid State machines  Hessian Free Learning
 Tuned initial states and momentum
 Neural Turing Machine – RNN which can learn to read/write memory  Relaxation networks – Hopfield, Boltzmann, Multcons, etc.
Recurrent Neural Networks 24

Supertagging with a RNN
• Using only dense features
– word embedding – sux embedding – capitalization
• The input layer is a concatenation of all embeddings of all words in a context window

Supertagging with a RNN
… bought some books and …

Supertagging with a RNN
… bought some books and …

Supertagging with a RNN
… bought some books and …

Supertagging with a RNN
… bought some books and …

Supertagging with a RNN
… bought some books and …

1-best Supertagging Results: dev
Model
c&c (gold pos) c&c (auto pos) NN
RNN RNN+dropout
Accuracy Time 92.60 – 91.50 0.57 91.10 21.00 92.63 – 93.07 2.02
Table 1 :
00 with a single CPU core (1,913 sentences), tagging time in secs.
1-best tagging accuracy and speed comparison on CCGBank Section

1-best Supertagging Results: test
Model
c&c (gold pos) c&c (auto pos) NN
RNN
Section 23 93.32 92.02 91.57 93.00
Wiki Bio 88.80 91.85 88.80 89.08 89.00 88.16 90.00 88.27
Table 2 :
sentences), Wikipedia (200 sentences) and Bio-GENIA (1,000 sentences).
1-best tagging accuracy comparison on CCGBank Section 23 (2,407

Multi-tagging Results: dev
1 0.995 0.99 0.985 0.98 0.975 0.97 0.965 0.96
RNN + dropout
RNN
NN
C&C
0 2
4 6
8 10 12 14 16
ambiguity level
multi-tagging accuracy

Multi-tagging Results: test
1 0.995 0.99 0.985 0.98 0.975 0.97 0.965
RNN + dropout
NN
C&C
0 102030405060708090
ambiguity level
multi-tagging accuracy

LP c&c 86.24 (NN) 86.71
(RNN) 87.68 c&c 86.24 (NN) 86.71 (RNN) 87.68
LR LF cov. LP LR 84.85 85.54 99.42 81.58 80.08 85.56 86.13 99.92 82.65 81.36 86.47 87.07 99.96 83.22 81.78 84.17 85.19 100 81.58 79.48 85.40 86.05 100 – – 86.41 87.04100–
LF
80.83 99.50 82.00 100 82.49 100 80.52 100 — —
Final Parsing Results
CCGBank Section 23 Wikipedia
Table 3 : Parsing test results (auto pos). We evaluate on all sentences (100% coverage) as well as on only those sentences that returned spanning analyses (% cov.). RNN and NN both have 100% coverage on the Wikipedia data.