程序代写代做代考 algorithm deep learning Adaptive Supertagging [Clark & Curran, 2007]

Adaptive Supertagging [Clark & Curran, 2007]

Start with an initial prob. cuto↵ �

He reads the book

NP (S [pss]\NP)/NP NP/N N
N (S\NP)/NP NP/NP (S\NP)/NP

N/N S\NP N/N
NP/NP (S [pt]\NP)/NP

(S [dcl ]\NP)/NP

Adaptive Supertagging [Clark & Curran, 2007]

Prune a category, if its probability is below � times the prob. of the best
category

He reads the book

NP (S [pss]\NP)/NP NP/N N
N (S\NP)/NP NP/NP (S\NP)/NP

N/N S\NP N/N
NP/NP (S [pt]\NP)/NP

(S [dcl ]\NP)/NP

Adaptive Supertagging [Clark & Curran, 2007]

Decrease � if no spanning analysis

He reads the book

NP (S [pss]\NP)/NP NP/N N
N (S\NP)/NP NP/NP (S\NP)/NP

N/N S\NP N/N
NP/NP (S [pt]\NP)/NP

(S [dcl ]\NP)/NP

Adaptive Supertagging [Clark & Curran, 2007]

Decrease � if no spanning analysis

He reads the book

NP (S [pss]\NP)/NP NP/N N
N (S\NP)/NP NP/NP (S\NP)/NP

N/N S\NP N/N
NP/NP (S [pt]\NP)/NP

(S [dcl ]\NP)/NP

Neural Networks (Fig: courtesy R Socher)
Neural Networks can be built for different
input, output types.

– Outputs can be:
– Linear, single output (Linear)
– Linear, multiple outputs (Linear)
– Single output binary (Logistic)
– Multi output binary (Logistic)
– 1 of k Multinomial output (Softmax)

– Inputs can be:
– A scalar number
– Vector of Real numbers
– Vector of Binary

Goal of training: Given the training data (inputs, targets) and the
architecture, determine the model parameters.
Model Parameters for a 3 layer network:
– Weight matrix from input layer to the hidden (Wjk)
– Weight matrix from hidden layer to the output (Wkj)
– Bias terms for hidden layer
– Bias terms for output layer
Our strategy will be:
– Compute the error at the output
– Determine the contribution of each parameter to the error by

taking the differential of error wrt the parameter
– Update the parameter commensurate with the error it contributed.

Design Choices

• When building a neural network, the designer would choose the
following hyper parameters and non linearities based on the
application characteristics:
• Number of hidden layers

• Number of hidden units in each layer

• Learning rate

• Regularization coefft

• Number of outputs

• Type of output (linear, logistic, softmax)

• Choice of Non linearity at the output layer and hidden layer (See next slide)

• Input representation and dimensionality

Commonly used non linearities (fig: courtesy Socher)

Objective Functions and gradients

• Linear – Mean squared error

• 𝐸 𝑤 =
1

2𝑁
1
𝑁(𝑡𝑛 − 𝑦𝑛)

2

• Logistic with binary classifications: Cross Entropy Error

• Logistic with k outputs: k > 2: Cross Entropy Error

• Softmax: 1 of K multinomial classification: Cross Entropy Error, minimize NLL

• In all the above cases we can show that the gradient is: (yk – tk) where yk is
the predicted output for the output unit k and tk is the corresponding target

High Level Backpropagation Algorithm

• Apply the input vector to the network and forward propagate. This
will yield the activations for hidden layer(s) and the output layer
• 𝑛𝑒𝑡𝑗 = 𝑖𝑤𝑗𝑖 𝑧𝑖 ,

• 𝑧𝑗 = ℎ(𝑛𝑒𝑡𝑗) where h is your choice of non linearity. Usually it is sigmoid or
tanh. Rectified Linear Unit (RelU) is also used.

• Evaluate the error 𝛿𝑘 for all the output units
𝛿𝑘 = 𝑜𝑘 − 𝑡𝑘 where 𝑜𝑘 is the output produced by the model and 𝑡𝑘 is the
target provided in the training dataset

• Backpropagate the 𝛿’s to obtain 𝛿𝑗 for each hidden unit j

𝛿𝑗 = ℎ
′(𝑧𝑗) 𝑘𝑤𝑘𝑗 𝛿𝑘

• Evaluate the required derivatives
𝜕𝐸

𝜕𝑊𝑗𝑖
= 𝛿𝑗𝑧𝑖

Recurrent neural networks (RNN)

Recurrent neural networks

• Use the same computational function and parameters across different
time steps of the sequence

• Each time step: takes the input entry and the previous hidden state to
compute the output entry

• Loss: typically computed every time step

Recurrent neural networks

Figure from Deep Learning, by Goodfellow, Bengio and Courville

Label

Loss

Output

State

Input

Recurrent neural networks

Figure from Deep Learning,
Goodfellow, Bengio and Courville

Math formula:

Advantage

• Hidden state: a lossy summary of the past

• Shared functions and parameters: greatly reduce the capacity and
good for generalization in learning

• Explicitly use the prior knowledge that the sequential data can be
processed by in the same way at different time step (e.g., NLP)

Advantage

• Hidden state: a lossy summary of the past

• Shared functions and parameters: greatly reduce the capacity and
good for generalization in learning

• Explicitly use the prior knowledge that the sequential data can be
processed by in the same way at different time step (e.g., NLP)

• Yet still powerful (actually universal): any function computable by a
Turing machine can be computed by such a recurrent network of a
finite size (see, e.g., Siegelmann and Sontag (1995))

5 Recurrent Neural Networks

Recurrent Network Variations

 This network can theoretically learn contexts arbitrarily far

back

 Many structural variations

– Elman/Simple Net

– Jordan Net

– Mixed

– Context sub-blocks, etc.

– Multiple hidden/context layers, etc.

– Generalized row representation

 How do we learn the weights?

Recurrent Neural Networks 6

Simple Recurrent Training – Elman Training

 Can think of net as just being a normal MLP structure
where part of the input happens to be a copy of the last set
of state/hidden node activations. The MLP itself does not
even need to be aware that the context inputs are coming
from the hidden layer

 Then can train with standard BP training

 While network can theoretically look back arbitrarily far in
time, Elman learning gradient goes back only 1 step in
time, thus limited in the context it can learn

– Would if current output depended on input 2 time steps back

 Can still be useful for applications with short term
dependencies

Recurrent Neural Networks 7

BPTT – Backprop Through Time

 BPTT allows us to look back further as we train

 However we have to pre-specify a value k, which is the
maximum that learning will look back

 During training we unfold the network in time as if it were
a standard feedfoward network with k layers

– But where the weights of each unfolded layer are the same (shared)

 We then train the unfolded k layer feedforward net with
standard BP

 Execution still happens with the actual recurrent version

 Is not knowing k apriori that bad? How do you choose it?

– Cross Validation, just like finding best number of hidden nodes,
etc., thus we can find a good k fairly reasonably for a given task

– But problematic if the amount of state needed varies a lot

Recurrent Neural Networks 8

9 Recurrent Neural Networks

• k is the number of

feedback/context blocks in

the unfolded net.

• Note k=1 is just standard

MLP with no feedback

• 1st block h(0) activations

are just initialized to a

constant or 0 so k=1 is still

same as standard MLP, so

just leave it out for

feedforward MLP

• Last context block is h(k-

1)

• k=2 is Elman training

Training RNN

• Principle: unfold the computational graph, and use backpropagation

• Called back-propagation through time (BPTT) algorithm

• Can then apply any general-purpose gradient-based techniques

Training RNN

• Principle: unfold the computational graph, and use backpropagation

• Called back-propagation through time (BPTT) algorithm

• Can then apply any general-purpose gradient-based techniques

• Conceptually: first compute the gradients of the internal nodes, then
compute the gradients of the parameters

Recurrent neural networks

Figure from Deep Learning,
Goodfellow, Bengio and Courville

Math formula:

Recurrent neural networks

Figure from Deep Learning,
Goodfellow, Bengio and Courville

Gradient at 𝐿(𝑡): (total loss
is sum of those at different
time steps)

Recurrent neural networks

Figure from Deep Learning,
Goodfellow, Bengio and Courville

Gradient at 𝑜(𝑡):

Recurrent neural networks

Figure from Deep Learning,
Goodfellow, Bengio and Courville

Gradient at 𝑠(𝜏):

Recurrent neural networks

Figure from Deep Learning,
Goodfellow, Bengio and Courville

Gradient at 𝑠(𝑡):

Recurrent neural networks

Figure from Deep Learning,
Goodfellow, Bengio and Courville

Gradient at parameter 𝑉:

Dealing with the vanishing/exploding

gradient in RNNs

 Gradient clipping – for large gradients – type of adaptive LR

 Linear self connection near one for gradient – Leaky unit

 Skip connections

– Make sure can be influenced by units d skips back, still limited by
amount of skipping, etc.

 Time delays and different time scales

 LSTM – Long short term memory – Current state of the art

– Gated recurrent network

– Keeps self loop to maintain state and gradient constant as long as
needed – self loop is gated by another learning node – forget gate

– Learns when to use and forget the state

Recurrent Neural Networks 23

Other Recurrent Approaches

 LSTM

 RTRL – Real Time Recurrent Learning
– Do not have to specify a k, will look arbitrarily far back

– But note, that with an expectation of looking arbitrarily far back, you create a
very difficult problem expectation

– Looking back more requires increase in data, else overfit – Lots of irrelevant
options which could lead to minor accuracy improvements

– Have reasonable expectations

– n4 and n3 versions – too expensive in practice

 Recursive Network – Dynamic tree structures

 Reservoir computing: Echo State Networks and Liquid State machines

 Hessian Free Learning

 Tuned initial states and momentum

 Neural Turing Machine – RNN which can learn to read/write memory

 Relaxation networks – Hopfield, Boltzmann, Multcons, etc.

24 Recurrent Neural Networks

Supertagging with a RNN

• Using only dense features

– word embedding
– su�x embedding
– capitalization

• The input layer is a concatenation of all embeddings of all words in
a context window

Supertagging with a RNN

bought some books and… …

Supertagging with a RNN

bought some books and… …

Supertagging with a RNN

bought some books and… …

Supertagging with a RNN

bought some books and… …

Supertagging with a RNN

bought some books and… …

1-best Supertagging Results: dev

Model Accuracy Time
c&c (gold pos) 92.60 –
c&c (auto pos) 91.50 0.57
NN 91.10 21.00
RNN 92.63 –
RNN+dropout 93.07 2.02

Table 1 : 1-best tagging accuracy and speed comparison on CCGBank Section
00 with a single CPU core (1,913 sentences), tagging time in secs.

1-best Supertagging Results: test

Model Section 23 Wiki Bio
c&c (gold pos) 93.32 88.80 91.85
c&c (auto pos) 92.02 88.80 89.08
NN 91.57 89.00 88.16
RNN 93.00 90.00 88.27

Table 2 : 1-best tagging accuracy comparison on CCGBank Section 23 (2,407
sentences), Wikipedia (200 sentences) and Bio-GENIA (1,000 sentences).

Multi-tagging Results: dev

0.96

0.965

0.97

0.975

0.98

0.985

0.99

0.995

1

0 2 4 6 8 10 12 14 16

m
u
lt
i-
ta

g
g
in

g
a

c
c
u
ra

c
y

ambiguity level

RNN + dropout
RNN

NN
C&C

Multi-tagging Results: test

0.965

0.97

0.975

0.98

0.985

0.99

0.995

1

0 10 20 30 40 50 60 70 80 90

m
u
lt
i-
ta

g
g
in

g
a

c
c
u
ra

c
y

ambiguity level

RNN + dropout
NN

C&C

Final Parsing Results

CCGBank Section 23 Wikipedia
LP LR LF cov. LP LR LF

c&c 86.24 84.85 85.54 99.42 81.58 80.08 80.83 99.50
(NN) 86.71 85.56 86.13 99.92 82.65 81.36 82.00 100
(RNN) 87.68 86.47 87.07 99.96 83.22 81.78 82.49 100
c&c 86.24 84.17 85.19 100 81.58 79.48 80.52 100
(NN) 86.71 85.40 86.05 100 – – – –
(RNN) 87.68 86.41 87.04 100 – – – –

Table 3 : Parsing test results (auto pos). We evaluate on all sentences (100%
coverage) as well as on only those sentences that returned spanning analyses
(% cov.). RNN and NN both have 100% coverage on the Wikipedia data.