程序代写代做代考 C chain html algorithm Recurrent Neural Networks (RNNs) for sequence labeling

Recurrent Neural Networks (RNNs) for sequence labeling

Long distance dependency in sequence labeling
Limitation of window based feature extraction for linear sequence models can reach a very high accuracy, but are insucient in some cases:
I POS tagging: The man who whistles tunes VBZ pianos
I Named Entity Recognition: Norma Jean’s song entitled “Pretty soon I don’t know what but something I-title is going to happen”
I Language modeling: The man who whistles tunes pianos I ···

Simple Recurrent Neural Networks
I A simple recurrent operation is called Elman unit. It is defined as:
RNN(xm,hm1)=g(⇥hm1 +xm)
where ⇥ 2 RK ⇥K is a recurrent matrix and g is a nonlinear transformation function, often defined as the element-wise hyperbolic tangent tanh.
I Although each wm only depends on the context vector hm1, but this vector is a↵ected by all previous tokens
w0, w1, · · · , wm1. This is crucially di↵erent from n-gram language models.
I While in principle this simple RNN can handle long distance dependencies, in practice it is quite inadequate, due to the repeated application of the non-linearity.

Layers in an RNN
I How many layers do we have in an RNN?
I The number of layers in an RNN is not fixed and varies with I the length of the input.
Derivatives can be computed automatically with packages such as Torch, MXNet, and TensorFlow.
I How many weight matricies do we need for an RNN?

Parameters of a simple RNN
I i 2 RK , the “input” word vectors (word embeddings); I i 2 RK , the “output” word vectors;
I ⇥ 2 RK ⇥K , the recurrence operator;
I h0, the initial state
Each of these parameters can be tuned by formulating an objective over the training corpus L(w)

Backpropagation through time (BPTT)
I Let `m+1 represent the negative log-likelihood of word m + 1, `m+1 = logP(wm+1|w1,w2,··· ,wm)
I Since the loss depends on the parameters only through hm (not through xm, we can apply the chain rule of di↵erentiation:
@`m+1 = @`m+1 @hm @✓k,k0 @hm @✓k,k0
I What is the derivative of this function with respect to an element in ⇥?

Recurrence in gradient computation
I Recall the simple RNN with the Elman Unit hm = g(⇥hm1 + xm)
I For any element hm,k
hm,k = g (✓k · hm1 + xm,k )
I Applying the chain rule and the product rule, we get: @hm,k =g0(xm,k +✓k ·hm1)✓hm1,k0 +✓k ·@hm1◆
@✓k,k0 @✓k,k0
I This means the derivative of @hm depends on @hm1 , which
in turn depends on @hm2 , etc., till h0 is reached @✓k,k0
@✓k,k0 @✓k,k0

Variants of RNNs
I An Elman RNN helps us demonstrate RNN’s capacity of modeling longer context, and the needs to use BPTT update the network
RNN(xm,hm1)=g(⇥hm1 +xm)
I A more common RNN also paramaterize the input with
another parameter:
RNN(xm,hm1)=g(⇥hhm1 +⇥xxm)
Updating ⇥x does not require backpropate through time.

Hyperparameters
I The optimal size of the word and context vectors K loosely depends on the size of the training data.
I If the training data is large, a larger K is preferred.
I Conversely, a smaller K should be chosen. Otherwise, the model may memorize the training data and doesn’t generalize.

Vanishing or exploding gradients and gated networks
I RNNs require repeated applications of the non-linear functions. Backpropagation can lead to vanishing gradients (gradients decay to zero) or exploding gradients (gradients increase towards infinity).
I Exploding gradients can be addressed by clipping gradients (set a maximum value)
I Vanishing gradients must be addressed by changing the model itself. Gated networks such as LSTM (Long Term Short Term memory) and GRU are popular solutions to this problem.
I You can find a demonstration here: https://cs224d.stanford. edu/notebooks/vanishing grad example.html

Gated Recurrent Units (GRUs)
I GRU uses a Reset gate and an Update to control how much information from the previous hidden state to pass on to the next hidden state.
Reset gate update gate update candidate output
rm+1 = (⇥h!r hm + ⇥x!r xm+1 + br )
um+1 = (⇥h!uhm + ⇥x!uxm+1 + bu)
h ̃m+1 = tanh(⇥h!h(hm rm+1) + ⇥x!hxm+1)
hm+1 = um+1 hm + (1 um+1)h ̃m+1

What is a gate?
I “Hard” gate vs “soft” gate:
213 2 0.99 3
607 6 0 7 (x) = 64175 (x) = 640.99875
00 00
I Hard gates are easier to understand conceptually, but soft gates are di↵erentiable (therefore learnable)

GRU gates
Hidden state hm

Reset gate rm+1
“”
Update gate um+1
Input xm+1

GRU gates
Hidden state hm

Reset gate rm+1
” ” tanh
Update gate um+1
Input xm+1

GRU gates
Hidden state hm
⊙+ 1- ⊙
Reset gate rm+1
” ” tanh
Hidden state hm+1

Input xm+1
Update gate um+1

GRU: Pytorch implementation

Long Short Term Memory (LSTM)
I LSTM enhances the hidden hm with a memory cell cm
I This memory cell does not pass through a squashing function (tanh or ), and can propagate through the network over long distances.
forget gate input gate
update candidate m e m o r y c e l l u p d a t e
output gate
output
fm+1 = (⇥h!f hm + ⇥x!f xm+1 + bf ) im+1 = (⇥h!i hm + ⇥x!i xm+1 + bi )
c ̃ m+1
= tanh(⇥h!ch + ⇥x!cx
m m+1
)
c
c
hm+1 = om+1 tanh(cm+1)
= f
om+1 = (⇥h!ohm + ⇥x!oxm+1 + bo)
m+1
m+1
m
+ i
c ̃ m+1 m+1

LSTM gates
Hidden state hm
forget
gate input fm+1 gate
! im+1
Input xm+1
!
Candidate Memory
cm+1
tanh
output gate
om+1 !

LSTM gates
Memory cm
Hidden state hm
Input xm+1
Memory cm+1
⊙+
forget
gate input fm+1 gate
” im+1

Candidate Memory
output gate
om+1

cm+1 tanh

LSTM gates
Memory cm
Hidden state hm
⊙+
Memory cm+1
Hidden state hm+1
forget gate
fm+1
input gate
im+1
Candidate Memory
cm+1

tanh
output gate
om+1
tanh
⊙ ”


Input xm+1

LSTM: Pytorch implementation

The successful stories of Recurrent Neural networks in NLP
RNN is very versatile and can be used in a variety of di↵erent NLP problems:
I Sequence labeling problems: POS tagging, Named Entity Recognition, etc.
I Sequence to sequence translation problems: Neural Machine Translation, Machine Reading, Dialogue systems, language generation, etc.
I It also serves as an underlying feature extraction mechanism to many classification problems (e.g., the use of Bi-LSTM RNNs for extracting spans)

RNNs for sequence labeling
I Recall RNN outputs a hidden vector hm at each time step:
hm = g(xm,hm1),m = 1,2,··· ,M I Scoring each label Ym as a linear function of hm
m(y)=y ·hm yˆ = argmax m(y)
y
I We can also turn this score into a probabilistic distribution using SoftMax:
P(y|w1:m) = P exp m(y) y02Y exp (y0)
I This is a classifier that uses only the context from the left.

Bi-directional LSTM (BiLSTM)
I Forward LSTM and Backward LSTM
~h = g(xm,~hm1),m = 1,2,··· ,M
h~ = g(x ~~
hm =[h;h]
,h~ ),m = 1,2,··· ,M m m+1
I BiLSTM summarizes the surrounding context from both directions, typically better than window-based context using ngrams both sides of the target word.
I But it doesn’t taking into consideration the transitions between tags.

Neural Structure Prediction: LSTM-CRF
Neural sequence labeling can be combined with the Viterbi algorithm by defining the local score as:
sm(ym, ym1) = · hm + ⌘ym1,ym
pick the can up