程序代写代做代考 algorithm go C kernel graph Recurrent Neural Networks

Recurrent Neural Networks
CMPUT 366: Intelligent Systems
 

P&M §10.0-10.2, 10.10

1. Recap
2. Unfolding Computations
3. Recurrent Neural Networks
4. Long Short-Term Memory
Lecture Outline

Recap: Convolutional Neural Networks
Convolutional networks: Specialized architecture for images Number of parameters controlled by using convolutions
and pooling operations instead of dense connections
• •

CHAPTER 9. CONVOLUTIONAL NETWORKS
CHAPTER 9. CONVOLUTIONAL NETWORKS
Fewer parameters means more efficient to train CHAPTER 9. CONVOLUTIONAL NETWORKS
2D Convolution
Input
Kernel
Edge Detection by Convolution
a
b
c
d
w
x
y
z
e
f
g
h
i
j
k
l
aw + bx + ey + fz
ew + f x + iy + jz
bw + cx + fy + gz
f w + gx + jy + kz
cw + dx + gy + hz
gw + hx + ky + lz
Input
fficiency of edge detection. The image on the right was formed by taking the original image and subtracting the value of its neighboring pixel on the ows the strength of all of the vertically oriented edges in the input image,
a useful object detection. Both images are 280 pixels tall.
operat
-1
0 pixel ure 9.
ion for
-1
s wide 6: Effi
age is 32 while the output image is 319 pixels wide. This
Fig ciency of edge detection. The image on the right was form
on can be described by a convolution kernel containing two elements, and
each pixel in the original image and subtracting the value of its neighborin
Kernel
Output
280 3 = 267,960 floating point operations (two multiplications and
left. This shows the strength of all of the vertically oriented edges in the
Output
Figure 9.1: An example of 2-D convolution without kernel-flipping. In this case we restrict the output to only positions where the kernel lies entirely within the image, called “valid”
transformation with a matrix multiplication would take 320 ⇥ 280 ⇥ 319 ⇥ 280, or over
(Goodfellow 2016) The input image is 320 pixels wide while the output image is 319 pixels wide. Th(iIsmages: Goodfellow 2016)
Figure 9.1
Figure 9.6
(Goodfellow 2016)
convolution in some contexts. We draw boxes with arrows to indicate how the upper-left element of the output tensor is formed by applying the kernel to the corresponding upper-left region of the input tensor.
eight billion, entries in the matrix, making convolution four billion times more efficient for
transformation can be described by a convolution kernel containing two elements, and
Figure 9.6: E
each pixel in
left. This sh
which can be
The input im
transformati
requires 319 ⇥ ⇥
ed by taking g pixel on the input image,
one addition per output pixel) to compute using convolution. To describe the same
which can be a useful operation for object detection. Both images are 280 pixels tall.
representing this transformation. The straightforward matrix multiplication algorithm
requires 319 ⇥ 280 ⇥ 3 = 267, 960 floating point operations (two multiplications and
performs over sixteen billion floating point operations, making convolution roughly 60,000
one addition per output pixel) to compute using convolution. To describe the same
times more efficient computationally. Of course, most of the entries of the matrix would be
transformation with a matrix multiplication would take 320 ⇥ 280 ⇥ 319 ⇥ 280, or over
zero. If we stored only the nonzero entries of the matrix, then both matrix multiplication
eight billion, entries in the matrix, making convolution four billion times more efficient for
and convolution would require the same number of floating point operations to compute.





Sequence Modelling
For many tasks, especially involving language, we want to model the

behaviour of sequences
Example: Translation
The cat is on the carpet ⟹ Le chat est sur le tapis
Example: Sentiment analysis
This pie is great ⟹ POSITIVE
This pie is okay, not great ⟹ NEUTRAL This pie is not okay ⟹ NEGATIVE
• •

Sequential Inputs
Question: How should we represent sequential input to a neural network?
1. 1-hot vector for each word
 (Sequence must be a particular length)
2. 1-hot vector for last few words
 (n-gram)
3. Single vector indicating each word that is present
 (bag of words)
4. Single vector summing the semantic embeddings of all the words
The cat is on the carpet
carpet

carpet the
cat the
carpet cat
the

Dynamical Systems A dynamical system is a system whose state at

time t + 1 depends on its state at time t:
s(t) = f(s(t−1); θ)
An expression that depends on the same expression

at an earlier time is recurrent.
s

where s is called the state of the system.
Equation 10.1 is recurrent because the definition of s at time t refers back to the same definition at time t 1.
For a finite number of time steps ⌧, the graph can be unfolded by applying Unfolding Computations
the definition ⌧ 1 times. For example, if we unfold equation 10.1 for ⌧ = 3 time steps, we obtain
s(3) =f(s(2);✓)
A recurrent expression can be converted to a non-recurrent
(10.2) (10.3)

expression by unfolding: (1)
=f (f (s ; ✓); ✓)
Unfolding the equation by(3r)epeated(ly2)applying the definition in this way has
s =f(s ;θ)
yielded an expression that does not involve recurrence. Such an expression can
now be represented by a traditional directed acyclic computational graph. The
Classical Dynamical Systems
= f( f(s(1); θ); θ)
unfolded computational graph of equation 10.1 and equation 10.3 is illustrated in
figure 10.1.
Figure 10.1: The classical dynamical system described by equation 10.1, illustrated as an unfolded computational graph. Each node represents the state at some time t and the
s(… ) s(t1) s(t) s(t+1) s(… ) f f f f
Figure 10.1
function f maps the state at t to the state at t + 1. The same parameters (the same value of ✓ used to parametrize f) are used for all time steps.
(Image: Goodfellow 2016)

almost any function can be considered a feedforward neural network, essentially
step. (Right)The same network seen as an unfolded computational graph, where each

Dynamical systems can also be driven by external signals: (x(t), x(t1), x(t2), . . . , x(2), x(1)) to a fixed length vector h(t). Depending on the
any function involving recurrence can be considered a recurrent neural network.
Many recurrent neural networks use equation 10.5 or a similar equation to
the state:
External Signals
define the values of their hidden units. To indicate that the state is the hidden
units of the network, we now rewrite equation 10.4 using the variable h to represent
h(t) = f(h(t1), x(t); ✓), (10.5) illustrated in figure 10.2, typical RNNs will add extra architectural features such
as output layers that read information out of the state h to make predictions.
When the recurrent network is trained to perform a task that requires predicting
the future from the past, the network typically learns to use h(t) as a kind of lossy
summary of the task-relevant aspects of the past sequence of inputs up to t. This
summary is in general necessarily lossy, since it maps an arbitrary length sequence
training criterion, this summary might selectively keep some aspects of the past sequence with more preci(stio)n than othe(rt−as1pe)cts. F(ot)r example, if the RNN is used
s=f(s ,x;θ)
in statistical language modeling, typically to predict the next word given previous
Unfolding Computation
words, it may not be necessary to store all of the information in the input sequence
Graphs
up to time t, but rather only enough information to predict the rest of the sentence. These systems can also be represented by non-recurrent, unfolded

computationosne:
to approximately recover the input sequence, as in autoencoder frameworks

 

The most demanding situation is when we ask h(t) to be rich enough to allow (chapter 14).
hs
f
x
hs(… )
hs(t1) hs(t)
f f f
x(t1) x(t)
hs(t+1)
x(t+1)
hs(… )
Unfold
f
Figure 10.2
Figure 10.2: A recurrent network with no outputs. This recurrent network just processes
(Image: Goodfellow 2016)
information from the input x by incorporating it into the state h that is passed forward through time. (Left)Circuit diagram. The black square indicates a delay of a single time


• •
• • •
Recurrent Neural Networks Recurrent neural network: a specialized architecture for modelling
sequential data
Input presented one element at a time
Parameter sharing by:
(6) carpet x=
Treating the sequence as a system with state Introducing hidden layers that represent state
Computing state transitions and output using same functions at each stage

R1 A
c
n time (computing gradients) by explicitly showing the path along which this
F t
A
(Goodfellow 2016)
nformation flows.
Recurrent Hidden Units:
ecurrent Hidden Units
0.2 Recurrent Neural Networks
rmed with the graph unrolling and parameter sharing ideas of section 10.1, we Sequence to Sequence
an design a wide variety of recurrent neural networks.
Input values x connected to hidden state h Hidden state h mapped to output o by
All hidden states computed must be stored igure 10.3: The computational graph to compute the training loss of a recurrent network (Image: Goodfellow 2016)
Figure 10.3

for computing gradients
hat maps an input sequence of x values to a corresponding sequence of output o values. loss L measures how far each o is from the corresponding training target y. When using

by weights U
y
L
o
V
W
h
x
y(t1)
L(t1)
o(t1)
y(t) y(t+1)
L(t) L(t+1)
o(t) o(t+1)
Unfold
V V V
W W W W
h(… )
h(t1)
h(t) h(t+1)
U U x(t) x(t+1)
h(… )
U U x(t1)

weights V
Hidden state h(t−1) connected to hidden

state h(t) by weights W

Gradients computed by back propagation through time: from final loss all the way back to initial input.

targets. The advantage of eliminating hidden-to-hidden recurrence n
qp t
T c
r
nd (as depicted here) or the gradient on the output
(t) can be obtained by
y loss function based on comparing the prediction at time t to the et at time t, all the time steps are decoupled. Training can thus be
Recurrent Hidden Units:
 with the gradient for each step t computed in isolation. There is no
uence Input, Single Output
S
e
quence to Single Output
Update state as inputs are provided
Only compute a single output at the end
W, U still shared at every stage
Back propagation through time still requires evaluating every state in gradient computation
ute the output for the previous time step first, because the training
he ideal value of that ou
tpu
t.
L( )
y( )
V
h(t1) W h(t) W . . . W h( )
UUUU
x(t1) x(t) x(…) x( )
o( )
. . . W
Figure 10.5
ime-unfolded recurrent neural network with a single output at the end
(Image: Goodfellow 2016)
e. Such a network can be used to summarize a sequence and produce a
esentation used as input for further processing. There might be a target
• •
• •
(Goodfellow 2016)

10.4 Encoder-Decoder Sequence-to-Sequence Architec-
Figure 10.12
Figure 10.12: Example of an encoder-decoder or sequence-to-sequence RNN architecture,
Sequence to Sequence
tures
We have seen in figure 10.5 how an RNN can map an input sequence to a fixed-size Encoder/Decoder Architecture for
vector. We have seen in figure 10.9 how an RNN can map a fixed-size vector to a
ArchSiteeqcutuenrece to Sequence
Can combine approaches for sequence-to-sequence:
1. Accept entire input to construct a single “context” output C
2. Construct new sequence using context C as only input
sequence. We have seen in figures 10.3, 10.4, 10.10 and 10.11 how an RNN can
map an input sequence to an output sequence of the same length.
x(1)
x(2)
Encoder

x(…)
x(nx )
C
y(1)
y(2)
Decoder

y(…)
y(ny )
(Image: Goodfellow 2016)

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
Recurrence through (only) Outputs
Recurrence through only the Output
Can have recurrence go from output (at t − 1) to hidden (at t) instead of
hidden to hidden Less general (why?)
y
L
o
V
W
h
x
y(t1)
L(t1)
o(t1) WVWVWVW
h(t1) h(t) h(t+1) h(… )
y(t)
L(t)
y(t+1)
L(t+1)
o(… )
Unfold
o(t)
o(t+1)
U UUU
x(t1) x(t) x(t+1)
Figure 10.4
Figure 10.4: An RNN whose only recurrence is the feedback connection from the output to the hidden layer. At each time step t, the input is xt, the hidden layer activations are h(t), the outputs are o(t), the targets are y(t) and the loss is L(t). (Left)Circuit diagram.
(Right)Unfolded computational graph. Such an RNN is less powerful (can express a smaller set of functions) than those in the family represented by figure 10.3. The RNN
(Image: Goodfellow 2016)


Question: Why would we want to do (Goodfellow 2016)

this?
in figure 10.3 can choose to put any information it wants about the past into its hidden

1 b
Teacher Forcing
Figure 10.6
0.6: Illustration of teacher forcing. Teacher forcing is a training technique that is
(Goodfellow 2016)
le to RNNs that have connections from their output to their hidden states at the
Forcing
Teacher
W VVVWV
h(t1) h(t) h(t1) h(t) UUUU
x(t1) x(t) x(t1) x(t) Train time Test time
y(t1)
L(t1)
y(t)
L(t)
o(t1)
o(t) o(t1) o(t)
Dependence on previous step is only

on output, not hidden state


Loss gradient depends only on a single transition
Training can be parallelized (don’t need to compute previous states to compute current state)
(Image: Goodfellow 2016)

formation flow forward in time (computing outputs and losses) and backward time (computing gradients) by explicitly showing the path along which this
formation flows.
Long-Range Dependence
The submarine, which was the subject of a well known song by the Beatles, was yellow. ecurrent Hidden Units
0.2 Recurrent Neural Networks
med with the graph unrolling and parameter sharing ideas of section 10.1, we n design a wide variety of recurrent neural networks.

Information sometimes needs to be accumulated for a long part of the sequence
y
L
o
V
W
h
x
y(t1)
L(t1)
o(t1)
y(t) y(t+1)
L(t) L(t+1)
o(t) o(t+1)
Unfold
V V V
W W W W
h(… )
h(t1)
h(t) h(t+1)
U U x(t) x(t+1)
h(… )
U U x(t1)
Figure 10.3
gure 10.3: The computational graph to compute the training loss of a recurrent network
But how long an individual piece of context-dependent

information should be accumulated is
Often need to accumulate information

in the state, and then forget it later
n n n
Rr a
i

APTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
LSTM
each time step.
Long Short-Term Memory
output
×
self-loop

state
×
input
input gate
forget gate
output gate
Figure 10.16
ure 10.16: Block diagram of the LSTM recurrent network “cell.” Cells are connected
(Goodfellow 2016)
urrently to each other, replacing the usual hidden units of ordinary recurrent networks. input feature is computed with a regular artificial neuron unit. Its value can be
LSTM networks replace regular hidden units Input feature computed with regular neuron
Feature accumulated into state only if input gate allows it
State decays according to value of forget gate Output can be shut off by the output gate

with cells
• •
• •
H
g c
n
cumulated into the state if the sigmoidal input gate allows it. The state unit has a

Summary
Naïvely representing sequential inputs for a neural network requires

infeasibly many input nodes (and hence parameters)
Recurrent neural networks are a specialized architecture for handling

sequential inputs
• •
State accumulates across input elements
Each stage computed from previous stage using same parameters
Long short-term memory (LSTM) cells allow context-dependent

accumulation and forgetting