CS计算机代考程序代写 Excel algorithm 5b: Long Short Term Memory

5b: Long Short Term Memory

Long Short Term Memory

Simple Recurrent Networks (SRNs) can learn medium-range dependencies but have di�culty learning
long range dependencies. Long Short Term Memory (LSTM) is able to learn long range dependencies
using a combination of forget, input and output gates (Hochreiter & Schmidhuber, 1998).

The LSTM maintains a context layer which is distinct from the hidden layer but contains the same
number of units. The full workings of the LSTM at each timestep are described by these equations:

Gates:

f t
i t
o t

= σ U h + W x + b ( f t−1 f t f)
= σ U h + W x + b ( i t−1 i t i)

= σ U h + W x + b ( o t−1 o t o)

Candidate Activation:

g =t tanh U h + W x + b ( g t−1 g t g)

State:

c =t f ⊙t c +t−1 i ⊙t g t

Output:

h =t o ⊙t tanh c t

First, the forget gate ( ) is used to determine, for each context unit, a ratio between 0 and 1 by which
the value of this context unit will be multiplied. If the ratio is close to zero, the previous value of the
corresponding context unit will be largely forgotten; if it is close to 1, the previous value will be
largely preserved.

Next, update values ( ) between 1 and 1 are computed using , and the input gate ( ) is
used to determine ratios by which these update values will be multiplied before being added to the
current context values.

g − + tanh i

Finally, the output gate ( ) is computed and used to determine the ratios by which of the
context unit values will be multiplied in order to produce the next hidden unit values.

o tanh

In this way, the context units are able to specialise, with some of them changing their values
frequently while others preserve their state for many timesteps, until particular circumstances cause
the gates to be ‘opened’ and allow the value of those units to change.

Embedded Reber Grammar

The ability of di�erent sequence processing algorithms to learn long range dependencies can be
explored using the Reber Grammar and Embedded Reber Grammar.

[image source: Fahlman, 1991]

The Reber Grammar (RG) is de�ned by the �nite state machine shown on the left. When there is a
choice between two transitions, they are understood to be chosen with equal probability. The
Embedded Reber Grammar (ERG) is shown on the right, where each box marked ‘REBER
GRAMMAR’ contains an identical copy of the �nite state machine on the left. The di�culty in learning
the ERG is that the network must remember which transition (T or P) occurred after the initial B, and
retain this information while it is processing the transitions associated with the RG in one of the two
identical boxes, in order to correctly predict the T or P occurring before the �nal E.

In the exercises for this week, you will be demonstrating that the SRN is able to learn the RG but
struggles to learn the ERG, whereas the LSTM can also learn the ERG. We can imagine that one of the
context units is somehow assigned the task of retaining the knowledge of the initial T or P, and that

this knowledge is preserved by appropriately high and low values for the forget gate and the input
and output gate, respectively.

Gated Recurrent Unit

The Gated Recurrent Unit (GRU) is similar to LSTM but has only two gates instead of three. Its update
equations are as follows:

Gates:

z t
r t

= σ U h + W x + b ( z t−1 z t z)

= σ U h + W x + b ( r t−1 r t r)

Candidate activation:

g =t tanh U r ⊙ h + W x + b ( g ( t t−1) g t g)

Output:

h =t 1 − z ⊙( t) h +t−1 z ⊙t g t

References

Fahlman, S. E. (1991). The recurrent cascade-correlation architecture (Technical Report CMU-CS-91-
100). Carnegie-Mellon University, Department of Computer Science.

Hochreiter, S., & Schmidhuber, J., 1997. Long short-term memory. Neural Computation, 9(8), 1735-
1780.

Related Posts