5b: Long Short Term Memory
Long Short Term Memory
Simple Recurrent Networks (SRNs) can learn medium-range dependencies but have di�culty learning
long range dependencies. Long Short Term Memory (LSTM) is able to learn long range dependencies
using a combination of forget, input and output gates (Hochreiter & Schmidhuber, 1998).
The LSTM maintains a context layer which is distinct from the hidden layer but contains the same
number of units. The full workings of the LSTM at each timestep are described by these equations:
Gates:
f t
i t
o t
= σ U h + W x + b ( f t−1 f t f)
= σ U h + W x + b ( i t−1 i t i)
= σ U h + W x + b ( o t−1 o t o)
Candidate Activation:
g =t tanh U h + W x + b ( g t−1 g t g)
State:
c =t f ⊙t c +t−1 i ⊙t g t
Output:
h =t o ⊙t tanh c t
First, the forget gate ( ) is used to determine, for each context unit, a ratio between 0 and 1 by which
the value of this context unit will be multiplied. If the ratio is close to zero, the previous value of the
corresponding context unit will be largely forgotten; if it is close to 1, the previous value will be
largely preserved.
f
Next, update values ( ) between 1 and 1 are computed using , and the input gate ( ) is
used to determine ratios by which these update values will be multiplied before being added to the
current context values.
g − + tanh i
Finally, the output gate ( ) is computed and used to determine the ratios by which of the
context unit values will be multiplied in order to produce the next hidden unit values.
o tanh
In this way, the context units are able to specialise, with some of them changing their values
frequently while others preserve their state for many timesteps, until particular circumstances cause
the gates to be ‘opened’ and allow the value of those units to change.
Embedded Reber Grammar
The ability of di�erent sequence processing algorithms to learn long range dependencies can be
explored using the Reber Grammar and Embedded Reber Grammar.
[image source: Fahlman, 1991]
The Reber Grammar (RG) is de�ned by the �nite state machine shown on the left. When there is a
choice between two transitions, they are understood to be chosen with equal probability. The
Embedded Reber Grammar (ERG) is shown on the right, where each box marked ‘REBER
GRAMMAR’ contains an identical copy of the �nite state machine on the left. The di�culty in learning
the ERG is that the network must remember which transition (T or P) occurred after the initial B, and
retain this information while it is processing the transitions associated with the RG in one of the two
identical boxes, in order to correctly predict the T or P occurring before the �nal E.
In the exercises for this week, you will be demonstrating that the SRN is able to learn the RG but
struggles to learn the ERG, whereas the LSTM can also learn the ERG. We can imagine that one of the
context units is somehow assigned the task of retaining the knowledge of the initial T or P, and that
this knowledge is preserved by appropriately high and low values for the forget gate and the input
and output gate, respectively.
Gated Recurrent Unit
The Gated Recurrent Unit (GRU) is similar to LSTM but has only two gates instead of three. Its update
equations are as follows:
Gates:
z t
r t
= σ U h + W x + b ( z t−1 z t z)
= σ U h + W x + b ( r t−1 r t r)
Candidate activation:
g =t tanh U r ⊙ h + W x + b ( g ( t t−1) g t g)
Output:
h =t 1 − z ⊙( t) h +t−1 z ⊙t g t
References
Fahlman, S. E. (1991). The recurrent cascade-correlation architecture (Technical Report CMU-CS-91-
100). Carnegie-Mellon University, Department of Computer Science.
Hochreiter, S., & Schmidhuber, J., 1997. Long short-term memory. Neural Computation, 9(8), 1735-
1780.
Further Reading:
Two excellent web resources for LSTM:
Understanding LSTM Networks (Colah, 2015. Github)
LSTM (Long Short Term Memory) (Huerta, n.d. christianherta.de)
Quiz 6: Recurrent Networks and LSTM
Question 1
No response
Question 2
No response
Question 3
No response
Question 4
No response
Question 5
No response
Explain the format and method by which input was fed to the NetTalk system, and the target output.
Explain the role of the context layer in an Elman network.
Draw a diagram of an LSTM and write the equations for its operation.
Draw a diagram of a Gated Recurrent Unit and write the equitions for its operation.
Brie�y describe the problem of long range dependencies, and discuss how well each of the following
architectures is able to deal with long range dependencies:
sliding window approach
Simple Recurrent (Elman) Network
Long Short Term Memory (LSTM)
Gated Recurrent Unit (GRU)
Coding: SRN and LSTM trained on Reber Grammar
In this exercise, we will test the ability of SRN and LSTM to learn the Reber Grammar (RG) and
Embedded Reber Grammar (ERG). Speci�cally, we will test the following:
Train an SRN on the Reber Grammar
Train an SRN on the Embedded Reber Grammar
Train an LSTM on the ERG
We’ll have one more setting where we train the model only on sequences that exceed a
speci�ed minimum length.
Week 5 Thursday video