9a: Hop�eld Networks and Boltzmann Machines
Week 9: Overview
We have previously discussed supervised learning, where the learner is presented with pairs of
inputs and target outputs, and RL, where the agent takes actions in an environment and receives
rewards. In this �nal week, we move to Unsupervised Learning, where the system is only given a set
of inputs (without any reward or target output) and the aim is to �nd some structure in the inputs. We
will look at Hop�eld Networks, Boltzmann Machines, Autoencoders, Artist-Critic Coevolution and
Generative Adversarial Networks.
Weekly learning outcomes
By the end of this module, you will be able to:
explain the di�erence between supervised, reinforcement and unsupervised learning
describe Hop�eld Networks and Boltzmann Machines
compute the weight matrix and dynamics for simple Hop�eld Networks
describe generative models, including regularized and variational autoencoders
train a variational autoencoder to produce handwritten digits
explain artist-critic coevolution and generative adversarial networks (GANs)
explain the capacity for deep neural networks to be fooled, and the importance of adversarial
training for image generation
Hop�eld Networks
Content Addressable Memory
Humans have the ability to retrieve something from memory when presented with only part of it. For
example,
“To be or not to be, …”
“I came, I saw, …”
Can we recreate this in computers?
Auto-Associative Memory
We want to store a set of images in a neural network in such a way that, starting with a corrupted or
occluded version, we can reconstruct the original image.
Energy Based Models
We can try to de�ne an energy function in con�guration space, in such a way that the local
minima of this energy function correspond to the stored items.
E(x)
Constraint Satisfaction Problems
Example: Place queens on an -by- chessboard in such a way that no two queens are attacking
each other. We assume there is exactly one queen on each column, so we just need to assign a row to
each queen, in such a way that there are no “con�icts”.
n n n
Local Search
Some algorithms for solving Constraint Satisfaction Problems work by “Iterative Improvement” or
“Local Search”.
These algorithms assign all variables randomly in the beginning (thus violating several constraints),
then change one variable at a time, trying to reduce the number of violations at each step.
Hill-Climbing by Min-Con�icts
Variable selection: randomly select any con�icted variable
Value selection by min-con�icts heuristic
Choose value that violates the fewest constraints
Hill-Climbing
The term Hill-Climbing suggests climbing up to regions of greater “�tness”.
Inverted View
When we are minimizing violated constraints, it makes sense to think of starting at the top of a ridge
and climbing down into the valleys.
Hop�eld Network: Energy Function
Consider the set of all black-and-white images with pixels, where each con�guration is an imaged x
x = x { j}1≤j≤d
with
We want to construct an energy function of the form
E(x) = − x w x + b x (
2
1
i,j
∑ i ij j
i
∑ i i)
such that the stored images
are local minima for .
x { (k)}
1≤k≤p
E(x)
The idea is to make positive if the two pixels and tend to have the same color, and make
negative if pixels and tend to have opposite colors (when averaged across the set of stored
images).
w ij x i x j
w ij x i x j
Hop�eld Network: Image Retrieval
Consider a state space where each con�guration (state) consists of a vector
with each either or We can de�ne an energy function as
We normally assume for all and for all These look very much like the
weights and biases of a neural network. But, it di�ers from the feedforward networks we are used to.
x = x { j}1≤j≤d
x =j +1 −1
E(x) = − x w x + b x .(
2
1
i,j
∑ i ij j
i
∑ i i)
w =ii 0 i, w =ij w ji i, j
The components (neurons) do not vary continuously, but instead take only the discrete values
and neurons are iteratively updated, either synchronously or asynchronously, based on the
current values of the neighboring neurons
x i −1
+1
E(x) = − x w x + b x (
2
1
i,j
∑ i ij j
i
∑ i i)
Start with an initial state and then repeatedly try to “�ip” neuron activations one at a time, in order
to reach a lower-energy state. If we choose to modify neuron , its new value should be
This ensures that the energy will never increase. It will eventually reach a local minimum.
x
x i
x ←i ⎩⎨
⎧ +1,
x ,i
−1,
if w x + b > 0∑
j ij j i
if w x + b = 0∑
j ij j i
if w x + b < 0∑j ij j i
E(x)
Hop�eld Network: Choosing Weights based on a Set of Images
Suppose we want to store items
into a network with neurons. We can set and
In other words,
where is the fraction of training items for which .
p
x { (k)}
1≤k≤p
d b =i 0
w =ij x x
d
1
k=1
∑
p
i
(k)
j
(k)
w =ij (−1 + 2c)p/d
c x =i
(k)
x j
(k)
This is known as Hebbian learning, by analogy with a process in the brain where the connection
strength between two neurons increases when they �re simultaneusly or in rapid succession.
One consequence of this choice for and is that, if is a stable attractor, then the negative
image is also a stable attractor.
b i w ij x
(−x)
Once the items are stored, then for any item we havex = x(l)
w x =
j=1
∑
d
ij j
(l)
x x x =
d
1
j=1
∑
d
k=1
∑
p
i
(k)
j
(k)
j
(l)
x +i
(l)
x x x
d
1
j
∑
k=l
∑ i(k) j(k) j(l)
The last term on the right is called the crosstalk term, representing interference from the other stored
items. If, for all , the crosstalk term is smaller than 1 in absolute value (or it has the same sign as
) then will not change and will be a stable attractor.
i x i
(l)
x i x
(l)
The number of patterns that can be reliably stored in a Hop�eld network is proportional to the
number of neurons in the network. A careful mathematical analysis shows that if , we
can expect the patterns to be stored and retrieved successfully. If we try to store more patterns than
these, additional, "spurious" stable states may emerge.
p
d p/d < 0.138
Hop�eld Network Example
Boltzmann Machines
Generative Models
The Hop�eld Network is used to store speci�c items and retrieve them. What if, instead, we want to
generate new items, which are somehow "similar" to the stored items, but not quite the same.
This is known as a generative model. The �rst attempt to do this using neural networks was the
Boltzmann Machine.
Ising Model of Ferromagnetism
Boltzmann Machine
The Boltzmann Machine uses exactly the same energy function as the Hop�eld network:
The Boltzmann Machine is very similar to the Hop�eld Network, except that
E(x) = −( x w x +
i
x =i 1
p =
1 + e−ΔE/T
1
if this process is repeated for many iterations, we will eventually obtain a sample from the
Boltzmann distribution
when the value of or is always chosen with equal probability, thus producing a
uniform distribution on the state space
T → ∞, 0 1
as , the behaviour becomes similar to that of the Hop�eld Network (never allowing the
energy to increase)
T → 0
the Temperature may be held �xed, or it may start high and be gradually reduced (known as
Simulated Annealing)
T
Restricted Boltzmann Machine
The Boltzmann Machine is limited in that the probability of each unit must be a linearly separable
function of the surrounding units. It becomes more powerful if we make a division between “visible”
units and “hidden” units . v h
The visible and hidden units roughly correspond to input and hidden units in a feedforward network.
The aim is that the hidden units should learn some hidden features or “latent variables” which help
the system to model the distribution of the inputs.
If we allow visible-to-visible and hidden-to-hidden connections, the network takes too long to train.
So we normally restrict the model by allowing only visible-to-hidden connections.
This is known as a Restricted Boltzmann Machine.
inputs are binary vectors
two-layer bi-directional neural network
visible layer v
hidden layer h
no vis-to-vis or hidden-to-hidden connections
all visible units connected to all hidden units
E(v, h) = −( b v +
i
∑ i i c h +
j
∑ j j v w h )
i,j
∑ i ij j
trained to maximize the expected log probability of the data
Conditional Distributions
Because the input and hidden units are decoupled, we can calculate the
conditional distribution of given and vice-versa.
It follows that
h v,
p(h ∣ v) =
p(v)
p(v, h)
= exp( b v + c h + v w h )
p(v)
1
Z
1
i
∑ i i
j
∑ j j
i,j
∑ i ij j
= exp( c h + v w h )
Z′
1
j
∑ j j
i,j
∑ i ij j
p(h ∣ v)
p(v ∣ h)
= p h ∣ v = σ (2h − 1) ⊙ c + W v
j
∏ ( j )
j
∏ ( ( T ))
j
= p v ∣ h = σ((2v − 1) ⊙ (b + Wh))
i
∏ ( i )
i
∏ i
where is component-wise multiplication and is the sigmoid function.⊙ σ(s) = 1/(1 + exp(−s))
Alternating Gibbs Sampling
With the Restricted Boltzmann Machine, we can sample from the Boltzmann distribution as follows:
choose randomly v 0
then sample from h 0 p h ∣ v ( 0)
then sample from v 1 p v ∣ h ( 0)
then sample from h 1 p h ∣ v ( 1)
etc.
Contrastive Divergence
RBM can be trained by Contrastive Divergence
select one or more positive samples from the training data{v }(k)
for each sample a hidden vector from v ,(k) h(k) p(h ∣ v )(k)
generate “fake” samples by alternating Gibbs sampling{ }v~(k)
for each sample a hidden vector from ,v̂(k) h
~(k) p(h ∣ )v~(k)
Update to increase b , c , w { i} { j} { ij} log p(v , h ) −(k) (k) log p( , )v~(k) h
~(k)
b ← b + η v − i i ( i v~i)
c ← c + η(h − )j j j h
~
j
w ← w + η(v h − )ij ij i j v~ih
~
j
Quick Contrastive Divergence
It was noticed in the early 2000’s that the process can be sped up by taking just one additional
sample instead of running for many iterations.
are used as positive sample, and as negative sample v , h 0 0 v , h 1 1
this can be compared to the Negative Sampling that was used with word2vec – it is not
guaranteed to approximate the true gradient, but it works well in practice
Deep Boltzmann Machine
The same approach can be applied iteratively to a multi-layer network. The weights from the input to
the �rst hidden layer are trained �rst. Keeping those �xed, the weights from the �rst to the second
layer are trained, and so on.
Greedy Layerwise Pretraining
One application for the deep Boltzmann Machine is greedy unsupervised layerwise pretraining. Each
pair of layers in succession is trained as an RBM. The resulting values are then used as the initial
weights and biases for a feedforward neural network, which is then trained by backpropagation for
some other task, such as classi�cation.
for the signmoid or tanh activation function, this kind of pre-training leads to a much better results
than training directly by backpropagation from random initial weights.
Exercise: Hop�eld Networks
Question 1
No response
Question 2
No response
Question 3
No response
Question 4
No response
Compute the weight matrix for a Hop�eld Network with two memory vectors
and stored in it.
W
[ 1, −1, 1, −1, 1, 1 ] [ 1, 1, 1, −1, −1, −1 ]
Con�rm that both of the memory vectors from Question 1 are stable states for the network with
weights .W
Consider the following weight matrix:
W = 0.2 ×
⎣
⎡ 0
−1
1
−1
−1
−1
0
−1
1
1
1
−1
0
−1
−1
−1
1
−1
0
1
−1
1
−1
1
0⎦
⎤
Starting from the initial state , compute the state �ow to a stable state using
asynchronous updates.
[ 1, 1, 1, 1, −1 ]T
Keeping the weight matrix from Question 3, and starting in the same initial state ,
compute the state �ow to a stable state, but this time using synchronous updates.
[ 1, 1, 1, 1, −1 ]T
Week 9 Wednesday video