CS计算机代考程序代写 algorithm 9a: Hop�eld Networks and Boltzmann Machines

9a: Hop�eld Networks and Boltzmann Machines

Week 9: Overview

We have previously discussed supervised learning, where the learner is presented with pairs of
inputs and target outputs, and RL, where the agent takes actions in an environment and receives
rewards. In this �nal week, we move to Unsupervised Learning, where the system is only given a set
of inputs (without any reward or target output) and the aim is to �nd some structure in the inputs. We
will look at Hop�eld Networks, Boltzmann Machines, Autoencoders, Artist-Critic Coevolution and
Generative Adversarial Networks.

Weekly learning outcomes

By the end of this module, you will be able to:

explain the di�erence between supervised, reinforcement and unsupervised learning

describe Hop�eld Networks and Boltzmann Machines

compute the weight matrix and dynamics for simple Hop�eld Networks

describe generative models, including regularized and variational autoencoders

train a variational autoencoder to produce handwritten digits

explain artist-critic coevolution and generative adversarial networks (GANs)

explain the capacity for deep neural networks to be fooled, and the importance of adversarial
training for image generation

Hop�eld Networks

Content Addressable Memory

Humans have the ability to retrieve something from memory when presented with only part of it. For
example,

“To be or not to be, …”

“I came, I saw, …”

Can we recreate this in computers?

Auto-Associative Memory

We want to store a set of images in a neural network in such a way that, starting with a corrupted or
occluded version, we can reconstruct the original image.

Energy Based Models

We can try to de�ne an energy function in con�guration space, in such a way that the local
minima of this energy function correspond to the stored items.

E(x)

Constraint Satisfaction Problems

Example: Place queens on an -by- chessboard in such a way that no two queens are attacking
each other. We assume there is exactly one queen on each column, so we just need to assign a row to
each queen, in such a way that there are no “con�icts”.

n n n

Local Search

Some algorithms for solving Constraint Satisfaction Problems work by “Iterative Improvement” or
“Local Search”.

These algorithms assign all variables randomly in the beginning (thus violating several constraints),
then change one variable at a time, trying to reduce the number of violations at each step.

Hill-Climbing by Min-Con�icts

Variable selection: randomly select any con�icted variable

Value selection by min-con�icts heuristic

Choose value that violates the fewest constraints

Hill-Climbing

The term Hill-Climbing suggests climbing up to regions of greater “�tness”.

Inverted View

When we are minimizing violated constraints, it makes sense to think of starting at the top of a ridge
and climbing down into the valleys.

Hop�eld Network: Energy Function

Consider the set of all black-and-white images with pixels, where each con�guration is an imaged x

x = x { j}1≤j≤d

with

We want to construct an energy function of the form

E(x) = − x w x + b x (
2
1

i,j

∑ i ij j
i

∑ i i)
such that the stored images

are local minima for .

x { (k)}
1≤k≤p

E(x)

The idea is to make positive if the two pixels and tend to have the same color, and make
negative if pixels and tend to have opposite colors (when averaged across the set of stored

images).

w ij x i x j
w ij x i x j

Hop�eld Network: Image Retrieval

Consider a state space where each con�guration (state) consists of a vector

with each either or We can de�ne an energy function as

We normally assume for all and for all These look very much like the
weights and biases of a neural network. But, it di�ers from the feedforward networks we are used to.

x = x { j}1≤j≤d

x =j +1 −1

E(x) = − x w x + b x .(
2
1

i,j

∑ i ij j
i

∑ i i)

w =ii 0 i, w =ij w ji i, j

The components (neurons) do not vary continuously, but instead take only the discrete values
and neurons are iteratively updated, either synchronously or asynchronously, based on the
current values of the neighboring neurons

x i −1
+1

E(x) = − x w x + b x (
2
1

i,j

∑ i ij j
i

∑ i i)

Start with an initial state and then repeatedly try to “�ip” neuron activations one at a time, in order
to reach a lower-energy state. If we choose to modify neuron , its new value should be

This ensures that the energy will never increase. It will eventually reach a local minimum.

x

x i

x ←i ⎩⎨
⎧ +1,

x ,i
−1,

if w x + b > 0∑
j ij j i

if w x + b = 0∑
j ij j i

if w x + b < 0∑j ij j i E(x) Hop�eld Network: Choosing Weights based on a Set of Images Suppose we want to store items into a network with neurons. We can set and In other words, where is the fraction of training items for which . p x { (k)} 1≤k≤p d b =i 0 w =ij x x d 1 k=1 ∑ p i (k) j (k) w =ij (−1 + 2c)p/d c x =i (k) x j (k) This is known as Hebbian learning, by analogy with a process in the brain where the connection strength between two neurons increases when they �re simultaneusly or in rapid succession. One consequence of this choice for and is that, if is a stable attractor, then the negative image is also a stable attractor. b i w ij x (−x) Once the items are stored, then for any item we havex = x(l) w x = j=1 ∑ d ij j (l) x x x = d 1 j=1 ∑ d k=1 ∑ p i (k) j (k) j (l) x +i (l) x x x d 1 j ∑ k=l ∑ i(k) j(k) j(l) The last term on the right is called the crosstalk term, representing interference from the other stored items. If, for all , the crosstalk term is smaller than 1 in absolute value (or it has the same sign as ) then will not change and will be a stable attractor. i x i (l) x i x (l) The number of patterns that can be reliably stored in a Hop�eld network is proportional to the number of neurons in the network. A careful mathematical analysis shows that if , we can expect the patterns to be stored and retrieved successfully. If we try to store more patterns than these, additional, "spurious" stable states may emerge. p d p/d < 0.138 Hop�eld Network Example Boltzmann Machines Generative Models The Hop�eld Network is used to store speci�c items and retrieve them. What if, instead, we want to generate new items, which are somehow "similar" to the stored items, but not quite the same. This is known as a generative model. The �rst attempt to do this using neural networks was the Boltzmann Machine. Ising Model of Ferromagnetism Boltzmann Machine The Boltzmann Machine uses exactly the same energy function as the Hop�eld network: The Boltzmann Machine is very similar to the Hop�eld Network, except that E(x) = −( x w x + i 0
x =i 1

p =
1 + e−ΔE/T

1

if this process is repeated for many iterations, we will eventually obtain a sample from the

Boltzmann distribution

when the value of or is always chosen with equal probability, thus producing a
uniform distribution on the state space

T → ∞, 0 1

as , the behaviour becomes similar to that of the Hop�eld Network (never allowing the
energy to increase)

T → 0

the Temperature may be held �xed, or it may start high and be gradually reduced (known as
Simulated Annealing)

T

Restricted Boltzmann Machine

The Boltzmann Machine is limited in that the probability of each unit must be a linearly separable
function of the surrounding units. It becomes more powerful if we make a division between “visible”
units and “hidden” units . v h

The visible and hidden units roughly correspond to input and hidden units in a feedforward network.
The aim is that the hidden units should learn some hidden features or “latent variables” which help
the system to model the distribution of the inputs.

If we allow visible-to-visible and hidden-to-hidden connections, the network takes too long to train.
So we normally restrict the model by allowing only visible-to-hidden connections.

This is known as a Restricted Boltzmann Machine.

inputs are binary vectors

two-layer bi-directional neural network

visible layer v

hidden layer h

no vis-to-vis or hidden-to-hidden connections

all visible units connected to all hidden units

E(v, h) = −( b v +
i

∑ i i c h +
j

∑ j j v w h )
i,j

∑ i ij j

trained to maximize the expected log probability of the data

Conditional Distributions

Because the input and hidden units are decoupled, we can calculate the
conditional distribution of given and vice-versa.

It follows that

h v,

p(h ∣ v) =
p(v)

p(v, h)
= exp( b v + c h + v w h )

p(v)
1

Z

1

i

∑ i i
j

∑ j j
i,j

∑ i ij j

= exp( c h + v w h )
Z′
1

j

∑ j j
i,j

∑ i ij j

p(h ∣ v)

p(v ∣ h)

= p h ∣ v = σ (2h − 1) ⊙ c + W v
j

∏ ( j )
j

∏ ( ( T ))
j

= p v ∣ h = σ((2v − 1) ⊙ (b + Wh))
i

∏ ( i )
i

∏ i

where is component-wise multiplication and is the sigmoid function.⊙ σ(s) = 1/(1 + exp(−s))

Alternating Gibbs Sampling

With the Restricted Boltzmann Machine, we can sample from the Boltzmann distribution as follows:

choose randomly v 0

then sample from h 0 p h ∣ v ( 0)

then sample from v 1 p v ∣ h ( 0)

then sample from h 1 p h ∣ v ( 1)

etc.

Contrastive Divergence

RBM can be trained by Contrastive Divergence

select one or more positive samples from the training data{v }(k)

for each sample a hidden vector from v ,(k) h(k) p(h ∣ v )(k)

generate “fake” samples by alternating Gibbs sampling{ }v~(k)

for each sample a hidden vector from ,v̂(k) h
~(k) p(h ∣ )v~(k)

Update to increase b , c , w { i} { j} { ij} log p(v , h ) −(k) (k) log p( , )v~(k) h
~(k)

b ← b + η v − i i ( i v~i)
c ← c + η(h − )j j j h

~
j

w ← w + η(v h − )ij ij i j v~ih
~

j

Quick Contrastive Divergence

It was noticed in the early 2000’s that the process can be sped up by taking just one additional
sample instead of running for many iterations.

are used as positive sample, and as negative sample v , h 0 0 v , h 1 1
this can be compared to the Negative Sampling that was used with word2vec – it is not
guaranteed to approximate the true gradient, but it works well in practice

Deep Boltzmann Machine

The same approach can be applied iteratively to a multi-layer network. The weights from the input to
the �rst hidden layer are trained �rst. Keeping those �xed, the weights from the �rst to the second
layer are trained, and so on.

Greedy Layerwise Pretraining

One application for the deep Boltzmann Machine is greedy unsupervised layerwise pretraining. Each
pair of layers in succession is trained as an RBM. The resulting values are then used as the initial
weights and biases for a feedforward neural network, which is then trained by backpropagation for
some other task, such as classi�cation.

for the signmoid or tanh activation function, this kind of pre-training leads to a much better results
than training directly by backpropagation from random initial weights.

Exercise: Hop�eld Networks

Question 1

No response

Question 2

No response

Question 3

No response

Question 4

No response

Compute the weight matrix for a Hop�eld Network with two memory vectors
and stored in it.

W

[ 1, −1, 1, −1, 1, 1 ] [ 1, 1, 1, −1, −1, −1 ]

Con�rm that both of the memory vectors from Question 1 are stable states for the network with
weights .W

Consider the following weight matrix:

W = 0.2 ×

⎡ 0
−1

1
−1
−1

−1
0

−1
1
1

1
−1

0
−1
−1

−1
1

−1
0
1

−1
1

−1
1
0⎦

Starting from the initial state , compute the state �ow to a stable state using
asynchronous updates.

[ 1, 1, 1, 1, −1 ]T

Keeping the weight matrix from Question 3, and starting in the same initial state ,
compute the state �ow to a stable state, but this time using synchronous updates.

[ 1, 1, 1, 1, −1 ]T

Week 9 Wednesday video