程序代写代做代考 algorithm deep learning C COMP9444

COMP9444
Neural Networks and Deep Learning
Outline
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 20T2
Boltzmann Machines 2
COMP9444 20T2 Boltzmann Machines
3
8a. Hopfield Networks and Boltzmann Machines
􏰈 Content Addressable Memory 􏰈 Hopfield Network
􏰈 Generative Models
􏰈 Boltzmann Machine
Content Addressable Memory
Auto-Associative Memory
Humans have the ability to retrieve something from memory when presented with only part of it.
For example,
We want to store a set of images in a neural network in such a way that, starting with a corrupted or occluded version, we can recon- struct the original image.
To be or not to be, … I came, I saw, …
Can we recreate this in computers?
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 20T2
Boltzmann Machines
1
􏰈 Restricted Boltzmann Machine 􏰈 Deep Boltzmann Machine
􏰈 Greedy Layerwise Pretraining

COMP9444 20T2 Boltzmann Machines 4
COMP9444 20T2 Boltzmann Machines 5
Energy Based Models
Constraint Satisfaction Problems
We can try to define an energy function E(x) in configuration space, in such a way that the local minima of this energy function correspond to the stored items.
Example: Place n queens on an n-by-n chessboard in such a way that no two queens are attacking each other.
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444 20T2
Boltzmann Machines 6
COMP9444 20T2 Boltzmann Machines 7
Local Search
Hill-Climbing by Min-Conflicts
Some algorithms for solving Constraint Satisfaction Problems work by “Iterative Improvement” or “Local Search”.
23 23 1 22 33 12 23
h= 5
h= 2
h= 0
􏰈 Variable selection: randomly select any conflicted variable 􏰈 Value selection by min-conflicts heuristic
These algorithms assign all variables randomly in the beginning (thus violating several constraints), and then change one variable at a time, trying to reduce the number of violations at each step.
◮ choose value that violates the fewest constraints
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444 ⃝c Alan Blair, 2017-20
We assume there is exactly one queen on each column, so we just need to assign a row to each queen, in such a way that there are no “conflicts”.
0

COMP9444 20T2 Boltzmann Machines 8
COMP9444 20T2 Boltzmann Machines 9
Hill-Climbing
Inverted View
The term Hill-climbing suggests climbing up to regions of greater “fitness”.
When we are minimizing violated constraints, it makes sense to think of starting at the top of a ridge and climbing down into the valleys.
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444 20T2
Boltzmann Machines 10
COMP9444 20T2 Boltzmann Machines 11
Energy Function for Images
Hopfield Network
Consider the set of all black-and-white images with d pixels, where each configuration x is an image x = {xj}1≤j≤d , with
Consider a state space where each configuration (state) consists of a vector x = {xj}1≤j≤d, with each xj = either +1 or −1
We can define an energy function as
x j = −1, if pixel j is black, +1, if pixel j is white.
E(x) = −(1 ∑xi wij xj +∑bi xi) 2 i,j i
We want to construct an energy function of the form E(x) = −(1 ∑xi wij xj +∑bi xi)
We normally assume wii = 0 for all i, and wi j = w ji for all i, j.
These look very much like the weights and biases of a neural network. But, it differs from the feedforward networks we are used to.
2 i, j i
such that the stored images {x(k)}1≤k≤p are local minima for E(x).
􏰈 The components (neurons) xi do not vary continuously, but instead take only the discrete values −1 and +1
The idea is to make wi j positive if the two pixels xi and x j tend to have the same color, and make wi j negative if pixels xi and x j tend to have opposite colors (when averaged across the set of stored images).
􏰈 neurons are iteratively updated, either synchronously or asyn- chronously, based on the current values of the neighboring neurons
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444 ⃝c Alan Blair, 2017-20

COMP9444 20T2
Boltzmann Machines 12
COMP9444 20T2 Boltzmann Machines 13
Hopfield Network
Hopfield Network
1 E(x)=−(2∑xiwijxj+∑bixi)
Suppose we want to store p items {x(k)}1≤k≤p into a network with
xi←
 −1,
if ∑jwijxj+bi=0, if ∑jwijxj+bi<0. i,j i d neurons. Wecanset bi =0 and wij = d ∑xi xj Start with an initial state x and then repeatedly try to “flip” neuron activations one at a time, in order to reach a lower-energy state. If we choose to modify neuron xi, its new value should be 1p k=1 (k) (k) ij +1, if ∑jwijxj+bi>0,
This is known as Hebbian learning, by analogy with a process in the brain where the connection strength between two neurons increases when they fire simultaneusly or in rapid succession.
xi,
This ensures that the energy E(x) will never increase. It will eventually reach a local minimum.
One consequence of this choice for bi and wi j is that, if x is a stable attractor, then the negative image (−x) is also a stable attractor.
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444 20T2
Boltzmann Machines
14
COMP9444 20T2 Boltzmann Machines 15
Hopfield Network
Hopfield Network
Once the items are stored, then for any item x = x(l) we have
􏰈 The number of patterns p that can be reliably stored in a Hopfield network is proportional to the number of neurons d in the network.
d (l) 1 d p (k) (k) (l) (l) 1 (k) (k) (l) ∑wijxj =d∑∑xi xj xj =xi +d∑∑xi xj xj j=1 j=1k=1 j k̸=l
􏰈 A careful mathematical analysis shows that if p/d < 0.138, we can expect the patterns to be stored and retrieved successfully. The last term on the right is called the crosstalk term, representing 􏰈 If we try to store more patterns than these, additional, “spurious” stable states may emerge. interference from the other stored items. If, for all i, the crosstalk term is smaller than 1 in absolute value (or it has the same sign as x(l)) then xi will i not change and x(l) will be a stable attractor. COMP9444 ⃝c Alan Blair, 2017-20 COMP9444 ⃝c Alan Blair, 2017-20 In other words, wi j = (−1 + 2c)p/d, where c is the fraction of training items for which x(k) = x(k). COMP9444 20T2 Boltzmann Machines 16 COMP9444 20T2 Boltzmann Machines 17 Generative Models Ising Model of Ferromagnetism The Hopfield Network is used to store specific items and retrieve them. What if, instead, we want to generate new items, which are somehow “similar” to the stored items, but not quite the same. This is known as a generative model. The first attempt to do this using neural networks was the Boltzmann Machine. COMP9444 ⃝c Alan Blair, 2017-20 COMP9444 ⃝c Alan Blair, 2017-20 COMP9444 20T2 Boltzmann Machines 18 COMP9444 20T2 Boltzmann Machines 19 Boltzmann Machine (20.1) Boltzmann Distribution The Boltzmann Machine uses exactly the same energy function as the Hopfield network: The Boltzmann Distribution is a probability distribution over a state space, given by E (x) = −( ∑ xi wi j x j + ∑ bi xi ) i 0, i.e. we never move to a higher energy state. For the Boltzmann machine, we instead choose xi = 1 with probability
the value 1 or 0 must be
p(xi = 1) = p(x)+ p(x′) = 1+e−∆E/T
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444 20T2
Boltzmann Machines
22
COMP9444 20T2 Boltzmann Machines 23
Boltzmann Machine
Boltzmann Machine Limitations
p(x) 1 p(xi =0)=1−p(xi =1)= 1
p = 1 1+e−∆E/T
p(xi → 1) = 1 1+e−∆E/T
The Boltzmann Machine is limited in that the probability of each unit must be a linearly separable function of the surrounding units. It becomes more powerful if we make a division between “visible” units v and “hidden” units h.
􏰈 if this process is repeated for many iterations, we will eventually obtain a sample from the Boltzmann distribution
􏰈 when T → ∞, the value of 0 or 1 is always chosen with equal probability, thus producing a uniform distribution on the state space
The visible and hidden units roughly correspond to input and hidden units in a feedforward network. The aim is that the hidden units should learn some hidden features or “latent variables” which help the system to model the distribution of the inputs.
􏰈 as T → 0, the behaviour becomes similar to that of the Hopfield Network (never allowing the energy to increase)
􏰈 the Temperature T may be held fixed, or it may start high and be gradually reduced (known as Simulated Annealing)
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
1+e+∆E/T
In other words, there is some probability of moving to a higher energy state (or remaining in a higher energy state when a lower one is available).
In both cases, we repeatedly choose one neuron xi and decide whether or not to “flip” the value of xi.

COMP9444 20T2 Boltzmann Machines
24
COMP9444 20T2 Boltzmann Machines 25
Restricted Boltzmann Machine (16.7)
Restricted Boltzmann Machine
If we allow visible-to-visible and hidden-to-hidden connections, the network takes too long to train. So we normally restrict the model by allowing only visible-to-hidden connections.
􏰈 inputs are binary vectors
This is known as a Restricted Boltzmann Machine.
COMP9444 ⃝c Alan Blair, 2017-20
􏰈 trained to maximize the expected log probability of the data
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444 20T2 Boltzmann Machines
26
COMP9444 20T2 Boltzmann Machines 27
Conditional Distributions (20.2)
Alternating Gibbs Sampling
Because the input and hidden units are decoupled, we can calculate the conditional distribution of h given v, and vice-versa.
p(h|v) = p(v,h) = 1 1 exp(∑bi vi +∑cj hj +∑vi wij hj)
p(v)
=
p(v) Z i j i, j 1 exp(∑cjhj+∑viwijhj)
Z′ j i,j
p(h | v) = ∏ p(h j | v) = ∏ σ􏰉(2h − 1) ⊙ (c + W T v)􏰊
With the Restricted Boltzmann Machine, we can sample from the Boltzmann distribution as follows:
It follows that p(v|h)=∏p(vi|h)=∏σ􏰉(2v−1)⊙(b+Wh)􏰊i
jjj
choose v0 randomly
then sample h0 from p(h | v0 ) then sample v1 from p(v | h0 ) then sample h1 from p(h|v1) etc.
ii
where ⊙ is component-wise multiplication and σ(s) = 1/(1 + exp(−s)) is
the sigmoid function.
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
􏰈 two-layer bi-directional neural network ◮ visible layer v
◮ hidden layer h
􏰈 no vis-to-vis or hidden-to-hidden connections 􏰈 all visible units connected to all hidden units
E(v,h) = −(∑bi vi +∑cj hj +∑vi wij hj) i j i,j

COMP9444 20T2 Boltzmann Machines 28
COMP9444 20T2 Boltzmann Machines 29
Contrastive Divergence (18.2)
Quick Contrastive Divergence
RBM can be trained by Contrastive Divergence
􏰈 select one or more positive samples {v(k)} from the training data
􏰈 for each v(k), sample a hidden vector h(k) from p(h|v(k))
􏰈 generate “fake” samples {v ̃(k)} by alternating Gibbs sampling
􏰈 for each vˆ(k), sample a hidden vector h ̃(k) from p(h|v ̃(k))
􏰈 Update{bi},{cj},{wij}toincrease logp(v(k),h(k))−logp(v ̃(k),h ̃(k))
It was noticed in the early 2000’s that the process can be sped up by taking just one additional sample instead of running for many iterations.
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444 20T2
Boltzmann Machines 30
COMP9444 20T2 Boltzmann Machines 31
b i ← b i + η ( v i − v ̃ i )
􏰈 v0,h0 are used as positive sample, and v1,h1 as negative sample
c j ← c j + η ( h j − h ̃ j )
w i j ← w i j + η ( v i h j − v ̃ i h ̃ j )
􏰈 this can be compared to the Negative Sampling that was used with word2vec – it is not guaranteed to approximate the true gradient, but it works well in practice
Deep Boltzmann Machine (20.4)
Greedy Layerwise Pretraining
The same approach can be applied iteratively to a multi-layer network. The weights from the input to the first hidden layer are trained first. Keeping those fixed, the weights from the first to the second hidden layer are trained, and so on.
For the sigmoid or tanh activation function, this kind of pre-training leads to a much better result than training directly by backpropagation from random initial weights.
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444 ⃝c Alan Blair, 2017-20
One application for the deep Bolzmann machine is greedy unsupervised layerwise pretraining.
Each pair of layers in succession is trained as an RBM.
The resulting values are then used as the initial weights and biases for a feedforward neural network, which is then trained by backpropagation for some other task, such as classification.