Machine Learning 1 TU Berlin, WiSe 2020/21
Training a Restricted Boltzmann Machine
In this exercise, we implement and train an RBM to model the distribution of MNIST digits 3. The following code loads some libraries, extracts the MNIST digits 3, and converts them to binary vectors.
In [1]: import matplotlib %matplotlib inline
from matplotlib import pyplot as plt import numpy,numpy.random
from sklearn.datasets import fetch_mldata mnist = fetch_mldata(‘MNIST original’)
X = (mnist.data[mnist.target==3]>127.5)*1.0
X.shape,X.min(),X.max()
Out[1]: ((7141, 784), 0.0, 1.0)
The RBM learning algorithm consists of several parts, which we will implement one after the other.
Implement the sigmoid function (10 P)
Implement a function that receives a vector of values and applies the sigmoid function sigm(t) = exp(t)
1 + exp(t)
element-wise to that vector. Note that a naive implementation the function is numerically instable. Instability can be avoided by using the hyperbolic tangent and applying some rescaling/offseting so that it matches the sigmoid. Alternatively, one can distinguish between large values of t and small values of t and apply a different treatment to these two cases. In any case, your implementation should be numerically stable, and fast (i.e. use vector operations instead of loops).
In [2]: def sigm(z):
# ———————————— # TODO: replace by your code
# ———————————— import solution
p = solution.sigm(z)
# ————————————
return p
Your implementation can be tested with the code below
In [3]: z = numpy.linspace(-5,5,100)
plt.figure(figsize=(4,1.5)); plt.plot(z,sigm(z)); plt.show()
1
Before implementing the RBM, we need a data structure to store its parameters. Here, we use a dictionary containing the weights and bias given as numpy arrays. Our RBM has 784 input dimensions (the number of pixels in a MNIST digit), and 100 hidden units (i.e. 100 experts).
In [4]: d = 784
H = 100
rbm = {
‘W’: numpy.random.normal(0,0.1,[d,H]),
‘b’: numpy.zeros([H]),
}
The initial weights of the RBM are set at random in order to break symmetries.
Implement the alternate Gibbs sampler (10 P)
We would like to implement a function that performs alternate Gibbs sampling to generate samples from the distribution modeled by the RBM. Recall: Each unit of the RBM is sampled according to a Bernoulli distribution of some parameter p depending on the other units:
∀Hj=1 : hj ∼ Bernoulli(p = sigm(x⊤wj + bj )) ∀di=1 : xi ∼ Bernoulli(p = sigm(h⊤wi))
where wi denotes the vector of weights that connect to input feature i.
Hint: drawing from a Bernoulli distribution is equivalent to drawing from a binomial distribution with n=1. Hence, you can make use of the function numpy.random.binomial, which supports vector operation by admits a vector of probabilities as input. Here is an example:
In [5]: numpy.random.binomial(1,numpy.linspace(0,1,22))
Out[5]: array([0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1])
**Task: Implement the function below that receives as input a variable rbm representing the dictionary containing the RBM parameters, a data point x given as a numpy array of shape (784,), and an integer k corresponding to the number of alternate Gibbs sampling steps it should perform. Your function should return a sample of same shape as your data point.
Hint: You can use matrix-vector multiplication to compute several probabilities values in parallel.
In [6]: def gibbs(rbm,x,k):
# ———————————— # TODO: replace by your code
# ———————————— import solution
xm = solution.gibbs(rbm,x,k)
# ————————————
return xm
The gibbs sampler you have implemented can be tested on some MNIST digit.
2
In [7]: xm = gibbs(rbm,X[0],3)
plt.figure(figsize=(4,1.5))
ax = plt.subplot(1,2,1); ax.set_title(‘before’); ax.imshow(X[0].reshape(28,28),cmap=’gray’)
ax = plt.subplot(1,2,2); ax.set_title(‘after’); ax.imshow(xm.reshape(28,28),cmap=’gray’)
plt.show()
The sample looks like random noise. This is because the RBM has not been trained yet.
Implementing the gradient (10 P)
Now we will create a function that implements the gradient of the log-likelihood the RBM associates to a given data point x. Recall: The components of the gradient are given by:
∇wj logp(x|θ) = x · sigm(w⊤j x + bj) − Ep(x|θ)[x · sigm(w⊤j x + bj)] ∇bj logp(x|θ)=sigm(w⊤j x+bj)−Ep(x|θ)[sigm(w⊤j x+bj)]
The rightmost term will be approximated using CD-k, i.e. we start from the data point and apply k steps of Gibbs samping using the function implemented above. We set the number of steps to k=3.
Task: Implement a function that computes the proposed approximation of the gradient for a given data point x given as an array of shape (784,). Your function should return a tuple (dW,db) where dW and db are arrays of same size as the parameters W and b of the RBM, but containing the approximated gradients.
In [8]: def gradient(rbm,x):
# ———————————— # TODO: replace by your code
# ———————————— import solution
dW,db = solution.gradient(rbm,x)
# ————————————
return dW,db Implementing gradient ascent (10 P)
We now would like to learn the parameters of the RBM by performing a gradient ascent procedure. The function to implement should loop for the number of iterations nbit specified as argument, at each iteration pick one data point at random, compute the gradient using the method implemented above, and preform the update step using the learning rate lr specified as argument. Recall: the update step is given by:
3
where γ is the learning rate.
wj ←wj +γ·∇wj logp(x|θ) bj ←bj +γ·∇bj logp(x|θ)
Task: Implement a function that performs the gradient ascent procedure using the learning rate and number of iterations specified as arguments. The function receives a dataset X given as an array of shape (N,784) where N is the number of data points. The function terminates after all training iterations have been performed. The function does not need to return anything.
In [9]: def train(rbm,X,lr=0.005,nbit=25000):
# ———————————— # TODO: replace by your code
# ———————————— import solution solution.train(rbm,X,lr=lr,nbit=nbit) # ————————————
Training and Inspecting the RBM
We now train the RBM. If the RBM has been implemented correctly, the code above should run in approximately 1-2 minutes. For debuging purposes, you may reduce the number of iterations initially to 1000, and switching back to the original number of iterations for the final run.
In [10]: train(rbm,X)
Once the RBM has been trained, the result of training can be visualized. The following code renders the
weights of the RBM as a mosaic where each tile corresponds to one latent variable (or expert).
In [11]: plt.figure(figsize=(10,10))
plt.axis(‘off’)
plt.subplots_adjust(left=0,right=1,bottom=0,top=1)
plt.imshow(rbm[‘W’].reshape(28,28,10,10).transpose(2,0,3,1).reshape(280,280),cmap=’gray’)
Out[11]:
4
We observe that the RBM has learned a rich set of features, some of which are localized in pixel space, and therefore, apply to a large region of the input space. These features are statistically robust and also transferrable to other digits.
5