Outline
Covered content:
� Products of Experts (PoE)
� Restricted Boltzmann Machine (RBM) � Structure of an RBM
� RBM learning algorithms
� Application of RBMs
Reference:
� Hinton, Geoffrey E. (2002). ”Training Products of Experts by Minimizing Contrastive Divergence” (PDF). Neural Computation. 14 (8): 1771–1800
1/24
Beyond Clustering
Mixture model Product of experts (last week’s lecture) (today’s lecture)
2/24
Mixture model vs. Product of experts
Mixture Model: Mixture models are commonly used for clustering problems and the probability is defined as a weighted sum of probability distributions:
�C
αk ·p(x|θk)
with αk denoting the mixing coefficients subject to the constraints �Ck=1 αk = 1 and αk ≥ 0.
Product of Experts (PoE): Products of experts are commonly used for learning dis- tributed representations (e.g. unsupervised feature extraction) and are defined as a product
of functions.
p(x|θ) =
where Z is a normalization term set to ensure that p(x|θ) integrates to 1.
p(x|α,θ)=
k=1
1 �H
g(x|θj ) Z j=1 � �� �
jth expert
3/24
Example 1: Product of Gaussians
Consider the experts to be univariate Gaussians spe- cialized on a particular input dimension
g(x|θi) = 1 exp � − (xi − μi)2 � (2π σi2 )1/2 2σi2
The product of experts (PoE) can be developed as:
�d
p(x|θ) = g(x|θi)
i=1 � � = 1 exp − 1(x−μ)�Σ−1(x−μ)
with Σ = diag(σ12 , . . . , σd2 ), i.e. a multivariate Gaussian distribution without correlation between features.
(2π)d/2|Σ|1/2 2
4/24
Example 2: Product of Gaussian Mixtures
Let the experts be the univariate mixtures:
�C k=1
π−1/2 exp(−(xi − μik)2)
g(x|θi) =
The product of experts (PoE) can be developed as:
1 �d �C
p(x|θ) = Z π−1/2 exp(−(xi − μik)2)
i=1 k=1
1�C �C�d
= Z ··· π−1/2 exp(−(xi −μiki)2) k1=1 kd=1 i=1
� �� �
multivariate Gaussian
i.e. a mixture of exponentially many (Cd) multivariate Gaussians. Therefore, a PoE can encode many varia- tions with few parameters.
5/24
Example 3: Product of t-Student Distributions
Define experts to be t-Student distributions in some projected space (z = wj�x):
g(x|θj) = 1 �wj� = 1 α j + ( w j� x ) 2
The resulting product of experts
p ( x | θ ) = 1 �H 1
with H the number of experts, produces a non-Gaussian multivariate distribution, which can be useful to model e.g. image or speech data. This PoE has connections to other analyses, e.g. independent component analysis (ICA).
Zj=1αj +(wj�x)2
6/24
The Restricted Boltzmann Machine
latent 0 1 0 0 1 1
input 01001
The restricted Boltzmann machine (RBM) is a joint probability model defined over input
features x ∈ {0, 1}d and latent variables h ∈ {0, 1}H .
1��d�H �H� p(x,h|θ) = Z exp xiwijhj + bjhj
i=1 j=1 j=1
The parameter wij can be interpreted as the connection strength between input feature xi and latent variable hj. The larger wij the stronger xi and hj co-activate.
7/24
The Restricted Boltzmann Machine (PoE View)
Connection between RBM and PoE
The RBM, when marginalized over its hidden units, has the structure of a Product of Experts with g(x, θj ) = (1 + exp(wj�x + bj )).
Proof:
�
p(x|θ) = =
p(x, h|θ)
�1��d�H �H�
h∈{0,1}d
h∈{0,1}d i=1 j=1 j=1
Z exp xiwijhj + bjhj 1 � �H
= Z exp((wj�x+bj)·hj) h∈{0,1}d j=1
=1�H � exp((wj�x+bj)·hj) Z j=1 hj ∈{0,1}
1 �H
= Z (1+exp(wj�x+bj))
j=1
8/24
Interpreting the RBM Experts
The experts
g(x, θj ) = 1 + exp(wj�x + bj )
forming the RBM implement two behaviors:
� wj�x+bj >0:g(x;θj)�1: theexamplexisin the area of competence of the expert and the latter speaks in favor of x.
� wj�x+bj <0:g(x;θj)≈1: theexamplexis outside the expert’s area of competence, and the expert ‘withdraws’, i.e. multiplies by 1.
This double behavior is important to implement the nonlinearity.
9/24
RBM as a Neural Network
The product of expert (PoE) model 1 �H
softplus
p(x|θ) = Z (1 + exp(wj�x + bj))
j=1 x
can be rewritten as:
�H
j=1 � �� �
f ( x )
log(1+exp(wj�x+bj))−logZ � softplus�� �
f(x,θ)
logp(x|θ) =
where f(x,θ) can be interpreted as a neural network.
w
Note: The neural network can predict which examples x are more probable relative to other examples, but not the actual probability score, because log Z is difficult to compute.
10/24
∑
RBM as a Neural Network
RBM trained on MNIST digits (only digits “3” with 100-125 foreground pixels).
2% least likely 2% most likely
Observations:
� Digits with anomalies are predicted the RBM to be less likely than clean digits.
11/24
������������������������
������������������������������������� ����������������
���
���
���
���
���
RBM as a Neural Network
Universal approximation
The RBM is an universal approximator of data distributions in the binary input space {0, 1}d.
“Proof” by construction: Add a softplus neuron pointing to each corner of the hypercube. Scale the weight wj according to the probability of each corner, and choose the bias accord- ingly.
Note: This construction requires exponentially many neurons. In practice, each expert θj captures more than a single corner of the hypercube (i.e. learn general features).
12/24
RBM as a Neural Network
RBM weights ((wj)Hj=1) K-Means centroids
Observation
� In the RBM, experts (wj)j encode only part of the digit (e.g. a stroke). The expert therefore contributes to make all data containing that particular stroke more likely.
� The experts that the RBM extracts can be used for transfer learning, e.g. reused to predict different digits/characters.
� K-means centroids are much more localized, and consequently less transferable.
13/24
Learning an RBM
Recap: mixture model (EM bound)
The mixture model can be optimized by finding the optimum of a lower-bound at each iteration
�N logp(D|θ)=
n=1
log
�C α k p ( x n | θ k ) β
βk ≥
�N �C n=1 k=1
log
� α k p ( x n | θ k ) �
β βk
k
k=1 k
where N is the number of data points, and (αk)k and (βk)k are distributions over the C mixture elements.
Question: Can the same EM approach be used for Product of Experts? logp(D|θ)=�N �f(xn,θ)−logZ�
n=1
Answer: No, the expression cannot be bounded in the same way as for the mixture model (it has a different structure).
14/24
Gradient of the RBM
Idea: Although EM is not applicable, we can still compute the gradient of the log-likelihood and perform gradient descent.
∂ logp(D|θ)= ∂ �N �f(xn,θ)−logZ� ∂θ ∂θ n=1
∂�N� �� � = ∂θ f(xn,θ)−log exp(f(x,θ))
n=1 x∈{0,1}d ���
�N ∂ x∈{0,1}d exp(f(x,θ)) ∂ f(x,θ) = f(xn,θ)− � ∂θ
n=1 ∂θ x∈{0,1}d exp(f(x,θ)) N∂�∂�
∂θf(xn,θ)−NEx∼p(x|θ) ∂θf(x,θ)
The gradient is a difference between a data-dependent and a model-dependent term.
=
n=1
15/24
RBM Update Rule
Based on the gradient calculation above, we can build the update rule:
Interpretation:
θ←θ+γ·�1 �N ∂f(xn,θ)−Ex∼p(x|θ)�∂f(x,θ)�� N n=1 ∂θ ∂θ
� The first term makes the data more likely according to the model.
� The second term makes everything that the model considers likely, less likely. � The training procedure terminates when we reach some equilibrium.
p(x)
data data
16/24
Computation of the RBM update rule
θ←θ+γ·�1 �N ∂f(xn,θ)−Ex∼p(x|θ)�∂f(x,θ)�� N n=1 ∂θ ∂θ
Observations
� The left hand side is easy to compute (backprop in f(x,θ), averaged over all data points).
� The right hand side is more tricky, because p(x|θ) is defined over exponentially many states. An unbiased approximation of the expected gradient can be obtained by generating a sample {x1, . . . , xm} from the model distribution p(x|θ).
Question: How do we sample from p(x|θ)?
� Idea: Switch back to the ‘classical’ (i.e. non-PoE) view of the RBM, where we can use latent variables to ease sampling.
17/24
Recap: ‘Classical’ View of the RBM
latent 0 1 0 0 1 1
input 01001
The RBM is a probability model defined over input features x ∈ {0, 1}d and
and latent variables h ∈ {0, 1}H .
1��d�H �H� p(x,h|θ) = Z exp xiwijhj + bjhj
i=1 j=1 j=1
The parameter wij can be interpreted as the connection strength between input feature xi and latent variable hj. The larger wij the stronger xi and hj co-activate.
18/24
Conditional Distributions in an RBM
Sampling p(x) and p(h) is hard, however, hidden variables h can be sampled easily and indep�endently when we condition on the state x, and similarly for x conditioned on h. Let zj= ixiwij+bj.Weproceedas:
1 exp��zjhj� �exp(zjhj) �
Z j j exp(zjhj)
p(h|x,θ)= � 1 exp��z h � = ��exp(z h ) = 1+exp(zj) Z jj jj j����
h j j hj p(hj|x,θ)
and we observe that, conditioned on x, the latent variables (hj)j can be sampled easily
(Bernoulli distributions) and independently.
By symmetry of the RBM model, we get a similar result for the visible �units conditioned on thelatentvariables,i.e.p(xi|h,θ)=exp(zihi)/(1+exp(zi))withzi = jwijhj.
19/24
Block Gibbs Sampling
Observation: We know p(hj |x, θ) and p(xi|h, θ). But how do we sample p(x|θ), which we need to compute the RBM gradient?
Block Gibbs Sampling: Start with some ran- dom input x(0), then sample alternately:
h(0) ∼ �p(hj | x(0), θ)�Hj=1
x(1) ∼ �p(xi | h(0), θ)�di=1
h(1) ∼ �p(hj | x(1), θ)�Hj=1
x(2) ∼ �p(xi | h(1), θ)�di=1 .
until x(·) converges in distribution to p(x|θ).
20/24
Fast Approximations
In practice, Gibbs sampling can take a long time to converge. Instead, we can use fast approximations of converged Gibbs sampling.
Common approximation strategies:
� Contrastive divergence (CD-k): Start from an existing data point, and perform k steps of alternate Gibbs sampling.
� Persistent contrastive divergence (PCD): Start from the Gibbs sample x in the previous iteration of gradient descent, and perform one step of Gibbs sampling.
21/24
Application: RBM for Data Denoising
Before denoising
Suppose you receive an example xnoisy, and would like to denoise it. A reconstruction of that digit can be obtained by mapping to the latent variables and projecting back:
1. Projection on latent variables
h = sigm(W xnoisy + b)
2. Backprojection on the input domain xrec = sigm(W�h)
After denoising
22/24
Application: RBM for Representation Learning
The RBM model can be used as a neural net- work initialization to achieve lower generaliza- tion error on some classification task.
� Step 1: Create a layer of features based on the RBM learned parameters:
MNIST example:
sigm(w1� x) φ(x) = .
sigm(wH� x)
� Step 2: Add a top layer that maps φ(x) to the class probabilities, and train the resulting neural network end-to-end using gradient descent.
Source: Erhan et al. (2010) Why Does Unsuper- vised Pre-training Help Deep Learning?
23/24
Summary
� The Product of Experts is an unsupervised learning approach that is substantially different from Mixture Models.
� Product of experts such as the RBM are optimized by gradient descent (instead of EM).
� The RBM has several desirable properties: Simple gradient, block Gibbs sampler (quite fast).
� The RBM can be used for various tasks, e.g. probability modeling, data denoising, representation learning.
24/24