CS计算机代考程序代写 algorithm deep learning Outline

Outline
Covered content:
� Products of Experts (PoE)
� Restricted Boltzmann Machine (RBM) � Structure of an RBM
� RBM learning algorithms
� Application of RBMs
Reference:
� Hinton, Geoffrey E. (2002). ”Training Products of Experts by Minimizing Contrastive Divergence” (PDF). Neural Computation. 14 (8): 1771–1800
1/24

Beyond Clustering
Mixture model Product of experts (last week’s lecture) (today’s lecture)
2/24

Mixture model vs. Product of experts
Mixture Model: Mixture models are commonly used for clustering problems and the probability is defined as a weighted sum of probability distributions:
�C
αk ·p(x|θk)
with αk denoting the mixing coefficients subject to the constraints �Ck=1 αk = 1 and αk ≥ 0.
Product of Experts (PoE): Products of experts are commonly used for learning dis- tributed representations (e.g. unsupervised feature extraction) and are defined as a product
of functions.
p(x|θ) =
where Z is a normalization term set to ensure that p(x|θ) integrates to 1.
p(x|α,θ)=
k=1
1 �H
g(x|θj ) Z j=1 � ��
jth expert
3/24

Example 1: Product of Gaussians
Consider the experts to be univariate Gaussians spe- cialized on a particular input dimension
g(x|θi) = 1 exp � − (xi − μi)2 � (2π σi2 )1/2 2σi2
The product of experts (PoE) can be developed as:
�d
p(x|θ) = g(x|θi)
i=1 � � = 1 exp − 1(x−μ)�Σ−1(x−μ)
with Σ = diag(σ12 , . . . , σd2 ), i.e. a multivariate Gaussian distribution without correlation between features.
(2π)d/2|Σ|1/2 2
4/24

Example 2: Product of Gaussian Mixtures
Let the experts be the univariate mixtures:
�C k=1
π−1/2 exp(−(xi − μik)2)
g(x|θi) =
The product of experts (PoE) can be developed as:
1 �d �C
p(x|θ) = Z π−1/2 exp(−(xi − μik)2)
i=1 k=1
1�C �C�d
= Z ··· π−1/2 exp(−(xi −μiki)2) k1=1 kd=1 i=1
� ��
multivariate Gaussian
i.e. a mixture of exponentially many (Cd) multivariate Gaussians. Therefore, a PoE can encode many varia- tions with few parameters.
5/24

Example 3: Product of t-Student Distributions
Define experts to be t-Student distributions in some projected space (z = wj�x):
g(x|θj) = 1 �wj� = 1 α j + ( w j� x ) 2
The resulting product of experts
p ( x | θ ) = 1 �H 1
with H the number of experts, produces a non-Gaussian multivariate distribution, which can be useful to model e.g. image or speech data. This PoE has connections to other analyses, e.g. independent component analysis (ICA).
Zj=1αj +(wj�x)2
6/24

The Restricted Boltzmann Machine
latent 0 1 0 0 1 1
input 01001
The restricted Boltzmann machine (RBM) is a joint probability model defined over input
features x ∈ {0, 1}d and latent variables h ∈ {0, 1}H .
1��d�H �H� p(x,h|θ) = Z exp xiwijhj + bjhj
i=1 j=1 j=1
The parameter wij can be interpreted as the connection strength between input feature xi and latent variable hj. The larger wij the stronger xi and hj co-activate.
7/24

The Restricted Boltzmann Machine (PoE View)
Connection between RBM and PoE
The RBM, when marginalized over its hidden units, has the structure of a Product of Experts with g(x, θj ) = (1 + exp(wj�x + bj )).
Proof:
�
p(x|θ) = =
p(x, h|θ)
�1��d�H �H�
h∈{0,1}d
h∈{0,1}d i=1 j=1 j=1
Z exp xiwijhj + bjhj 1 � �H
= Z exp((wj�x+bj)·hj) h∈{0,1}d j=1
=1�H � exp((wj�x+bj)·hj) Z j=1 hj ∈{0,1}
1 �H
= Z (1+exp(wj�x+bj))
j=1
8/24

Interpreting the RBM Experts
The experts
g(x, θj ) = 1 + exp(wj�x + bj )
forming the RBM implement two behaviors:
� wj�x+bj >0:g(x;θj)�1: theexamplexisin the area of competence of the expert and the latter speaks in favor of x.
� wj�x+bj <0:g(x;θj)≈1: theexamplexis outside the expert’s area of competence, and the expert ‘withdraws’, i.e. multiplies by 1. This double behavior is important to implement the nonlinearity. 9/24 RBM as a Neural Network The product of expert (PoE) model 1 �H softplus p(x|θ) = Z (1 + exp(wj�x + bj)) j=1 x can be rewritten as: �H j=1 � �� f ( x ) log(1+exp(wj�x+bj))−logZ � softplus�� f(x,θ) logp(x|θ) = where f(x,θ) can be interpreted as a neural network. w Note: The neural network can predict which examples x are more probable relative to other examples, but not the actual probability score, because log Z is difficult to compute. 10/24 ∑ RBM as a Neural Network RBM trained on MNIST digits (only digits “3” with 100-125 foreground pixels). 2% least likely 2% most likely Observations: � Digits with anomalies are predicted the RBM to be less likely than clean digits. 11/24 �� RBM as a Neural Network Universal approximation The RBM is an universal approximator of data distributions in the binary input space {0, 1}d. “Proof” by construction: Add a softplus neuron pointing to each corner of the hypercube. Scale the weight wj according to the probability of each corner, and choose the bias accord- ingly. Note: This construction requires exponentially many neurons. In practice, each expert θj captures more than a single corner of the hypercube (i.e. learn general features). 12/24 RBM as a Neural Network RBM weights ((wj)Hj=1) K-Means centroids Observation � In the RBM, experts (wj)j encode only part of the digit (e.g. a stroke). The expert therefore contributes to make all data containing that particular stroke more likely. � The experts that the RBM extracts can be used for transfer learning, e.g. reused to predict different digits/characters. � K-means centroids are much more localized, and consequently less transferable. 13/24 Learning an RBM Recap: mixture model (EM bound) The mixture model can be optimized by finding the optimum of a lower-bound at each iteration �N logp(D|θ)= n=1 log �C α k p ( x n | θ k ) β βk ≥ �N �C n=1 k=1 log � α k p ( x n | θ k ) � β βk k k=1 k where N is the number of data points, and (αk)k and (βk)k are distributions over the C mixture elements. Question: Can the same EM approach be used for Product of Experts? logp(D|θ)=�N �f(xn,θ)−logZ� n=1 Answer: No, the expression cannot be bounded in the same way as for the mixture model (it has a different structure). 14/24 Gradient of the RBM Idea: Although EM is not applicable, we can still compute the gradient of the log-likelihood and perform gradient descent. ∂ logp(D|θ)= ∂ �N �f(xn,θ)−logZ� ∂θ ∂θ n=1 ∂�N� �� = ∂θ f(xn,θ)−log exp(f(x,θ)) n=1 x∈{0,1}d �� N ∂ x∈{0,1}d exp(f(x,θ)) ∂ f(x,θ) = f(xn,θ)− � ∂θ n=1 ∂θ x∈{0,1}d exp(f(x,θ)) N∂�∂� ∂θf(xn,θ)−NEx∼p(x|θ) ∂θf(x,θ) The gradient is a difference between a data-dependent and a model-dependent term. = n=1 15/24 RBM Update Rule Based on the gradient calculation above, we can build the update rule: Interpretation: θ←θ+γ·�1 �N ∂f(xn,θ)−Ex∼p(x|θ)�∂f(x,θ)�� N n=1 ∂θ ∂θ � The first term makes the data more likely according to the model. � The second term makes everything that the model considers likely, less likely. � The training procedure terminates when we reach some equilibrium. p(x) data data 16/24 Computation of the RBM update rule θ←θ+γ·�1 �N ∂f(xn,θ)−Ex∼p(x|θ)�∂f(x,θ)�� N n=1 ∂θ ∂θ Observations � The left hand side is easy to compute (backprop in f(x,θ), averaged over all data points). � The right hand side is more tricky, because p(x|θ) is defined over exponentially many states. An unbiased approximation of the expected gradient can be obtained by generating a sample {x1, . . . , xm} from the model distribution p(x|θ). Question: How do we sample from p(x|θ)? � Idea: Switch back to the ‘classical’ (i.e. non-PoE) view of the RBM, where we can use latent variables to ease sampling. 17/24 Recap: ‘Classical’ View of the RBM latent 0 1 0 0 1 1 input 01001 The RBM is a probability model defined over input features x ∈ {0, 1}d and and latent variables h ∈ {0, 1}H . 1��d�H �H� p(x,h|θ) = Z exp xiwijhj + bjhj i=1 j=1 j=1 The parameter wij can be interpreted as the connection strength between input feature xi and latent variable hj. The larger wij the stronger xi and hj co-activate. 18/24 Conditional Distributions in an RBM Sampling p(x) and p(h) is hard, however, hidden variables h can be sampled easily and indep�endently when we condition on the state x, and similarly for x conditioned on h. Let zj= ixiwij+bj.Weproceedas: 1 exp��zjhj� �exp(zjhj) � Z j j exp(zjhj) p(h|x,θ)= � 1 exp��z h � = ��exp(z h ) = 1+exp(zj) Z jj jj j�� h j j hj p(hj|x,θ) and we observe that, conditioned on x, the latent variables (hj)j can be sampled easily (Bernoulli distributions) and independently. By symmetry of the RBM model, we get a similar result for the visible �units conditioned on thelatentvariables,i.e.p(xi|h,θ)=exp(zihi)/(1+exp(zi))withzi = jwijhj. 19/24 Block Gibbs Sampling Observation: We know p(hj |x, θ) and p(xi|h, θ). But how do we sample p(x|θ), which we need to compute the RBM gradient? Block Gibbs Sampling: Start with some ran- dom input x(0), then sample alternately: h(0) ∼ �p(hj | x(0), θ)�Hj=1 x(1) ∼ �p(xi | h(0), θ)�di=1 h(1) ∼ �p(hj | x(1), θ)�Hj=1 x(2) ∼ �p(xi | h(1), θ)�di=1 . until x(·) converges in distribution to p(x|θ). 20/24 Fast Approximations In practice, Gibbs sampling can take a long time to converge. Instead, we can use fast approximations of converged Gibbs sampling. Common approximation strategies: � Contrastive divergence (CD-k): Start from an existing data point, and perform k steps of alternate Gibbs sampling. � Persistent contrastive divergence (PCD): Start from the Gibbs sample x in the previous iteration of gradient descent, and perform one step of Gibbs sampling. 21/24 Application: RBM for Data Denoising Before denoising Suppose you receive an example xnoisy, and would like to denoise it. A reconstruction of that digit can be obtained by mapping to the latent variables and projecting back: 1. Projection on latent variables h = sigm(W xnoisy + b) 2. Backprojection on the input domain xrec = sigm(W�h) After denoising 22/24 Application: RBM for Representation Learning The RBM model can be used as a neural net- work initialization to achieve lower generaliza- tion error on some classification task. � Step 1: Create a layer of features based on the RBM learned parameters: MNIST example: sigm(w1� x) φ(x) =  .  sigm(wH� x) � Step 2: Add a top layer that maps φ(x) to the class probabilities, and train the resulting neural network end-to-end using gradient descent. Source: Erhan et al. (2010) Why Does Unsuper- vised Pre-training Help Deep Learning? 23/24 Summary � The Product of Experts is an unsupervised learning approach that is substantially different from Mixture Models. � Product of experts such as the RBM are optimized by gradient descent (instead of EM). � The RBM has several desirable properties: Simple gradient, block Gibbs sampler (quite fast). � The RBM can be used for various tasks, e.g. probability modeling, data denoising, representation learning. 24/24

Related Posts