L16 – Deep Belief Networks
EECS 391
Intro to AI
Deep Belief Networks
L16 Thu Nov 2
Michael S. Lewicki ◇ CWRUEECS 531: Computer Vision
Hierarchy of brain areas in the mammalian visual system
Flat map of macaque monkey brain Hierarchy of brain areas
from Felleman and
Van Essen (1991)Simple and Complex Cells are here
Real neurons
subsequent spikes. To examine this relationship, the action po-
tentials in response to auditory stimuli (during stimulus and
background periods) were sorted according to the prepotential,
defined here as the average potential in the 50 msec epoch before
each spike relative to the average resting membrane potential.
Spikes occurring within 30 msec of an already considered spike
were excluded. The prepotential and number of following spikes
were measured for three groups of neurons: song-specific cells
that had a phasic response to the forward song (n 5 3), song-
specific cells that had a tonic response throughout the forward
song (n 5 3), and nonauditory cells that showed spontaneous
bursting (n 5 5). One-way ANOVA was performed for each
group to determine whether there was a statistically significant
change in prepotential for different numbers of subsequent spikes.
Figure 8 summarizes the distributions of prepotentials before
each spike or spike burst for the three different groups. Song-
specific neurons that responded phasically to song were signifi-
cantly hyperpolarized before each action potential ( p , 0.001, F
test) (Fig. 8, left), but tonically excited song-specific cells showed
no significant change in prepotential versus the number of follow-
ing spikes ( p . 0.5, F test) (Fig. 8, middle). The spike bursts
produced by the nonauditory cells appear similar to those of the
phasically excited song-specific cells and, like those cells, also
showed a significant hyperpolarization before bursting compared
with the prepotential for a single spike ( p , 0.001, F test) (Fig. 8,
right).
Examples of the trend suggested by this analysis can be seen
using data taken from one of the phasically excited song-specific
cells (the one shown in Fig. 6). The waveform patterns following
the 10 most negative prepotentials of the data collected in re-
sponse to the forward song are shown in Figure 9a. Nearly every
trace is followed by a burst. The most positive prepotentials,
however, are all followed by single action potentials (Fig. 9b). The
other two phasically excited song-specific cells showed similar
patterns.
Conversely, one can examine the membrane potential before a
burst containing a certain number of spikes. Figure 10a shows four
waveforms aligned on the first action potential of bursts contain-
ing six spikes. These are preceded by a consistent hyperpolariza-
tion, whereas the bursts containing only three spikes are not (Fig.
Figure 3. Precise timing in a phasically excited HVc cell. The horizontal
line indicates the average resting potential, which was 259 mV. Some
phasic cells show high regularity in the firing times of the bursts and in the
subthreshold membrane potential. The action potentials in these bursts
also show attenuation over the time course of the burst.
Figure 2. An intracellular recording of a phasically excited song-specific HVc cell. The conventions are the same as in Figure 1, except that in bottom
traces (a–c), the waveform rasters are overlaid to show the consistency of both the hyperpolarizations and of the temporal positions of the phasic bursts.
The boxed region indicates where this cell responded phasically with bursts of action potentials to the forward song. The phasic response is lost when the
syllable order is reversed or when the entire song is reversed.
Lewicki • Intracellular Responses of Song-Specific Neurons J. Neurosci., September 15, 1996, 16(18):5854–5863 5857
may subserve complex auditory neurons observed in other sys-
tems, such as the cat (Weinberger and McKenna, 1988; McKenna
et al., 1989), squirrel monkey (Wollberg and Newman, 1972;
Newman and Wollberg, 1973; Glass and Wollberg, 1983), and
rhesus monkey (Rauschecker et al., 1995). The temporal pattern
of the response of song-specific cells can be highly regular, even at
the level of the subthreshold membrane potential. It remains to be
seen whether the presence of such precise timing reflects that of
afferent cells or whether it is refined by the circuitry in HVc.
Precise timing is correlated with phasic bursting, and it is thus
possible that phasic bursting and its associated hyperpolarization
subserve a neural code utilizing spike timing. Such a code has
obvious utility in the context of song learning and production.
REFERENCES
Bonke D, Scheich H, Langner G (1979) Responsiveness of units in the
auditory neostriatum of the guinea fowl (Numida meleagris) to species-
specific calls and synthetic stimuli. I. Tonotopy and functional zones of
field L. J Comp Physiol [A] 132:243–255.
Bottjer SW, Miesner EA, Arnold AP (1984) Forebrain lesions disrupt
development but not maintenance of song in passerine birds. Science
224:901–903.
Doupe AJ, Konishi M (1991) Song-selective auditory circuits in the vocal
control system of the zebra finch. Proc Natl Acad Sci USA
88:11339–11343.
Doupe AJ, Konishi M (1992) Song-selective auditory neurons emerge
during vocal learning in the zebra finch. Soc Neurosci Abstr 18:527.
Fortune ES, Margoliash D (1994) In vivo characterization of identified
HVc neurons in the zebra finch. Soc Neurosci Abstr 20:165.
Fortune ES, Margoliash D (1995) Parallel pathways and convergence
onto HVc and adjacent neostriatum of adult zebra finches (Taeniopygia
guttata). J Comp Neurol 360:413–441.
Glass I, Wollberg Z (1983) Auditory-cortex responses to sequences of
normal and reversed squirrel-monkey vocalizations. Brain Behav Evol
22:13–21.
Hose B, Langner G, Scheich H (1987) Topographic representation of
periodicities in the forebrain of the myna bird: one map for pitch and
rhythm. Brain Res 422:367–373.
Jahnsen H, Llinas R (1984) Electrophysiological properties of guinea-pig
thalamic neurons: an in vitro study. J Physiol (Lond) 349:105–226.
Katz LC, Gurney ME (1981) Auditory responses in the zebra finch’s
motor system for song. Brain Res 211:192–197.
Kelley DB, Nottebohm F (1979) Projections of a telencephalic auditory
nucleus—field L—in the canary. J Comp Neurol 183:455–470.
Kirkwood A, Dudek SM, Gold JT, Aizenman CD, Bear MF (1993)
Common forms of synaptic plasticity in the hippocampus and neocortex
in vitro. Science 260:1518–1521.
Knipschild M, Dorrscheidt GJ, Rubsamen R (1992) Setting complex
tasks to single units in the avian auditory forebrain. I. Processing of
complex artificial stimuli. Hear Res 57:216–230.
Konishi M (1965) The role of auditory feedback in the control of vocal-
ization in the white-crowned sparrow. Z Tierpsychol 22:771–783.
Kubota M, Saito N (1991) Sodium-dependent and calcium-dependent
conductances of neurons in the zebra finch hyperstriatum-ventrale pars
caudale in vitro. J Physiol (Lond) 440:131–142.
Larson J, Wong D, Lynch G (1986) Patterned stimulation at the theta
frequency is optimal for the induction of hippocampal long-term po-
tentiation. Brain Res 368:347–350.
Leppelsack HJ (1983) Analysis of song in the auditory pathway of song-
birds. In: Advances in vertebrate neuroethology (Ewert JP, ed), pp
783–800. New York: Plenum.
Lewicki MS, Arthur BJ (1995) Sensitivity to auditory temporal context
increases significantly from field L to HVc. Soc Neurosci Abstr 21:958.
Lewicki MS, Konishi M (1995) Mechanisms underlying the sensitivity of
songbird forebrain neurons to temporal order. Proc Natl Acad Sci USA
92:5582–5586.
Mainen ZF, Sejnowski TJ (1995) Reliability of spike timing in neocorti-
cal neurons. Science 268:1503–1506.
Malenka RC (1994) Synaptic plasticity in the hippocampus: LTP and
LTD. Cell 78:535–538.
Figure 11. Photomicrographs of avidin–horseradish peroxidase-stained
neurons. a, A song-specific HVc cell (data shown in Fig. 2); the axon of
this cell could be traced to area X. b, c, Two nonauditory HVc cells; both
cells had clear projections to RA. All scale bars, 20 mm.
5862 J. Neurosci., September 15, 1996, 16(18):5854–5863 Lewicki • Intracellular Responses of Song-Specific Neurons
20 mV
100 ms
Michael S. Lewicki ◇ CWRUEECS 531: Computer Vision
The Neocognitron (Fukushima, 1980)
• features are pooled
from local areas
• the aim is to allow
recognition of a variety
of feature positions,
while preserving relative
position
Michael S. Lewicki ◇ CWRUEECS 600: Computational Perception
What do you see?
y
x
7
Michael S. Lewicki ◇ CWRUEECS 600: Computational Perception
What do you see?
y
x
8
b c
a
Each column is a distinct eight-dimensional binary feature.
b c
a
true hidden causes of the data
The data: a set of stochastic binary patterns
This is a learning problem, which
we’ll cover in later lecture.
Hierarchical Statistical Models
A Bayesian belief network:
pa(Si)
Si
D
The joint probability of binary states is
P (S|W) =
⇤
i
P (Si|pa(Si),W)
The probability Si depends only on its parents:
P (Si|pa(Si),W) =
�
h(
⇥
j Sjwji) if Si = 1
1� h(
⇥
j Sjwji) if Si = 0
The function h specifies how causes are
combined, h(u) = 1� exp(�u), u > 0.
Main points:
• hierarchical structure allows model to form
high order representations
• upper states are priors for lower states
• weights encode higher order features
Another approach: Inference by stochastic simulation
Basic idea:
1. Draw N samples from a sampling distribution S
2. Compute an approximate posterior probability
3. Show this converges to the true probability
Sampling from an empty network
function Prior-Sample(bn) returns an event sampled from bn
inputs: bn, a belief network specifying joint distribution P(X1, . . . , Xn)
x← an event with n elements
for i = 1 to n do
xi ← a random sample from P(Xi | parents(Xi))
given the values of Parents(Xi) in x
return x
Chapter 14.4–5 14
Sampling with no evidence (from the prior)
Example
Cloudy
RainSprinkler
Wet
Grass
C
T
F
.80
.20
P(R|C)C
T
F
.10
.50
P(S|C)
S R
T T
T F
F T
F F
.90
.90
.99
P(W|S,R)
P(C)
.50
.01
Chapter 14.4–5 15
Example
Cloudy
RainSprinkler
Wet
Grass
C
T
F
.80
.20
P(R|C)C
T
F
.10
.50
P(S|C)
S R
T T
T F
F T
F F
.90
.90
.99
P(W|S,R)
P(C)
.50
.01
Chapter 14.4–5 16
Example
Cloudy
RainSprinkler
Wet
Grass
C
T
F
.80
.20
P(R|C)C
T
F
.10
.50
P(S|C)
S R
T T
T F
F T
F F
.90
.90
.99
P(W|S,R)
P(C)
.50
.01
Chapter 14.4–5 17
Example
Cloudy
RainSprinkler
Wet
Grass
C
T
F
.80
.20
P(R|C)C
T
F
.10
.50
P(S|C)
S R
T T
T F
F T
F F
.90
.90
.99
P(W|S,R)
P(C)
.50
.01
Chapter 14.4–5 18
Example
Cloudy
RainSprinkler
Wet
Grass
C
T
F
.80
.20
P(R|C)C
T
F
.10
.50
P(S|C)
S R
T T
T F
F T
F F
.90
.90
.99
P(W|S,R)
P(C)
.50
.01
Chapter 14.4–5 19
Example
Cloudy
RainSprinkler
Wet
Grass
C
T
F
.80
.20
P(R|C)C
T
F
.10
.50
P(S|C)
S R
T T
T F
F T
F F
.90
.90
.99
P(W|S,R)
P(C)
.50
.01
Chapter 14.4–5 20
Example
Cloudy
RainSprinkler
Wet
Grass
C
T
F
.80
.20
P(R|C)C
T
F
.10
.50
P(S|C)
S R
T T
T F
F T
F F
.90
.90
.99
P(W|S,R)
P(C)
.50
.01
Chapter 14.4–5 21
Keep sampling
to estimate joint
probabilities of
interest.
Rejection sampling
P̂(X|e) estimated from samples agreeing with e
function Rejection-Sampling(X,e, bn,N) returns an estimate of P (X |e)
local variables: N, a vector of counts over X, initially zero
for j = 1 to N do
x←Prior-Sample(bn)
if x is consistent with e then
N[x]←N[x]+1 where x is the value of X in x
return Normalize(N[X])
E.g., estimate P(Rain|Sprinkler = true) using 100 samples
27 samples have Sprinkler = true
Of these, 8 have Rain = true and 19 have Rain = false.
P̂(Rain|Sprinkler = true) = Normalize(⟨8, 19⟩) = ⟨0.296, 0.704⟩
Similar to a basic real-world empirical estimation procedure
Chapter 14.4–5 23
What if we do have some evidence? Rejection sampling.
Analysis of rejection sampling
P̂(X|e) = αNPS(X, e) (algorithm defn.)
= NPS(X, e)/NPS(e) (normalized by NPS(e))
≈ P(X, e)/P (e) (property of PriorSample)
= P(X|e) (defn. of conditional probability)
Hence rejection sampling returns consistent posterior estimates
Problem: hopelessly expensive if P (e) is small
P (e) drops off exponentially with number of evidence variables!
Chapter 14.4–5 24
Analysis of rejection sampling
Approximate inference using MCMC
“State” of network = current assignment to all variables.
Generate next state by sampling one variable given Markov blanket
Sample each variable in turn, keeping evidence fixed
function MCMC-Ask(X,e, bn,N) returns an estimate of P (X |e)
local variables: N[X ], a vector of counts over X, initially zero
Z, the nonevidence variables in bn
x, the current state of the network, initially copied from e
initialize x with random values for the variables in Y
for j = 1 to N do
for each Zi in Z do
sample the value of Zi in x from P(Zi |mb(Zi))
given the values of MB(Zi) in x
N[x ]←N[x ] + 1 where x is the value of X in x
return Normalize(N[X ])
Can also choose a variable to sample at random each time
Chapter 14.4–5 34
Approximate inference using Markov Chain Monte Carlo (MCMC)
The extent of dependencies in Bayesian networks
• A node X is conditionally independent
of its non-descendants given its
parents
• A node X is conditionally independent
of all the other nodes in the network
given its Markov blanket.
X X
The Markov chain
With Sprinkler = true, WetGrass = true, there are four states:
Cloudy
RainSprinkler
Wet
Grass
Cloudy
RainSprinkler
Wet
Grass
Cloudy
RainSprinkler
Wet
Grass
Cloudy
RainSprinkler
Wet
Grass
Wander about for a while, average what you see
Chapter 14.4–5 35
The Markov chain
MCMC example contd.
Estimate P(Rain|Sprinkler = true,WetGrass = true)
Sample Cloudy or Rain given its Markov blanket, repeat.
Count number of times Rain is true and false in the samples.
E.g., visit 100 states
31 have Rain = true, 69 have Rain = false
P̂(Rain|Sprinkler = true,WetGrass = true)
= Normalize(⟨31, 69⟩) = ⟨0.31, 0.69⟩
Theorem: chain approaches stationary distribution:
long-run fraction of time spent in each state is exactly
proportional to its posterior probability
Chapter 14.4–5 36
After obtaining the MCMC samples
But:
1. Difficult to tell when samples have converged. Theorem only applies in
limit, and it could take time to “settle in”.
2. Can also be inefficient if each state depends on many other variables.
• Model represents stochastic
binary features.
• Each input xi encodes the
probability that the ith binary
input feature is present.
• The set of features represented
by ϕj is defined by weights fij
which encode the probability
that feature i is an instance of ϕj.
• Trick: It’s easier to adapt weights
in an unbounded space, so use
the transformation:
• optimize in w-space.
Gibbs sampling (back to the noisy-OR example)
Beyond tables: modeling causal relationships using Noisy-OR
• We assume each cause Cj can produce
effect Ei with probability fij.
• The noisy-OR model assumes the
parent causes of effect Ei contribute
independently.
• The probability that none of them
caused effect Ei is simply the product of
the probabilities that each one did not
cause Ei.
• The probability that any of them caused
Ei is just one minus the above, i.e.
P (Ei|par(Ei)) = P (Ei|C1, . . . , Cn)
= 1 −
!
i
(1 − P (Ei|Cj))
= 1 −
!
i
(1 − fij)
catch cold (C)
touch contaminated
object (O)
hit by viral
droplet (D)
f
CD
f
CO
eat contaminated
food (F)
f
CF
1 − (1 − fCD)(1 − fCO)(1 − fCF )
P (C|D, O, F ) =
Hierarchical Statistical Models
A Bayesian belief network:
pa(Si)
Si
D
The joint probability of binary states is
P (S|W) =
⇤
i
P (Si|pa(Si),W)
The probability Si depends only on its parents:
P (Si|pa(Si),W) =
�
h(
⇥
j Sjwji) if Si = 1
1� h(
⇥
j Sjwji) if Si = 0
The function h specifies how causes are
combined, h(u) = 1� exp(�u), u > 0.
Main points:
• hierarchical structure allows model to form
high order representations
• upper states are priors for lower states
• weights encode higher order features
Inferring the best representation of the observed variables
• Given on the input D, the is no simple way to determine which states are the
input’s most likely causes.
– Computing the most probable network state is an inference process
– we want to find the explanation of the data with highest probability
– this can be done efficiently with Gibbs sampling
• Gibbs sampling is another example of an MCMC method
• Key idea:
The samples are guaranteed to converge to the
true posterior probability distribution
Gibbs Sampling
Gibbs sampling is a way to select an ensemble of states that are representative of
the posterior distribution P (S|D,W).
• Each state of the network is updated iteratively according to the probability of
Si given the remaining states.
• this conditional probability can be computed using (Neal, 1992)
P (Si = a|Sj : j ⇤= i,W) ⇥ P (Si = a|pa(Si),W)
�
j2ch(Si)
P (Sj|pa(Sj), Si = a,W)
• limiting ensemble of states will be typical samples from P (S|D,W)
• also works if any subset of states are fixed and the rest are sampled
Network Interpretation of the Gibbs Sampling Equation
The probability of Si changing state given the remaining states is
P (Si = 1�Si|Sj : j ⇤= i,W) =
1
1 + exp(��xi)
�xi indicates how much changing the state Si changes the probability of the whole
network state
�xi = log h(ui; 1�Si)� log h(ui;Si)
+
⇥
j2ch(Si)
log h(uj + �ij;Sj)� log h(uj;Sj)
• ui is the causal input to Si, ui =
�
k Skwki
• �j specifies the change in uj for a change in Si,
�ij = +Sjwij if Si = 0, or �Sjwij if Si = 1
The Gibbs sampling equations (derivation omitted)
Interpretation of the Gibbs sampling equation
• The Gibbs equation can be interpreted as: feedback + ∑ feedforward
• feed-back: how consistent is Si with current causes?
• ∑ feedforward: how likely is Si a cause of its children
• feedback allows the lower-level units to use information only
computable at higher levels
• feedback determines (disambiguates) the state when the
feedforward input is ambiguous
The Shifter Problem
BA
Shift patterns
weights of a 32-20-2 network after
learning
Gibbs sampling: feedback disambiguates lower-level states
1
2
3
4
One the structure learned, the Gibbs updating convergences in two sweeps.
Difficulties with identifying the causal structure
• Space of S, C is huge
• If we define P(I|S,C), must be
able to invert it or search it
• Need efficient algorithms
• Many unknowns: identity of
objects, types of scene
elements, illuminations
• Might never have encountered
some structures
• Is it even the right approach?
• Can we solve a simple case?
shape material illuminationviewpoint
image
Motivation: learn hierarchical, context dependent representations
• many real-world patterns are
hierarchical in structure
• interpretation of patterns depends on
context
• essential for complex recognition
tasks
pixels
features
surface properties,
contours, …
objects, scene properties
and structure, …
Motivation: hierarchical, context dependent representations
pixels
features
surface properties,
contours, …
objects, scene properties
and structure, …
• many real-world patterns are
hierarchical in structure
• interpretation of patterns depends on
context
• essential for complex recognition tasks
Motivation: hierarchical, context dependent representations
pixels
features
surface properties,
contours, …
objects, scene properties
and structure, …
• many real-world patterns are
hierarchical in structure
• interpretation of patterns depends on
context
• essential for complex recognition tasks
Toy problems – Handwritten digits
0 1 2 … 9
p(I|S, C)
Data: pixels (black or white)
Complex “random” patterns with simple
underlying structure
I –
S –
The inferred explanation is the most probable scene
The inferred explanation is the most probable scene
p(Ŝ|I, C) = arg max
S
p(I|S, C)p(S|C)
p(I|C)
� �
CP08: Perceptual Inference 1 / Michael S. Lewicki, CMU ·· ·
·
? 19
Motivation: learn hierarchical, context dependent representations
• many real-world patterns are
hierarchical in structure
• interpretation of patterns depends on
context
• essential for complex recognition
tasks
pixels
features
surface properties,
contours, …
objects, scene properties
and structure, …
The toy problem:
is it a pattern or a collection of features?
The higher-order lines problem
A
0.6
0.1 0.1 0.10.1
B
The true generative model Patterns sampled from the model
Can we infer the structure of the network given only the patterns?
Weights in a 25-10-5 belief network after learning
The first layer of weights learn that
patterns are combinations of lines.
The second layer learns combinations of the first layer features
A
0.6
0.1 0.1 0.10.1
B
Logistic belief nets
• generative model:
• is the logistic function:
• with deep networks, it is possible to
learn complex joint probability
distributions
recognition
generation
p(vi = 1) = �(bi +
�
j
hjwij)
�(x) =
1
1 + exp(�x)
�(x)
Wake-sleep learning
• For probabilistic models
– top-down weights generate patterns from model distribution
– bottom-up weights convey distribution of data-vectors
– ideally the two distributions should match
• The “wake-sleep” algorithm adjusts the weights so that the distribution from
recognition (wake) matches the distribution from generation (sleep)
recognition
generation
Wake-sleep learning
• For each digit in training set
– bottom-up pass: use recognition weights to stochastically set hidden states
– adjust generative weights to improve how model generates training data:
– is the probability of activating state i given inferred states hj
recognition
generation
�wji ⇥ hj(hi � ĥi)
ĥi
p(hj = 1) = �(bj +
�
i
viwij)
Generative model for hand-written digits
from Hinton (2007)
• Generation:
– use alternating Gibbs sampling
from top-level assoc. memory
– use directed weights to
stochastically generate pixel
probs. from sampled binary of
500 hidden units
• Recognition:
– Use bottom-up weights to
produce binary activities in two
lower layers
– use alternating Gibbs sampling in
the top two layers
Demo: deep-belief network model (Hinton)
www.cs.toronto.edu/~hinton/adi/index.htm
http://www.cs.toronto.edu/~hinton/adi/index.htm