程序代写代做代考 Bayesian network Bayesian algorithm AI chain L16 – Deep Belief Networks

L16 – Deep Belief Networks

EECS 391
Intro to AI

Deep Belief Networks

L16 Thu Nov 2

Michael S. Lewicki ◇ CWRUEECS 531: Computer Vision

Hierarchy of brain areas in the mammalian visual system

Flat map of macaque monkey brain Hierarchy of brain areas

from Felleman and
Van Essen (1991)Simple and Complex Cells are here

Real neurons

subsequent spikes. To examine this relationship, the action po-
tentials in response to auditory stimuli (during stimulus and
background periods) were sorted according to the prepotential,
defined here as the average potential in the 50 msec epoch before
each spike relative to the average resting membrane potential.
Spikes occurring within 30 msec of an already considered spike
were excluded. The prepotential and number of following spikes
were measured for three groups of neurons: song-specific cells
that had a phasic response to the forward song (n 5 3), song-
specific cells that had a tonic response throughout the forward

song (n 5 3), and nonauditory cells that showed spontaneous
bursting (n 5 5). One-way ANOVA was performed for each
group to determine whether there was a statistically significant
change in prepotential for different numbers of subsequent spikes.

Figure 8 summarizes the distributions of prepotentials before
each spike or spike burst for the three different groups. Song-
specific neurons that responded phasically to song were signifi-
cantly hyperpolarized before each action potential ( p , 0.001, F
test) (Fig. 8, left), but tonically excited song-specific cells showed
no significant change in prepotential versus the number of follow-
ing spikes ( p . 0.5, F test) (Fig. 8, middle). The spike bursts
produced by the nonauditory cells appear similar to those of the
phasically excited song-specific cells and, like those cells, also
showed a significant hyperpolarization before bursting compared
with the prepotential for a single spike ( p , 0.001, F test) (Fig. 8,
right).

Examples of the trend suggested by this analysis can be seen
using data taken from one of the phasically excited song-specific
cells (the one shown in Fig. 6). The waveform patterns following
the 10 most negative prepotentials of the data collected in re-
sponse to the forward song are shown in Figure 9a. Nearly every
trace is followed by a burst. The most positive prepotentials,
however, are all followed by single action potentials (Fig. 9b). The
other two phasically excited song-specific cells showed similar
patterns.

Conversely, one can examine the membrane potential before a
burst containing a certain number of spikes. Figure 10a shows four
waveforms aligned on the first action potential of bursts contain-
ing six spikes. These are preceded by a consistent hyperpolariza-
tion, whereas the bursts containing only three spikes are not (Fig.

Figure 3. Precise timing in a phasically excited HVc cell. The horizontal
line indicates the average resting potential, which was 259 mV. Some
phasic cells show high regularity in the firing times of the bursts and in the
subthreshold membrane potential. The action potentials in these bursts
also show attenuation over the time course of the burst.

Figure 2. An intracellular recording of a phasically excited song-specific HVc cell. The conventions are the same as in Figure 1, except that in bottom
traces (a–c), the waveform rasters are overlaid to show the consistency of both the hyperpolarizations and of the temporal positions of the phasic bursts.
The boxed region indicates where this cell responded phasically with bursts of action potentials to the forward song. The phasic response is lost when the
syllable order is reversed or when the entire song is reversed.

Lewicki • Intracellular Responses of Song-Specific Neurons J. Neurosci., September 15, 1996, 16(18):5854–5863 5857

may subserve complex auditory neurons observed in other sys-
tems, such as the cat (Weinberger and McKenna, 1988; McKenna
et al., 1989), squirrel monkey (Wollberg and Newman, 1972;
Newman and Wollberg, 1973; Glass and Wollberg, 1983), and
rhesus monkey (Rauschecker et al., 1995). The temporal pattern
of the response of song-specific cells can be highly regular, even at
the level of the subthreshold membrane potential. It remains to be
seen whether the presence of such precise timing reflects that of
afferent cells or whether it is refined by the circuitry in HVc.
Precise timing is correlated with phasic bursting, and it is thus
possible that phasic bursting and its associated hyperpolarization
subserve a neural code utilizing spike timing. Such a code has
obvious utility in the context of song learning and production.

REFERENCES
Bonke D, Scheich H, Langner G (1979) Responsiveness of units in the

auditory neostriatum of the guinea fowl (Numida meleagris) to species-
specific calls and synthetic stimuli. I. Tonotopy and functional zones of
field L. J Comp Physiol [A] 132:243–255.

Bottjer SW, Miesner EA, Arnold AP (1984) Forebrain lesions disrupt
development but not maintenance of song in passerine birds. Science
224:901–903.

Doupe AJ, Konishi M (1991) Song-selective auditory circuits in the vocal
control system of the zebra finch. Proc Natl Acad Sci USA
88:11339–11343.

Doupe AJ, Konishi M (1992) Song-selective auditory neurons emerge
during vocal learning in the zebra finch. Soc Neurosci Abstr 18:527.

Fortune ES, Margoliash D (1994) In vivo characterization of identified
HVc neurons in the zebra finch. Soc Neurosci Abstr 20:165.

Fortune ES, Margoliash D (1995) Parallel pathways and convergence
onto HVc and adjacent neostriatum of adult zebra finches (Taeniopygia
guttata). J Comp Neurol 360:413–441.

Glass I, Wollberg Z (1983) Auditory-cortex responses to sequences of
normal and reversed squirrel-monkey vocalizations. Brain Behav Evol
22:13–21.

Hose B, Langner G, Scheich H (1987) Topographic representation of
periodicities in the forebrain of the myna bird: one map for pitch and
rhythm. Brain Res 422:367–373.

Jahnsen H, Llinas R (1984) Electrophysiological properties of guinea-pig
thalamic neurons: an in vitro study. J Physiol (Lond) 349:105–226.

Katz LC, Gurney ME (1981) Auditory responses in the zebra finch’s
motor system for song. Brain Res 211:192–197.

Kelley DB, Nottebohm F (1979) Projections of a telencephalic auditory
nucleus—field L—in the canary. J Comp Neurol 183:455–470.

Kirkwood A, Dudek SM, Gold JT, Aizenman CD, Bear MF (1993)
Common forms of synaptic plasticity in the hippocampus and neocortex
in vitro. Science 260:1518–1521.

Knipschild M, Dorrscheidt GJ, Rubsamen R (1992) Setting complex
tasks to single units in the avian auditory forebrain. I. Processing of
complex artificial stimuli. Hear Res 57:216–230.

Konishi M (1965) The role of auditory feedback in the control of vocal-
ization in the white-crowned sparrow. Z Tierpsychol 22:771–783.

Kubota M, Saito N (1991) Sodium-dependent and calcium-dependent
conductances of neurons in the zebra finch hyperstriatum-ventrale pars
caudale in vitro. J Physiol (Lond) 440:131–142.

Larson J, Wong D, Lynch G (1986) Patterned stimulation at the theta
frequency is optimal for the induction of hippocampal long-term po-
tentiation. Brain Res 368:347–350.

Leppelsack HJ (1983) Analysis of song in the auditory pathway of song-
birds. In: Advances in vertebrate neuroethology (Ewert JP, ed), pp
783–800. New York: Plenum.

Lewicki MS, Arthur BJ (1995) Sensitivity to auditory temporal context
increases significantly from field L to HVc. Soc Neurosci Abstr 21:958.

Lewicki MS, Konishi M (1995) Mechanisms underlying the sensitivity of
songbird forebrain neurons to temporal order. Proc Natl Acad Sci USA
92:5582–5586.

Mainen ZF, Sejnowski TJ (1995) Reliability of spike timing in neocorti-
cal neurons. Science 268:1503–1506.

Malenka RC (1994) Synaptic plasticity in the hippocampus: LTP and
LTD. Cell 78:535–538.

Figure 11. Photomicrographs of avidin–horseradish peroxidase-stained
neurons. a, A song-specific HVc cell (data shown in Fig. 2); the axon of
this cell could be traced to area X. b, c, Two nonauditory HVc cells; both
cells had clear projections to RA. All scale bars, 20 mm.

5862 J. Neurosci., September 15, 1996, 16(18):5854–5863 Lewicki • Intracellular Responses of Song-Specific Neurons

20 mV

100 ms

Michael S. Lewicki ◇ CWRUEECS 531: Computer Vision

The Neocognitron (Fukushima, 1980)

• features are pooled
from local areas

• the aim is to allow
recognition of a variety
of feature positions,
while preserving relative
position

Michael S. Lewicki ◇ CWRUEECS 600: Computational Perception

What do you see?

y

x

7

Michael S. Lewicki ◇ CWRUEECS 600: Computational Perception

What do you see?

y

x

8

b c

a

Each column is a distinct eight-dimensional binary feature.

b c

a

true hidden causes of the data

The data: a set of stochastic binary patterns

This is a learning problem, which
we’ll cover in later lecture.

Hierarchical Statistical Models
A Bayesian belief network:

pa(Si)

Si

D

The joint probability of binary states is

P (S|W) =

i

P (Si|pa(Si),W)

The probability Si depends only on its parents:

P (Si|pa(Si),W) =

h(

j Sjwji) if Si = 1
1� h(


j Sjwji) if Si = 0

The function h specifies how causes are
combined, h(u) = 1� exp(�u), u > 0.

Main points:
• hierarchical structure allows model to form

high order representations
• upper states are priors for lower states
• weights encode higher order features

Another approach: Inference by stochastic simulation

Basic idea:

1. Draw N samples from a sampling distribution S

2. Compute an approximate posterior probability

3. Show this converges to the true probability

Sampling from an empty network

function Prior-Sample(bn) returns an event sampled from bn
inputs: bn, a belief network specifying joint distribution P(X1, . . . , Xn)

x← an event with n elements
for i = 1 to n do

xi ← a random sample from P(Xi | parents(Xi))
given the values of Parents(Xi) in x

return x

Chapter 14.4–5 14

Sampling with no evidence (from the prior)

Example

Cloudy

RainSprinkler

Wet

Grass

C

T

F

.80

.20

P(R|C)C

T

F

.10

.50

P(S|C)

S R

T T

T F

F T

F F

.90

.90

.99

P(W|S,R)

P(C)

.50

.01

Chapter 14.4–5 15

Example

Cloudy

RainSprinkler

Wet

Grass

C

T

F

.80

.20

P(R|C)C

T

F

.10

.50

P(S|C)

S R

T T

T F

F T

F F

.90

.90

.99

P(W|S,R)

P(C)

.50

.01

Chapter 14.4–5 16

Example

Cloudy

RainSprinkler

Wet

Grass

C

T

F

.80

.20

P(R|C)C

T

F

.10

.50

P(S|C)

S R

T T

T F

F T

F F

.90

.90

.99

P(W|S,R)

P(C)

.50

.01

Chapter 14.4–5 17

Example

Cloudy

RainSprinkler

Wet

Grass

C

T

F

.80

.20

P(R|C)C

T

F

.10

.50

P(S|C)

S R

T T

T F

F T

F F

.90

.90

.99

P(W|S,R)

P(C)

.50

.01

Chapter 14.4–5 18

Example

Cloudy

RainSprinkler

Wet

Grass

C

T

F

.80

.20

P(R|C)C

T

F

.10

.50

P(S|C)

S R

T T

T F

F T

F F

.90

.90

.99

P(W|S,R)

P(C)

.50

.01

Chapter 14.4–5 19

Example

Cloudy

RainSprinkler

Wet

Grass

C

T

F

.80

.20

P(R|C)C

T

F

.10

.50

P(S|C)

S R

T T

T F

F T

F F

.90

.90

.99

P(W|S,R)

P(C)

.50

.01

Chapter 14.4–5 20

Example

Cloudy

RainSprinkler

Wet

Grass

C

T

F

.80

.20

P(R|C)C

T

F

.10

.50

P(S|C)

S R

T T

T F

F T

F F

.90

.90

.99

P(W|S,R)

P(C)

.50

.01

Chapter 14.4–5 21

Keep sampling
to estimate joint
probabilities of

interest.

Rejection sampling

P̂(X|e) estimated from samples agreeing with e

function Rejection-Sampling(X,e, bn,N) returns an estimate of P (X |e)
local variables: N, a vector of counts over X, initially zero

for j = 1 to N do
x←Prior-Sample(bn)
if x is consistent with e then

N[x]←N[x]+1 where x is the value of X in x
return Normalize(N[X])

E.g., estimate P(Rain|Sprinkler = true) using 100 samples
27 samples have Sprinkler = true

Of these, 8 have Rain = true and 19 have Rain = false.

P̂(Rain|Sprinkler = true) = Normalize(⟨8, 19⟩) = ⟨0.296, 0.704⟩

Similar to a basic real-world empirical estimation procedure

Chapter 14.4–5 23

What if we do have some evidence? Rejection sampling.

Analysis of rejection sampling

P̂(X|e) = αNPS(X, e) (algorithm defn.)
= NPS(X, e)/NPS(e) (normalized by NPS(e))
≈ P(X, e)/P (e) (property of PriorSample)
= P(X|e) (defn. of conditional probability)

Hence rejection sampling returns consistent posterior estimates

Problem: hopelessly expensive if P (e) is small

P (e) drops off exponentially with number of evidence variables!

Chapter 14.4–5 24

Analysis of rejection sampling

Approximate inference using MCMC

“State” of network = current assignment to all variables.

Generate next state by sampling one variable given Markov blanket
Sample each variable in turn, keeping evidence fixed

function MCMC-Ask(X,e, bn,N) returns an estimate of P (X |e)
local variables: N[X ], a vector of counts over X, initially zero

Z, the nonevidence variables in bn
x, the current state of the network, initially copied from e

initialize x with random values for the variables in Y
for j = 1 to N do

for each Zi in Z do
sample the value of Zi in x from P(Zi |mb(Zi))

given the values of MB(Zi) in x
N[x ]←N[x ] + 1 where x is the value of X in x

return Normalize(N[X ])

Can also choose a variable to sample at random each time

Chapter 14.4–5 34

Approximate inference using Markov Chain Monte Carlo (MCMC)

The extent of dependencies in Bayesian networks

• A node X is conditionally independent
of its non-descendants given its
parents


• A node X is conditionally independent
of all the other nodes in the network
given its Markov blanket.

X X

The Markov chain

With Sprinkler = true, WetGrass = true, there are four states:

Cloudy

RainSprinkler

Wet

Grass

Cloudy

RainSprinkler

Wet

Grass

Cloudy

RainSprinkler

Wet

Grass

Cloudy

RainSprinkler

Wet

Grass

Wander about for a while, average what you see

Chapter 14.4–5 35

The Markov chain

MCMC example contd.

Estimate P(Rain|Sprinkler = true,WetGrass = true)

Sample Cloudy or Rain given its Markov blanket, repeat.
Count number of times Rain is true and false in the samples.

E.g., visit 100 states
31 have Rain = true, 69 have Rain = false

P̂(Rain|Sprinkler = true,WetGrass = true)
= Normalize(⟨31, 69⟩) = ⟨0.31, 0.69⟩

Theorem: chain approaches stationary distribution:
long-run fraction of time spent in each state is exactly
proportional to its posterior probability

Chapter 14.4–5 36

After obtaining the MCMC samples

But:
1. Difficult to tell when samples have converged. Theorem only applies in

limit, and it could take time to “settle in”.
2. Can also be inefficient if each state depends on many other variables.

• Model represents stochastic
binary features.

• Each input xi encodes the
probability that the ith binary
input feature is present.

• The set of features represented
by ϕj is defined by weights fij
which encode the probability
that feature i is an instance of ϕj.

• Trick: It’s easier to adapt weights
in an unbounded space, so use
the transformation:

• optimize in w-space.

Gibbs sampling (back to the noisy-OR example)

Beyond tables: modeling causal relationships using Noisy-OR

• We assume each cause Cj can produce
effect Ei with probability fij.

• The noisy-OR model assumes the
parent causes of effect Ei contribute
independently.

• The probability that none of them
caused effect Ei is simply the product of
the probabilities that each one did not
cause Ei.

• The probability that any of them caused
Ei is just one minus the above, i.e.

P (Ei|par(Ei)) = P (Ei|C1, . . . , Cn)

= 1 −
!

i

(1 − P (Ei|Cj))

= 1 −
!

i

(1 − fij)

catch cold (C)

touch contaminated
object (O)

hit by viral
droplet (D)

f
CD

f
CO

eat contaminated
food (F)

f
CF

1 − (1 − fCD)(1 − fCO)(1 − fCF )

P (C|D, O, F ) =

Hierarchical Statistical Models
A Bayesian belief network:

pa(Si)

Si

D

The joint probability of binary states is

P (S|W) =

i

P (Si|pa(Si),W)

The probability Si depends only on its parents:

P (Si|pa(Si),W) =

h(

j Sjwji) if Si = 1
1� h(


j Sjwji) if Si = 0

The function h specifies how causes are
combined, h(u) = 1� exp(�u), u > 0.

Main points:
• hierarchical structure allows model to form

high order representations
• upper states are priors for lower states
• weights encode higher order features

Inferring the best representation of the observed variables

• Given on the input D, the is no simple way to determine which states are the
input’s most likely causes.

– Computing the most probable network state is an inference process
– we want to find the explanation of the data with highest probability
– this can be done efficiently with Gibbs sampling

• Gibbs sampling is another example of an MCMC method
• Key idea:

The samples are guaranteed to converge to the

true posterior probability distribution

Gibbs Sampling

Gibbs sampling is a way to select an ensemble of states that are representative of
the posterior distribution P (S|D,W).

• Each state of the network is updated iteratively according to the probability of
Si given the remaining states.

• this conditional probability can be computed using (Neal, 1992)

P (Si = a|Sj : j ⇤= i,W) ⇥ P (Si = a|pa(Si),W)

j2ch(Si)

P (Sj|pa(Sj), Si = a,W)

• limiting ensemble of states will be typical samples from P (S|D,W)

• also works if any subset of states are fixed and the rest are sampled

Network Interpretation of the Gibbs Sampling Equation

The probability of Si changing state given the remaining states is

P (Si = 1�Si|Sj : j ⇤= i,W) =
1

1 + exp(��xi)

�xi indicates how much changing the state Si changes the probability of the whole
network state

�xi = log h(ui; 1�Si)� log h(ui;Si)

+

j2ch(Si)

log h(uj + �ij;Sj)� log h(uj;Sj)

• ui is the causal input to Si, ui =

k Skwki

• �j specifies the change in uj for a change in Si,
�ij = +Sjwij if Si = 0, or �Sjwij if Si = 1

The Gibbs sampling equations (derivation omitted)

Interpretation of the Gibbs sampling equation

• The Gibbs equation can be interpreted as: feedback + ∑ feedforward
• feed-back: how consistent is Si with current causes?
• ∑ feedforward: how likely is Si a cause of its children
• feedback allows the lower-level units to use information only

computable at higher levels

• feedback determines (disambiguates) the state when the
feedforward input is ambiguous

The Shifter Problem

BA

Shift patterns
weights of a 32-20-2 network after
learning

Gibbs sampling: feedback disambiguates lower-level states

1

2

3

4

One the structure learned, the Gibbs updating convergences in two sweeps.

Difficulties with identifying the causal structure

• Space of S, C is huge
• If we define P(I|S,C), must be

able to invert it or search it

• Need efficient algorithms
• Many unknowns: identity of

objects, types of scene
elements, illuminations

• Might never have encountered
some structures

• Is it even the right approach?
• Can we solve a simple case?

shape material illuminationviewpoint

image

Motivation: learn hierarchical, context dependent representations

• many real-world patterns are
hierarchical in structure

• interpretation of patterns depends on
context

• essential for complex recognition
tasks

pixels

features

surface properties,
contours, …

objects, scene properties
and structure, …

Motivation: hierarchical, context dependent representations

pixels

features

surface properties,
contours, …

objects, scene properties
and structure, …

• many real-world patterns are
hierarchical in structure

• interpretation of patterns depends on
context

• essential for complex recognition tasks

Motivation: hierarchical, context dependent representations

pixels

features

surface properties,
contours, …

objects, scene properties
and structure, …

• many real-world patterns are
hierarchical in structure

• interpretation of patterns depends on
context

• essential for complex recognition tasks

Toy problems – Handwritten digits

0 1 2 … 9

p(I|S, C)

Data: pixels (black or white)

Complex “random” patterns with simple
underlying structure

I –

S –

The inferred explanation is the most probable scene
The inferred explanation is the most probable scene

p(Ŝ|I, C) = arg max
S

p(I|S, C)p(S|C)
p(I|C)

� �

CP08: Perceptual Inference 1 / Michael S. Lewicki, CMU ·· ·

·

? 19

Motivation: learn hierarchical, context dependent representations

• many real-world patterns are
hierarchical in structure

• interpretation of patterns depends on
context

• essential for complex recognition
tasks

pixels

features

surface properties,
contours, …

objects, scene properties
and structure, …

The toy problem:
is it a pattern or a collection of features?

The higher-order lines problem

A

0.6

0.1 0.1 0.10.1

B

The true generative model Patterns sampled from the model

Can we infer the structure of the network given only the patterns?

Weights in a 25-10-5 belief network after learning

The first layer of weights learn that
patterns are combinations of lines.

The second layer learns combinations of the first layer features

A

0.6

0.1 0.1 0.10.1

B

Logistic belief nets

• generative model: 


• is the logistic function: 


• with deep networks, it is possible to
learn complex joint probability
distributions

recognition

generation

p(vi = 1) = �(bi +

j

hjwij)

�(x) =
1

1 + exp(�x)

�(x)

Wake-sleep learning

• For probabilistic models
– top-down weights generate patterns from model distribution
– bottom-up weights convey distribution of data-vectors
– ideally the two distributions should match

• The “wake-sleep” algorithm adjusts the weights so that the distribution from
recognition (wake) matches the distribution from generation (sleep)

recognition

generation

Wake-sleep learning

• For each digit in training set
– bottom-up pass: use recognition weights to stochastically set hidden states

– adjust generative weights to improve how model generates training data:

– is the probability of activating state i given inferred states hj

recognition

generation

�wji ⇥ hj(hi � ĥi)
ĥi

p(hj = 1) = �(bj +

i

viwij)

Generative model for hand-written digits

from Hinton (2007)

• Generation:
– use alternating Gibbs sampling

from top-level assoc. memory

– use directed weights to
stochastically generate pixel
probs. from sampled binary of
500 hidden units


• Recognition:
– Use bottom-up weights to

produce binary activities in two
lower layers

– use alternating Gibbs sampling in
the top two layers

Demo: deep-belief network model (Hinton)

www.cs.toronto.edu/~hinton/adi/index.htm

http://www.cs.toronto.edu/~hinton/adi/index.htm