L20 – Neural Networks
k-means clustering (recap)
• Idea: try to estimate k cluster centers by
minimizing “distortion”
• Define distortion as:
• rnk is 1 for the closest cluster mean to xn.
• Each point xn is the minimum distance
from its closet center.
• How do we learn the cluster means?
• Need to derive a learning rule.
D =
N�
n=1
K�
k=1
rnk ⇤ xn � µk ⇤2
rnk = 1 if xn ⇥ cluster k, 0 otherwise.
Deriving a learning rule for the cluster means
• Our objective function is:
• Differentiate w.r.t. to the mean (the
parameter we want to estimate):
• We know the optimum is when
• Here, we can solve for the mean:
• This is simply a weighted mean for
each cluster.
• Thus we have a simple estimation
algorithm (k-means clustering)
1. select k points at random
2. estimate (update) means
3. repeat until converged
• convergence (to a local minimum)
is guaranteed
D =
N�
n=1
K�
k=1
rnk ⇥ xn � µk ⇥2
�D
�µk
= 2
N�
n=1
rnk(xn � µk)
�D
�µk
= 2
N�
n=1
rnk(xn � µk) = 0
µk =
�
n rnkxn�
n rnk
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
8
9
10
k = number of clusters
D
is
to
rti
on
How do we choose k?
• Increasing k, will always decrease our
distortion. This will overfit the data.
– How can we avoid this?
– Or how do we choose the best k?
• One way: cross validation
• Use our distortion metric:
• Then just measure the distortion on
a test data set, and stop when we
reach a minimum.
D =
N�
n=1
K�
k=1
rnk ⇥ xn � µk ⇥2
−100 0 100 200 300
−200
−100
0
100
200
300
1st PC score
2n
d
P
C
s
co
re
k=10 clusters
overfitting
test set error
training set error
EECS 391
Intro to AI
Neural Networks
L20 Tue Nov 16
The Iris dataset with decision tree boundaries
1 2 3 4 5 6 7
0
0.5
1
1.5
2
2.5
petal length (cm)
pe
ta
l w
id
th
(c
m
)
The optimal decision boundary for C2 vs C3
1 2 3 4 5 6 7
0
0.5
1
1.5
2
2.5
petal length (cm)
pe
ta
l w
id
th
(c
m
)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9p(petal length |C2)
p(petal length |C3)
Is this really optimal?
Could we do better?
Arbitrary decision boundaries would be more powerful
1 2 3 4 5 6 7
0
0.5
1
1.5
2
2.5
petal length (cm)
pe
ta
l w
id
th
(c
m
)
Decision boundaries
could be non-linear
3 4 5 6 7
1
1.5
2
2.5
x1
x 2
Defining a decision boundary
The decision boundary:
y = mT x + b = 0
x ⇥
�
class 1 if y � 0,
class 2 if y < 0.
m1x1 + m2x2 = �b
⇥ x2 = �
m1x1 + b
m2
Or in terms of scalars:
y = mT x + b
= m1x1 + m2x2 + b
=
�
i
mixi + b
Linear separability
1 2 3 4 5 6 7
0
0.5
1
1.5
2
2.5
petal length (cm)
pe
ta
l w
id
th
(c
m
)
linearly separable
not linearly separable
Diagraming the classifier as a “neural” network
x1 x2 xM• • •
y
w1 w2 wM
b
x1 x2 xM• • •
y
w1 w2 wM
x0=1
w0
y = wT x =
M�
i=0
wixi
y = wT x + b
=
M�
i=1
wixi + b
“output unit”
“input units”
“bias”
“weights”
Determining, ie learning, the optimal linear discriminant
• First we must define an objective function, ie the goal of learning
• Simple idea: adjust weights so that output y(xn) matches class cn
• Objective: minimize sum-squared error over all patterns xn:
• Note the notation xn defines a pattern vector:
• We can define the desired class as:
E =
1
2
N�
n=1
(wT xn � cn)2
xn = {x1, . . . , xM}n
cn =
�
0 xn � class 1
1 xn � class 2
We’ve seen this before: curve fitting
example from Bishop (2006), Pattern Recognition and Machine Learning
t = sin(2�x) + noise
0 1
−1
0
1 t
x
y(xn,w)
tn
xn
Neural networks compared to polynomial curve fitting
0 1
−1
0
1
0 1
−1
0
1
0 1
−1
0
1
0 1
−1
0
1
y(x,w) = w0 + w1x + w2x2 + · · · + wMxM =
M�
j=0
wjx
j
E(w) =
1
2
N�
n=1
[y(xn,w)� tn]
2
example from Bishop (2006), Pattern Recognition and Machine Learning
For the linear
network, M=1 and
there are multiple
input dimensions
General form of a linear network
• A linear neural network is simply a
linear transformation of the input.
• Or, in matrix-vector form:
• Multiple outputs corresponds to
multivariate regression
y = Wx
yj =
M!
i=0
wi,jxi
x1 xi xM• • •
yi
wij
x0=1
y1 yK
• • •
• • •• • •
x
y
W
“outputs”
“weights”
“inputs”
“bias”
Training the network: Optimization by gradient descent
• We can adjust the weights incrementally
to minimize the objective function.
• This is called gradient descent
• Or gradient ascent if we’re maximizing.
• The gradient descent rule for weight wi is:
• Or in vector form:
• For gradient ascent, the sign
of the gradient step changes.
w1
w2
w3
w4
w2
w1
w
t+1
i
= w
t
i
− ϵ
∂E
wi
wt+1 = wt − ϵ
∂E
w
Computing the gradient
• Idea: minimize error by gradient descent
• Take the derivative of the objective function wrt the weights:
• And in vector form:
E =
1
2
N�
n=1
(wT xn � cn)2
�E
wi
=
2
2
N�
n=1
(w0x0,n + · · · + wixi,n + · · · + wMxM,n � cn)xi,n
=
N�
n=1
(wT xn � cn)xi,n
�E
w
=
N�
n=1
(wT xn � cn)xn
Simulation: learning the decision boundary
• Each iteration updates the gradient:
• Epsilon is a small value:
ε = 0.1/N
• Epsilon too large:
- learning diverges
• Epsilon too small:
- convergence slow
3 4 5 6 7
1
1.5
2
2.5
x1
x 2
0 5 10 15
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
iteration
E
rr
or
∂E
wi
=
N!
n=1
(wT xn − cn)xi,n
w
t+1
i
= w
t
i
− ϵ
∂E
wi
Learning Curve
Simulation: learning the decision boundary
• Each iteration updates the gradient:
• Epsilon is a small value:
ε = 0.1/N
• Epsilon too large:
- learning diverges
• Epsilon too small:
- convergence slow
3 4 5 6 7
1
1.5
2
2.5
x1
x 2
0 5 10 15
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
iteration
E
rr
or
∂E
wi
=
N!
n=1
(wT xn − cn)xi,n
w
t+1
i
= w
t
i
− ϵ
∂E
wi
Learning Curve
Simulation: learning the decision boundary
• Each iteration updates the gradient:
• Epsilon is a small value:
ε = 0.1/N
• Epsilon too large:
- learning diverges
• Epsilon too small:
- convergence slow
3 4 5 6 7
1
1.5
2
2.5
x1
x 2
0 5 10 15
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
iteration
E
rr
or
∂E
wi
=
N!
n=1
(wT xn − cn)xi,n
w
t+1
i
= w
t
i
− ϵ
∂E
wi
Learning Curve
Simulation: learning the decision boundary
• Each iteration updates the gradient:
• Epsilon is a small value:
ε = 0.1/N
• Epsilon too large:
- learning diverges
• Epsilon too small:
- convergence slow
3 4 5 6 7
1
1.5
2
2.5
x1
x 2
0 5 10 15
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
iteration
E
rr
or
∂E
wi
=
N!
n=1
(wT xn − cn)xi,n
w
t+1
i
= w
t
i
− ϵ
∂E
wi
Learning Curve
Simulation: learning the decision boundary
• Each iteration updates the gradient:
• Epsilon is a small value:
ε = 0.1/N
• Epsilon too large:
- learning diverges
• Epsilon too small:
- convergence slow
3 4 5 6 7
1
1.5
2
2.5
x1
x 2
0 5 10 15
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
iteration
E
rr
or
∂E
wi
=
N!
n=1
(wT xn − cn)xi,n
w
t+1
i
= w
t
i
− ϵ
∂E
wi
Learning Curve
Simulation: learning the decision boundary
• Each iteration updates the gradient:
• Epsilon is a small value:
ε = 0.1/N
• Epsilon too large:
- learning diverges
• Epsilon too small:
- convergence slow
3 4 5 6 7
1
1.5
2
2.5
x1
x 2
0 5 10 15
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
iteration
E
rr
or
∂E
wi
=
N!
n=1
(wT xn − cn)xi,n
w
t+1
i
= w
t
i
− ϵ
∂E
wi
Learning Curve
Simulation: learning the decision boundary
• Learning converges onto the solution
that minimizes the error.
• For linear networks, this is
guaranteed to converge to the
minimum
• It is also possible to derive a closed-
form solution (covered later)
3 4 5 6 7
1
1.5
2
2.5
x1
x 2
0 5 10 15
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
iteration
E
rr
or
Learning Curve
Learning is slow when epsilon is too small
• Here, larger step sizes would
converge more quickly to the
minimum
w
Error
Divergence when epsilon is too large
• If the step size is too large, learning
can oscillate between different sides
of the minimum
w
Error
Multi-layer networks
• Can we extend our network to
multiple layers? We have:
• Or in matrix form
• Thus a two-layer linear network is
equivalent to a one-layer linear
network with weights U=VW.
• It is not more powerful.
x
y
W
V
z
yj =
!
i
wi,jxi
zj =
!
k
vj,kyj
=
!
k
vj,k
!
i
wi,jxi
z = Vy
= VWx
How do we address this?
Non-linear neural networks
• Idea introduce a non-linearity:
• Now, multiple layers are not equivalent
• Common nonlinearities:
- threshold
- sigmoid
yj = f(
!
i
wi,jxi)
zj = f(
!
k
vj,k yj)
= f(
!
k
vj,k f(
!
i
wi,jxi))
−5 −4 −3 −2 −1 0 1 2 3 4 5
0
1
x
f(x
)
threshold
−5 −4 −3 −2 −1 0 1 2 3 4 5
0
1
x
f(x
)
sigmoid
y =
!
0 x < 0
1 x ≥ 0
y =
1
1 + exp(−x)
Posterior odds interpretation of a sigmoid
• The sigmoid can be interpreted as the posterior odds between two hypotheses
• “a” is the log odds ratio of p(C1|x) and p(C2|x).
p(C1|x) =
p(x|C1)p(C1)
p(x|C1)p(C1) + p(x|C2)p(C2)
=
1
1 + exp(a)
= σ(a)
where a = log
p(x|C1)p(C1)
p(x|C2)p(C2)
Modeling logical operators
• A one-layer binary-threshold
network can implement the logical
operators AND and OR, but not
XOR.
• Why not?
x1
x2
x1
x2
x1
x2
x1 AND x2 x1 OR x2 x1 XOR x2
yj = f(
!
i
wi,jxi)
−5 −4 −3 −2 −1 0 1 2 3 4 5
0
1
x
f(x
)
threshold y =
!
0 x < 0
1 x ≥ 0
The general classification/regression problem
Data
D = {x1, . . . ,xT }
xi = {x1, . . . , xN}i
desired output
y = {y1, . . . , yK}
model
θ = {θ1, . . . , θM}
Given data, we want to learn a model that
can correctly classify novel observations or
map the inputs to the outputs
yi =
�
1 if xi ⇥ Ci � class i,
0 otherwise
for classification:
input is a set of T observations, each an
N-dimensional vector (binary, discrete, or
continuous)
model (e.g. a decision tree) is defined by
M parameters, e.g. a multi-layer
neural network.
regression for arbitrary y.
• Error function is defined as before,
where we use the target vector tn to
define the desired output for network
output yn.
• The “forward pass” computes the
outputs at each layer:
A general multi-layer neural network
E =
1
2
N!
n=1
(yn(xn,W1:L) − tn)
2
x1 xi xM• • •
yi
wij
x0=1
y1 yK
• • •
• • •• • •
yiy1 yK• • •• • •
yiy1 yK• • •• • •
y0=1
wij
•
•
•
y0=1
ylj = f(
!
i
wli,j y
l−1
j )
l = {1, . . . , L}
x ≡ y0
output = yL
input
layer 1
layer 2
output
Deriving the gradient for a sigmoid neural network
• Mathematical procedure for train
is gradient descient: same as
before, except the gradients are
more complex to derive.
• Convenient fact for the sigmoid
non-linearity:
• backward pass computes the
gradients: back-propagation
dσ(x)
dx
=
d
dx
1
1 + exp (−x)
= σ(x)(1 − σ(x))
E =
1
2
N!
n=1
(yn(xn,W1:L) − tn)
2
x1 xi xM• • •
yi
wij
x0=1
y1 yK
• • •
• • •• • •
yiy1 yK• • •• • •
yiy1 yK• • •• • •
y0=1
wij
•
•
•
y0=1
input
layer 1
layer 2
output
Wt+1 = Wt + ϵ
∂E
W
New problem: local minima
Applications: Driving (output is analog: steering direction)
24
Real Example
network with 1 layer
(4 hidden units)
D. Pomerleau. Neural network
perception for mobile robot
guidance. Kluwer Academic
Publishing, 1993.
• Learns to drive on roads
• Demonstrated at highway
speeds over 100s of miles
Training data:
Images +
corresponding
steering angle
Important:
Conditioning of
training data to
generate new
examples ! avoids
overfitting
Real image input is augmented to avoid overfitting
24
Real Example
network with 1 layer
(4 hidden units)
D. Pomerleau. Neural network
perception for mobile robot
guidance. Kluwer Academic
Publishing, 1993.
• Learns to drive on roads
• Demonstrated at highway
speeds over 100s of miles
Training data:
Images +
corresponding
steering angle
Important:
Conditioning of
training data to
generate new
examples ! avoids
overfitting
23
Real Example
• Takes as input image of handwritten digit
• Each pixel is an input unit
• Complex network with many layers
• Output is digit class
• Tested on large (50,000+) database of handwritten samples
• Real-time
• Used commercially
Y. LeCun, L. Bottou, Y. Bengio, and
P. Haffner. Gradient-based learning
applied to document recognition.
Proceedings of the IEEE, november
1998.
http://yann.lecun.com/exdb/lenet/
Very low error rate (<< 1%
Hand-written digits: LeNet
LeNet
23
Real Example
• Takes as input image of handwritten digit
• Each pixel is an input unit
• Complex network with many layers
• Output is digit class
• Tested on large (50,000+) database of handwritten samples
• Real-time
• Used commercially
Y. LeCun, L. Bottou, Y. Bengio, and
P. Haffner. Gradient-based learning
applied to document recognition.
Proceedings of the IEEE, november
1998.
http://yann.lecun.com/exdb/lenet/
Very low error rate (<< 1%
http://yann.lecun.com/exdb/lenet/
http://yann.lecun.com/exdb/lenet/
http://yann.lecun.com/exdb/lenet/index.html
Object recognition
• LeCun, Huang, Bottou (2004). Learning Methods for Generic Object Recognition
with Invariance to Pose and Lighting. Proceedings of CVPR 2004.
• http://www.cs.nyu.edu/~yann/research/norb/
http://www.cs.nyu.edu/~yann/research/norb/
Is AI is still at the balloon stage?
Limitations of this approach
AI research focuses on underlying principles. Russel & Norvig:
Aeronautical engineering texts do not define the goal of their field as making “machines that
fly so exactly like pigeons that they can fool even other pigeons.”
My own neural recordings from my PhD studies
subsequent spikes. To examine this relationship, the action po-
tentials in response to auditory stimuli (during stimulus and
background periods) were sorted according to the prepotential,
defined here as the average potential in the 50 msec epoch before
each spike relative to the average resting membrane potential.
Spikes occurring within 30 msec of an already considered spike
were excluded. The prepotential and number of following spikes
were measured for three groups of neurons: song-specific cells
that had a phasic response to the forward song (n 5 3), song-
specific cells that had a tonic response throughout the forward
song (n 5 3), and nonauditory cells that showed spontaneous
bursting (n 5 5). One-way ANOVA was performed for each
group to determine whether there was a statistically significant
change in prepotential for different numbers of subsequent spikes.
Figure 8 summarizes the distributions of prepotentials before
each spike or spike burst for the three different groups. Song-
specific neurons that responded phasically to song were signifi-
cantly hyperpolarized before each action potential ( p , 0.001, F
test) (Fig. 8, left), but tonically excited song-specific cells showed
no significant change in prepotential versus the number of follow-
ing spikes ( p . 0.5, F test) (Fig. 8, middle). The spike bursts
produced by the nonauditory cells appear similar to those of the
phasically excited song-specific cells and, like those cells, also
showed a significant hyperpolarization before bursting compared
with the prepotential for a single spike ( p , 0.001, F test) (Fig. 8,
right).
Examples of the trend suggested by this analysis can be seen
using data taken from one of the phasically excited song-specific
cells (the one shown in Fig. 6). The waveform patterns following
the 10 most negative prepotentials of the data collected in re-
sponse to the forward song are shown in Figure 9a. Nearly every
trace is followed by a burst. The most positive prepotentials,
however, are all followed by single action potentials (Fig. 9b). The
other two phasically excited song-specific cells showed similar
patterns.
Conversely, one can examine the membrane potential before a
burst containing a certain number of spikes. Figure 10a shows four
waveforms aligned on the first action potential of bursts contain-
ing six spikes. These are preceded by a consistent hyperpolariza-
tion, whereas the bursts containing only three spikes are not (Fig.
Figure 3. Precise timing in a phasically excited HVc cell. The horizontal
line indicates the average resting potential, which was 259 mV. Some
phasic cells show high regularity in the firing times of the bursts and in the
subthreshold membrane potential. The action potentials in these bursts
also show attenuation over the time course of the burst.
Figure 2. An intracellular recording of a phasically excited song-specific HVc cell. The conventions are the same as in Figure 1, except that in bottom
traces (a–c), the waveform rasters are overlaid to show the consistency of both the hyperpolarizations and of the temporal positions of the phasic bursts.
The boxed region indicates where this cell responded phasically with bursts of action potentials to the forward song. The phasic response is lost when the
syllable order is reversed or when the entire song is reversed.
Lewicki • Intracellular Responses of Song-Specific Neurons J. Neurosci., September 15, 1996, 16(18):5854–5863 5857
may subserve complex auditory neurons observed in other sys-
tems, such as the cat (Weinberger and McKenna, 1988; McKenna
et al., 1989), squirrel monkey (Wollberg and Newman, 1972;
Newman and Wollberg, 1973; Glass and Wollberg, 1983), and
rhesus monkey (Rauschecker et al., 1995). The temporal pattern
of the response of song-specific cells can be highly regular, even at
the level of the subthreshold membrane potential. It remains to be
seen whether the presence of such precise timing reflects that of
afferent cells or whether it is refined by the circuitry in HVc.
Precise timing is correlated with phasic bursting, and it is thus
possible that phasic bursting and its associated hyperpolarization
subserve a neural code utilizing spike timing. Such a code has
obvious utility in the context of song learning and production.
REFERENCES
Bonke D, Scheich H, Langner G (1979) Responsiveness of units in the
auditory neostriatum of the guinea fowl (Numida meleagris) to species-
specific calls and synthetic stimuli. I. Tonotopy and functional zones of
field L. J Comp Physiol [A] 132:243–255.
Bottjer SW, Miesner EA, Arnold AP (1984) Forebrain lesions disrupt
development but not maintenance of song in passerine birds. Science
224:901–903.
Doupe AJ, Konishi M (1991) Song-selective auditory circuits in the vocal
control system of the zebra finch. Proc Natl Acad Sci USA
88:11339–11343.
Doupe AJ, Konishi M (1992) Song-selective auditory neurons emerge
during vocal learning in the zebra finch. Soc Neurosci Abstr 18:527.
Fortune ES, Margoliash D (1994) In vivo characterization of identified
HVc neurons in the zebra finch. Soc Neurosci Abstr 20:165.
Fortune ES, Margoliash D (1995) Parallel pathways and convergence
onto HVc and adjacent neostriatum of adult zebra finches (Taeniopygia
guttata). J Comp Neurol 360:413–441.
Glass I, Wollberg Z (1983) Auditory-cortex responses to sequences of
normal and reversed squirrel-monkey vocalizations. Brain Behav Evol
22:13–21.
Hose B, Langner G, Scheich H (1987) Topographic representation of
periodicities in the forebrain of the myna bird: one map for pitch and
rhythm. Brain Res 422:367–373.
Jahnsen H, Llinas R (1984) Electrophysiological properties of guinea-pig
thalamic neurons: an in vitro study. J Physiol (Lond) 349:105–226.
Katz LC, Gurney ME (1981) Auditory responses in the zebra finch’s
motor system for song. Brain Res 211:192–197.
Kelley DB, Nottebohm F (1979) Projections of a telencephalic auditory
nucleus—field L—in the canary. J Comp Neurol 183:455–470.
Kirkwood A, Dudek SM, Gold JT, Aizenman CD, Bear MF (1993)
Common forms of synaptic plasticity in the hippocampus and neocortex
in vitro. Science 260:1518–1521.
Knipschild M, Dorrscheidt GJ, Rubsamen R (1992) Setting complex
tasks to single units in the avian auditory forebrain. I. Processing of
complex artificial stimuli. Hear Res 57:216–230.
Konishi M (1965) The role of auditory feedback in the control of vocal-
ization in the white-crowned sparrow. Z Tierpsychol 22:771–783.
Kubota M, Saito N (1991) Sodium-dependent and calcium-dependent
conductances of neurons in the zebra finch hyperstriatum-ventrale pars
caudale in vitro. J Physiol (Lond) 440:131–142.
Larson J, Wong D, Lynch G (1986) Patterned stimulation at the theta
frequency is optimal for the induction of hippocampal long-term po-
tentiation. Brain Res 368:347–350.
Leppelsack HJ (1983) Analysis of song in the auditory pathway of song-
birds. In: Advances in vertebrate neuroethology (Ewert JP, ed), pp
783–800. New York: Plenum.
Lewicki MS, Arthur BJ (1995) Sensitivity to auditory temporal context
increases significantly from field L to HVc. Soc Neurosci Abstr 21:958.
Lewicki MS, Konishi M (1995) Mechanisms underlying the sensitivity of
songbird forebrain neurons to temporal order. Proc Natl Acad Sci USA
92:5582–5586.
Mainen ZF, Sejnowski TJ (1995) Reliability of spike timing in neocorti-
cal neurons. Science 268:1503–1506.
Malenka RC (1994) Synaptic plasticity in the hippocampus: LTP and
LTD. Cell 78:535–538.
Figure 11. Photomicrographs of avidin–horseradish peroxidase-stained
neurons. a, A song-specific HVc cell (data shown in Fig. 2); the axon of
this cell could be traced to area X. b, c, Two nonauditory HVc cells; both
cells had clear projections to RA. All scale bars, 20 mm.
5862 J. Neurosci., September 15, 1996, 16(18):5854–5863 Lewicki • Intracellular Responses of Song-Specific Neurons
20 mV
100 ms
Sound waveform
Neural response
A neuron in an auditory brain area
Brains vs computers
Brains (adult cortex)
• surface area: 2500 cm2
• squishy
• neurons: 20 billion
• synapses: 240,000 billion
• neuron size: 15,000 nm
• synapse size: 1,000 nm
• synaptic OPS: 30,000 billion
Intel quad core (Nehalem)
• surface area: 107 mm2
• crystalline
• transistors: 0.82 billion
• transistor size: 45 nm
• FLOPS: 45 billion
Deep Blue: 512
processors, 1 TFLOP
Brains vs computers: energy efficiency
Brains (adult cortex)
• surface area: 2500 cm2
• squishy
• neurons: 20 billion
• synapses: 240,000 billion
• neuron size: 15,000 nm
• synapse size: 1,000 nm
• synaptic OPS: 30,000 billion
Intel quad core (Nehalem, 2008)
• surface area: 107 mm2
• crystalline
• transistors: 0.82 billion
• transistor size: 45 nm
• FLOPS: 45 billion
• power usage: ~12 W
• 2500 GFLOPS/W
• power usage: 60 W
• 0.75 GFLOPS/W
Google’s Deep Learning Chip: The Tensor Processing Unit
Brains vs computers: energy efficiency
Brains (adult cortex)
• surface area: 2500 cm2
• squishy
• neurons: 20 billion
• synapses: 240,000 billion
• neuron size: 15,000 nm
• synapse size: 1,000 nm
• synaptic Terra OPS: ~30
Google TPU 2.0
• surface area: ~331 mm2
• crystalline
• transistors: 2-2.5 billion
• transistor size: 28 nm
• Terra OPS (8 bit): 92
(TPU 2.0 is 180 teraops)
• power usage: ~12 W
• 2.5 TOPS/W
• power usage: 40 W
• 2.3 TOPS/W
(4.5 for TPU 2.0 assuming 40W)
Reconstruction of neural motion processing circuits in a fly
A visual motion detection circuit suggested by Drosophila connectomics
379 neurons; 8,637 chemical synaptic connections
Summary
• Decision boundaries
- Bayes optimal
- linear discriminant
- linear separability
• Classification vs regression
• Optimization by gradient descent
• Degeneracy of a multi-layer linear network
• Non-linearities:: threshold, sigmoid, others?
• Issues:
- very general architecture, can solve many problems
- large number of parameters: need to avoid overfitting
- usually requires a large amount of data, or special architecture
- local minima, training can be slow, need to set stepsize