程序代写代做代考 database decision tree algorithm AI deep learning L20 - Neural Networks

L20 – Neural Networks

k-means clustering (recap)

• Idea: try to estimate k cluster centers by
minimizing “distortion”

• Define distortion as:

• rnk is 1 for the closest cluster mean to xn.
• Each point xn is the minimum distance

from its closet center.

• How do we learn the cluster means?
• Need to derive a learning rule.

D =
N�

n=1

K�

k=1

rnk ⇤ xn � µk ⇤2

rnk = 1 if xn ⇥ cluster k, 0 otherwise.

Deriving a learning rule for the cluster means

• Our objective function is:  
 
 
 

• Differentiate w.r.t. to the mean (the
parameter we want to estimate):  
 
 
 
 

• We know the optimum is when  

• Here, we can solve for the mean:  
 
 
 

• This is simply a weighted mean for
each cluster.  

• Thus we have a simple estimation
algorithm (k-means clustering)

1. select k points at random

2. estimate (update) means

3. repeat until converged 

• convergence (to a local minimum)
is guaranteed

D =
N�

n=1

K�

k=1

rnk ⇥ xn � µk ⇥2

�D

�µk
= 2

N�

n=1

rnk(xn � µk)

�D

�µk
= 2

N�

n=1

rnk(xn � µk) = 0

µk =
�

n rnkxn�
n rnk

1 2 3 4 5 6 7 8 9 10
1

k = number of clusters

D
is

to
rti

How do we choose k?

• Increasing k, will always decrease our
distortion. This will overfit the data.

– How can we avoid this?
– Or how do we choose the best k?

• One way: cross validation
• Use our distortion metric:

• Then just measure the distortion on
a test data set, and stop when we
reach a minimum.

D =
N�

n=1

K�

k=1

rnk ⇥ xn � µk ⇥2

−100 0 100 200 300

−200

−100

100

200

300

1st PC score

2n
d

P
C

s
co

k=10 clusters

overfitting

test set error

training set error

EECS 391
Intro to AI

Neural Networks

L20 Tue Nov 16

The Iris dataset with decision tree boundaries

1 2 3 4 5 6 7
0

0.5

1.5

2.5

petal length (cm)

pe
ta

l w
id

th
(c

m
)

The optimal decision boundary for C2 vs C3

1 2 3 4 5 6 7
0

0.5

1.5

2.5

petal length (cm)

pe
ta

l w
id

th
(c

m
)

0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9p(petal length |C2)

p(petal length |C3)

Is this really optimal?

Could we do better?

Arbitrary decision boundaries would be more powerful

1 2 3 4 5 6 7
0

0.5

1.5

2.5

petal length (cm)

pe
ta

l w
id

th
(c

m
)

Decision boundaries
could be non-linear

3 4 5 6 7

1.5

2.5

x1
x 2

Defining a decision boundary

The decision boundary:

y = mT x + b = 0
x ⇥

�
class 1 if y � 0,
class 2 if y < 0. m1x1 + m2x2 = �b ⇥ x2 = � m1x1 + b m2 Or in terms of scalars: y = mT x + b = m1x1 + m2x2 + b = � i mixi + b Linear separability 1 2 3 4 5 6 7 0 0.5 1 1.5 2 2.5 petal length (cm) pe ta l w id th (c m ) linearly separable not linearly separable Diagraming the classifier as a “neural” network x1 x2 xM• • • y w1 w2 wM b x1 x2 xM• • • y w1 w2 wM x0=1 w0 y = wT x = M� i=0 wixi y = wT x + b = M� i=1 wixi + b “output unit” “input units” “bias” “weights” Determining, ie learning, the optimal linear discriminant • First we must define an objective function, ie the goal of learning • Simple idea: adjust weights so that output y(xn) matches class cn • Objective: minimize sum-squared error over all patterns xn: • Note the notation xn defines a pattern vector: • We can define the desired class as: E = 1 2 N� n=1 (wT xn � cn)2 xn = {x1, . . . , xM}n cn = � 0 xn � class 1 1 xn � class 2 We’ve seen this before: curve fitting example from Bishop (2006), Pattern Recognition and Machine Learning t = sin(2�x) + noise 0 1 −1 0 1 t x y(xn,w) tn xn Neural networks compared to polynomial curve fitting 0 1 −1 0 1 0 1 −1 0 1 0 1 −1 0 1 0 1 −1 0 1 y(x,w) = w0 + w1x + w2x2 + · · · + wMxM = M� j=0 wjx j E(w) = 1 2 N� n=1 [y(xn,w)� tn] 2 example from Bishop (2006), Pattern Recognition and Machine Learning For the linear network, M=1 and there are multiple input dimensions General form of a linear network • A linear neural network is simply a linear transformation of the input. • Or, in matrix-vector form: • Multiple outputs corresponds to multivariate regression y = Wx yj = M! i=0 wi,jxi x1 xi xM• • • yi wij x0=1 y1 yK • • • • • •• • • x y W “outputs” “weights” “inputs” “bias” Training the network: Optimization by gradient descent • We can adjust the weights incrementally to minimize the objective function. • This is called gradient descent • Or gradient ascent if we’re maximizing. • The gradient descent rule for weight wi is: • Or in vector form: • For gradient ascent, the sign of the gradient step changes. w1 w2 w3 w4 w2 w1 w t+1 i = w t i − ϵ ∂E wi wt+1 = wt − ϵ ∂E w Computing the gradient • Idea: minimize error by gradient descent • Take the derivative of the objective function wrt the weights: • And in vector form: E = 1 2 N� n=1 (wT xn � cn)2 �E wi = 2 2 N� n=1 (w0x0,n + · · · + wixi,n + · · · + wMxM,n � cn)xi,n = N� n=1 (wT xn � cn)xi,n �E w = N� n=1 (wT xn � cn)xn Simulation: learning the decision boundary • Each iteration updates the gradient: • Epsilon is a small value: ε = 0.1/N • Epsilon too large: - learning diverges • Epsilon too small: - convergence slow 3 4 5 6 7 1 1.5 2 2.5 x1 x 2 0 5 10 15 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 iteration E rr or ∂E wi = N! n=1 (wT xn − cn)xi,n w t+1 i = w t i − ϵ ∂E wi Learning Curve Simulation: learning the decision boundary • Each iteration updates the gradient: • Epsilon is a small value: ε = 0.1/N • Epsilon too large: - learning diverges • Epsilon too small: - convergence slow 3 4 5 6 7 1 1.5 2 2.5 x1 x 2 0 5 10 15 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 iteration E rr or ∂E wi = N! n=1 (wT xn − cn)xi,n w t+1 i = w t i − ϵ ∂E wi Learning Curve Simulation: learning the decision boundary • Each iteration updates the gradient: • Epsilon is a small value: ε = 0.1/N • Epsilon too large: - learning diverges • Epsilon too small: - convergence slow 3 4 5 6 7 1 1.5 2 2.5 x1 x 2 0 5 10 15 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 iteration E rr or ∂E wi = N! n=1 (wT xn − cn)xi,n w t+1 i = w t i − ϵ ∂E wi Learning Curve Simulation: learning the decision boundary • Each iteration updates the gradient: • Epsilon is a small value: ε = 0.1/N • Epsilon too large: - learning diverges • Epsilon too small: - convergence slow 3 4 5 6 7 1 1.5 2 2.5 x1 x 2 0 5 10 15 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 iteration E rr or ∂E wi = N! n=1 (wT xn − cn)xi,n w t+1 i = w t i − ϵ ∂E wi Learning Curve Simulation: learning the decision boundary • Each iteration updates the gradient: • Epsilon is a small value: ε = 0.1/N • Epsilon too large: - learning diverges • Epsilon too small: - convergence slow 3 4 5 6 7 1 1.5 2 2.5 x1 x 2 0 5 10 15 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 iteration E rr or ∂E wi = N! n=1 (wT xn − cn)xi,n w t+1 i = w t i − ϵ ∂E wi Learning Curve Simulation: learning the decision boundary • Each iteration updates the gradient: • Epsilon is a small value: ε = 0.1/N • Epsilon too large: - learning diverges • Epsilon too small: - convergence slow 3 4 5 6 7 1 1.5 2 2.5 x1 x 2 0 5 10 15 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 iteration E rr or ∂E wi = N! n=1 (wT xn − cn)xi,n w t+1 i = w t i − ϵ ∂E wi Learning Curve Simulation: learning the decision boundary • Learning converges onto the solution that minimizes the error. • For linear networks, this is guaranteed to converge to the minimum • It is also possible to derive a closed- form solution (covered later) 3 4 5 6 7 1 1.5 2 2.5 x1 x 2 0 5 10 15 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 iteration E rr or Learning Curve Learning is slow when epsilon is too small • Here, larger step sizes would converge more quickly to the minimum w Error Divergence when epsilon is too large • If the step size is too large, learning can oscillate between different sides of the minimum w Error Multi-layer networks • Can we extend our network to multiple layers? We have: • Or in matrix form • Thus a two-layer linear network is equivalent to a one-layer linear network with weights U=VW. • It is not more powerful. x y W V z yj = ! i wi,jxi zj = ! k vj,kyj = ! k vj,k ! i wi,jxi z = Vy = VWx How do we address this? Non-linear neural networks • Idea introduce a non-linearity: • Now, multiple layers are not equivalent • Common nonlinearities: - threshold - sigmoid yj = f( ! i wi,jxi) zj = f( ! k vj,k yj) = f( ! k vj,k f( ! i wi,jxi)) −5 −4 −3 −2 −1 0 1 2 3 4 5 0 1 x f(x ) threshold −5 −4 −3 −2 −1 0 1 2 3 4 5 0 1 x f(x ) sigmoid y = ! 0 x < 0 1 x ≥ 0 y = 1 1 + exp(−x) Posterior odds interpretation of a sigmoid • The sigmoid can be interpreted as the posterior odds between two hypotheses • “a” is the log odds ratio of p(C1|x) and p(C2|x). p(C1|x) = p(x|C1)p(C1) p(x|C1)p(C1) + p(x|C2)p(C2) = 1 1 + exp(a) = σ(a) where a = log p(x|C1)p(C1) p(x|C2)p(C2) Modeling logical operators • A one-layer binary-threshold network can implement the logical operators AND and OR, but not XOR. • Why not? x1 x2 x1 x2 x1 x2 x1 AND x2 x1 OR x2 x1 XOR x2 yj = f( ! i wi,jxi) −5 −4 −3 −2 −1 0 1 2 3 4 5 0 1 x f(x ) threshold y = ! 0 x < 0 1 x ≥ 0 The general classification/regression problem Data D = {x1, . . . ,xT } xi = {x1, . . . , xN}i desired output y = {y1, . . . , yK} model θ = {θ1, . . . , θM} Given data, we want to learn a model that can correctly classify novel observations or map the inputs to the outputs yi = � 1 if xi ⇥ Ci � class i, 0 otherwise for classification: input is a set of T observations, each an N-dimensional vector (binary, discrete, or continuous) model (e.g. a decision tree) is defined by M parameters, e.g. a multi-layer neural network. regression for arbitrary y. • Error function is defined as before, where we use the target vector tn to define the desired output for network output yn. • The “forward pass” computes the outputs at each layer: A general multi-layer neural network E = 1 2 N! n=1 (yn(xn,W1:L) − tn) 2 x1 xi xM• • • yi wij x0=1 y1 yK • • • • • •• • • yiy1 yK• • •• • • yiy1 yK• • •• • • y0=1 wij • • • y0=1 ylj = f( ! i wli,j y l−1 j ) l = {1, . . . , L} x ≡ y0 output = yL input layer 1 layer 2 output Deriving the gradient for a sigmoid neural network • Mathematical procedure for train is gradient descient: same as before, except the gradients are more complex to derive. • Convenient fact for the sigmoid non-linearity: • backward pass computes the gradients: back-propagation dσ(x) dx = d dx 1 1 + exp (−x) = σ(x)(1 − σ(x)) E = 1 2 N! n=1 (yn(xn,W1:L) − tn) 2 x1 xi xM• • • yi wij x0=1 y1 yK • • • • • •• • • yiy1 yK• • •• • • yiy1 yK• • •• • • y0=1 wij • • • y0=1 input layer 1 layer 2 output Wt+1 = Wt + ϵ ∂E W New problem: local minima Applications: Driving (output is analog: steering direction) 24 Real Example network with 1 layer (4 hidden units) D. Pomerleau. Neural network perception for mobile robot guidance. Kluwer Academic Publishing, 1993. • Learns to drive on roads • Demonstrated at highway speeds over 100s of miles Training data: Images + corresponding steering angle Important: Conditioning of training data to generate new examples ! avoids overfitting Real image input is augmented to avoid overfitting 24 Real Example network with 1 layer (4 hidden units) D. Pomerleau. Neural network perception for mobile robot guidance. Kluwer Academic Publishing, 1993. • Learns to drive on roads • Demonstrated at highway speeds over 100s of miles Training data: Images + corresponding steering angle Important: Conditioning of training data to generate new examples ! avoids overfitting 23 Real Example • Takes as input image of handwritten digit • Each pixel is an input unit • Complex network with many layers • Output is digit class • Tested on large (50,000+) database of handwritten samples • Real-time • Used commercially Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, november 1998. http://yann.lecun.com/exdb/lenet/ Very low error rate (<< 1% Hand-written digits: LeNet LeNet 23 Real Example • Takes as input image of handwritten digit • Each pixel is an input unit • Complex network with many layers • Output is digit class • Tested on large (50,000+) database of handwritten samples • Real-time • Used commercially Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, november 1998. http://yann.lecun.com/exdb/lenet/ Very low error rate (<< 1% http://yann.lecun.com/exdb/lenet/ http://yann.lecun.com/exdb/lenet/ http://yann.lecun.com/exdb/lenet/index.html Object recognition • LeCun, Huang, Bottou (2004). Learning Methods for Generic Object Recognition with Invariance to Pose and Lighting. Proceedings of CVPR 2004. • http://www.cs.nyu.edu/~yann/research/norb/ http://www.cs.nyu.edu/~yann/research/norb/ Is AI is still at the balloon stage? Limitations of this approach AI research focuses on underlying principles. Russel & Norvig:     Aeronautical engineering texts do not define the goal of their field as making “machines that fly so exactly like pigeons that they can fool even other pigeons.” My own neural recordings from my PhD studies subsequent spikes. To examine this relationship, the action po- tentials in response to auditory stimuli (during stimulus and background periods) were sorted according to the prepotential, defined here as the average potential in the 50 msec epoch before each spike relative to the average resting membrane potential. Spikes occurring within 30 msec of an already considered spike were excluded. The prepotential and number of following spikes were measured for three groups of neurons: song-specific cells that had a phasic response to the forward song (n 5 3), song- specific cells that had a tonic response throughout the forward song (n 5 3), and nonauditory cells that showed spontaneous bursting (n 5 5). One-way ANOVA was performed for each group to determine whether there was a statistically significant change in prepotential for different numbers of subsequent spikes. Figure 8 summarizes the distributions of prepotentials before each spike or spike burst for the three different groups. Song- specific neurons that responded phasically to song were signifi- cantly hyperpolarized before each action potential ( p , 0.001, F test) (Fig. 8, left), but tonically excited song-specific cells showed no significant change in prepotential versus the number of follow- ing spikes ( p . 0.5, F test) (Fig. 8, middle). The spike bursts produced by the nonauditory cells appear similar to those of the phasically excited song-specific cells and, like those cells, also showed a significant hyperpolarization before bursting compared with the prepotential for a single spike ( p , 0.001, F test) (Fig. 8, right). Examples of the trend suggested by this analysis can be seen using data taken from one of the phasically excited song-specific cells (the one shown in Fig. 6). The waveform patterns following the 10 most negative prepotentials of the data collected in re- sponse to the forward song are shown in Figure 9a. Nearly every trace is followed by a burst. The most positive prepotentials, however, are all followed by single action potentials (Fig. 9b). The other two phasically excited song-specific cells showed similar patterns. Conversely, one can examine the membrane potential before a burst containing a certain number of spikes. Figure 10a shows four waveforms aligned on the first action potential of bursts contain- ing six spikes. These are preceded by a consistent hyperpolariza- tion, whereas the bursts containing only three spikes are not (Fig. Figure 3. Precise timing in a phasically excited HVc cell. The horizontal line indicates the average resting potential, which was 259 mV. Some phasic cells show high regularity in the firing times of the bursts and in the subthreshold membrane potential. The action potentials in these bursts also show attenuation over the time course of the burst. Figure 2. An intracellular recording of a phasically excited song-specific HVc cell. The conventions are the same as in Figure 1, except that in bottom traces (a–c), the waveform rasters are overlaid to show the consistency of both the hyperpolarizations and of the temporal positions of the phasic bursts. The boxed region indicates where this cell responded phasically with bursts of action potentials to the forward song. The phasic response is lost when the syllable order is reversed or when the entire song is reversed. Lewicki • Intracellular Responses of Song-Specific Neurons J. Neurosci., September 15, 1996, 16(18):5854–5863 5857 may subserve complex auditory neurons observed in other sys- tems, such as the cat (Weinberger and McKenna, 1988; McKenna et al., 1989), squirrel monkey (Wollberg and Newman, 1972; Newman and Wollberg, 1973; Glass and Wollberg, 1983), and rhesus monkey (Rauschecker et al., 1995). The temporal pattern of the response of song-specific cells can be highly regular, even at the level of the subthreshold membrane potential. It remains to be seen whether the presence of such precise timing reflects that of afferent cells or whether it is refined by the circuitry in HVc. Precise timing is correlated with phasic bursting, and it is thus possible that phasic bursting and its associated hyperpolarization subserve a neural code utilizing spike timing. Such a code has obvious utility in the context of song learning and production. REFERENCES Bonke D, Scheich H, Langner G (1979) Responsiveness of units in the auditory neostriatum of the guinea fowl (Numida meleagris) to species- specific calls and synthetic stimuli. I. Tonotopy and functional zones of field L. J Comp Physiol [A] 132:243–255. Bottjer SW, Miesner EA, Arnold AP (1984) Forebrain lesions disrupt development but not maintenance of song in passerine birds. Science 224:901–903. Doupe AJ, Konishi M (1991) Song-selective auditory circuits in the vocal control system of the zebra finch. Proc Natl Acad Sci USA 88:11339–11343. Doupe AJ, Konishi M (1992) Song-selective auditory neurons emerge during vocal learning in the zebra finch. Soc Neurosci Abstr 18:527. Fortune ES, Margoliash D (1994) In vivo characterization of identified HVc neurons in the zebra finch. Soc Neurosci Abstr 20:165. Fortune ES, Margoliash D (1995) Parallel pathways and convergence onto HVc and adjacent neostriatum of adult zebra finches (Taeniopygia guttata). J Comp Neurol 360:413–441. Glass I, Wollberg Z (1983) Auditory-cortex responses to sequences of normal and reversed squirrel-monkey vocalizations. Brain Behav Evol 22:13–21. Hose B, Langner G, Scheich H (1987) Topographic representation of periodicities in the forebrain of the myna bird: one map for pitch and rhythm. Brain Res 422:367–373. Jahnsen H, Llinas R (1984) Electrophysiological properties of guinea-pig thalamic neurons: an in vitro study. J Physiol (Lond) 349:105–226. Katz LC, Gurney ME (1981) Auditory responses in the zebra finch’s motor system for song. Brain Res 211:192–197. Kelley DB, Nottebohm F (1979) Projections of a telencephalic auditory nucleus—field L—in the canary. J Comp Neurol 183:455–470. Kirkwood A, Dudek SM, Gold JT, Aizenman CD, Bear MF (1993) Common forms of synaptic plasticity in the hippocampus and neocortex in vitro. Science 260:1518–1521. Knipschild M, Dorrscheidt GJ, Rubsamen R (1992) Setting complex tasks to single units in the avian auditory forebrain. I. Processing of complex artificial stimuli. Hear Res 57:216–230. Konishi M (1965) The role of auditory feedback in the control of vocal- ization in the white-crowned sparrow. Z Tierpsychol 22:771–783. Kubota M, Saito N (1991) Sodium-dependent and calcium-dependent conductances of neurons in the zebra finch hyperstriatum-ventrale pars caudale in vitro. J Physiol (Lond) 440:131–142. Larson J, Wong D, Lynch G (1986) Patterned stimulation at the theta frequency is optimal for the induction of hippocampal long-term po- tentiation. Brain Res 368:347–350. Leppelsack HJ (1983) Analysis of song in the auditory pathway of song- birds. In: Advances in vertebrate neuroethology (Ewert JP, ed), pp 783–800. New York: Plenum. Lewicki MS, Arthur BJ (1995) Sensitivity to auditory temporal context increases significantly from field L to HVc. Soc Neurosci Abstr 21:958. Lewicki MS, Konishi M (1995) Mechanisms underlying the sensitivity of songbird forebrain neurons to temporal order. Proc Natl Acad Sci USA 92:5582–5586. Mainen ZF, Sejnowski TJ (1995) Reliability of spike timing in neocorti- cal neurons. Science 268:1503–1506. Malenka RC (1994) Synaptic plasticity in the hippocampus: LTP and LTD. Cell 78:535–538. Figure 11. Photomicrographs of avidin–horseradish peroxidase-stained neurons. a, A song-specific HVc cell (data shown in Fig. 2); the axon of this cell could be traced to area X. b, c, Two nonauditory HVc cells; both cells had clear projections to RA. All scale bars, 20 mm. 5862 J. Neurosci., September 15, 1996, 16(18):5854–5863 Lewicki • Intracellular Responses of Song-Specific Neurons 20 mV 100 ms Sound waveform Neural response A neuron in an auditory brain area Brains vs computers Brains (adult cortex) • surface area: 2500 cm2 • squishy • neurons: 20 billion • synapses: 240,000 billion • neuron size: 15,000 nm • synapse size: 1,000 nm • synaptic OPS: 30,000 billion Intel quad core (Nehalem) • surface area: 107 mm2 • crystalline • transistors: 0.82 billion • transistor size: 45 nm • FLOPS: 45 billion Deep Blue: 512 processors, 1 TFLOP Brains vs computers: energy efficiency Brains (adult cortex) • surface area: 2500 cm2 • squishy • neurons: 20 billion • synapses: 240,000 billion • neuron size: 15,000 nm • synapse size: 1,000 nm • synaptic OPS: 30,000 billion   Intel quad core (Nehalem, 2008) • surface area: 107 mm2 • crystalline • transistors: 0.82 billion • transistor size: 45 nm • FLOPS: 45 billion • power usage: ~12 W • 2500 GFLOPS/W  • power usage: 60 W • 0.75 GFLOPS/W Google’s Deep Learning Chip: The Tensor Processing Unit Brains vs computers: energy efficiency Brains (adult cortex) • surface area: 2500 cm2 • squishy • neurons: 20 billion • synapses: 240,000 billion • neuron size: 15,000 nm • synapse size: 1,000 nm • synaptic Terra OPS: ~30   Google TPU 2.0 • surface area: ~331 mm2 • crystalline • transistors: 2-2.5 billion • transistor size: 28 nm • Terra OPS (8 bit): 92   (TPU 2.0 is 180 teraops) • power usage: ~12 W • 2.5 TOPS/W  • power usage: 40 W • 2.3 TOPS/W    (4.5 for TPU 2.0 assuming 40W) Reconstruction of neural motion processing circuits in a fly A visual motion detection circuit suggested by Drosophila connectomics 379 neurons; 8,637 chemical synaptic connections Summary • Decision boundaries - Bayes optimal - linear discriminant - linear separability • Classification vs regression • Optimization by gradient descent • Degeneracy of a multi-layer linear network • Non-linearities:: threshold, sigmoid, others? • Issues: - very general architecture, can solve many problems - large number of parameters: need to avoid overfitting - usually requires a large amount of data, or special architecture - local minima, training can be slow, need to set stepsize

Related Posts