程序代写代做代考 flex Bayesian algorithm Option One Title Here

Option One Title Here

ANLY-601
Advanced Pattern Recognition

Spring 2018

L19 — Neural Nets I

What’s a Neural Network?
• Exceedingly simple processing units.

• Parallel collections of such elements provides a map
from vector inputs to vector outputs

xi

Oj

wji

inputs

weights

 

  )1(,

0
0

22110








xxwxw

xwxwxwwO

N

i
iji

NjNjjjjj




2

Characteristics

• Enormous flexibility in maps that can be represented.

• Mapping can be learned from examples.

• Generalize far better than models that are linear in the parameters

• Relatively forgiving of noisy training data.

• Extrapolate gracefully to new data.

• Training times can be a few seconds, up to many hours for large nets
with large datasets.

• Evaluation of learned function is fast.

• Doesn’t require programming in target function.

• Success depends on picking appropriate features for inputs xi, and
representation for output.

3

What Are They Good For?

• Enormously flexible, can achieve huge range of maps.

• Mapping can be learned from examples.

• Pattern Classification (statistical pattern recognition)

– Text-to-speech

– Handwritten, machine printed (OCR), cursive writing (online)
recognition.

– Event detection in high energy physics experiments

– Medical screening Papnet, adjunct to conventional screening,
reduces false negatives

http://www.mda.mil/mdalink/pdf/pap.pdf

(testing on sputum smears too).

– Acoustic front end for speech recognition systems.

– Illegal drug source identification

4

What Are They Good For?
• Regression / prediction of continuous-valued systems

– Time series prediction, e.g. financial forecasting

– Non-linear regression

• Control

– Plasma fusion reactor control

– Chemical process control

– Quality control in food production (Mars)

– Trailer truck backer-upper
http://www.handshake.de/user/blickle/Truck/

– Aircraft controller – recovery from damaged airframe

• Signal Processing

– Adaptive noise cancellation

– Adaptive vibration cancellation

– Image analysis

5

http://www.handshake.de/user/blickle/Truck/

Why “Neural”?

• Artificial neural network (ANN) function is
derived from massive connectivity, real nervous
systems are also massively connected.

• Parallelism exploited – biological processing
times on order of tens of milliseconds. Yet,
we make complex judgments and responses
in several tenths of a second – i.e. a few
dozen “processing steps”.

• ANN units are cartoons of real neurons.
The latter have complex dynamics, and can
have tens of thousands of inputs (in cortex).
Real nervous systems have a multitude of
neuron types.

6

Adaptive Linear Unit (Adaline)

• Training – adjust w so output matches target
values in least mean square sense
– Data : input / target pairs

– Performance metric, or cost function – mean squared
error

xi

O

wi 


N

i
ii xwxwO

0



Ddtx dd ,…,1},{ 





D

d
ddD

D

d
ddD

xwtxOtw
1

2

2
1

1

2

2
1 )())(()(


E

7

Linear Unit – Gradient Descent

– Optimization: crawl downhill (steepest descent) on

the error surface
















D

d
idddD

D

d
idddD

D

d
idddD

i

i

xxOt

So

xxOtxxwt
w

w

wwor
w

w

1

1

1

1

1

1

))((

)())(()()(
)(

)(
)(









i

i

iii

w

E

E
E

w

wwworwww

8

w*

Linear Unit – Gradient Descent
• The error function E(w) is quadratic in w, and bounded below by

zero. There is unique, global minimum w*.

• Can show that for learning rate  sufficiently small, this algorithm will

approach w* exponentially. How small? Must have

9

w*

Linear Unit – Gradient Descent

• Can show that for learning rate  sufficiently small, this algorithm will

approach w* exponentially. How small? Must have








D

d
djdiD

D

d
ij

T
ddD

xxRxxR
1

1

1

1 i.e.

matrixation autocorrel theof eigenvaluelargest theiswhere

2
0


10

Linear Unit
Stochastic Gradient Descent

• Gradient descent

• Instead of summing over all data pairs for each update to w, just use one
data pair for each update. At each step, sample an input/target pair { xd,td }
at random from the data (with replacement) and modify w

This is the celebrated Widrow-Huff or LMS (for Least Mean Square)
algorithm.

– Note that the gradient descent (A) is the average of this stochastic
gradient descent (B), over all training data.

– The stochastic descent is a noisy version of the true gradient descent.



D

d
idddD

xxOtw
1

1 ))((

i

iddd xxOtw ))((

 i

A

B

11

Stochastic vs True Gradient Descent

true gradient descent

stochastic
gradient descent

12

Linear Unit with LMS Training

• Used in adaptive filter applications: adaptive

noise cancellation and vibration damping, linear

prediction problems (linear regression, AR

models).

13

Perceptron Classifier
• Early ANN — Rosenbaltt, Principles of Neurodynamics:

Perceptrons and the Theory of Brain Mechanisms, Spartan,
1962.

Single unit with hard-limiter output

• Represents a dichotomy responds +1 or –1 to input vector.
Input is member of class (+1) or not (-1). Concept is present
(+1), or not (-1).

e.g. – Does this picture have a tree in it? This is tough, the
inputs x will need to be superbly crafted features.

xi

Oj

wji

 

)sgn()(

01

01
)(

0

xwxO

y

y
y

xwxwO
Nin

i
ijij












 



14

Perceptron Classifier – Geometry
Hypothesis space is space of all possible weights w

(RN+1)

Learning means choosing weight vector w that correctly

classifies the training data.

Perceptron weight vector defines a hyperplane s in the

N-dimensional feature space

x1

x3

x2

w

.

s

do

)sgn()(

0

0
0

0

xwxO

w

w
d

sw

sxxwxw i

N

i
i





 

+1

+1

-1
-1

-1

15

Perceptron Limitations
Boolean functions

AND

OR

XOR can only solve linearly
separable dichotomies

..

.

.

-1

+1

..

.

.+1

+1

s

s

..

.

.+1

+1

x1

x2

16

Perceptron Learning
• Training data input / target pairs (e.g. pictures with trees, +1

target, and pictures without trees, -1 target) { xd, td }

• We want

this is equivalent to

A given data example will be misclassified if

• Define cost function

• Do stochastic gradient descent on this cost function : If the
example xd is misclassified, change the weights according to

10

10





dd

dd

tforxw

tforxw





data allfor 0)(  dd txw


0)(  dd txw


  0)(   d
iedmisclassif

d txww


E

idd xtw  i

17

18

Perceptron Learning

• If the data are linearly separable, this algorithm

will converge, in a finite number of steps, to a

weight that correctly classifies all the training

data.

19

Soft Threshold
Differentiable “Perceptron”

• In order to get past the restriction to linearly
separable problems, we are going to combine
many non-linear neurons. (Why non-linear?)

• In order to train the resulting networks, we
introduce a sigmoidal unit

xi

O

wi 



 

N

i
ii xwxwO

0

)( 


y

Smooth, bounded,
monotonically increasing.

20

Sigmoidal Functions

• Typical choices are

– Logistic function

– Hyperbolic tangent

)exp(1

1
)(

y
y




)tanh()( yy 

21

Training the Soft Threshold

Logistic function – targets are {0,1}

Hyperbolic tangent – targets are {-1,1}

Cost function

Train by gradient descent





D

d
ddD

D

d
ddD

xwtxOtw
1

2

2
1

1

2

2
1 ))(())(()(


E

























D

d
iddddD

D

d
didddD

D

d i

d
dddD

D

d i

d
ddD

D

d i

d
ddD

i

i

xxwxOt

So

xxwxOt
w

xw
xwxOt

w

xw
xOt

w

xO
xOt

w

w

w

w

1

1

1

1

1

1

1

1

1

1

)(‘))((

)(‘))(()(‘))((

)(
))((

)(
))((

)(

)(














i

i

w

E

E
w

22

Training the Soft Threshold

• We have the gradient descent rule

just like the linear gradient descent
except for slope of sigmoidal function

• Stochastic gradient version

• Note that if we get up onto the flat “rails” of the sigmoid,
then the slope ’ gets very small, and the gradient of the
cost function gets very small  slow progress.



D

d
iddddD

xxwxOt
1

1 )(‘))((


iw

idddd xxwxOt )(‘))((

 iw

23

Cost Function

• The cost surface is now not a simple parabolic

function, but instead is more complex looking.

– 20 – 10 10 20 30 40
w1

0.1

0.2

0.3

0.4

0.5

0.6

E
0

50

100

w1

0
50

100
w2

0
0.2
0.4
0.6
0.8

E

0
0.2
0.4
0.6
0.8

E

24

Workhorse Neural Networks
Multi-Layer Perceptrons (MLP)

Feed forward, layered networks, with sigmoidal hidden units

.

Can have more than two layers of weights.

Output nodes

Linear for regression, time series prediction, other problems needing full
range of real values in output.

Sigmoidal for classification problems.

Number of inputs, number of outputs determined by problem.
Number of hidden units is an architectural parameter.
More hidden nodes  more functions available.

inputs

hidden units – sigmoidal

outputs

25

MLP Output

• Signal propagation (forward pass, bottom-up)

 k
i

ikik

l

l
l

l
k

klkl

netxwy

regressionfornet

tionclassificafornet
netgwith

netgywgO









)(

)(
)(

)()(

inputs

outputs

xj

yk

Ol

26

Example Uses
• Classification – e.g. from text fig 4.5. Able to produce non-linear

class boundaries.

• Fisher Iris data:

27

Example Uses

• Non-linear regression

able to map strongly
non-linear functions

x

O(x)

1

28

Gradient Descent in MLP
• Cost function as before:

• Learning by gradient descent

• Calculating gradients takes some care.

• Surface can have multiple ‘local’ minima. Some may have

lower cost than others.

• Local optima are in different basins of attraction; where you

end depends on where you start.

 
 


D

d
dmdm

N

m
D

xOtw
O

1

2

1
2
1 ))(()(


E

ijw

w




)(

E
wij 

w

E

number of outputs

29

Stochastic Gradient Descent in MLP

As above, but no sum over data pairs d

Stochastic descent has some robustness against getting stuck in
poor local minima. Where you end, depends on where you start,
learning rate, and the order the examples are given.

Can also be faster in clock-time for large data sets. Instead of
waiting to accumulate errors from all data before making a weight
change, make a small weight change in response to each datum.

2

1
2
1 ))(()( dmdm

N

m
d xOtw

O 
 

E

ij

d

w

w




)(

E
wij 

30

Visualization of Stochastic
Gradient Descent

• Different 2-d slices through E(w) for 9-d
weight space

– eg 1

31

Visualization of Stochastic
Gradient Descent

– eg 2

32

Next

• Backpropagation training of MLP.

• Representation power – universal

approximation theorems.

• Inductive bias.

• Generalization, underfitting, overfitting.

• Bayesian methods for neural networks.