Option One Title Here
ANLY-601
Advanced Pattern Recognition
Spring 2018
L19 — Neural Nets I
What’s a Neural Network?
• Exceedingly simple processing units.
• Parallel collections of such elements provides a map
from vector inputs to vector outputs
xi
Oj
wji
inputs
weights
)1(,
…
0
0
22110
xxwxw
xwxwxwwO
N
i
iji
NjNjjjjj
2
Characteristics
• Enormous flexibility in maps that can be represented.
• Mapping can be learned from examples.
• Generalize far better than models that are linear in the parameters
• Relatively forgiving of noisy training data.
• Extrapolate gracefully to new data.
• Training times can be a few seconds, up to many hours for large nets
with large datasets.
• Evaluation of learned function is fast.
• Doesn’t require programming in target function.
• Success depends on picking appropriate features for inputs xi, and
representation for output.
3
What Are They Good For?
• Enormously flexible, can achieve huge range of maps.
• Mapping can be learned from examples.
• Pattern Classification (statistical pattern recognition)
– Text-to-speech
– Handwritten, machine printed (OCR), cursive writing (online)
recognition.
– Event detection in high energy physics experiments
– Medical screening Papnet, adjunct to conventional screening,
reduces false negatives
http://www.mda.mil/mdalink/pdf/pap.pdf
(testing on sputum smears too).
– Acoustic front end for speech recognition systems.
– Illegal drug source identification
4
What Are They Good For?
• Regression / prediction of continuous-valued systems
– Time series prediction, e.g. financial forecasting
– Non-linear regression
• Control
– Plasma fusion reactor control
– Chemical process control
– Quality control in food production (Mars)
– Trailer truck backer-upper
http://www.handshake.de/user/blickle/Truck/
– Aircraft controller – recovery from damaged airframe
• Signal Processing
– Adaptive noise cancellation
– Adaptive vibration cancellation
– Image analysis
5
http://www.handshake.de/user/blickle/Truck/
Why “Neural”?
• Artificial neural network (ANN) function is
derived from massive connectivity, real nervous
systems are also massively connected.
• Parallelism exploited – biological processing
times on order of tens of milliseconds. Yet,
we make complex judgments and responses
in several tenths of a second – i.e. a few
dozen “processing steps”.
• ANN units are cartoons of real neurons.
The latter have complex dynamics, and can
have tens of thousands of inputs (in cortex).
Real nervous systems have a multitude of
neuron types.
6
Adaptive Linear Unit (Adaline)
• Training – adjust w so output matches target
values in least mean square sense
– Data : input / target pairs
– Performance metric, or cost function – mean squared
error
xi
O
wi
N
i
ii xwxwO
0
Ddtx dd ,…,1},{
D
d
ddD
D
d
ddD
xwtxOtw
1
2
2
1
1
2
2
1 )())(()(
E
7
Linear Unit – Gradient Descent
– Optimization: crawl downhill (steepest descent) on
the error surface
D
d
idddD
D
d
idddD
D
d
idddD
i
i
xxOt
So
xxOtxxwt
w
w
wwor
w
w
1
1
1
1
1
1
))((
)())(()()(
)(
)(
)(
i
i
iii
w
E
E
E
w
wwworwww
8
w*
Linear Unit – Gradient Descent
• The error function E(w) is quadratic in w, and bounded below by
zero. There is unique, global minimum w*.
• Can show that for learning rate sufficiently small, this algorithm will
approach w* exponentially. How small? Must have
9
w*
Linear Unit – Gradient Descent
• Can show that for learning rate sufficiently small, this algorithm will
approach w* exponentially. How small? Must have
D
d
djdiD
D
d
ij
T
ddD
xxRxxR
1
1
1
1 i.e.
matrixation autocorrel theof eigenvaluelargest theiswhere
2
0
10
Linear Unit
Stochastic Gradient Descent
• Gradient descent
• Instead of summing over all data pairs for each update to w, just use one
data pair for each update. At each step, sample an input/target pair { xd,td }
at random from the data (with replacement) and modify w
This is the celebrated Widrow-Huff or LMS (for Least Mean Square)
algorithm.
– Note that the gradient descent (A) is the average of this stochastic
gradient descent (B), over all training data.
– The stochastic descent is a noisy version of the true gradient descent.
D
d
idddD
xxOtw
1
1 ))((
i
iddd xxOtw ))((
i
A
B
11
Stochastic vs True Gradient Descent
true gradient descent
stochastic
gradient descent
12
Linear Unit with LMS Training
• Used in adaptive filter applications: adaptive
noise cancellation and vibration damping, linear
prediction problems (linear regression, AR
models).
13
Perceptron Classifier
• Early ANN — Rosenbaltt, Principles of Neurodynamics:
Perceptrons and the Theory of Brain Mechanisms, Spartan,
1962.
Single unit with hard-limiter output
• Represents a dichotomy responds +1 or –1 to input vector.
Input is member of class (+1) or not (-1). Concept is present
(+1), or not (-1).
e.g. – Does this picture have a tree in it? This is tough, the
inputs x will need to be superbly crafted features.
xi
Oj
wji
)sgn()(
01
01
)(
0
xwxO
y
y
y
xwxwO
Nin
i
ijij
14
Perceptron Classifier – Geometry
Hypothesis space is space of all possible weights w
(RN+1)
Learning means choosing weight vector w that correctly
classifies the training data.
Perceptron weight vector defines a hyperplane s in the
N-dimensional feature space
x1
x3
x2
w
.
s
do
)sgn()(
0
0
0
0
xwxO
w
w
d
sw
sxxwxw i
N
i
i
+1
+1
-1
-1
-1
15
Perceptron Limitations
Boolean functions
AND
OR
XOR can only solve linearly
separable dichotomies
..
.
.
-1
+1
..
.
.+1
+1
s
s
..
.
.+1
+1
x1
x2
16
Perceptron Learning
• Training data input / target pairs (e.g. pictures with trees, +1
target, and pictures without trees, -1 target) { xd, td }
• We want
this is equivalent to
A given data example will be misclassified if
• Define cost function
• Do stochastic gradient descent on this cost function : If the
example xd is misclassified, change the weights according to
10
10
dd
dd
tforxw
tforxw
data allfor 0)( dd txw
0)( dd txw
0)( d
iedmisclassif
d txww
E
idd xtw i
17
18
Perceptron Learning
• If the data are linearly separable, this algorithm
will converge, in a finite number of steps, to a
weight that correctly classifies all the training
data.
19
Soft Threshold
Differentiable “Perceptron”
• In order to get past the restriction to linearly
separable problems, we are going to combine
many non-linear neurons. (Why non-linear?)
• In order to train the resulting networks, we
introduce a sigmoidal unit
xi
O
wi
N
i
ii xwxwO
0
)(
y
Smooth, bounded,
monotonically increasing.
20
Sigmoidal Functions
• Typical choices are
– Logistic function
– Hyperbolic tangent
)exp(1
1
)(
y
y
)tanh()( yy
21
Training the Soft Threshold
Logistic function – targets are {0,1}
Hyperbolic tangent – targets are {-1,1}
Cost function
Train by gradient descent
D
d
ddD
D
d
ddD
xwtxOtw
1
2
2
1
1
2
2
1 ))(())(()(
E
D
d
iddddD
D
d
didddD
D
d i
d
dddD
D
d i
d
ddD
D
d i
d
ddD
i
i
xxwxOt
So
xxwxOt
w
xw
xwxOt
w
xw
xOt
w
xO
xOt
w
w
w
w
1
1
1
1
1
1
1
1
1
1
)(‘))((
)(‘))(()(‘))((
)(
))((
)(
))((
)(
)(
i
i
w
E
E
w
22
Training the Soft Threshold
• We have the gradient descent rule
just like the linear gradient descent
except for slope of sigmoidal function
• Stochastic gradient version
• Note that if we get up onto the flat “rails” of the sigmoid,
then the slope ’ gets very small, and the gradient of the
cost function gets very small slow progress.
D
d
iddddD
xxwxOt
1
1 )(‘))((
iw
idddd xxwxOt )(‘))((
iw
23
Cost Function
• The cost surface is now not a simple parabolic
function, but instead is more complex looking.
– 20 – 10 10 20 30 40
w1
0.1
0.2
0.3
0.4
0.5
0.6
E
0
50
100
w1
0
50
100
w2
0
0.2
0.4
0.6
0.8
E
0
0.2
0.4
0.6
0.8
E
24
Workhorse Neural Networks
Multi-Layer Perceptrons (MLP)
Feed forward, layered networks, with sigmoidal hidden units
.
Can have more than two layers of weights.
Output nodes
Linear for regression, time series prediction, other problems needing full
range of real values in output.
Sigmoidal for classification problems.
Number of inputs, number of outputs determined by problem.
Number of hidden units is an architectural parameter.
More hidden nodes more functions available.
inputs
hidden units – sigmoidal
outputs
25
MLP Output
• Signal propagation (forward pass, bottom-up)
k
i
ikik
l
l
l
l
k
klkl
netxwy
regressionfornet
tionclassificafornet
netgwith
netgywgO
)(
…
)(
)(
)()(
inputs
outputs
xj
yk
Ol
26
Example Uses
• Classification – e.g. from text fig 4.5. Able to produce non-linear
class boundaries.
• Fisher Iris data:
27
Example Uses
• Non-linear regression
able to map strongly
non-linear functions
x
O(x)
1
28
Gradient Descent in MLP
• Cost function as before:
• Learning by gradient descent
• Calculating gradients takes some care.
• Surface can have multiple ‘local’ minima. Some may have
lower cost than others.
• Local optima are in different basins of attraction; where you
end depends on where you start.
D
d
dmdm
N
m
D
xOtw
O
1
2
1
2
1 ))(()(
E
ijw
w
)(
E
wij
w
E
number of outputs
29
Stochastic Gradient Descent in MLP
As above, but no sum over data pairs d
Stochastic descent has some robustness against getting stuck in
poor local minima. Where you end, depends on where you start,
learning rate, and the order the examples are given.
Can also be faster in clock-time for large data sets. Instead of
waiting to accumulate errors from all data before making a weight
change, make a small weight change in response to each datum.
2
1
2
1 ))(()( dmdm
N
m
d xOtw
O
E
ij
d
w
w
)(
E
wij
30
Visualization of Stochastic
Gradient Descent
• Different 2-d slices through E(w) for 9-d
weight space
– eg 1
31
Visualization of Stochastic
Gradient Descent
– eg 2
32
Next
• Backpropagation training of MLP.
• Representation power – universal
approximation theorems.
• Inductive bias.
• Generalization, underfitting, overfitting.
• Bayesian methods for neural networks.