PowerPoint Presentation
Machine Learning Lecture:
Two-Layer Artificial Neural Networks (ANNs)
C.-C. Hung
Slides used in the classroom only
Textbook
In Chapter 18 (section 18.7) page 727 – 737.
Outline
What are ANNs?
Biological Neural Networks
ANN – The basics
Feed forward net
Training
Testing
Example – Voice recognition
Some ANNs
Recurrency
Elman nets
Hopfield nets
Characterizing artificial neural networks
What are Artificial Neural Networks?
Models of the brain and nervous system
Highly parallel
Process information much more like the brain than a serial computer
Learning
Very simple principles
Very complex behaviours
Applications
As powerful problem solvers
As biological models
Biological Neural Nets
Pigeons as art experts (Watanabe et al. 1995)
Experiment:
Pigeon in Skinner box (psychological experiments).
Present paintings of two different artists (e.g. Chagall / Van Gogh).
Reward for pecking when presented a particular artist (e.g. Van Gogh).
*
Chagall and his paintings
Marc Chagall was born Moishe/Marc Shagal in Liozne, near Vitebsk, in modern day Belarus, in 1887. He was a Russian-French-Jewish artist of international repute who, arguably, was one of the most influential modernist artists of the 20th Century, both as an early modernist, and as an important part of the Jewish artistic tradition. He distinguished himself in many arenas: as a painter, book illustrator, ceramicist, stained-glass painter, stage set designer and tapestry maker. Widely admired by both his contemporaries, and by later artists, he forged his creative path in spite of the many difficulties and injustices he faced in his long lifetime.
https://www.marcchagall.net/
Van Gogh and his paintings
Between November of 1881 and July of 1890, Vincent van Gogh painted almost 900 paintings. Since his death, he has become one of the most famous painters in the world. Van Gogh’s paintings have captured the minds and hearts of millions of art lovers and have made art lovers of those new to world of art. The following excerpts are from letters that Van Gogh wrote expressing how he evolved as a painter. There are also links to pages describing some of Vincent van Gogh’s most famous paintings, Starry Night, Sunflowers, Irises, Poppies, The Bedroom, Blossoming Almond Tree, The Mulberry Tree, The Night Café, and The Potato Eaters, in great detail.
https://www.vangoghgallery.com/painting/
*
Canal with Women Washing, 1888
Pigeons were able to discriminate between Van Gogh and Chagall with 95% accuracy (when presented with pictures they had been trained on)
Discrimination still 85% successful for previously unseen paintings of the artists
Pigeons do not simply memorize the pictures
They can extract and recognize patterns (the ‘style’)
They generalize from the already seen to make predictions
This is what neural networks (biological and artificial) are good at (unlike conventional computer)
Biological inspiration
Dendrites – Input
Soma (cell body) – Processing Element
Axon – Output
Biological inspiration
synapses
axon
dendrites
The information transmission happens at the synapses (i.e. Weights).
Biological inspiration
The spikes travelling along the axon of the pre-synaptic neuron trigger the release of neurotransmitter substances at the synapse.
The neurotransmitters cause excitation or inhibition in the dendrite of the post-synaptic neuron.
The integration of the excitatory and inhibitory signals may produce spikes in the post-synaptic neuron.
The contribution of the signals depends on the strength of the synaptic connection (i.e. weights).
Architecture of a typical artificial neural network
Weights
Weights
I n p u t S i g n a l s
O u t p u t S i g n a l s
Analogy between biological and artificial neural networks
Non-Symbolic Representations
Decision trees can be easily read
A disjunction of conjunctions (logic)
We call this a symbolic representation
Non-symbolic representations
More numerical in nature, more difficult to read
Artificial Neural Networks (ANNs)
A Non-symbolic representation scheme
They embed a giant mathematical function
To take inputs and compute an output which is interpreted as a categorization
Often shortened to “Neural Networks”
Don’t confuse them with real neural networks (in heads)
ANNs – The basics
ANNs incorporate the two fundamental components of biological neural nets:
Neurons (nodes)
Synapses (weights)
Neuron vs. Node
Structure of a node in ANN
Squashing function limits node output:
Function
Synapse vs. weight
Feed-forward nets
Information flow is unidirectional
Data is presented to Input layer
Passed on to Hidden Layer
Passed on to Output layer
Information is distributed
Information processing is parallel
Internal representation (interpretation) of data
Feeding data through the net:
(1 0.25) + (0.5 (-1.5)) = 0.25 + (-0.75) = – 0.5
Squashing using Sigmoid function:
Feed-forward nets: Output
Data is presented to the network in the form of activations in the input layer
Examples
Pixel intensity (for pictures)
Molecule concentrations (for artificial nose)
Share prices (for stock market prediction)
Data usually requires preprocessing
Analogous to senses in biology
How to represent more abstract data, e.g. a name?
Choose a pattern, e.g.
0-0-1 for “Chris”
0-1-0 for “Becky”
Feed-forward nets: Input
Weight settings determine the behaviour of a network
How can we find the right weights?
Answer: Training
Feed-forward nets: Weights
Training the Network – Learning
Advantages
It works!
Relatively fast.
Downsides
Requires a training set.
Training can be slow.
Probably not biologically realistic.
Alternatives to Backpropagation
Hebbian learning: Not successful in feed-forward nets.
Reinforcement learning: Only limited success.
Artificial evolution (Genetic Algorithm)
More general, but can be even slower than backpropagation.
Feed-forward nets
Example: Voice Recognition
Task: Learn to discriminate between two different voices saying “Hello”.
Data
Sources
Steve Simpson
David Raubenheimer
Format
Frequency distribution (60 bins)
Analogy: cochlea
Network architecture
Feed-forward network
60 input (one for each frequency bin)
6 hidden
2 output (0-1 for “Steve”, 1-0 for “David”)
Example: Voice Recognition
Results – Voice Recognition
Performance of trained network
Discrimination accuracy between known “Hello”s
100%
Discrimination accuracy between new “Hello”’s
100%
Example: Voice Recognition
Results – Voice Recognition (ctnd.)
Network has learnt to generalize from original data.
Networks with different weight settings can have same functionality.
Trained networks ‘concentrate’ on lower frequencies.
Network is robust against non-functioning nodes.
Example: Voice Recognition
Pattern recognition
Character recognition
Face Recognition
Sonar mine/rock recognition (Gorman & Sejnowksi, 1988)
Navigation of a car (Pomerleau, 1989)
Stock-market prediction
Pronunciation (NETtalk)
(Sejnowksi & Rosenberg, 1987)
Applications of Feed-forward nets
Function Learning of Neural Networks
Map categorization learning to numerical problem
Each category given a number.
Or a range of real valued numbers (e.g., 0.5 – 0.9).
Function learning examples
Input = 1, 2, 3, 4 Output = 1, 4, 9, 16
Here the concept to learn is squaring integers
Input = [1, 2, 3], [2, 3, 4], [3, 4, 5], [4, 5, 6]
Output = 1, 5, 11, 19
Here the concept is: [a, b, c] -> a*c – b
The calculation is more complicated than in the first example.
Neural networks:
Calculation is much more complicated in general.
But it is still just a numerical calculation.
Complicated Example:
Categorizing Vehicles
INPUT INPUT INPUT INPUT
Input to function: pixel data from vehicle images
Output: numbers: 1 for a car; 2 for a bus; 3 for a tank
OUTPUT = 3 OUTPUT = 2 OUTPUT = 1 OUTPUT=1
So, what functions can we use?
Biological motivation:
The brain does categorisation tasks like this easily.
The brain is made up of networks of neurons.
Naturally occurring neural networks
Each neuron is connected to many others.
Input to one neuron is the output from many others.
Neuron “fires” if a weighted sum S of inputs > threshold.
Artificial neural networks
Similar hierarchy with neurons firing.
Don’t take the analogy too far
Human brains: 100,000,000,000 neurons
ANNs: < 1000 usually
ANNs are a gross simplification of real neural networks
Recurrent Networks
Feed forward networks:
Information only flows one way
One input pattern produces one output
No sense of time (or memory of previous state)
Recurrency
Nodes connect back to other nodes or themselves
Information flow is multidirectional
Sense of time and memory of previous state(s)
Biological nervous systems show high levels of recurrency (but feed-forward structures exists too)
Elman Nets
Elman nets are feed forward networks with partial recurrency.
Unlike feed forward nets, Elman nets have a memory or sense of time.
Classic experiment on language acquisition and processing (Elman, 1990)
Task
Elman net to predict successive words in sentences.
Data
Suite of sentences, e.g.
“The boy catches the ball.”
“The girl eats an apple.”
Words are input one at a time
Representation
Binary representation for each word, e.g.
0-1-0-0-0 for “girl”
Training method
Backpropagation
Hopfield Networks
Sub-type of recurrent neural nets
Fully recurrent.
Weights are symmetric.
Nodes can only be on or off.
Random updating.
Learning: Hebb rule (cells that fire together wire together)
Biological equivalent to LTP and LTD.
Can recall a memory, if presented with a corrupt or incomplete version.
called auto-associative or
content-addressable memory
Task: store images with resolution of 20x20 pixels
Hopfield net with 400 nodes
Memorize:
Present image
Apply Hebb rule (cells that fire together, wire together)
Increase weight between two nodes if both have same activity, otherwise decrease
Go to 1
Recall:
Present incomplete pattern
Pick random node, update
Go to 2 until settled
Memories are attractors in state space
Recap: Artificial neural networks
Inputs
Output
An artificial neural network is composed of many artificial neurons that are linked together according to a specific network architecture.
The objective of the neural network is to transform the inputs into meaningful outputs.
General idea
1.1
2.7
3.0
-1.3
2.7
4.2
-0.8
7.1
2.1
-1.2
1.1
0.2
0.3
HIDDEN LAYERS
INPUT LAYER
NUMBERS INPUT
NUMBERS OUTPUT
OUTPUT LAYER CATEGORY
VALUES PROPAGATE THROUGH THE NETWORK
Cat A
Cat B
Cat C
Choose Cat A
(largest output value)
Value calculated using
all the input unit values
Artificial Neural Network
trying to mimic brain?
mathematical model?
Axon
Dendrite
Body
Synapse
Characterizing Artificial Neural Networks: Three Elements
Topology (Architecture)
Transfer function (squashing function)
Training algorithm (learning algorithm)
What is an artificial Neuron?
Sum
S
Transfer
function
Input
to
neuron
Output from
neuron
The McCullogh-Pitts model
Transfer function is the same as squashing function, activation function.
Artificial neurons
The McCullogh-Pitts model:
spikes are interpreted as spike rates;
synaptic strength are translated as synaptic weights;
excitation means positive product between the incoming spike rate and the corresponding synaptic weight;
inhibition means negative product between the incoming spike rate and the corresponding synaptic weight;
Artificial neurons
Nonlinear generalization of the McCullogh-Pitts neuron:
y is the neuron’s output, x is the vector of inputs, and w is the vector of synaptic weights.
Examples:
sigmoidal neuron
Gaussian neuron
Activation/Squashing functions
Please note that X is Sum (from previous page)
Representation of Information
If ANNs can correctly identify vehicles
They then contain some notion of “car”, “bus”, etc.
The categorisation is produced by the nodes.
Exactly how the input reals are turned into outputs
But, in practice:
Each unit does the same calculation
But it is based on the weighted sum of inputs to the unit
So, the weights in the weighted sum
Is where the information is really stored
We draw weights on to the ANN diagrams (see later)
“Black Box” representation:
Useful knowledge about learned concept is difficult to extract
ANN learning problem
Given a categorisation to learn (expressed numerically)
And training examples/samples represented numerically
With the correct categorisation for each example
Learn a neural network using the examples
which produces the correct output for unseen examples
Boils down to
(a) Choosing the correct network architecture (topology)
Number of hidden layers, number of units, etc.
(b) Choosing (the same) function for each unit (transfer function)
(c) Training the weights between units to work correctly (training algorithm)
Special Cases
Generally, we can have many hidden layers
In practice, usually only one or two.
Next lecture:
Look at ANNs with one hidden layer
Multi-layer ANNs
This lecture:
Look at ANNs with no hidden layer
Two-layer ANNs
Example: Perceptrons
Perceptrons
Multiple input nodes.
Single output node.
Takes a weighted sum of the inputs, call this S.
Unit function calculates the output for the network.
Useful to study because
We can use perceptrons to build larger networks.
Perceptrons have limited representational abilities
We will look at concepts they can’t learn later
Squashing Functions
Linear Functions
Simply output the weighted sum.
Threshold Functions
Output low values
Until the weighted sum gets over a threshold.
Then output high values.
Equivalent of “firing” of neurons.
Step function:
Output is +1 if S > Threshold T.
Output is –1 otherwise.
Sigma function:
Similar to step function but differentiable (next lecture).
Step
Function
Sigma
Function
Example: Perceptron
Categorisation of 2 x 2 pixel black & white images
Into “bright” and “dark”
Representation of this rule:
If it contains 2, 3 or 4 white pixels, it is “bright”
If it contains 0 or 1 white pixels, it is “dark”
Perceptron architecture:
Four input nodes, one for each pixel.
One output node: +1 for bright, -1 for dark.
Example: Perceptron
Example calculation: x1=-1, x2=1, x3=1, x4=-1
S = 0.25*(-1) + 0.25*(1) + 0.25*(1) + 0.25*(-1) = 0
0 > -0.1, so the output from the ANN is +1
So the image is categorised as “bright”
Learning in Perceptrons
Need to learn
Both the weights between input and output units.
And the value for the threshold (What is this? Next slide)
Make calculations easier by
Thinking of the threshold as a weight from a special input unit where the output from the unit is always 1.
Exactly the same result
But we only have to worry about learning weights.
New Representation
for Perceptrons
Special Input Node
Always produces 1 (called Bias Neuron)
Threshold function
has become this
Bias Neuron
The bias neuron is a special neuron added to each layer in the neural network, which simply stores the value of 1. This makes it possible to move or “translate” the activation function left or right on the graph.
Without a bias neuron, each neuron takes the input and multiplies it by a weight, with nothing else added to the equation. For example, it is not possible to input a value of 0 and output 2. In many cases, it is necessary to move the entire activation function to the left or right to generate the required output values—this is made possible by the bias.
Although neural networks can work without bias neurons, in reality, they are almost always added, and their weights are estimated as part of the overall model.
Learning Algorithm
Weights are set randomly initially
For each training example E
Calculate the observed output from the ANN, o(E)
If the target output t(E) is different to o(E)
Then tweak all the weights so that o(E) gets closer to t(E)
Tweaking is done by perceptron training rule (next slide)
This routine is done for every input example E
Don’t necessarily stop when all examples used
Repeat the cycle again (called an ‘epoch’)
Until the ANN produces the correct output
For all the examples in the training set (or good enough)
Perceptron Training Rule
When t(E) is different to o(E)
Add on Δi to weight wi
Where Δi = η(t(E)-o(E))xi
Do this for every weight in the network
η is the learning rate (between 0.0 and 1.0) and xi is input.
Interpretation:
(t(E) – o(E)) will either be + or –
So we can think of the addition of Δi as the movement of the weight in a direction
Which will improve the networks performance with respect to E
Multiplication by xi
Moves it more if the input is bigger
The Learning Rate
η is called the learning rate
Usually set to something small (e.g., 0.1) (can be negative)
Learning rate: to control the movement of the weights
Not to move too far for one example
Which may over-compensate for another example
If a large movement is actually necessary for the weights to correctly categorise E
This will occur over time with multiple epochs
Worked Example
Return to the “bright” and “dark” example
Use a learning rate of η = 0.1
Suppose we have set random weights:
Worked Example
Use this training example, E, to update weights:
Here, x1 = -1, x2 = 1, x3 = 1, x4 = -1 as before
Propagate this information through the network:
S = (-0.5 * 1) + (0.7 * -1) + (-0.2 * +1) + (0.1 * +1) + (0.9 * -1) = -2.2
Hence the network outputs o(E) = -1
But this should have been “bright”=+1
So t(E) = +1
Calculating the Error Values
Δ0 = η(t(E)-o(E))x0
= 0.1 * (1 – (-1)) * (1) = 0.1 * (2) = 0.2
Δ1 = η(t(E)-o(E))x1
= 0.1 * (1 – (-1)) * (-1) = 0.1 * (-2) = -0.2
Δ2 = η(t(E)-o(E))x2
= 0.1 * (1 – (-1)) * (1) = 0.1 * (2) = 0.2
Δ3 = η(t(E)-o(E))x3
= 0.1 * (1 – (-1)) * (1) = 0.1 * (2) = 0.2
Δ4 = η(t(E)-o(E))x4
= 0.1 * (1 – (-1)) * (-1) = 0.1 * (-2) = -0.2
Calculating the New Weights
w’0 = -0.5 + Δ0 = -0.5 + 0.2 = -0.3
w’1 = 0.7 + Δ1 = 0.7 + -0.2 = 0.5
w’2 = -0.2 + Δ2 = -0.2 + 0.2 = 0
w’3= 0.1 + Δ3 = 0.1 + 0.2 = 0.3
w’4 = 0.9 + Δ4 = 0.9 – 0.2 = 0.7
New Look Perceptron
Calculate for the example, E, again:
S = (-0.3 * 1) + (0.5 * -1) + (0 * +1) + (0.3 * +1) + (0.7 * -1) = -1.2
Still gets the wrong categorisation
But the value is closer to zero (from -2.2 to -1.2)
In a few epochs time, this example will be correctly categorised
Learning Abilities
of Perceptrons
Perceptrons are a very simple network.
Computational learning theory
Study of which concepts can and can’t be learned
By particular learning techniques (representation, method)
Minsky and Papert’s influencial book
Showed the limitations of perceptrons.
Cannot learn some simple boolean functions
Caused a “winter” of research for ANNs in AI
People thought it represented a fundamental limitation
But perceptrons are the simplest network
ANNS were revived by neuroscientists, etc.
Boolean Functions
Take in two inputs (-1 or +1)
Produce one output (-1 or +1)
In other contexts, use 0 and 1
Example: AND function
Produces +1 only if both inputs are +1
Example: OR function
Produces +1 if either inputs are +1
Related to the logical connectives from the first order logic (F.O.L).
Boolean Functions as Perceptrons
Problem: XOR boolean function
Produces +1 only if inputs are different
Cannot be represented as a perceptron
Because it is not linearly separable
Linearly Separable: Boolean Functions
Linearly separable:
Can use a line (dotted) to separate +1 and –1
Think of the line as representing the threshold
Angle of line determined by two weights in perceptron
Y-axis crossing determined by threshold
Remember: You are smart. Why?
XOR function: 2 – 3 neurons.
Fruit fly’s brain: around 200,000 neurons.
Firefly-inspired algorithms (some intelligence).
Human beings: Billions of neurons.
Therefore, you are potentially an “A” student.
XOR function
To implement the XOR function only needs 2 neurons, however, this is not perceptron.
XOR function with 2 neurons
XOR
Please note that: 1.5 and 0.5 inside boxes are thresholds
X Y XOR
0 0 0
0 1 1
1 0 1
1 1 0
XOR function with 3 neurons
X Y XOR
0 0 0
0 1 1
1 0 1
1 1 0
Perceptron learning algorithm (The binary version)
Step 1: Apply input X and calculate output y, then apply threshold operation. (note that X is a vector)
Step 2:
a) If y is correct, change nothing and go to step 1.
b) If y is incorrect and is zero, add each input to its corresponding weight or
c) If y is incorrect and is one, subtract each input from its corresponding weight.
Step 3: go to step 1.
Example for perceptron learning
Example: Learning OR function; representing logical OR with initial weights set randomly to: w1 = 1, w2 = 0, T (threshold) = 1
Steps 1 and 2: Apply input and calculate output
w1 w2 input input output Correct (Y/N)
1 0 0 0 0 Y
“ “ 0 1 0 N, so add 0 to W1 and 1 to W2
1 1 1 0 1 Y
“ “ 1 1 1 Y
“ “ 0 0 0 Y
“ “ 0 1 1 Y
“ “ 1 0 1 Y
“ “ 1 1 1 Y
Perceptron learning algorithm (The continuous version)
Step 1: Apply input x and calculate output y.
Step 2: W(n + 1) = W (n) + Δi (Δi = η(t(x)-o(x))xi)
t(x)-o(x) = 0
t(x)-o(x) > 0
t(x)-o(x) < 0
Step 3: go to step 1.
Exercise: Use the continuous version for any Boolean functions.
Recap – Neural Networks
Components – biological plausibility
Neurone / node
Synapse / weight
Feed forward networks
Unidirectional flow of information
Good at extracting patterns, generalisation and prediction
Distributed representation of data
Parallel processing of data
Training: Backpropagation
Not exact models, but good at demonstrating principles
Recurrent networks
Multidirectional flow of information
Memory / sense of time
Complex temporal dynamics (e.g. CPGs)
Various training methods (Hebbian, evolution)
Often better biological models than FFNs
Summary
Artificial neural networks are inspired by the learning processes that take place in biological systems.
Artificial neurons and neural networks try to imitate the working mechanisms of their biological counterparts.
Learning can be perceived as an optimisation process.
Biological neural learning happens by the modification of the synaptic strength. Artificial neural networks learn in the same way.
The synapse strength modification rules for artificial neural networks can be derived by applying mathematical optimisation methods.
Summary
Learning tasks of artificial neural networks can be reformulated as function approximation tasks.
Neural networks can be considered as nonlinear function approximating tools (i.e., linear combinations of nonlinear basis functions), where the parameters of the networks should be found by applying optimisation methods.
The optimisation is done with respect to the approximation error measure.
In general it is enough to have a single hidden layer neural network (such as multi-layer feedforward back-propogation networks) to learn the approximation of a nonlinear function. In such cases general optimisation can be applied to find the change rules for the synaptic weights.
Key Concepts
Neurons
Topology
Transfer functions (i.e. Squashing functions)
Training algorithms (i.e. Learning algorithms)
Perceptron
Implement simple Boolean functions using perceptrons
References
Simon Haykin, Neural Networks: A Comprehensive Foundation, IEEE Press, 1994
Exercises and Examples
Design a perceptron which can perform logic AND function using Perceptron learning algorithm (The binary version).
Epoch Input
(x, y) Desired Output Weights
(w1, w2) Actual Output Error Adjusted Weights
1
2
3
4
Exercises and Examples
Design a perceptron which can perform logic OR function using Perceptron learning algorithm (The binary version).
Epoch Input
(x, y) Desired Output Weights
(w1, w2) Actual Output Error Adjusted Weights
1
2
3
4
Questions & Suggestions?
The End
*
Appendix
Linearly Separable Functions
Result extends to functions taking many inputs
And outputting +1 and –1
Also extends to higher dimensions for outputs
More on activation functions
More on activation functions
In
pu
t
La
ye
r
Ou
tp
ut
La
ye
r
Mi
ddl
e
La
ye
r
Bi
ol
og
ic
al
Ne
ur
al
Ne
tw
or
k
A
rt
ific
ia
l
Ne
ur
al
Ne
tw
or
k
So
ma
De
nd
ri
te
Ax
on
Sy
na
ps
e
Ne
ur
on
In
pu
t
Ou
tp
ut
We
ig
ht
0.3775
1
1
5
.
0
=
+
e
)
,
(
w
x
f
y
=
2
2
2
||
||
1
1
a
w
x
a
x
w
e
y
e
y
T
-
-
-
-
=
+
=
St
ep
fu
nc
ti
on
Si
gn
fu
nc
ti
on
+1
-1
0
+1
-1
0
X
Y
X
Y
1
1
-1
0
X
Y
S
i
g
m
o
i
d
f
u
n
c
t
i
o
n
-1
0
X
Y
Li
ne
ar
fu
nc
ti
on
î
í
ì
<
³
=
0
if
,
0
0
if
,
1
X
X
Y
st
ep
î
í
ì
<
-
³
+
=
0
if
,
1
0
if
,
1
X
X
Y
si
gn
X
si
gm
oi
d
e
Y
-
+
=
1
1
X
Y
lin
ea
r
=