Lecture 8: The Perceptron
COMP90049
Introduction to Machine Learning
Semester 1, 2020
Lea Frermann, CIS
1
Introduction
Roadmap
So far… Naive Bayes and Logistic Regression
• Probabilistic models
• Maximum likelihood estimation
• Examples and code
Today… The Perceptron
• Geometric motivation
• Error-based optimization
• …towards neural networks
2
Roadmap
So far… Naive Bayes and Logistic Regression
• Probabilistic models
• Maximum likelihood estimation
• Examples and code
Today… The Perceptron
• Geometric motivation
• Error-based optimization
• …towards neural networks
2
Recap: Classification algorithms
Naive Bayes
• Generative model of p(x , y)
• Find optimal parameter that maximize the log data likelihood
• Unrealistic independence assumption p(x |y) =
∏
i p(xi |y)
Logistic Regression
• Discriminative model of p(y |x)
• Find optimal parameters that maximize the conditional log data likelihood
• Allows for more complex features (fewer assumptions)
Perceptron
• Biological motivation: imitating neurons in the brain
• No more probabilities
• Instead: minimize the classification error directly
3
Recap: Classification algorithms
Naive Bayes
• Generative model of p(x , y)
• Find optimal parameter that maximize the log data likelihood
• Unrealistic independence assumption p(x |y) =
∏
i p(xi |y)
Logistic Regression
• Discriminative model of p(y |x)
• Find optimal parameters that maximize the conditional log data likelihood
• Allows for more complex features (fewer assumptions)
Perceptron
• Biological motivation: imitating neurons in the brain
• No more probabilities
• Instead: minimize the classification error directly
3
Introduction: Neurons in the Brain
• Humans are the best learners we know
• Can we take inspiration from human learning
• → the brain!
4
Introduction: Neurons in the Brain
4
Introduction: Neurons in the Brain
Source: https://upload.wikimedia.org/wikipedia/commons/1/10/Blausen_0657_MultipolarNeuron.png
4
https://upload.wikimedia.org/wikipedia/commons/1/10/Blausen_0657_MultipolarNeuron.png
Introduction: Neurons in the Brain
The hype
• 1943 McCulloch and Pitts introduced the first ‘artificial neurons’
• If the weighted sum of inputs is equal to or greater than a threshold,
then the output is 1. Otherwise the output is 0.
• the weights needed to be designed by hand
• In 1958 Rosenblatt invented the Perceptron, which can learn the
optimal parameters through the perceptron learning rule
• The perceptron can be trained to learn the correct weights, even if
randomly initialized [[ for a limited set of problems ]].
4
Introduction: Neurons in the Brain
The AI winter
• A few years later Misky and Papert (too?) successfully pointed out the
fundamental limitations of the perceptron.
• As a result, research on artificial neural networks stopped until the
mid-1980s
• But the limitations can be overcome by combining multiple perceptrons
into Artificial Neural Networks
• The perceptron is the basic component of today’s deep learning success!
4
Introduction: Artificial Neurons I
x
1
x
2
…
x
F
f(x;θ,b) y
input weights
activation
output
1
y = f(θTx)
θ
2
…
θ
F
θ
1
θ
0
5
Introduction: Artificial Neurons I
x
1
x
2
…
x
F
y
input weights
activation
output
1
y = f(θx+b)
f: y = 1 if f(θTx) >= 0
y = -1 if f(θTx) < 0
θ
2
...
θ
F
θ
1
θ
0
5
Perceptron: Definition I
• The Perceptron is a minimal neural network
• neural networks are composed of neurons
• A neuron is defined as follows:
• input = a vector x of numeric inputs (〈1, x1, x2, ...xn〉)
• output = a scalar yi ∈ R
• hyper-parameter: an activation function f
• parameters: θ = 〈θ0, θ1, θ2, ...θn〉
• Mathematically:
y i = f
∑
j
θjx
i
j
= f (θT x i)
6
Perceptron: Definition II
• Task: binary classification of instances into classes 1 and −1
• Model: a single-neuron (aka a “perceptron”) :
f (θT x) =
1 if θ
T x ≥ 0
−1 otherwise
• θT x is the decision boundary
• Graphically, f is the step function
7
Perceptron: Definition II
• Task: binary classification of instances into classes 1 and −1
• Model: a single-neuron (aka a “perceptron”) :
f (θT x) =
1 if θ
T x ≥ 0
−1 otherwise
• θT x is the decision boundary
• Example: 2-d case:
7
Towards the Perceptron Algorithm I
• As usual, learning means to modify the parameters (i.e., weights) of the
perceptron so that performance is optimized
• The perceptron is a supervised classification algorithm, so we learn
from observations of input-label pairs
(x1, y1), (x2, y2), ...(xN , yN)
• Simplest way to learn: compare predicted outputs ŷ against true outputs
y and minimize the number of mis-classifications. Unfortunately,
mathematically inconvenient.
• Second simplest idea: Find θ such that gap between the predicted value
ŷ i ← f (θT x i) and the true class label y ∈ {−1, 1} is minimized
More formally
Initialize parameters θ ← 0
for training sample (x , y ) do
Calculate the output ŷ = f (θT x)
if y = 1 and ŷ = −1 then
θ(new) ← θ(old) + x
if y = −1 and ŷ = 1 then
θ(new) ← θ(old) − x
until tired
8
Towards the Perceptron Algorithm I
Intuition Iterate over the training data and modify weights:
• if the true label y = 1 and ŷ = 1 then do nothing
• if the true label y = −1 and ŷ = −1 then do nothing
• if the true label y = 1 but ŷ = −1 then increase weights
• if the true label y = −1 but ŷ = 1 then decrease weights
More formally
Initialize parameters θ ← 0
for training sample (x , y ) do
Calculate the output ŷ = f (θT x)
if y = 1 and ŷ = −1 then
θ(new) ← θ(old) + x
if y = −1 and ŷ = 1 then
θ(new) ← θ(old) − x
until tired
8
Towards the Perceptron Algorithm I
Intuition Iterate over the training data and modify weights:
• if the true label y = 1 and ŷ = 1 then do nothing
• if the true label y = −1 and ŷ = −1 then do nothing
• if the true label y = 1 but ŷ = −1 then increase weights
• if the true label y = −1 but ŷ = 1 then decrease weights
More formally
Initialize parameters θ ← 0
for training sample (x , y ) do
Calculate the output ŷ = f (θT x)
if y = 1 and ŷ = −1 then
θ(new) ← θ(old) + x
if y = −1 and ŷ = 1 then
θ(new) ← θ(old) − x
until tired
8
Towards the Perceptron Algorithm II
• We set a learning rate or step size η
• and note that
(y i − ŷ i) =
0 if y i == ŷ i
2 if y i = 1and= ŷ i = −1
−2 if y i = −1and= ŷ i = 1
(1)
• For each individual weight θj , we compute an update such that
θj ← θj + η(y
i − ŷ i)x ij
9
The Perceptron Algorithm
D = {(xi , y i)|i = 1, 2, ...,N} the set of training instances
Initialise the weight vector θ ← 0
t← 0
repeat
t← t+1
for each training instance (x i , y i) ∈ D do
compute ŷ i,(t) = f (θT x i)
if ŷ i,(t) 6= y i then
for each each weight θj do
update θ(t)j ← θ
(t−1)
j + η(y
i − ŷ i,(t))x ij
else
θ
(t)
j ← θ
(t−1)
j
until tired
Return θ(t)
10
An example
Perceptron Example I
• Training instances:
〈xi1, xi2〉 yi
〈1, 1〉 1
〈1, 2〉 1
〈0, 0〉 −1
〈−1, 0〉 −1
• Learning rate η = 1
11
Perceptron Example II
• θ = 〈0, 0, 0〉
• learning rate: η = 1
• Epoch 1:
〈x1, x2〉 θ1 · 1 + θ2 · x1 + θ3 · x2 ŷ
(1)
i yi
〈1, 1〉 0 + 1× 0 + 1× 0 = 0 1 1
〈1, 2〉 0 + 1× 0 + 2× 0 = 0 1 1
〈0, 0〉 0 + 0× 0 + 0× 0 = 0 1 -1
Update to θ = 〈−2, 0, 0〉
〈−1, 0〉 −2 +−1× 0 + 0× 0 = −2 -1 -1
12
Perceptron Example III
• θ = 〈−2, 0, 0〉
• learning rate: η = 1
• Epoch 2:
〈x1, x2〉 θ1 · 1 + θ2 · x1 + θ3 · x2 ŷ
(2)
i yi
〈1, 1〉 −2 + 1× 0 + 1× 0 = −2 -1 1
Update to θ = 〈0, 2, 2〉
〈1, 2〉 0 + 1× 2 + 2× 2 = 6 1 1
〈0, 0〉 0 + 0× 2 + 0× 2 = 0 1 -1
Update to θ = 〈−2, 2, 2〉
〈−1, 0〉 −2 +−1× 2 + 0× 2 = −4 -1 -1
13
Perceptron Example IV
• θ = 〈−2, 2, 2〉
• learning rate: η = 1
• Epoch 3:
〈x1, x2〉 θ1 · 1 + θ2 · x1 + θ3 · x2 ŷ
(3)
i yi
〈1, 1〉 −2 + 1× 2 + 1× 2 = 2 1 1
〈1, 2〉 −2 + 1× 2 + 2× 2 = 4 1 1
〈0, 0〉 −2 + 0× 2 + 0× 2 = −2 -1 -1
〈−1, 0〉 −2 +−1× 2 + 0× 2 = −4 -1 -1
• Convergence, as no updates throughout epoch
14
Perceptron Convergence
Perceptron Rule:
θ
(t+1)
j ← θ
(t)
j + η(yi − ŷ i)x
i
j
• So, all we’re doing is adding and subtracting constants every time we
make a mistake.
• Does this really work!?
15
Perceptron Convergence
Perceptron Convergence
• The Perceptron algorithm is guaranteed to converge for
linearly-separable data
• the convergence point will depend on the initialisation
• the convergence point will depend on the learning rate
• (no guarantee of the margin being maximised)
• No guarantee of convergence over non-linearly separable data
15
Back to Logistic Regression and Gradient Descent
Perceptron Rule
θ
(t+1)
j ← θ
(t)
j + η(yi − ŷ i)x
i
j
Gradient Descent
θ
(t+1) ← θ(t) − η
∂f
∂θ(t)
Activation Functions
A single ‘neuron’ with a sigmoid activation which optimizes the
cross-entropy loss (negative log likelihood) is equivalent to logistic
regression
x
1
x
2
...
x
F
θ
2
...
θ
F
y
input weights
activation
output
f: y = 1 / (1 + exp(-θTx))
1
θ
1
θ
0
16
Back to Logistic Regression and Gradient Descent
Perceptron Rule
θ
(t+1)
j ← θ
(t)
j + η(yi − ŷ i)x
i
j
Gradient Descent
θ
(t+1) ← θ(t) − η
∂f
∂θ(t)
Activation Functions
A single ‘neuron’ with a sigmoid activation which optimizes the
cross-entropy loss (negative log likelihood) is equivalent to logistic
regression
x
1
x
2
...
x
F
θ
2
...
θ
F
y
input weights
activation
output
f: y = 1 / (1 + exp(-θTx))
1
θ
1
θ
0
16
Online learning vs. Batch learning
• It is an online algorithm: we update the weights after each training
example
• In contrast, Naive Bayes and logistic regression (with Gradient Descent)
are updated as a batch algorithm:
• compute statistics of the whole training data set
• update all parameters at once
• Online learning can be more efficient for large data sets
• Gradient Descent can be converted into an online version: stochastic
gradient descent
17
Multi-Class Perceptron
We can generalize the perceptron to more than 2 classes
• create a weight vector for each class k ∈ Y , θk
• score input wrt each class: θTk x for all k
• predict the class with maximum output ŷ = argmaxk∈Y θ
T
k x
• learning works as before: if for some (x i , y i) we make a wrong prediction
ŷ i 6= y i such that θTy i x
i < θTŷ i x
i ,
θy i ← θy i + ηx
i move towards predicting y i for x i
θŷ i ← θŷ i − ηx
i move away from predicting ŷ i for x i
18
Summary
This lecture: The Perceptron
• Biological motivation
• Error-based classifier
• The Perceptron Rule
• Relation to Logistic Regression
• Multi-class perceptron
Next
• More powerful machine learning through combining perceptrons
• More on linear separability
• More on activation functions
• Learning with backpropagation
19
References
• Rosenblatt, Frank. ”The perceptron: a probabilistic model for information
storage and organization in the brain.” Psychological review 65.6 (1958):
386.
• Minsky, Marvin, and Seymour Papert. ”Perceptrons: An essay in
computational geometry.” MIT Press. (1969).
• Bishop, Christopher M. Pattern recognition and machine learning.
Springer, 2006. Chapter 4.1.7
20
Introduction
An example