Lecture 8: The Perceptron
COMP90049
Introduction to Machine Learning Semester 1, 2020
Lea Frermann, CIS
1
Introduction
Roadmap
So far… Naive Bayes and Logistic Regression
• Probabilistic models
• Maximum likelihood estimation • Examples and code
2
Roadmap
So far… Naive Bayes and Logistic Regression
• Probabilistic models
• Maximum likelihood estimation • Examples and code
Today… The Perceptron
• Geometric motivation
• Error-based optimization • …towards neural networks
2
Recap: Classification algorithms
Naive Bayes
• Generative model of p(x,y)
• Find optimal parameter that maximize the log data likelihood • Unrealistic independence assumption p(x|y) = i p(xi|y)
Logistic Regression
• Discriminative model of p(y|x)
• Find optimal parameters that maximize the conditional log data likelihood • Allows for more complex features (fewer assumptions)
3
Recap: Classification algorithms
Naive Bayes
• Generative model of p(x,y)
• Find optimal parameter that maximize the log data likelihood • Unrealistic independence assumption p(x|y) = i p(xi|y)
Logistic Regression
• Discriminative model of p(y|x)
• Find optimal parameters that maximize the conditional log data likelihood • Allows for more complex features (fewer assumptions)
Perceptron
• Biological motivation: imitating neurons in the brain • No more probabilities
• Instead: minimize the classification error directly
3
Introduction: Neurons in the Brain
• Humans are the best learners we know
• Can we take inspiration from human learning • → the brain!
4
Introduction: Neurons in the Brain
4
Introduction: Neurons in the Brain
Source: https://upload.wikimedia.org/wikipedia/commons/1/10/Blausen_0657_MultipolarNeuron.png
4
Introduction: Neurons in the Brain
The hype
• 1943 McCulloch and Pitts introduced the first ‘artificial neurons’
• If the weighted sum of inputs is equal to or greater than a threshold,
then the output is 1. Otherwise the output is 0.
• the weights needed to be designed by hand
• In 1958 Rosenblatt invented the Perceptron, which can learn the optimal parameters through the perceptron learning rule
• The perceptron can be trained to learn the correct weights, even if randomly initialized [[ for a limited set of problems ]].
4
Introduction: Neurons in the Brain
The AI winter
• A few years later Misky and Papert (too?) successfully pointed out the fundamental limitations of the perceptron.
• As a result, research on artificial neural networks stopped until the mid-1980s
• But the limitations can be overcome by combining multiple perceptrons into Artificial Neural Networks
• The perceptron is the basic component of today’s deep learning success!
4
Introduction: Artificial Neurons I
1 x1 x2 …
θ0 θ1
θ2
…
θF
f(x;θ,b) y
output
y = f(θTx)
activation
xF
input weights
5
Introduction: Artificial Neurons I
1 x1 x2 …
θ0 θ1
θ2
…
θF
y
output
y = f(θx+b) f: y = 1 if
y = -1 if
f(θTx) >= 0 f(θTx) < 0
activation
xF
input weights
5
Perceptron: Definition I
• The Perceptron is a minimal neural network
• neural networks are composed of neurons
• A neuron is defined as follows:
• input=avectorxofnumericinputs(⟨1,x1,x2,...xn⟩)
• output=ascalaryi ∈R
• hyper-parameter: an activation function f • parameters:θ=⟨θ0,θ1,θ2,...θn⟩
• Mathematically:
jj
yi =f θxi j
=f(θTxi)
6
Perceptron: Definition II
• Task: binary classification of instances into classes 1 and −1
• Model: a single-neuron (aka a “perceptron”) :
f(θTx)=1 ifθTx ≥0 −1 otherwise
• θT x is the decision boundary
• Graphically, f is the step function
7
Perceptron: Definition II
• Task: binary classification of instances into classes 1 and −1
• Model: a single-neuron (aka a “perceptron”) :
f(θTx)=1 ifθTx ≥0 −1 otherwise
• θT x is the decision boundary • Example: 2-d case:
7
Towards the Perceptron Algorithm I
• As usual, learning means to modify the parameters (i.e., weights) of the perceptron so that performance is optimized
• The perceptron is a supervised classification algorithm, so we learn from observations of input-label pairs
(x1, y1), (x2, y2), ...(xN , yN )
• Simplest way to learn: compare predicted outputs yˆ against true outputs y and minimize the number of mis-classifications. Unfortunately, mathematically inconvenient.
• Second simplest idea: Find θ such that gap between the predicted value yˆi ←f(θTxi)andthetrueclasslabely ∈{−1,1}isminimized
8
Towards the Perceptron Algorithm I
Intuition Iterate over the training data and modify weights:
• ifthetruelabely=1andyˆ=1thendonothing
• ifthetruelabely=−1andyˆ=−1thendonothing
• if the true label y = 1 but yˆ = −1 then increase weights • if the true label y = −1 but yˆ = 1 then decrease weights
8
Towards the Perceptron Algorithm I
Intuition Iterate over the training data and modify weights:
• ifthetruelabely=1andyˆ=1thendonothing
• ifthetruelabely=−1andyˆ=−1thendonothing
• if the true label y = 1 but yˆ = −1 then increase weights • if the true label y = −1 but yˆ = 1 then decrease weights
More formally
Initialize parameters θ ← 0 for training sample (x,y) do
Calculate the output yˆ = f(θT x) if y = 1 and yˆ = −1 then
θ(new) ←θ(old) +x
if y = −1 and yˆ = 1 then
θ(new) ←θ(old) −x until tired
8
Towards the Perceptron Algorithm II
• We set a learning rate or step size η • and note that
−2ifyi =−1and=yˆi =1
• For each individual weight θj , we compute an update such that
θ ← θ + η ( y i − yˆ i ) x i jjj
0ify ==yˆ
ii
(yi −yˆi)=2ifyi =1and=yˆi =−1 (1)
9
The Perceptron Algorithm
D = {(xi , yi )|i = 1, 2, ..., N} the set of training instances Initialise the weight vector θ ← 0
t←0
repeat
t ← t+1
for each training instance (xi , yi ) ∈ D do
computeyˆi,(t) =f(θTxi) if yˆi,(t) ̸= yi then
for each each weight θj do
update θ(t) ← θ(t−1) + η(yi − yˆi,(t))xi
else
θ(t) ← θ(t−1) jj
until tired Return θ(t)
jjj
10
An example
Perceptron Example I
• Training instances:
• Learning rate η = 1
⟨xi1,xi2⟩ yi
⟨1,1⟩ 1 ⟨1,2⟩ 1 ⟨0, 0⟩ −1 ⟨−1, 0⟩ −1
11
Perceptron Example II
• θ=⟨0,0,0⟩
• learning rate: η = 1 • Epoch 1:
⟨x1,x2⟩ ⟨1, 1⟩
⟨1, 2⟩ ⟨0, 0⟩
⟨−1,0⟩
θ1 ·1+θ2 ·x1 +θ3 ·x2 0 + 1 × 0 + 1 × 0 = 0
0 + 1 × 0 + 2 × 0 = 0
0 + 0 × 0 + 0 × 0 = 0 Update to θ = ⟨−2, 0, 0⟩ −2+−1×0+0×0 = −2
yˆ(1) yi i
1 1 1 1 1 -1
-1 -1
12
Perceptron Example III
• θ=⟨−2,0,0⟩
• learning rate: η = 1 • Epoch 2:
⟨x1,x2⟩ ⟨1, 1⟩
⟨1, 2⟩ ⟨0, 0⟩
⟨−1,0⟩
θ1 ·1+θ2 ·x1 +θ3 ·x2
−2 + 1 × 0 + 1 × 0 = −2 Update to θ = ⟨0, 2, 2⟩
0 + 1 × 2 + 2 × 2 = 6
0 + 0 × 2 + 0 × 2 = 0 Update to θ = ⟨−2, 2, 2⟩ −2+−1×2+0×2 = −4
yˆ(2) yi i
-1 1
1 1 1 -1
-1 -1
13
Perceptron Example IV
• θ=⟨−2,2,2⟩
• learning rate: η = 1 • Epoch 3:
⟨x1,x2⟩ ⟨1, 1⟩
⟨1, 2⟩
⟨0, 0⟩ ⟨−1,0⟩
θ1 ·1+θ2 ·x1 +θ3 ·x2 −2 + 1 × 2 + 1 × 2 = 2
−2 + 1 × 2 + 2 × 2 = 4 −2 + 0 × 2 + 0 × 2 = −2 −2+−1×2+0×2 = −4
yˆ(3) yi i
1 1 1 1 -1 -1 -1 -1
• Convergence, as no updates throughout epoch
14
Perceptron Convergence
Perceptron Rule:
θ ( t + 1 ) ← θ ( t ) + η ( y − yˆ i ) x i jjij
• So, all we’re doing is adding and subtracting constants every time we make a mistake.
• Does this really work!?
15
Perceptron Convergence
Perceptron Convergence
• The Perceptron algorithm is guaranteed to converge for
linearly-separable data
• the convergence point will depend on the initialisation • the convergence point will depend on the learning rate • (no guarantee of the margin being maximised)
• No guarantee of convergence over non-linearly separable data
15
Back to Logistic Regression and Gradient Descent
Perceptron Rule Gradient Descent Activation Functions
θ ( t + 1 ) ← θ ( t ) + η ( y − yˆ i ) x i jjij
θ(t+1) ←θ(t) −η ∂f ∂θ(t)
16
Back to Logistic Regression and Gradient Descent
Perceptron Rule
Gradient Descent
Activation Functions
θ ( t + 1 ) ← θ ( t ) + η ( y − yˆ i ) x i jjij
θ(t+1) ←θ(t) −η ∂f ∂θ(t)
A single ‘neuron’ with a sigmoid activation which optimizes the cross-entropy loss (negative log likelihood) is equivalent to logistic regression
1
x θ0 1 θ1
x2
... ...
θF
θ2
y f: y = output
1 / (1 + exp(-θTx))
activation
xF
input weights
16
Online learning vs. Batch learning
• It is an online algorithm: we update the weights after each training example
• In contrast, Naive Bayes and logistic regression (with Gradient Descent) are updated as a batch algorithm:
• compute statistics of the whole training data set • update all parameters at once
• Online learning can be more efficient for large data sets
• Gradient Descent can be converted into an online version: stochastic gradient descent
17
Multi-Class Perceptron
We can generalize the perceptron to more than 2 classes
• create a weight vector for each class k ∈ Y , θk
• score input wrt each class: θkT x for all k
• predict the class with maximum output yˆ = argmaxk ∈Y θkT x
• learning works as before: if for some (x i , y i ) we make a wrong prediction yˆi ̸=yi suchthatθT xi <θT xi,
y i
θyi ← θyi + ηxi θyˆi ← θyˆi − ηxi
yˆ i
move towards predicting yi for xi move away from predicting yˆi for xi
18
Summary
This lecture: The Perceptron
• Biological motivation
• Error-based classifier
• The Perceptron Rule
• Relation to Logistic Regression • Multi-class perceptron
Next
• More powerful machine learning through combining perceptrons • More on linear separability
• More on activation functions
• Learning with backpropagation
19
References
• Rosenblatt, Frank. ”The perceptron: a probabilistic model for information storage and organization in the brain.” Psychological review 65.6 (1958): 386.
• Minsky, Marvin, and Seymour Papert. ”Perceptrons: An essay in computational geometry.” MIT Press. (1969).
• Bishop, Christopher M. Pattern recognition and machine learning. Springer, 2006. Chapter 4.1.7
20