Lecture 12: The Perceptron
Introduction to Machine Learning Semester 1, 2022
Copyright @ University of Melbourne 2022. All rights reserved. No part of the publication may be reproduced in any form by print, photoprint, microfilm or any other means without written permission from the author.
Copyright By PowCoder代写 加微信 powcoder
So far… Naive Bayes and Logistic Regression
• Probabilistic models (Naive Bayes and Logistic Regression) … and KNN! • Maximum likelihood estimation
• Examples and code
Today… The Perceptron
• Geometric motivation
• Error-based optimization
• …towards neural networks
Recap: Classification algorithms
Naive Bayes
• Generative model of p(x,y)
• Find optimal parameter that maximize the log data likelihood • Unrealistic independence assumption p(x|y) = i p(xi|y)
Logistic Regression
• Discriminative model of p(y|x)
• Find optimal parameters that maximize the conditional log data likelihood • Allows for more complex features (fewer assumptions)
Recap: Classification algorithms
Naive Bayes
• Generative model of p(x,y)
• Find optimal parameter that maximize the log data likelihood • Unrealistic independence assumption p(x|y) = i p(xi|y)
Logistic Regression
• Discriminative model of p(y|x)
• Find optimal parameters that maximize the conditional log data likelihood • Allows for more complex features (fewer assumptions)
Perceptron
• Biological motivation: imitating neurons in the brain • No more probabilities
• Instead: minimize the classification error directly
Introduction: Neurons in the Brain
• Humans are the best learners we know
• Can we take inspiration from human learning • → the brain!
Introduction: Neurons in the Brain
Introduction: Neurons in the Brain
Source: https://upload.wikimedia.org/wikipedia/commons/1/10/Blausen_0657_MultipolarNeuron.png
Introduction: Neurons in the Brain
• 1943 McCulloch and Pitts introduced the first ‘artificial neurons’
• If the weighted sum of inputs is equal to or greater than a threshold,
then the output is 1. Otherwise the output is 0.
• the weights needed to be designed by hand
• In 1958 Rosenblatt invented the Perceptron, which can learn the optimal parameters through the perceptron learning rule
• The perceptron can be trained to learn the correct weights, even if randomly initialized [[ for a limited set of problems ]].
Introduction: Neurons in the Brain
The AI winter
• A few years later Misky and Papert (too?) successfully pointed out the fundamental limitations of the perceptron.
• As a result, research on artificial neural networks stopped until the mid-1980s
• But the limitations can be overcome by combining multiple perceptrons into Artificial Neural Networks
• The perceptron is the basic component of today’s deep learning success!
Introduction: Artificial Neurons I
1 x1 x2 …
f(x;θ,b) y
y = f(θTx)
activation
input weights
Introduction: Artificial Neurons I
1 x1 x2 …
y = f(θx+b) f: y = 1 if
f(θTx) >= 0 f(θTx) < 0
activation
input weights
Perceptron: Definition I
• The Perceptron is a minimal neural network
• neural networks are composed of neurons
• A neuron is defined as follows:
• input = a vector x of numeric inputs (⟨1, x1 , x2 , ...xn ⟩)
• output=ascalaryi ∈R
• hyper-parameter: an activation function f • parameters: θ = ⟨θ0, θ1, θ2, ...θn⟩
• Mathematically:
jj
yi =f θxi j
Perceptron: Definition II
• Task: binary classification of instances into classes 1 and −1
• Model: a single-neuron (aka a “perceptron”) :
f(θTx)=1 ifθTx ≥0
−1 otherwise • Graphically, f is the step function
• θT x is the decision boundary
Perceptron: Definition II
• Task: binary classification of instances into classes 1 and −1
• Model: a single-neuron (aka a “perceptron”) :
f(θTx)=1 ifθTx ≥0
• θT x is the decision boundary • Example: 2-d case:
−1 otherwise
Towards the Perceptron Algorithm I
• As usual, learning means to modify the parameters (i.e., weights) of the perceptron so that performance is optimized
• The perceptron is a supervised classification algorithm, so we learn from observations of input-label pairs
(x1, y1), (x2, y2), ...(xN , yN )
• Simplest way to learn: compare predicted outputs yˆ against true outputs y and minimize the number of mis-classifications. Unfortunately, mathematically inconvenient.
• Second simplest idea: Find θ such that gap between the predicted value yˆi ←f(θTxi)andthetrueclasslabely ∈{−1,1}isminimized
Towards the Perceptron Algorithm I
Intuition Iterate over the training data and modify weights:
• ifthetruelabely=1andyˆ=1thendonothing
• ifthetruelabely=−1andyˆ=−1thendonothing
• if the true label y = 1 but yˆ = −1 then increase weights • if the true label y = −1 but yˆ = 1 then decrease weights
Towards the Perceptron Algorithm I
Intuition Iterate over the training data and modify weights:
• ifthetruelabely=1andyˆ=1thendonothing
• ifthetruelabely=−1andyˆ=−1thendonothing
• if the true label y = 1 but yˆ = −1 then increase weights • if the true label y = −1 but yˆ = 1 then decrease weights
More formally
Initialize parameters θ ← 0 for training sample (x,y) do
Calculate the output yˆ = f(θT x) if y = 1 and yˆ = −1 then
θ(new) ←θ(old) +x
if y = −1 and yˆ = 1 then
θ(new) ←θ(old) −x until tired
Towards the Perceptron Algorithm II
• We set a learning rate or step size η • and note that
−2ifyi =−1and=yˆi =1
• For each individual weight θj , we compute an update such that
θ ← θ + η ( y i − yˆ i ) x i jjj
0ify ==yˆ
(yi −yˆi)=2ifyi =1and=yˆi =−1 (1)
The Perceptron Algorithm
D = {(xi , yi )|i = 1, 2, ..., N} the set of training instances Initialise the weight vector θ ← 0
for each training instance (xi , yi ) ∈ D do
computeyˆi,(t) =f(θTxi) if yˆi,(t) ̸= yi then
for each each weight θj do
update θ(t) ← θ(t−1) + η(yi − yˆi,(t))xi
θ(t) ← θ(t−1) jj
until tired Return θ(t)
An example
Perceptron Example I
• Training instances:
• Learning rate η = 1
⟨xi1,xi2⟩ yi
⟨1,1⟩ 1 ⟨1,2⟩ 1 ⟨0, 0⟩ −1 ⟨−1, 0⟩ −1
Perceptron Example II
• θ=⟨0,0,0⟩
• learning rate: η = 1 • Epoch 1:
⟨x1,x2⟩ ⟨1, 1⟩
⟨1, 2⟩ ⟨0, 0⟩
θ1 ·1+θ2 ·x1 +θ3 ·x2 0 + 1 × 0 + 1 × 0 = 0
0 + 1 × 0 + 2 × 0 = 0
0 + 0 × 0 + 0 × 0 = 0 Update to θ = ⟨−2, 0, 0⟩ −2+−1×0+0×0 = −2
yˆ(1) yi i
1 1 1 1 1 -1
Perceptron Example III
• θ=⟨−2,0,0⟩
• learning rate: η = 1 • Epoch 2:
⟨x1,x2⟩ ⟨1, 1⟩
⟨1, 2⟩ ⟨0, 0⟩
θ1 ·1+θ2 ·x1 +θ3 ·x2
−2 + 1 × 0 + 1 × 0 = −2 Update to θ = ⟨0, 2, 2⟩
0 + 1 × 2 + 2 × 2 = 6
0 + 0 × 2 + 0 × 2 = 0 Update to θ = ⟨−2, 2, 2⟩ −2+−1×2+0×2 = −4
yˆ(2) yi i
Perceptron Example IV
• θ=⟨−2,2,2⟩
• learning rate: η = 1 • Epoch 3:
⟨x1,x2⟩ ⟨1, 1⟩
⟨0, 0⟩ ⟨−1,0⟩
θ1 ·1+θ2 ·x1 +θ3 ·x2 −2 + 1 × 2 + 1 × 2 = 2
−2 + 1 × 2 + 2 × 2 = 4 −2 + 0 × 2 + 0 × 2 = −2 −2+−1×2+0×2 = −4
yˆ(3) yi i
1 1 1 1 -1 -1 -1 -1
• Convergence, as no updates throughout epoch
Perceptron Convergence
Perceptron Rule:
θ ( t + 1 ) ← θ ( t ) + η ( y − yˆ i ) x i jjij
• So, all we’re doing is adding and subtracting constants every time we make a mistake.
• Does this really work!?
Perceptron Convergence
Perceptron Convergence
• The Perceptron algorithm is guaranteed to converge for
linearly-separable data
• the convergence point will depend on the initialisation
• the convergence point will depend on the learning rate • (no guarantee of the margin being maximised)
• No guarantee of convergence over non-linearly separable data
Back to Logistic Regression and Gradient Descent
Perceptron Rule Gradient Descent Activation Functions
θ ( t + 1 ) ← θ ( t ) + η ( y − yˆ i ) x i jjij
θ(t+1) ←θ(t) −η ∂f ∂θ(t)
Back to Logistic Regression and Gradient Descent
Perceptron Rule
Gradient Descent
Activation Functions
θ ( t + 1 ) ← θ ( t ) + η ( y − yˆ i ) x i jjij
θ(t+1) ←θ(t) −η ∂f ∂θ(t)
A single ‘neuron’ with a sigmoid activation which optimizes the cross-entropy loss (negative log likelihood) is equivalent to logistic regression
y f: y = output
1 / (1 + exp(-θTx))
input weights
activation
Online learning vs. Batch learning
• It is an online algorithm: we update the weights after each training example
• In contrast, Naive Bayes and logistic regression (with Gradient Descent) are updated as a batch algorithm:
• compute statistics of the whole training data set • update all parameters at once
• Online learning can be more efficient for large data sets
• Gradient Descent can be converted into an online version: stochastic gradient descent
Multi-Class Perceptron
We can generalize the perceptron to more than 2 classes
• create a weight vector for each class k ∈ Y , θk
• score input wrt each class: θkT x for all k
• predict the class with maximum output yˆ = argmaxk ∈Y θkT x
• learning works as before: if for some (x i , y i ) we make a wrong prediction yˆi ̸=yi suchthatθT xi <θT xi,
θyi ← θyi + ηxi θyˆi ← θyˆi − ηxi
move towards predicting yi for xi move away from predicting yˆi for xi
This lecture: The Perceptron
• Biological motivation
• Error-based classifier
• The Perceptron Rule
• Relation to Logistic Regression • Multi-class perceptron
• (Brief interlude on evaluation)
• More powerful machine learning through combining perceptrons • More on linear separability
• More on activation functions
• Learning with backpropagation
References
• Rosenblatt, Frank. ”The perceptron: a probabilistic model for information storage and organization in the brain.” Psychological review 65.6 (1958): 386.
• Minsky, Marvin, and . ”Perceptrons: An essay in computational geometry.” MIT Press. (1969).
• Bishop, . Pattern recognition and machine learning. Springer, 2006. Chapter 4.1.7
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com