Lecture 6. Perceptron
COMP90051 Statistical Machine Learning
Semester 2, 2019 Lecturer: Ben Rubinstein
Copyright: University of Melbourne
COMP90051 Statistical Machine Learning
This lecture
• Perceptron
∗ Introduction to Artificial Neural Networks
∗ The perceptron model
∗ Stochastic gradient descent
2
COMP90051 Statistical Machine Learning
The Perceptron Model
A building block for artificial neural networks, yet another linear classifier
3
COMP90051 Statistical Machine Learning
Biological inspiration
• Humans perform well at many tasks that matter • Originallyneuralnetworkswereanattemptto
mimic the human brain
photo: Alvesgaspar, Wikimedia Commons, CC3
4
COMP90051 Statistical Machine Learning
Artificial neural network
• As a crude approximation, the human brain can be thought as a mesh of interconnected processing nodes (neurons) that relay electrical signals
• Artificial neural network is a network of processing elements
• Each element converts inputs to output
…
• The output is a function (called activation function) of a weighted sum of inputs
…
…
5
COMP90051 Statistical Machine Learning
Outline
• In order to use an ANN we need (a) to design network topology and (b) adjust weights to given data
∗ Inthissubject,wewillexclusivelyfocusontask(b)foraparticular class of networks called feed forward networks
• Training an ANN means adjusting weights for training data given a pre-defined network topology
• We will come back to ANNs and discuss ANN training in the next lecture
• Right now we will turn our attention to an individual network element because it is an interesting model in itself
6
COMP90051 Statistical Machine Learning
𝑥𝑥 × 𝑤𝑤1 Perceptron model 1×𝑤𝑤Σ𝑠𝑠𝑓𝑓
𝑓𝑓(𝑠𝑠)
𝑥𝑥2 2× 𝑤𝑤
1 0 • 𝑥𝑥1,𝑥𝑥2–inputs
• 𝑤𝑤1, 𝑤𝑤2 – synaptic weights
Compare this model to logistic regression
• 𝑤𝑤0 – bias weight
• 𝑓𝑓 – activation function
7
COMP90051 Statistical Machine Learning
Perceptron is a linear binary classifier
Perceptron is a linear classifier: 𝑠𝑠 is a linear function of inputs, and the decision boundary is linear
where𝑠𝑠=∑𝑚𝑚 𝑥𝑥𝑤𝑤 𝑖𝑖=0 𝑖𝑖 𝑖𝑖
Perceptron is a binary classifier:
Predict class B if 𝑠𝑠 < 0
plane with values of 𝑠𝑠
Predict class A if 𝑠𝑠 ≥ 0
decision boundary
𝑥𝑥2
𝑥𝑥1
Example for 2D data
plane with data points
8
COMP90051 Statistical Machine Learning
Exercise: find weights of a perceptron capable of perfect classification of the following dataset
𝒙𝒙𝟏𝟏 𝒙𝒙𝟐𝟐 𝒚𝒚
0
0
Class B
0
1
Class B
1
0
Class B
1
1
Class A
art: OpenClipartVectors at pixabay.com (CC0)
COMP90051 Statistical Machine Learning
Loss function for perceptron
• “Training”: finds weights to minimise some loss. Which?
• Our task is binary classification. Let’s arbitrarily encode one class as +1 and the other as −1. So each training example is now {𝒙𝒙, 𝑦𝑦}, where 𝑦𝑦 is either +1 or −1
• Recall that, in a perceptron, 𝑠𝑠 = ∑𝑚𝑚 𝑥𝑥 𝑤𝑤 , and the sign of 𝑠𝑠 𝑖𝑖=0 𝑖𝑖 𝑖𝑖
determines the predicted class: +1 if 𝑠𝑠 > 0, and −1 if 𝑠𝑠 < 0
• Consider a single training example. If 𝑦𝑦 and 𝑠𝑠 have same sign then the example is classified correctly. If 𝑦𝑦 and 𝑠𝑠 have different signs, the example is misclassified
10
COMP90051 Statistical Machine Learning
Loss function for perceptron
• Consider a single training example. If 𝑦𝑦 and 𝑠𝑠 have the same sign then the example is classified correctly. If 𝑦𝑦 and 𝑠𝑠 have different signs, the example is misclassified
• The perceptron uses a loss function in which there is no penalty for correctly classified examples, while the penalty
(loss) is equal to 𝑠𝑠 for misclassified examples*
𝐿𝐿(𝑠𝑠, 𝑦𝑦)
𝑠𝑠𝑦𝑦
11
• Formally:
∗ 𝐿𝐿 𝑠𝑠, 𝑦𝑦 = 0 if both 𝑠𝑠, 𝑦𝑦 have the same sign ∗ 𝐿𝐿 𝑠𝑠,𝑦𝑦 = 𝑠𝑠 if both 𝑠𝑠,𝑦𝑦 have different signs
0
• This can be re-written as 𝐿𝐿 𝑠𝑠, 𝑦𝑦 = max 0, −𝑠𝑠𝑦𝑦 * This is similar, but not identical to another widely used loss function called hinge loss
COMP90051 Statistical Machine Learning
• •
• • •
•
Stochastic gradient descent
Randomly shuffle/split all training examples in 𝐵𝐵 batches Choose initial 𝜽𝜽(1)
For𝑖𝑖from1to𝑇𝑇
Iterations over the entire dataset are called epochs
For 𝑗𝑗 from 1 to 𝐵𝐵
Do gradient descent update using data from batch 𝑗𝑗
Advantage of such an approach: computational feasibility for large datasets
12
COMP90051 Statistical Machine Learning
Perceptron training algorithm
Choose initial guess 𝒘𝒘(0), 𝑘𝑘 = 0 For 𝑖𝑖 from 1 to 𝑇𝑇 (epochs)
For 𝑗𝑗 from 1 to 𝑁𝑁 (training examples) Considerexample 𝒙𝒙𝑗𝑗,𝑦𝑦𝑗𝑗
Update*: 𝒘𝒘 𝑘𝑘++ = 𝒘𝒘 𝑘𝑘 − 𝜂𝜂𝛁𝛁𝐿𝐿(𝒘𝒘(𝑘𝑘)) 𝐿𝐿(𝒘𝒘) = max 0, −𝑠𝑠𝑦𝑦
𝑚𝑚
𝑠𝑠 = �𝑖𝑖 = 0 𝑥𝑥 𝑖𝑖 𝑤𝑤 𝑖𝑖
*There is no derivative when 𝑠𝑠 = 0, but this case is handled explicitly in the algorithm, see next slides
𝜂𝜂 is learning rate
13
COMP90051 Statistical Machine Learning
Perceptron training rule
• Wehave 𝜕𝜕𝐿𝐿 =0when𝑠𝑠𝑦𝑦>0 𝜕𝜕𝑤𝑤𝑖𝑖
𝜕𝜕𝐿𝐿 𝑖𝑖
• Wehave =−𝑥𝑥 when𝑦𝑦=1and𝑠𝑠<0
∗ Wedon’tneedtodoupdatewhenanexampleiscorrectlyclassified
𝜕𝜕𝐿𝐿 𝑖𝑖
• Wehave𝜕𝜕𝑤𝑤𝑖𝑖 =𝑥𝑥 when𝑦𝑦=−1and𝑠𝑠>0
𝜕𝜕𝑤𝑤𝑖𝑖
𝐿𝐿(𝑠𝑠, 𝑦𝑦)
𝑠𝑠𝑦𝑦
• 𝑠𝑠=∑𝑚𝑚 𝑥𝑥𝑤𝑤 𝑖𝑖=0 𝑖𝑖 𝑖𝑖
0
14
COMP90051 Statistical Machine Learning
Perceptron training algorithm
When classified correctly, weights are unchanged When misclassified: 𝒘𝒘 𝑘𝑘+1 = −𝜂𝜂(±𝒙𝒙)
(𝜂𝜂 > 0 is called learning rate)
If 𝑦𝑦 = 1, but 𝑠𝑠 < 0 If 𝑦𝑦 = −1, but 𝑠𝑠 ≥ 0 𝑤𝑤𝑖𝑖 ←𝑤𝑤𝑖𝑖 +𝜂𝜂𝑥𝑥𝑖𝑖 𝑤𝑤𝑖𝑖 ←𝑤𝑤𝑖𝑖 −𝜂𝜂𝑥𝑥𝑖𝑖
𝑤𝑤0 ← 𝑤𝑤0 + 𝜂𝜂 𝑤𝑤0 ← 𝑤𝑤0 − 𝜂𝜂
Convergence Theorem: if the training data is linearly separable, the algorithm is guaranteed to converge to a solution. That is, there exist a finite 𝐾𝐾 such that 𝐿𝐿 𝒘𝒘𝐾𝐾 = 0
15
COMP90051 Statistical Machine Learning
Perceptron convergence theorem
• Assumptions 𝑖𝑖 𝑖𝑖
∗ Linear separability: There exists 𝒘𝒘∗ so that 𝑦𝑦 𝒘𝒘∗ ′𝒙𝒙 ≥ 𝛾𝛾 for all
training data 𝑖𝑖 = 1, ... , 𝑁𝑁 and some positive 𝛾𝛾.
∗ Bounded data: 𝒙𝒙𝑖𝑖 ≤ 𝑅𝑅 for 𝑖𝑖 = 1,...,𝑁𝑁 and some finite 𝑅𝑅.
• Proof sketch
∗ Establishthat 𝒘𝒘 𝒘𝒘 ≥ 𝒘𝒘 𝒘𝒘 +𝛾𝛾 ∗ Itthenfollowsthat 𝒘𝒘∗ ′𝒘𝒘𝑘𝑘 ≥𝑘𝑘𝛾𝛾
∗′𝑘𝑘 ∗′𝑘𝑘−1 ∗ Establish that 𝒘𝒘 𝑘𝑘 2 ≤ 𝑘𝑘𝑅𝑅2
∗Notethatcos𝒘𝒘∗,𝒘𝒘𝑘𝑘 = 𝒘𝒘∗′𝒘𝒘𝑘𝑘
𝒘𝒘∗ 𝒘𝒘𝑘𝑘 ∗ Usethefactthatcos ... ≤1
∗ Arriveat𝑘𝑘≤𝑅𝑅2 𝒘𝒘∗ 2 𝛾𝛾
16
COMP90051 Statistical Machine Learning
•
Pros and cons of perceptron learning
If the data is linearly separable, the perceptron training algorithm will converge to a correct solution
∗ There is a formal proofgood!
∗ It will converge to some solution (separating boundary), one of
infinitely many possiblebad!
However, if the data is not linearly separable, the training will fail completely rather than give some approximate solution
∗ Ugly
•
17
COMP90051 Statistical Machine Learning
Perceptron learning example
𝑤𝑤
𝑥𝑥1 𝑤𝑤1 𝑠𝑠 = 𝑤𝑤1𝑥𝑥1 + 𝑤𝑤2𝑥𝑥2 + 𝑤𝑤0
𝑥𝑥2 Σ𝑓𝑓�−1,𝑠𝑠<0 2𝑤𝑤 1,𝑠𝑠≥0
Basic setup
1 0 (learning rate 𝜂𝜂 = 0.1)
18
COMP90051 Statistical Machine Learning
Perceptron learning example
𝑤𝑤 = 0.2 𝑥𝑥1 1
Start with random weights
𝑥𝑥2 𝑤𝑤2=0.0 1 𝑤𝑤0 = −0.1
𝑠𝑠 = 𝑤𝑤1𝑥𝑥1 + 𝑤𝑤2𝑥𝑥2 + 𝑤𝑤0 −1, 𝑠𝑠 < 0 Σ 𝑓𝑓 �1,𝑠𝑠≥0
(learning rate 𝜂𝜂 = 0.1) 𝑥𝑥2
0.5
𝑥𝑥1
19
COMP90051 Statistical Machine Learning
Perceptron learning example
𝑤𝑤 = 0.2 Consider training example 1 𝑥𝑥1 1 𝑠𝑠 = 𝑤𝑤1𝑥𝑥1 + 𝑤𝑤2𝑥𝑥2 + 𝑤𝑤0 −1, 𝑠𝑠 < 0
𝑥𝑥2 𝑤𝑤2=0.0 Σ 𝑓𝑓 �1,𝑠𝑠≥0 1 𝑤𝑤0 = −0.1 (learning rate 𝜂𝜂 = 0.1)
0.2𝑥𝑥1 +0.0𝑥𝑥2 −0.1 > 0 𝑥𝑥2
class -1 class 1
𝑤𝑤1←𝑤𝑤1−𝜂𝜂𝑥𝑥1= 0.1 𝑤𝑤2 ← 𝑤𝑤2 − 𝜂𝜂𝑥𝑥2 = −0.1 𝑤𝑤0←𝑤𝑤0−𝜂𝜂 =−0.2
0.5
(1,1) 𝑥𝑥 1
20
COMP90051 Statistical Machine Learning
Perceptron learning example
𝑤𝑤 = 0.1 𝑥𝑥1 1
Update weights
𝑥𝑥2 𝑤𝑤2=−0.1 1 𝑤𝑤0 = −0.2
𝑠𝑠 = 𝑤𝑤1𝑥𝑥1 + 𝑤𝑤2𝑥𝑥2 + 𝑤𝑤0 −1, 𝑠𝑠 < 0 Σ 𝑓𝑓 �1,𝑠𝑠≥0
(learning rate 𝜂𝜂 = 0.1) 𝑥𝑥2
class -1 class 1
(1,1)
𝑥𝑥 1
2
21
COMP90051 Statistical Machine Learning
Perceptron learning example
𝑤𝑤 = 0.1 Consider training example 2 𝑥𝑥1 1 𝑠𝑠 = 𝑤𝑤1𝑥𝑥1 + 𝑤𝑤2𝑥𝑥2 + 𝑤𝑤0 −1, 𝑠𝑠 < 0
𝑥𝑥2 𝑤𝑤2=−0.1 Σ 𝑓𝑓 �1,𝑠𝑠≥0 1 𝑤𝑤0 = −0.2 (learning rate 𝜂𝜂 = 0.1)
0.1𝑥𝑥1 −0.1𝑥𝑥2 −0.2 < 0 𝑥𝑥2
class -1 class 1
𝑤𝑤1←𝑤𝑤1+𝜂𝜂𝑥𝑥1= 0.3 𝑤𝑤2←𝑤𝑤2+𝜂𝜂𝑥𝑥2= 0.0 𝑤𝑤0 ← 𝑤𝑤0 + 𝜂𝜂𝑥𝑥1 = −0.1
2
(2,1) 𝑥𝑥 1
22
COMP90051 Statistical Machine Learning
Perceptron learning example
𝑤𝑤 = 0.3 𝑥𝑥1 1
Update weights
𝑥𝑥2 𝑤𝑤2=0.0 1 𝑤𝑤0 = −0.1
𝑠𝑠 = 𝑤𝑤1𝑥𝑥1 + 𝑤𝑤2𝑥𝑥2 + 𝑤𝑤0 −1, 𝑠𝑠 < 0 Σ 𝑓𝑓 �1,𝑠𝑠≥0
(learning rate 𝜂𝜂 = 0.1) 𝑥𝑥2
class -1 class 1
1/3
(2,1) 𝑥𝑥 1
23
COMP90051 Statistical Machine Learning
Perceptron learning example
𝑤𝑤 = 0.3 Further examples
𝑥𝑥1 1 𝑠𝑠 = 𝑤𝑤1𝑥𝑥1 + 𝑤𝑤2𝑥𝑥2 + 𝑤𝑤0 −1, 𝑠𝑠 < 0
𝑥𝑥2 𝑤𝑤2=0.0 Σ 𝑓𝑓 �1,𝑠𝑠≥0 1 𝑤𝑤0 = −0.1 (learning rate 𝜂𝜂 = 0.1)
0.3𝑥𝑥1 − 0.0𝑥𝑥2 − 0.1 > 0 𝑥𝑥2 3rd point: correctly classified
4th point
(1.5,0.5)
4th point: incorrect, update
etc.
class -1 class 1
1/3
𝑥𝑥 1
24
COMP90051 Statistical Machine Learning
Perceptron learning example
𝑤𝑤 = ⋯ Further examples
𝑥𝑥1 1 𝑠𝑠 = 𝑤𝑤1𝑥𝑥1 + 𝑤𝑤2𝑥𝑥2 + 𝑤𝑤0 −1, 𝑠𝑠 < 0
𝑥𝑥2 𝑤𝑤2 = ⋯ Σ 𝑓𝑓 � 1, 𝑠𝑠 ≥ 0 1 𝑤𝑤0 = ⋯ (learning rate 𝜂𝜂 = 0.1)
Eventually, all the data will 𝑥𝑥2 be correctly classified
class -1 class 1
(provided it is linearly separable)
𝑥𝑥 1
25
COMP90051 Statistical Machine Learning
Summary
• Perceptron
∗ Introduction to Artificial Neural Networks
∗ The perceptron model ∗ Training algorithm
• Nextlecture:Multiplelayers,Backproptraining
26