CS计算机代考程序代写 deep learning AI algorithm Lecture 8: The Perceptron

Lecture 8: The Perceptron

COMP90049
Introduction to Machine Learning
Semester 1, 2020

Lea Frermann, CIS

1

Introduction

Roadmap

So far… Naive Bayes and Logistic Regression

• Probabilistic models

• Maximum likelihood estimation

• Examples and code

Today… The Perceptron

• Geometric motivation

• Error-based optimization

• …towards neural networks

2

Roadmap

So far… Naive Bayes and Logistic Regression

• Probabilistic models

• Maximum likelihood estimation

• Examples and code

Today… The Perceptron

• Geometric motivation

• Error-based optimization

• …towards neural networks

2

Recap: Classification algorithms

Naive Bayes

• Generative model of p(x , y)
• Find optimal parameter that maximize the log data likelihood
• Unrealistic independence assumption p(x |y) =


i p(xi |y)

Logistic Regression

• Discriminative model of p(y |x)
• Find optimal parameters that maximize the conditional log data likelihood
• Allows for more complex features (fewer assumptions)

Perceptron

• Biological motivation: imitating neurons in the brain
• No more probabilities
• Instead: minimize the classification error directly

3

Recap: Classification algorithms

Naive Bayes

• Generative model of p(x , y)
• Find optimal parameter that maximize the log data likelihood
• Unrealistic independence assumption p(x |y) =


i p(xi |y)

Logistic Regression

• Discriminative model of p(y |x)
• Find optimal parameters that maximize the conditional log data likelihood
• Allows for more complex features (fewer assumptions)

Perceptron

• Biological motivation: imitating neurons in the brain
• No more probabilities
• Instead: minimize the classification error directly

3

Introduction: Neurons in the Brain

• Humans are the best learners we know

• Can we take inspiration from human learning

• → the brain!

4

Introduction: Neurons in the Brain

4

Introduction: Neurons in the Brain

Source: https://upload.wikimedia.org/wikipedia/commons/1/10/Blausen_0657_MultipolarNeuron.png

4

https://upload.wikimedia.org/wikipedia/commons/1/10/Blausen_0657_MultipolarNeuron.png

Introduction: Neurons in the Brain

The hype

• 1943 McCulloch and Pitts introduced the first ‘artificial neurons’
• If the weighted sum of inputs is equal to or greater than a threshold,

then the output is 1. Otherwise the output is 0.
• the weights needed to be designed by hand

• In 1958 Rosenblatt invented the Perceptron, which can learn the
optimal parameters through the perceptron learning rule
• The perceptron can be trained to learn the correct weights, even if

randomly initialized [[ for a limited set of problems ]].

4

Introduction: Neurons in the Brain

The AI winter

• A few years later Misky and Papert (too?) successfully pointed out the
fundamental limitations of the perceptron.

• As a result, research on artificial neural networks stopped until the
mid-1980s

• But the limitations can be overcome by combining multiple perceptrons
into Artificial Neural Networks

• The perceptron is the basic component of today’s deep learning success!

4

Introduction: Artificial Neurons I

x
1

x
2

x
F

f(x;θ,b) y

input weights

activation

output

1

y = f(θTx)
θ

2

θ
F

θ
1

θ
0

5

Introduction: Artificial Neurons I

x
1

x
2

x
F

y

input weights

activation

output

1

y = f(θx+b)

f: y = 1 if f(θTx) >= 0
y = -1 if f(θTx) < 0 θ 2 ... θ F θ 1 θ 0 5 Perceptron: Definition I • The Perceptron is a minimal neural network • neural networks are composed of neurons • A neuron is defined as follows: • input = a vector x of numeric inputs (〈1, x1, x2, ...xn〉) • output = a scalar yi ∈ R • hyper-parameter: an activation function f • parameters: θ = 〈θ0, θ1, θ2, ...θn〉 • Mathematically: y i = f    ∑ j θjx i j     = f (θT x i) 6 Perceptron: Definition II • Task: binary classification of instances into classes 1 and −1 • Model: a single-neuron (aka a “perceptron”) : f (θT x) =  1 if θ T x ≥ 0 −1 otherwise • θT x is the decision boundary • Graphically, f is the step function 7 Perceptron: Definition II • Task: binary classification of instances into classes 1 and −1 • Model: a single-neuron (aka a “perceptron”) : f (θT x) =  1 if θ T x ≥ 0 −1 otherwise • θT x is the decision boundary • Example: 2-d case: 7 Towards the Perceptron Algorithm I • As usual, learning means to modify the parameters (i.e., weights) of the perceptron so that performance is optimized • The perceptron is a supervised classification algorithm, so we learn from observations of input-label pairs (x1, y1), (x2, y2), ...(xN , yN) • Simplest way to learn: compare predicted outputs ŷ against true outputs y and minimize the number of mis-classifications. Unfortunately, mathematically inconvenient. • Second simplest idea: Find θ such that gap between the predicted value ŷ i ← f (θT x i) and the true class label y ∈ {−1, 1} is minimized More formally Initialize parameters θ ← 0 for training sample (x , y ) do Calculate the output ŷ = f (θT x) if y = 1 and ŷ = −1 then θ(new) ← θ(old) + x if y = −1 and ŷ = 1 then θ(new) ← θ(old) − x until tired 8 Towards the Perceptron Algorithm I Intuition Iterate over the training data and modify weights: • if the true label y = 1 and ŷ = 1 then do nothing • if the true label y = −1 and ŷ = −1 then do nothing • if the true label y = 1 but ŷ = −1 then increase weights • if the true label y = −1 but ŷ = 1 then decrease weights More formally Initialize parameters θ ← 0 for training sample (x , y ) do Calculate the output ŷ = f (θT x) if y = 1 and ŷ = −1 then θ(new) ← θ(old) + x if y = −1 and ŷ = 1 then θ(new) ← θ(old) − x until tired 8 Towards the Perceptron Algorithm I Intuition Iterate over the training data and modify weights: • if the true label y = 1 and ŷ = 1 then do nothing • if the true label y = −1 and ŷ = −1 then do nothing • if the true label y = 1 but ŷ = −1 then increase weights • if the true label y = −1 but ŷ = 1 then decrease weights More formally Initialize parameters θ ← 0 for training sample (x , y ) do Calculate the output ŷ = f (θT x) if y = 1 and ŷ = −1 then θ(new) ← θ(old) + x if y = −1 and ŷ = 1 then θ(new) ← θ(old) − x until tired 8 Towards the Perceptron Algorithm II • We set a learning rate or step size η • and note that (y i − ŷ i) =   0 if y i == ŷ i 2 if y i = 1and= ŷ i = −1 −2 if y i = −1and= ŷ i = 1 (1) • For each individual weight θj , we compute an update such that θj ← θj + η(y i − ŷ i)x ij 9 The Perceptron Algorithm D = {(xi , y i)|i = 1, 2, ...,N} the set of training instances Initialise the weight vector θ ← 0 t← 0 repeat t← t+1 for each training instance (x i , y i) ∈ D do compute ŷ i,(t) = f (θT x i) if ŷ i,(t) 6= y i then for each each weight θj do update θ(t)j ← θ (t−1) j + η(y i − ŷ i,(t))x ij else θ (t) j ← θ (t−1) j until tired Return θ(t) 10 An example Perceptron Example I • Training instances: 〈xi1, xi2〉 yi 〈1, 1〉 1 〈1, 2〉 1 〈0, 0〉 −1 〈−1, 0〉 −1 • Learning rate η = 1 11 Perceptron Example II • θ = 〈0, 0, 0〉 • learning rate: η = 1 • Epoch 1: 〈x1, x2〉 θ1 · 1 + θ2 · x1 + θ3 · x2 ŷ (1) i yi 〈1, 1〉 0 + 1× 0 + 1× 0 = 0 1 1 〈1, 2〉 0 + 1× 0 + 2× 0 = 0 1 1 〈0, 0〉 0 + 0× 0 + 0× 0 = 0 1 -1 Update to θ = 〈−2, 0, 0〉 〈−1, 0〉 −2 +−1× 0 + 0× 0 = −2 -1 -1 12 Perceptron Example III • θ = 〈−2, 0, 0〉 • learning rate: η = 1 • Epoch 2: 〈x1, x2〉 θ1 · 1 + θ2 · x1 + θ3 · x2 ŷ (2) i yi 〈1, 1〉 −2 + 1× 0 + 1× 0 = −2 -1 1 Update to θ = 〈0, 2, 2〉 〈1, 2〉 0 + 1× 2 + 2× 2 = 6 1 1 〈0, 0〉 0 + 0× 2 + 0× 2 = 0 1 -1 Update to θ = 〈−2, 2, 2〉 〈−1, 0〉 −2 +−1× 2 + 0× 2 = −4 -1 -1 13 Perceptron Example IV • θ = 〈−2, 2, 2〉 • learning rate: η = 1 • Epoch 3: 〈x1, x2〉 θ1 · 1 + θ2 · x1 + θ3 · x2 ŷ (3) i yi 〈1, 1〉 −2 + 1× 2 + 1× 2 = 2 1 1 〈1, 2〉 −2 + 1× 2 + 2× 2 = 4 1 1 〈0, 0〉 −2 + 0× 2 + 0× 2 = −2 -1 -1 〈−1, 0〉 −2 +−1× 2 + 0× 2 = −4 -1 -1 • Convergence, as no updates throughout epoch 14 Perceptron Convergence Perceptron Rule: θ (t+1) j ← θ (t) j + η(yi − ŷ i)x i j • So, all we’re doing is adding and subtracting constants every time we make a mistake. • Does this really work!? 15 Perceptron Convergence Perceptron Convergence • The Perceptron algorithm is guaranteed to converge for linearly-separable data • the convergence point will depend on the initialisation • the convergence point will depend on the learning rate • (no guarantee of the margin being maximised) • No guarantee of convergence over non-linearly separable data 15 Back to Logistic Regression and Gradient Descent Perceptron Rule θ (t+1) j ← θ (t) j + η(yi − ŷ i)x i j Gradient Descent θ (t+1) ← θ(t) − η ∂f ∂θ(t) Activation Functions A single ‘neuron’ with a sigmoid activation which optimizes the cross-entropy loss (negative log likelihood) is equivalent to logistic regression x 1 x 2 ... x F θ 2 ... θ F y input weights activation output f: y = 1 / (1 + exp(-θTx)) 1 θ 1 θ 0 16 Back to Logistic Regression and Gradient Descent Perceptron Rule θ (t+1) j ← θ (t) j + η(yi − ŷ i)x i j Gradient Descent θ (t+1) ← θ(t) − η ∂f ∂θ(t) Activation Functions A single ‘neuron’ with a sigmoid activation which optimizes the cross-entropy loss (negative log likelihood) is equivalent to logistic regression x 1 x 2 ... x F θ 2 ... θ F y input weights activation output f: y = 1 / (1 + exp(-θTx)) 1 θ 1 θ 0 16 Online learning vs. Batch learning • It is an online algorithm: we update the weights after each training example • In contrast, Naive Bayes and logistic regression (with Gradient Descent) are updated as a batch algorithm: • compute statistics of the whole training data set • update all parameters at once • Online learning can be more efficient for large data sets • Gradient Descent can be converted into an online version: stochastic gradient descent 17 Multi-Class Perceptron We can generalize the perceptron to more than 2 classes • create a weight vector for each class k ∈ Y , θk • score input wrt each class: θTk x for all k • predict the class with maximum output ŷ = argmaxk∈Y θ T k x • learning works as before: if for some (x i , y i) we make a wrong prediction ŷ i 6= y i such that θTy i x i < θTŷ i x i , θy i ← θy i + ηx i move towards predicting y i for x i θŷ i ← θŷ i − ηx i move away from predicting ŷ i for x i 18 Summary This lecture: The Perceptron • Biological motivation • Error-based classifier • The Perceptron Rule • Relation to Logistic Regression • Multi-class perceptron Next • More powerful machine learning through combining perceptrons • More on linear separability • More on activation functions • Learning with backpropagation 19 References • Rosenblatt, Frank. ”The perceptron: a probabilistic model for information storage and organization in the brain.” Psychological review 65.6 (1958): 386. • Minsky, Marvin, and Seymour Papert. ”Perceptrons: An essay in computational geometry.” MIT Press. (1969). • Bishop, Christopher M. Pattern recognition and machine learning. Springer, 2006. Chapter 4.1.7 20 Introduction An example