程序代写代做代考 deep learning algorithm AI graph Lecture 8: The Perceptron

Lecture 8: The Perceptron
COMP90049
Introduction to Machine Learning Semester 1, 2020
Lea Frermann, CIS
1

Introduction

Roadmap
So far… Naive Bayes and Logistic Regression
• Probabilistic models
• Maximum likelihood estimation • Examples and code
2

Roadmap
So far… Naive Bayes and Logistic Regression
• Probabilistic models
• Maximum likelihood estimation • Examples and code
Today… The Perceptron
• Geometric motivation
• Error-based optimization • …towards neural networks
2

Recap: Classification algorithms
Naive Bayes
• Generative model of p(x,y)
• Find optimal parameter that maximize the log data likelihood • Unrealistic independence assumption p(x|y) = 􏰧i p(xi|y)
Logistic Regression
• Discriminative model of p(y|x)
• Find optimal parameters that maximize the conditional log data likelihood • Allows for more complex features (fewer assumptions)
3

Recap: Classification algorithms
Naive Bayes
• Generative model of p(x,y)
• Find optimal parameter that maximize the log data likelihood • Unrealistic independence assumption p(x|y) = 􏰧i p(xi|y)
Logistic Regression
• Discriminative model of p(y|x)
• Find optimal parameters that maximize the conditional log data likelihood • Allows for more complex features (fewer assumptions)
Perceptron
• Biological motivation: imitating neurons in the brain • No more probabilities
• Instead: minimize the classification error directly
3

Introduction: Neurons in the Brain
• Humans are the best learners we know
• Can we take inspiration from human learning • → the brain!
4

Introduction: Neurons in the Brain

4

Introduction: Neurons in the Brain
Source: https://upload.wikimedia.org/wikipedia/commons/1/10/Blausen_0657_MultipolarNeuron.png
4

Introduction: Neurons in the Brain
The hype
• 1943 McCulloch and Pitts introduced the first ‘artificial neurons’
• If the weighted sum of inputs is equal to or greater than a threshold,
then the output is 1. Otherwise the output is 0.
• the weights needed to be designed by hand
• In 1958 Rosenblatt invented the Perceptron, which can learn the optimal parameters through the perceptron learning rule
• The perceptron can be trained to learn the correct weights, even if randomly initialized [[ for a limited set of problems ]].
4

Introduction: Neurons in the Brain
The AI winter
• A few years later Misky and Papert (too?) successfully pointed out the fundamental limitations of the perceptron.
• As a result, research on artificial neural networks stopped until the mid-1980s
• But the limitations can be overcome by combining multiple perceptrons into Artificial Neural Networks
• The perceptron is the basic component of today’s deep learning success!
4

Introduction: Artificial Neurons I
1 x1 x2 …
θ0 θ1
θ2

θF
f(x;θ,b) y
output
y = f(θTx)
activation
xF
input weights
5

Introduction: Artificial Neurons I
1 x1 x2 …
θ0 θ1
θ2

θF
y
output
y = f(θx+b) f: y = 1 if
y = -1 if
f(θTx) >= 0 f(θTx) < 0 activation xF input weights 5 Perceptron: Definition I • The Perceptron is a minimal neural network • neural networks are composed of neurons • A neuron is defined as follows: • input=avectorxofnumericinputs(⟨1,x1,x2,...xn⟩) • output=ascalaryi ∈R • hyper-parameter: an activation function f • parameters:θ=⟨θ0,θ1,θ2,...θn⟩ • Mathematically:    jj yi =f 􏰁θxi j =f(θTxi) 6 Perceptron: Definition II • Task: binary classification of instances into classes 1 and −1 • Model: a single-neuron (aka a “perceptron”) :  f(θTx)=1 ifθTx ≥0 −1 otherwise • θT x is the decision boundary • Graphically, f is the step function 7 Perceptron: Definition II • Task: binary classification of instances into classes 1 and −1 • Model: a single-neuron (aka a “perceptron”) :  f(θTx)=1 ifθTx ≥0 −1 otherwise • θT x is the decision boundary • Example: 2-d case: 7 Towards the Perceptron Algorithm I • As usual, learning means to modify the parameters (i.e., weights) of the perceptron so that performance is optimized • The perceptron is a supervised classification algorithm, so we learn from observations of input-label pairs (x1, y1), (x2, y2), ...(xN , yN ) • Simplest way to learn: compare predicted outputs yˆ against true outputs y and minimize the number of mis-classifications. Unfortunately, mathematically inconvenient. • Second simplest idea: Find θ such that gap between the predicted value yˆi ←f(θTxi)andthetrueclasslabely ∈{−1,1}isminimized 8 Towards the Perceptron Algorithm I Intuition Iterate over the training data and modify weights: • ifthetruelabely=1andyˆ=1thendonothing • ifthetruelabely=−1andyˆ=−1thendonothing • if the true label y = 1 but yˆ = −1 then increase weights • if the true label y = −1 but yˆ = 1 then decrease weights 8 Towards the Perceptron Algorithm I Intuition Iterate over the training data and modify weights: • ifthetruelabely=1andyˆ=1thendonothing • ifthetruelabely=−1andyˆ=−1thendonothing • if the true label y = 1 but yˆ = −1 then increase weights • if the true label y = −1 but yˆ = 1 then decrease weights More formally Initialize parameters θ ← 0 for training sample (x,y) do Calculate the output yˆ = f(θT x) if y = 1 and yˆ = −1 then θ(new) ←θ(old) +x if y = −1 and yˆ = 1 then θ(new) ←θ(old) −x until tired 8 Towards the Perceptron Algorithm II • We set a learning rate or step size η • and note that  −2ifyi =−1and=yˆi =1 • For each individual weight θj , we compute an update such that θ ← θ + η ( y i − yˆ i ) x i jjj  0ify ==yˆ ii (yi −yˆi)=2ifyi =1and=yˆi =−1 (1) 9 The Perceptron Algorithm D = {(xi , yi )|i = 1, 2, ..., N} the set of training instances Initialise the weight vector θ ← 0 t←0 repeat t ← t+1 for each training instance (xi , yi ) ∈ D do computeyˆi,(t) =f(θTxi) if yˆi,(t) ̸= yi then for each each weight θj do update θ(t) ← θ(t−1) + η(yi − yˆi,(t))xi else θ(t) ← θ(t−1) jj until tired Return θ(t) jjj 10 An example Perceptron Example I • Training instances: • Learning rate η = 1 ⟨xi1,xi2⟩ yi ⟨1,1⟩ 1 ⟨1,2⟩ 1 ⟨0, 0⟩ −1 ⟨−1, 0⟩ −1 11 Perceptron Example II • θ=⟨0,0,0⟩ • learning rate: η = 1 • Epoch 1: ⟨x1,x2⟩ ⟨1, 1⟩ ⟨1, 2⟩ ⟨0, 0⟩ ⟨−1,0⟩ θ1 ·1+θ2 ·x1 +θ3 ·x2 0 + 1 × 0 + 1 × 0 = 0 0 + 1 × 0 + 2 × 0 = 0 0 + 0 × 0 + 0 × 0 = 0 Update to θ = ⟨−2, 0, 0⟩ −2+−1×0+0×0 = −2 yˆ(1) yi i 1 1 1 1 1 -1 -1 -1 12 Perceptron Example III • θ=⟨−2,0,0⟩ • learning rate: η = 1 • Epoch 2: ⟨x1,x2⟩ ⟨1, 1⟩ ⟨1, 2⟩ ⟨0, 0⟩ ⟨−1,0⟩ θ1 ·1+θ2 ·x1 +θ3 ·x2 −2 + 1 × 0 + 1 × 0 = −2 Update to θ = ⟨0, 2, 2⟩ 0 + 1 × 2 + 2 × 2 = 6 0 + 0 × 2 + 0 × 2 = 0 Update to θ = ⟨−2, 2, 2⟩ −2+−1×2+0×2 = −4 yˆ(2) yi i -1 1 1 1 1 -1 -1 -1 13 Perceptron Example IV • θ=⟨−2,2,2⟩ • learning rate: η = 1 • Epoch 3: ⟨x1,x2⟩ ⟨1, 1⟩ ⟨1, 2⟩ ⟨0, 0⟩ ⟨−1,0⟩ θ1 ·1+θ2 ·x1 +θ3 ·x2 −2 + 1 × 2 + 1 × 2 = 2 −2 + 1 × 2 + 2 × 2 = 4 −2 + 0 × 2 + 0 × 2 = −2 −2+−1×2+0×2 = −4 yˆ(3) yi i 1 1 1 1 -1 -1 -1 -1 • Convergence, as no updates throughout epoch 14 Perceptron Convergence Perceptron Rule: θ ( t + 1 ) ← θ ( t ) + η ( y − yˆ i ) x i jjij • So, all we’re doing is adding and subtracting constants every time we make a mistake. • Does this really work!? 15 Perceptron Convergence Perceptron Convergence • The Perceptron algorithm is guaranteed to converge for linearly-separable data • the convergence point will depend on the initialisation • the convergence point will depend on the learning rate • (no guarantee of the margin being maximised) • No guarantee of convergence over non-linearly separable data 15 Back to Logistic Regression and Gradient Descent Perceptron Rule Gradient Descent Activation Functions θ ( t + 1 ) ← θ ( t ) + η ( y − yˆ i ) x i jjij θ(t+1) ←θ(t) −η ∂f ∂θ(t) 16 Back to Logistic Regression and Gradient Descent Perceptron Rule Gradient Descent Activation Functions θ ( t + 1 ) ← θ ( t ) + η ( y − yˆ i ) x i jjij θ(t+1) ←θ(t) −η ∂f ∂θ(t) A single ‘neuron’ with a sigmoid activation which optimizes the cross-entropy loss (negative log likelihood) is equivalent to logistic regression 1 x θ0 1 θ1 x2 ... ... θF θ2 y f: y = output 1 / (1 + exp(-θTx)) activation xF input weights 16 Online learning vs. Batch learning • It is an online algorithm: we update the weights after each training example • In contrast, Naive Bayes and logistic regression (with Gradient Descent) are updated as a batch algorithm: • compute statistics of the whole training data set • update all parameters at once • Online learning can be more efficient for large data sets • Gradient Descent can be converted into an online version: stochastic gradient descent 17 Multi-Class Perceptron We can generalize the perceptron to more than 2 classes • create a weight vector for each class k ∈ Y , θk • score input wrt each class: θkT x for all k • predict the class with maximum output yˆ = argmaxk ∈Y θkT x • learning works as before: if for some (x i , y i ) we make a wrong prediction yˆi ̸=yi suchthatθT xi <θT xi, y i θyi ← θyi + ηxi θyˆi ← θyˆi − ηxi yˆ i move towards predicting yi for xi move away from predicting yˆi for xi 18 Summary This lecture: The Perceptron • Biological motivation • Error-based classifier • The Perceptron Rule • Relation to Logistic Regression • Multi-class perceptron Next • More powerful machine learning through combining perceptrons • More on linear separability • More on activation functions • Learning with backpropagation 19 References • Rosenblatt, Frank. ”The perceptron: a probabilistic model for information storage and organization in the brain.” Psychological review 65.6 (1958): 386. • Minsky, Marvin, and Seymour Papert. ”Perceptrons: An essay in computational geometry.” MIT Press. (1969). • Bishop, Christopher M. Pattern recognition and machine learning. Springer, 2006. Chapter 4.1.7 20