程序代写 Lecture 12: The Perceptron

Lecture 12: The Perceptron
Introduction to Machine Learning Semester 1, 2022
Copyright @ University of Melbourne 2022. All rights reserved. No part of the publication may be reproduced in any form by print, photoprint, microfilm or any other means without written permission from the author.

Copyright By PowCoder代写 加微信 powcoder

So far… Naive Bayes and Logistic Regression
• Probabilistic models (Naive Bayes and Logistic Regression) … and KNN! • Maximum likelihood estimation
• Examples and code
Today… The Perceptron
• Geometric motivation
• Error-based optimization
• …towards neural networks

Recap: Classification algorithms
Naive Bayes
• Generative model of p(x,y)
• Find optimal parameter that maximize the log data likelihood • Unrealistic independence assumption p(x|y) = 􏰎i p(xi|y)
Logistic Regression
• Discriminative model of p(y|x)
• Find optimal parameters that maximize the conditional log data likelihood • Allows for more complex features (fewer assumptions)

Recap: Classification algorithms
Naive Bayes
• Generative model of p(x,y)
• Find optimal parameter that maximize the log data likelihood • Unrealistic independence assumption p(x|y) = 􏰎i p(xi|y)
Logistic Regression
• Discriminative model of p(y|x)
• Find optimal parameters that maximize the conditional log data likelihood • Allows for more complex features (fewer assumptions)
Perceptron
• Biological motivation: imitating neurons in the brain • No more probabilities
• Instead: minimize the classification error directly

Introduction: Neurons in the Brain
• Humans are the best learners we know
• Can we take inspiration from human learning • → the brain!

Introduction: Neurons in the Brain

Introduction: Neurons in the Brain
Source: https://upload.wikimedia.org/wikipedia/commons/1/10/Blausen_0657_MultipolarNeuron.png

Introduction: Neurons in the Brain
• 1943 McCulloch and Pitts introduced the first ‘artificial neurons’
• If the weighted sum of inputs is equal to or greater than a threshold,
then the output is 1. Otherwise the output is 0.
• the weights needed to be designed by hand
• In 1958 Rosenblatt invented the Perceptron, which can learn the optimal parameters through the perceptron learning rule
• The perceptron can be trained to learn the correct weights, even if randomly initialized [[ for a limited set of problems ]].

Introduction: Neurons in the Brain
The AI winter
• A few years later Misky and Papert (too?) successfully pointed out the fundamental limitations of the perceptron.
• As a result, research on artificial neural networks stopped until the mid-1980s
• But the limitations can be overcome by combining multiple perceptrons into Artificial Neural Networks
• The perceptron is the basic component of today’s deep learning success!

Introduction: Artificial Neurons I
1 x1 x2 …
f(x;θ,b) y
y = f(θTx)
activation
input weights

Introduction: Artificial Neurons I
1 x1 x2 …
y = f(θx+b) f: y = 1 if
f(θTx) >= 0 f(θTx) < 0 activation input weights Perceptron: Definition I • The Perceptron is a minimal neural network • neural networks are composed of neurons • A neuron is defined as follows: • input = a vector x of numeric inputs (⟨1, x1 , x2 , ...xn ⟩) • output=ascalaryi ∈R • hyper-parameter: an activation function f • parameters: θ = ⟨θ0, θ1, θ2, ...θn⟩ • Mathematically:    jj yi =f 􏰂θxi j Perceptron: Definition II • Task: binary classification of instances into classes 1 and −1 • Model: a single-neuron (aka a “perceptron”) : f(θTx)=1 ifθTx ≥0 −1 otherwise • Graphically, f is the step function • θT x is the decision boundary Perceptron: Definition II • Task: binary classification of instances into classes 1 and −1 • Model: a single-neuron (aka a “perceptron”) : f(θTx)=1 ifθTx ≥0 • θT x is the decision boundary • Example: 2-d case: −1 otherwise Towards the Perceptron Algorithm I • As usual, learning means to modify the parameters (i.e., weights) of the perceptron so that performance is optimized • The perceptron is a supervised classification algorithm, so we learn from observations of input-label pairs (x1, y1), (x2, y2), ...(xN , yN ) • Simplest way to learn: compare predicted outputs yˆ against true outputs y and minimize the number of mis-classifications. Unfortunately, mathematically inconvenient. • Second simplest idea: Find θ such that gap between the predicted value yˆi ←f(θTxi)andthetrueclasslabely ∈{−1,1}isminimized Towards the Perceptron Algorithm I Intuition Iterate over the training data and modify weights: • ifthetruelabely=1andyˆ=1thendonothing • ifthetruelabely=−1andyˆ=−1thendonothing • if the true label y = 1 but yˆ = −1 then increase weights • if the true label y = −1 but yˆ = 1 then decrease weights Towards the Perceptron Algorithm I Intuition Iterate over the training data and modify weights: • ifthetruelabely=1andyˆ=1thendonothing • ifthetruelabely=−1andyˆ=−1thendonothing • if the true label y = 1 but yˆ = −1 then increase weights • if the true label y = −1 but yˆ = 1 then decrease weights More formally Initialize parameters θ ← 0 for training sample (x,y) do Calculate the output yˆ = f(θT x) if y = 1 and yˆ = −1 then θ(new) ←θ(old) +x if y = −1 and yˆ = 1 then θ(new) ←θ(old) −x until tired Towards the Perceptron Algorithm II • We set a learning rate or step size η • and note that −2ifyi =−1and=yˆi =1 • For each individual weight θj , we compute an update such that θ ← θ + η ( y i − yˆ i ) x i jjj 0ify ==yˆ (yi −yˆi)=2ifyi =1and=yˆi =−1 (1) The Perceptron Algorithm D = {(xi , yi )|i = 1, 2, ..., N} the set of training instances Initialise the weight vector θ ← 0 for each training instance (xi , yi ) ∈ D do computeyˆi,(t) =f(θTxi) if yˆi,(t) ̸= yi then for each each weight θj do update θ(t) ← θ(t−1) + η(yi − yˆi,(t))xi θ(t) ← θ(t−1) jj until tired Return θ(t) An example Perceptron Example I • Training instances: • Learning rate η = 1 ⟨xi1,xi2⟩ yi ⟨1,1⟩ 1 ⟨1,2⟩ 1 ⟨0, 0⟩ −1 ⟨−1, 0⟩ −1 Perceptron Example II • θ=⟨0,0,0⟩ • learning rate: η = 1 • Epoch 1: ⟨x1,x2⟩ ⟨1, 1⟩ ⟨1, 2⟩ ⟨0, 0⟩ θ1 ·1+θ2 ·x1 +θ3 ·x2 0 + 1 × 0 + 1 × 0 = 0 0 + 1 × 0 + 2 × 0 = 0 0 + 0 × 0 + 0 × 0 = 0 Update to θ = ⟨−2, 0, 0⟩ −2+−1×0+0×0 = −2 yˆ(1) yi i 1 1 1 1 1 -1 Perceptron Example III • θ=⟨−2,0,0⟩ • learning rate: η = 1 • Epoch 2: ⟨x1,x2⟩ ⟨1, 1⟩ ⟨1, 2⟩ ⟨0, 0⟩ θ1 ·1+θ2 ·x1 +θ3 ·x2 −2 + 1 × 0 + 1 × 0 = −2 Update to θ = ⟨0, 2, 2⟩ 0 + 1 × 2 + 2 × 2 = 6 0 + 0 × 2 + 0 × 2 = 0 Update to θ = ⟨−2, 2, 2⟩ −2+−1×2+0×2 = −4 yˆ(2) yi i Perceptron Example IV • θ=⟨−2,2,2⟩ • learning rate: η = 1 • Epoch 3: ⟨x1,x2⟩ ⟨1, 1⟩ ⟨0, 0⟩ ⟨−1,0⟩ θ1 ·1+θ2 ·x1 +θ3 ·x2 −2 + 1 × 2 + 1 × 2 = 2 −2 + 1 × 2 + 2 × 2 = 4 −2 + 0 × 2 + 0 × 2 = −2 −2+−1×2+0×2 = −4 yˆ(3) yi i 1 1 1 1 -1 -1 -1 -1 • Convergence, as no updates throughout epoch Perceptron Convergence Perceptron Rule: θ ( t + 1 ) ← θ ( t ) + η ( y − yˆ i ) x i jjij • So, all we’re doing is adding and subtracting constants every time we make a mistake. • Does this really work!? Perceptron Convergence Perceptron Convergence • The Perceptron algorithm is guaranteed to converge for linearly-separable data • the convergence point will depend on the initialisation • the convergence point will depend on the learning rate • (no guarantee of the margin being maximised) • No guarantee of convergence over non-linearly separable data Back to Logistic Regression and Gradient Descent Perceptron Rule Gradient Descent Activation Functions θ ( t + 1 ) ← θ ( t ) + η ( y − yˆ i ) x i jjij θ(t+1) ←θ(t) −η ∂f ∂θ(t) Back to Logistic Regression and Gradient Descent Perceptron Rule Gradient Descent Activation Functions θ ( t + 1 ) ← θ ( t ) + η ( y − yˆ i ) x i jjij θ(t+1) ←θ(t) −η ∂f ∂θ(t) A single ‘neuron’ with a sigmoid activation which optimizes the cross-entropy loss (negative log likelihood) is equivalent to logistic regression y f: y = output 1 / (1 + exp(-θTx)) input weights activation Online learning vs. Batch learning • It is an online algorithm: we update the weights after each training example • In contrast, Naive Bayes and logistic regression (with Gradient Descent) are updated as a batch algorithm: • compute statistics of the whole training data set • update all parameters at once • Online learning can be more efficient for large data sets • Gradient Descent can be converted into an online version: stochastic gradient descent Multi-Class Perceptron We can generalize the perceptron to more than 2 classes • create a weight vector for each class k ∈ Y , θk • score input wrt each class: θkT x for all k • predict the class with maximum output yˆ = argmaxk ∈Y θkT x • learning works as before: if for some (x i , y i ) we make a wrong prediction yˆi ̸=yi suchthatθT xi <θT xi, θyi ← θyi + ηxi θyˆi ← θyˆi − ηxi move towards predicting yi for xi move away from predicting yˆi for xi This lecture: The Perceptron • Biological motivation • Error-based classifier • The Perceptron Rule • Relation to Logistic Regression • Multi-class perceptron • (Brief interlude on evaluation) • More powerful machine learning through combining perceptrons • More on linear separability • More on activation functions • Learning with backpropagation References • Rosenblatt, Frank. ”The perceptron: a probabilistic model for information storage and organization in the brain.” Psychological review 65.6 (1958): 386. • Minsky, Marvin, and . ”Perceptrons: An essay in computational geometry.” MIT Press. (1969). • Bishop, . Pattern recognition and machine learning. Springer, 2006. Chapter 4.1.7 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com