Introduction to Machine Learning Perceptron Revisited
Gradient Descent
Prof. Kutty
Copyright By PowCoder代写 加微信 powcoder
Announcements
• HW 1 released this evening due in 2 weeks
• HW1 topics
– Early Draft Upload
– Vectors & Planes Review
– Linear Binary Classifiers
– numpy Exercises
– Perceptron Algorithm Without Offset – Perceptron Convergence Proof
– SGD with Logistic Loss
Today’s Agenda
– Linear Classifiers
– Classification Algorithm: Perceptron
• Section 1
– The Perceptron Update
• Section 2
– Perceptron with Offset
• Section 3
– Non-linear separability and linear classifiers
• extending the notion of error (empirical risk) • convexity
• Gradient Descent
• And later…
– Speeding things up with Stochastic Gradient Descent
If you do not have access to a laptop, tablet or smartphone for clicker responses, please email me ( or see me after class.
Recap: The Algorithm
Linear Classifier problem M supervised learning
Given:trainingdata.!= #̅”,0″ ! “#$
“̅(“)∈R$ %(“)∈R1 I
Labeled dataset
Goal: Learn a linear decision boundary (through the origin)
that minimizes training error
zed e1122 h#̅;%̅ =sign(%̅⋅#̅)
fraction of misclassified
1 ! % ̅ = 31 4 0 ( ” ) ≠ h # ̅ ” ; % ̅ $”#$ ̅
p o i n t s
= ∑! 0(“) %⋅#̅” ≤0 ! “#$
simplifying assumptions: linear separability
Perceptron Algorithm
= )’ if&(“).̅(⋅$̅(“) ≤0
.̅()$ =.̅( +&(“)$̅(“)
Oninput!!= $̅”,&”
current guessof while there exists a misclassified point
Initialize k = 0, (‘ % for* = 1,…,-
(̅!”# =(̅! ++̅(%) (̅!”# =(̅! −+̅(%)
If the data are linearly separable
then this algorithm finds a separating hyperplane
Today’s Agenda
– Linear Classifiers
– Classification Algorithm: Perceptron
• Section 1
– The Perceptron Update
• Section 2
– Perceptron with Offset
• Section 3
– Non-linear separability and linear classifiers
• extending the notion of error (empirical risk) • convexity
• Gradient Descent
• And later…
– Speeding things up with Stochastic Gradient Descent
&̅(%) = 0,0 &
while there exists a misclassified point for i = 1, …,n
fck Ili f o
i f 0 ( ( ) ≠ 2 3- ( ( ) ; .- ‘
.- ‘”) = .- ‘ + 0(()3-(() Iif
k++ f flat a
Je has been by Eck
misclassified
while there exists a misclassified point fori=1,…,n
if0(() ≠2 3-(();.-‘
.-‘”) =.-‘ +0(()3-(()
A. nomoreupdates
B. [15,7]T C. [-3, 5]T D. unsure
&̅(%) = 0,0 &
&̅(‘) =&̅(%)+”̅(‘) = 6,6&
Yes Need to update
while there exists a misclassified point for i = 1, …,n
if0(() ≠2 3-(();.-‘
.-‘”) =.-‘ +0(()3-(()
&̅(%) = 0,0 &
*=0,+=1 &̅(‘)=&̅(%)+”̅(‘)= 6,6&
* = 1, + = 2 &̅(()=&̅(‘)−”̅(()= −3,5&
fi yo 807.25 converged
since y 22 1
Observations
• The perceptron algorithm updates based on a single (misclassified) point at a time
• The algorithm moves the hyperplane in the ‘right’ direction based on that point
*(“) ,̅ $%& ⋅.̅(“)
> *(“) ,̅ $ ⋅.̅(“)
Elks Eli yet
Why does this update make sense?
Suppose in the kth iteration the algorithm considers the point #̅(“) C’d
If #̅(“) is correctly classified The parameter is not updated
k=0,.-‘ =/-
while there exists a misclassified point
for i = 1, …,n
i f 0 ( .- ‘ ⋅ 3- ( ( ) ≤ /
.- ‘”) = .- ‘ + 0(()3-(()
Why does this update make sense?
Suppose in the kth iteration the algorithm considers the point .̅(“)
If .̅(“) is misclassified
And the parameter is updated as 10 ‘%( = 10 ‘ + 4())50())
Aftertheupdate4) 10’%( ⋅50()) =4) (10’ + 4())50()))⋅50())
yo ofeces l t gli
=*(“) ,̅$ ⋅.̅(“) =*(“) ,̅$ ⋅.̅(“)
k=0,.-‘ =/-
while there exists a misclassified point
for i = 1, …,n
i f 0 ( .- ‘ ⋅ 3- ( ( ) ≤ /
.- ‘”) = .- ‘ + 0(()3-(()
acis * yoI
Misclassifying despite an update
“̅(“) = 3,3 &
0+ “̅(%) =
while there
for i = 1, …,n
i f 0 ( ( ) ≠ 2 3- ( ( ) ; .- ‘
lassified p
.- ‘”) = .- ‘ + 0(()3-(()
Misclassifying despite an update
“̅(“) = 3,3 &
0+ “̅(%) =
43 ( ) ) =
while there
for i = 1, …,n
i f 0 ( ( ) ≠ 2 3- ( ( ) ; .- ‘
lassified p
+ 93 ( + ) =
; , ( 43 + ⋅
.- ‘”) = .- ‘ + 0(()3-(()
Misclassifying despite an update
“̅(“) = 3,3 &
while there
for i = 1, …,n
i f 0 ( ( ) ≠ 2 3- ( ( ) ; .- ‘
lassified p
0+ “̅(%) =
; , ( 43 + ⋅
8, 7 = @ + 93 ( , ) =
43 ( ) ) =
+ 93 ( + ) = : , : *
.-‘”) =.-‘ +0(()3-(() ;, (43, ⋅93, k++
Linear Classifier: (initial) simplifying assumptions
Goal: Learn a linear decision boundary i.e., constrain possible choices A to hyperplanes
simplifying assumptions:
• constrain 6 to be the set of all hyperplanes that go through the origin
• e.g., in R* this is the set of lines that go through the origin • constrain problem to datasets that are linearly separable
Today’s Agenda
– Linear Classifiers
– Classification Algorithm: Perceptron
• Section 1
– The Perceptron Update
• Section 2
– Perceptron with Offset
• Section 3
– Non-linear separability and linear classifiers
• extending the notion of error (empirical risk) • convexity
• Gradient Descent
• And later…
– Speeding things up with Stochastic Gradient Descent
Linear Classifiers with offset
Given: training data
h#̅;%̅ =sign(%̅⋅#̅+:) 1!
Goal: Learn a linear decision boundary
that minimizes training error
1!%̅=340(“) %̅⋅#̅”+:≤0 “#$
OR Ozdatb O a InE
Linearly separable with offset
Given training examples
we say the data are linearly separable if there exists b ∈ R and %̅ ∈ R7
y ( i ) ( ✓ ̄ · x ̄ ( i ) + b ) > 0 for i = 1, …,n. We refer to
as a separating hyperplane
Perceptron Algorithm (Modified)
k = 0 , >= 8 , = 0? for i = 1, …, n
L e t B= 9 , = C , B= 9 =
augmenting dataset f e a t u r e m a p s
N o t e t h a t >= 8 , ∈ R 7 ; $ while there exists a misclassified point
i f @ ( 9 ) ≠ A B= ( 9 ) ; >= : ,
>= : ; < , = >= : , + @ ( 9 ) B= 9 ,
Perceptron Algorithm (Modified)
=(8) cos 0 k = 0, > = 0, b = 0
while there exists a misclassified point for i = 1, …, n
i f @ ( 9 ) ≠ A B= ( 9 ) ; >= : , D ( : )
>= :;< = >= : + @(9)B=(9)
To derive this, observe that
50)-=8, 50)+ 10′-=:(‘), 10’+
Notes and Observations
Theorem: The perceptron algorithm converges after a finite number of steps when the training examples are linearly separable
• Zero training error at convergence
What if examples are not linearly separable?
algorithm does not converge
moreover, does not find classifier with smallest training error
The perceptron algorithm
– updates based on a single (misclassified) point at a time
– moves the hyperplane in the ‘right’ direction based on that point
The perceptron is actually a (type of) single layer Neural Network – building block of Deep Learning
– more on this later in the course
“the first machine which is capable of having
an original idea”
Today’s Agenda
– LinearClassifiers
– ClassificationAlgorithm:Perceptron
• Section 1
– The Perceptron Update
• Section 2
– PerceptronwithOffset
• Section 3
– Linearseparabilityassumption
• Loss functions and Gradient Descent • And later…
– SpeedingthingsupwithStochasticGradientDescent
Linear Classifier
Goal: Learn a linear decision boundary i.e., constrain possible choices A to hyperplanes
simplifying assumptions:
• constrain 6 to be the set of all hyperplanes that go through the origin
• e.g., in R* this is the set of lines that go through the origin • constrain problem to datasets that are linearly separable
Non-linearly separable data
When the data are not linearly separable we need a slightly different approach: the goal is still to generalize well to new examples.
Idea: focus on minimizing empirical risk
Goal: Find parameter vector %̅ that minimizes the empirical risk. 1,
!, #̅ =&'((“)≠h+̅”;#̅ “-&
= & ∑, *(“) ,̅ ⋅ .̅ ” ≤ 0 with linear classifiers , “-&
direct minimization of this function is difficult in general (NP hard)
Empirical risk
Goal: Find parameter vector %̅ that minimizes the empirical risk. 1! 1!
E!%̅=340(“)≠h#̅”;%̅ =34Lossh#̅”;%̅,0″ “#$ “#$
= $ ∑! Loss 0 ” %̅ ⋅ #̅ ” for linear classifiers ! “#$
Examples of loss functions for linear classifiers :
• 0-1Lossfunction –Loss>?$H=I0 forH>0
1 otherwise
!#̅=&∑, *(“),̅⋅.̅” ≤0 , ,”-&
• Hingelossfunction
Hinge loss function
Loss./012 E =max 1−E ,0
In all plots, horizontal axis is z and vertical axis is Loss(z)
Empirical risk with hinge loss
Good news:
Empirical risk with hinge loss is a convex function
Idea: minimize empirical risk using gradient descent 1!
E!%̅=34max 1−0″ %̅⋅#̅” ,0 “#$
Advantages of the hinge loss function:-
incorporates idea of ‘worse’ mistakes -2
forces predictions to be more than just a ‘little correct’
Gradient Descent (GD)
Gradient Descent (GD) Idea: take a small step in the opposite direction to which the gradient points
informally, a convex function is characteristically ‘bowl’-shaped
&̅(45′)=&̅(4)−J∇76I&̅ L76876(()
Step size for Gradient Descent (GD)
How do we set η?
• constant η
too large: can cause algorithm to overshoot and ‘oscillate’ too small: can be too slow
• variable η
ηk =1/(k+1)
Gradient Descent (GD)
Initial guess on minimizer 10 = %0 and for some (variable or fixed) step size >
Update guess on 1: “(45’) = “(4) − J∇9M ” L989(()
̄ ̄ ̄ ✓(k+1)=✓(k) ⌘r ̄Rn(✓)| ̄ ̄k
̄ (i) ̄ (i)
Rn(✓)=n max{1 y (✓·x ̄ ),0} i=1
∇76I3 &̅ = NI3 &̅ ,…,NI3 &̅
Find the value of parameter >= that minimizes training error E!(>=)
Gradient Descent (GD)
! = 0 , % ̅ ( ” ) = 0&
while convergence criteria is not met
̄ ̄ ̄ ✓(k+1) = ✓(k) ⌘r R (✓)| k
̄n ̄ ̄ ✓ ✓=✓
Goal: Given
Find the value of paramXeter >= that minimizes empirical risk E!(>=) 1n
̄ (i) ̄ (i) Rn(✓)=n max{1 y (✓·x ̄ ),0}
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com