Introduction to Machine Learning Stochastic Gradient Descent
Support Vector Machines
Prof. Kutty
Copyright By PowCoder代写 加微信 powcoder
Announcements
• HW1 due in 1 week!
– no late days can be applied for early upload!
• Join code for clicker https://join.iclicker.com/U8HAB
• Alternate midterm exam time
– onlyonealternatemidtermexamtimefrom11:20am-1:20pmon
Wednesday, 2/23 (the same day as the regular midterm time) – midtermconflictformaftertheadd/dropperiod
• tohaverequestapproved,youwillneedtospecifythereasonfor (the hard) conflict (e.g., another scheduled midterm)
• ifconflictiswithanothercourse,weaskthatyoureachoutto your other course for alternate arrangements first.
• Alternate final exam time
– only one alternate final exam time from 10am-noon on Thursday,
4/21 (the same day as the regular final exam time)
– final conflict form after midterm (see above for conditions)
• Please be sure to submit SSD accommodation requests through the accommodate system by Wednesday, January 19 @ 10:00 PM.
MSAIL Information Session
• 2 mass meetings
– Wednesday 1/19 from 6:00-7:00pm EST
– Thursday 1/20 from 6:00-7:00pm EST virtually
https://umich.zoom.us/j/7998899437.
• If you have any questions, contact
Today’s Agenda
– LossfunctionsandGradientDescent
• Section 1
– StochasticGradientDescent
• Section 2
– SupportVectorMachines – hardmarginSVM
• And later…
– SoftMarginSVMsandfeaturemaps
Data that are linearly separable
++—- +++ —
perceptron works!
perceptron update: “! !”# y
updated guess
current guess
+ %(%)&!(%)
misclassified point
Data that are not linearly separable
++++—- –
++—- +++ – – – –
Idea: minimize empirical risk with hinge loss using gradient descent
In other words, find “! that minimizes 1′
” “! = ) * Loss+,-./(/(()”!. &!(()) ()*
Gradient Descent (GD)
Gradient Descent (GD) Idea: take a small step in the opposite direction to which the gradient points
informally, a convex function is characteristically ‘bowl’-shaped
Update guess on 9:
9(0″*) = 9(0) − <∇1: 9 >1)1(“)
Gradient Descent (GD)
Idea via a simple example
Goal: Find value for variable 9 that minimizes the convex function :(9).
Initial guess on minimizer 9(3)
4 and step
9(0″*) = A. 9(0)
Update guess on 9:
9(0″*) = 9(0) − <∇1: 9 >1)1(“)
Gradient Descent (GD)
Idea via a simple example
Goal: Find value for variable 9 that minimizes the convex function :(9).
Initial guess on minimizer 9(3)
4 and step
9 ( 0 ” * ) = 9 ( 0 ) − 14 2 9 0
∇1: 9 |1)1(“) = ∇1 92|1)1(“) =29 0
9(3) = 4 9(*) = 2
9(2) = 1 9(5) = 0.5
Gradient Descent (GD)
Goal: Given
Find the value of parameter “! that minimizes empirical risk #!(“!)
(i) ̄ (i) max{1 y (✓·x ̄ ),0}
! = 0 , % ̅ ( ” ) = 0&
while convergence criteria is not met
̄ ̄ ̄ ✓(k+1) = ✓(k) ⌘r R (✓)| k
k++ ∇76” 2̅ = G” 2̅ ,…,G” 2̅
̄n ̄ ̄ ✓ ✓=✓
9 (variable or fixed) step size ” G2* G28
Gradient Descent (GD)
1. Keep track of 2. Keep track of 3. Keep track of !
! = 0 , % ̅ ( ” ) = 0&
while convergence criteria is not met
̄ ̄ ̄ ✓(k+1) = ✓(k) ⌘r R (✓)| k
̄n ̄ ̄ ✓ ✓=✓
Gradient Descent (GD)
̄ ̄ ̄ ✓(k+1) = X✓(k) ⌘r R (✓)| k
̄ (i) ̄ (i) ToRn(✓)= max{1 y (✓·x ̄ ),0}
Due to the summation involved in calculating the gradient, in order
to make a single update, you have to look at every training example If we have a lot of training examples, this will be slow
̄n ̄ ̄ ✓ ✓=✓
Section 1: Stochastic Gradient Descent or How to speed things up!
Stochastic Gradient Descent (SGD)
Instead of looping over all examples before update, update
Reminder: Hinge loss
based on a single point 1 Xn ̄
(i) ̄ (i) max{1 y (✓·x ̄ ),0}
randomly shuffle points y for i = 1, …,n
✓(k+1) = ✓(k) ⌘ ̄ ̄
pick u a r uniformly at random without replacement
k=0,θ(k) =0
while convergence criteria is not met
r Loss (y(i)(✓ ̄·x ̄(i)))| k ̄h ̄ ̄
∇$#max{1−-% /̅⋅1̅% ,0}
Stochastic Gradient Descent
∇$#max{1−-% /̅⋅1̅% ,0}
y ( i ) ✓ ̄ · x ̄ ( i ) > 1 __
y ( i ) ✓ ̄ · x ̄ ( i ) 1
No update to 2 is made if
A.(! *̅⋅,̅! ≥1
B.(! *̅⋅,̅! ≤1 C. 234256
Stochastic Gradient Descent
∇$#max{1−-% /̅⋅1̅% ,0}
y ( i ) ✓ ̄ · x ̄ ( i ) > 1 __
• Gradient is
• No update is made
g y ( i ) ✓ ̄ · x ̄ ( i ) 1
s u b g r a a i e n t method
YCy ECK y gli
To l yes E yes zci
Stochastic Gradient Descent
hyperparameter ! = 0 , % ̅ ( ” ) = 0&
while convergence criteria are not met randomly shuffle points
for’ = 1,…,+
if ,($) %̅(%) ⋅.̅ $ < Anon zerohinge loss
%̅(%&') = %̅(%) + ,($) .̅($) !++
Looks a lot like the perceptron algorithm! Differences?
Convergence criteria
1. Keep track of "!(%̅) stop when less than some amount
̄ 2. Keep track of r✓ ̄Rn(✓)
stop when less than some amount
3. Keep track of %̅
stop when doesn’t change by much
don’t do this too often; or defeats the purpose of SGD!
4. Keep track of number of iterations
stop when max # iterations reached
Stochastic Gradient Descent
withappropriatelearningrate,since!! #̅
is convex, will almost surely converge to global minimum
SGD often gets close to minimum faster than GD
Note: can be applied to non-convex functions
compared to GD, SGD more sensitive to step size may never “converge” to the minimum
parameters may keep oscillating around the minimum in practice most of the values near the minimum will be reasonably good approximations to true minimum
Section 2: Support Vector Machines
Picking a decision boundary
Suppose data are linearly separable
• Boundary that classifies the training set correctly and,
• That is maximally removed from training examples
closest to the decision boundary
Maximum Margin Separator as an
optimization problem
• Boundary that classifies the
training set correctly and,
• That is maximally removed from training examples closest to the decision boundary
sit !(') #̅⋅%̅' +' ≥1
Assume data are linearly separable
Ob2 for*∈ 1,...,.
Maximum Margin Separator as an optimization problem
Assume data are linearly separable
• Boundary that classifies the training set correctly
– !(() #̅⋅%̅( +' ≥1 for*∈ 1,...,.
That is maximally removed from training examples closest to the decision boundary
Define /(() #̅, '
datapoint J to the hyperplane
*̅ ⋅ ,̅ + < = 0
A. min :(!) *̅($), <
> min :(!) *̅(%), <
to be the distance of
C.min: * ,< =min: * ,<
min minimum over all datapoints
B̅(%)⋅C̅& +D=0 :6(")⋅6<$ +>=@
Maximum Margin Separator as an optimization problem
Assume data are linearly separable
port vectors lie on margin boundaries
Test 6 0 COa tb
max min4(‘) #̅,’ subjectto!(‘) #̅⋅%̅’ +’ ≥1
nanny antigistangarantes
for*∈ 1,…,.
that satisfy the constraints
FoilSt Y0Icibzyo
0,5 ngt st
Hard Margin SVM Assuming data are linearly separable
optimization problem objective
linear inequality
min subjectto! #⋅%̅ +’ ≥1for*∈ 1,…,.
constraints
)!K (‘) ̅ ‘ )!,+ ,
Linear classifier output by this QP: =*>.(#̅ ⋅ %̅ + ‘)
Section 3: Soft Margin SVMs
Soft-Margin SVM
Suppose data are not linearly separable
M formulat
$#,B 2 (%) ̅ % subjectto- /⋅1̅ +A ≥1
What goes wrong?
forC∈ 1,…,F
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com