CS代考 Introduction to Machine Learning Stochastic Gradient Descent

Introduction to Machine Learning Stochastic Gradient Descent
Support Vector Machines
Prof. Kutty

Copyright By PowCoder代写 加微信 powcoder

Announcements
• HW1 due in 1 week!
– no late days can be applied for early upload!
• Join code for clicker https://join.iclicker.com/U8HAB
• Alternate midterm exam time
– onlyonealternatemidtermexamtimefrom11:20am-1:20pmon
Wednesday, 2/23 (the same day as the regular midterm time) – midtermconflictformaftertheadd/dropperiod
• tohaverequestapproved,youwillneedtospecifythereasonfor (the hard) conflict (e.g., another scheduled midterm)
• ifconflictiswithanothercourse,weaskthatyoureachoutto your other course for alternate arrangements first.
• Alternate final exam time
– only one alternate final exam time from 10am-noon on Thursday,
4/21 (the same day as the regular final exam time)
– final conflict form after midterm (see above for conditions)
• Please be sure to submit SSD accommodation requests through the accommodate system by Wednesday, January 19 @ 10:00 PM.

MSAIL Information Session
• 2 mass meetings
– Wednesday 1/19 from 6:00-7:00pm EST
– Thursday 1/20 from 6:00-7:00pm EST virtually
https://umich.zoom.us/j/7998899437.
• If you have any questions, contact

Today’s Agenda
– LossfunctionsandGradientDescent
• Section 1
– StochasticGradientDescent
• Section 2
– SupportVectorMachines – hardmarginSVM
• And later…
– SoftMarginSVMsandfeaturemaps

Data that are linearly separable
++—- +++ —
perceptron works!
perceptron update: “! !”# y
updated guess
current guess
+ %(%)&!(%)
misclassified point

Data that are not linearly separable
++++—- –
++—- +++ – – – –
Idea: minimize empirical risk with hinge loss using gradient descent
In other words, find “! that minimizes 1′
” “! = ) * Loss+,-./(/(()”!. &!(()) ()*

Gradient Descent (GD)
Gradient Descent (GD) Idea: take a small step in the opposite direction to which the gradient points
informally, a convex function is characteristically ‘bowl’-shaped

Update guess on 9:
9(0″*) = 9(0) − <∇1: 9 >1)1(“)
Gradient Descent (GD)
Idea via a simple example
Goal: Find value for variable 9 that minimizes the convex function :(9).
Initial guess on minimizer 9(3)
4 and step
9(0″*) = A. 9(0)

Update guess on 9:
9(0″*) = 9(0) − <∇1: 9 >1)1(“)
Gradient Descent (GD)
Idea via a simple example
Goal: Find value for variable 9 that minimizes the convex function :(9).
Initial guess on minimizer 9(3)
4 and step
9 ( 0 ” * ) = 9 ( 0 ) − 14 2 9 0
∇1: 9 |1)1(“) = ∇1 92|1)1(“) =29 0
9(3) = 4 9(*) = 2
9(2) = 1 9(5) = 0.5

Gradient Descent (GD)
Goal: Given
Find the value of parameter “! that minimizes empirical risk #!(“!)
(i) ̄ (i) max{1y (✓·x ̄ ),0}
! = 0 , % ̅ ( ” ) = 0&
while convergence criteria is not met
̄ ̄ ̄ ✓(k+1) = ✓(k) ⌘r R (✓)| k
k++ ∇76” 2̅ = G” 2̅ ,…,G” 2̅
̄n ̄ ̄ ✓ ✓=✓
9 (variable or fixed) step size ” G2* G28

Gradient Descent (GD)
1. Keep track of 2. Keep track of 3. Keep track of !
! = 0 , % ̅ ( ” ) = 0&
while convergence criteria is not met
̄ ̄ ̄ ✓(k+1) = ✓(k) ⌘r R (✓)| k
̄n ̄ ̄ ✓ ✓=✓

Gradient Descent (GD)
̄ ̄ ̄ ✓(k+1) = X✓(k) ⌘r R (✓)| k
̄ (i) ̄ (i) ToRn(✓)= max{1y (✓·x ̄ ),0}
Due to the summation involved in calculating the gradient, in order
to make a single update, you have to look at every training example If we have a lot of training examples, this will be slow
̄n ̄ ̄ ✓ ✓=✓

Section 1: Stochastic Gradient Descent or How to speed things up!

Stochastic Gradient Descent (SGD)
Instead of looping over all examples before update, update
Reminder: Hinge loss
based on a single point 1 Xn ̄
(i) ̄ (i) max{1y (✓·x ̄ ),0}
randomly shuffle points y for i = 1, …,n
✓(k+1) = ✓(k) ⌘ ̄ ̄
pick u a r uniformly at random without replacement
k=0,θ(k) =0
while convergence criteria is not met
r Loss (y(i)(✓ ̄·x ̄(i)))| k ̄h ̄ ̄
∇$#max{1−-% /̅⋅1̅% ,0}

Stochastic Gradient Descent
∇$#max{1−-% /̅⋅1̅% ,0}
y ( i ) ✓ ̄ · x ̄ ( i ) > 1 __
y ( i ) ✓ ̄ · x ̄ ( i )  1
No update to 2 is made if
A.(! *̅⋅,̅! ≥1
B.(! *̅⋅,̅! ≤1 C. 234256

Stochastic Gradient Descent
∇$#max{1−-% /̅⋅1̅% ,0}
y ( i ) ✓ ̄ · x ̄ ( i ) > 1 __
• Gradient is
• No update is made
g y ( i ) ✓ ̄ · x ̄ ( i )  1
s u b g r a a i e n t method
YCy ECK y gli
To l yes E yes zci

Stochastic Gradient Descent
hyperparameter ! = 0 , % ̅ ( ” ) = 0&
while convergence criteria are not met randomly shuffle points
for’ = 1,…,+
if ,($) %̅(%) ⋅.̅ $ < Anon zerohinge loss %̅(%&') = %̅(%) + ,($) .̅($) !++ Looks a lot like the perceptron algorithm! Differences? Convergence criteria 1. Keep track of "!(%̅) stop when less than some amount ̄ 2. Keep track of r✓ ̄Rn(✓) stop when less than some amount 3. Keep track of %̅ stop when doesn’t change by much don’t do this too often; or defeats the purpose of SGD! 4. Keep track of number of iterations stop when max # iterations reached Stochastic Gradient Descent withappropriatelearningrate,since!! #̅ is convex, will almost surely converge to global minimum SGD often gets close to minimum faster than GD Note: can be applied to non-convex functions compared to GD, SGD more sensitive to step size may never “converge” to the minimum parameters may keep oscillating around the minimum in practice most of the values near the minimum will be reasonably good approximations to true minimum Section 2: Support Vector Machines Picking a decision boundary Suppose data are linearly separable • Boundary that classifies the training set correctly and, • That is maximally removed from training examples closest to the decision boundary Maximum Margin Separator as an optimization problem • Boundary that classifies the training set correctly and, • That is maximally removed from training examples closest to the decision boundary sit !(') #̅⋅%̅' +' ≥1 Assume data are linearly separable Ob2 for*∈ 1,...,. Maximum Margin Separator as an optimization problem Assume data are linearly separable • Boundary that classifies the training set correctly – !(() #̅⋅%̅( +' ≥1 for*∈ 1,...,. That is maximally removed from training examples closest to the decision boundary Define /(() #̅, ' datapoint J to the hyperplane *̅ ⋅ ,̅ + < = 0 A. min :(!) *̅($), < > min :(!) *̅(%), < to be the distance of C.min: * ,< =min: * ,< min minimum over all datapoints B̅(%)⋅C̅& +D=0 :6(")⋅6<$ +>=@

Maximum Margin Separator as an optimization problem
Assume data are linearly separable
port vectors lie on margin boundaries
Test 6 0 COa tb
max min4(‘) #̅,’ subjectto!(‘) #̅⋅%̅’ +’ ≥1
nanny antigistangarantes
for*∈ 1,…,.
that satisfy the constraints
FoilSt Y0Icibzyo
0,5 ngt st

Hard Margin SVM Assuming data are linearly separable
optimization problem objective
linear inequality
min subjectto! #⋅%̅ +’ ≥1for*∈ 1,…,.
constraints
)!K (‘) ̅ ‘ )!,+ ,
Linear classifier output by this QP: =*>.(#̅ ⋅ %̅ + ‘)

Section 3: Soft Margin SVMs

Soft-Margin SVM
Suppose data are not linearly separable
M formulat
$#,B 2 (%) ̅ % subjectto- /⋅1̅ +A ≥1
What goes wrong?
forC∈ 1,…,F

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com