程序代写代做代考 algorithm chain Machine learning lecture slides

Machine learning lecture slides
COMS 4771 Fall 2020
0/32

Optimization I: Convex optimization

Outline
􏰛 Convex sets and convex functions
􏰛 Local minimizers and global minimizers
􏰛 Gradient descent
􏰛 Analysis for smooth objective functions
􏰛 Stochastic gradient method
􏰛 Gradient descent for least squares linear regression
1/32

Convex sets
􏰛 Convex set: a set that contains every line segment between pairs of points in the set.
􏰛 Examples:
􏰛 All of Rd
􏰛 Empty set
􏰛 Half-spaces
􏰛 Intersections of convex sets 􏰛 Convex hulls
convex not convex convex convex
Figure 1: Which of these sets are convex?
2/32

Convex functions (1)
􏰛 Convex function: a function satisfying the two-point version of Jensen’s inequality:
f((1−α)w+αw′) ≤ (1−α)f(w)+αf(w′), w, w′ ∈ Rd, α ∈ [0, 1].
xx′ xx′ f is not convex f is convex
Figure 2: Which of these functions are convex?
3/32

Convex functions (2)
􏰛 Examples:
􏰛 f(w)=cforc∈R
􏰛 f(w) = exp(w) (on R)
􏰛 f(w)=|w|c forc≥1(onR)
􏰛 f(w)=bTwforb∈Rd
􏰛 f(w)=∥w∥foranynorm∥·∥
􏰛 f(w) = wTAw for any symmetric positive semidefinite matrix A 􏰛 w􏰜→af(w)+g(w)forconvexfunctionsf,ganda≥0
􏰛 w 􏰜→ max{f (w), g(w)} for convex functions f, g
􏰛 f(w) = logsumexp(w) = ln 􏰁􏰉di=1 exp(wi)􏰂
􏰛 w 􏰜→ f(g(w)) for convex function f and affine function g
4/32

Verifying convexity of Euclidean norm
􏰛 Verify f(w) = ∥w∥ is convex
5/32

Convexity of differentiable functions (1)
􏰛 Differentiable function f is convex iff f(w)≥f(w0)+∇f(w0)T(w−w0) forallw,w0 ∈Rd.
f (x)
a(x)
x0
Figure 3: Affine approximation
􏰛 Twice-differentiable function f is convex iff ∇2f(w) is positive semidefinite for all w ∈ Rd.
6/32

Convexity of differentiable functions (2)
􏰛 Example: Verify f(w) = w4 is convex 􏰛 Use second-order condition
7/32

Convexity of differentiable functions (3)
􏰛 Example: Verify f(w) = ebTw for b ∈ Rd is convex 􏰛 Use first-order condition
8/32

Verifying convexity of least squares linear regression
􏰛 Verify f (w) = ∥Aw − b∥2 is convex
9/32

Verifying convexity of logistic regression MLE problem
􏰛 Verify f(w) = 1 􏰉n ln(1 + e−yixTi w) is convex n i=1
10/32

Local minimizers
􏰛 Sayw⋆ ∈Rd isalocalminimizeroff:Rd →Rifthereisan “open ball” U = {w ∈ Rd : ∥w − w⋆∥2 < r} of positive radius r > 0 such that f(w⋆) ≤ f(w) for all w ∈ U.
􏰛 I.e., nothing looks better in the immediate vicinity of w⋆.
locally optimal
Figure 4: Local minimizer
11/32

Local minimizers of convex problems
􏰛 If f is convex, and w⋆ is a local minimizer, then it is also a global minimizer.
􏰛 “Local to global” phenomenon
􏰛 Local search is well-motivated for convex optimization problems
local global
Figure 5: Local-to-global phenomenon
12/32

Gradient descent
􏰛 Consider (unconstrained) convex optimization problem min f (w).
w∈Rd
􏰛 Gradient descent: iterative algorithm for (approximately) minimizing f
􏰛 Given initial iterate w(0) ∈ Rd and step size η > 0, 􏰛 For t = 1,2,…:
w(t) := w(t−1) − η∇f(w(t−1)). 􏰛 (Lots of things unspecified here . . . )
13/32

Motivation for gradient descent
􏰛 Why move in direction of (negative) gradient?
􏰛 Affine approximation of f (w + δ) around w:
f(w + δ) ≈ f(w) + ∇f(w)Tδ.
􏰛 Therefore, want δ such that ∇f(w)Tδ < 0 􏰛 Useδ:=−η∇f(w)forsomeη>0: ∇f(w)T(−η∇f(w)) = −η∥∇f(w)∥2 < 0 as long as ∇f(w) ̸= 0. 􏰛 Need η to be small enough so still have improvement given error of affine approximation. 14/32 x4 x3 x2 x1 * * x0 Figure 6: Trajectory of gradient descent 15/32 Example: Gradient of logistic loss 􏰛 Negative gradient of logistic loss on i-th training example: using chain rule, −∇{llogistic(yixTi w)} = −l′logistic(yixTi w)yixi 􏰅1􏰆 = 1 − 1 + exp(−yixTi w) yixi = (1 − σ(yixTi w))yixi where σ is the sigmoid function. 􏰛 Recall,Prw(Y =y|X=x)=σ(yxTw)for(X,Y)following the logistic regression model. 16/32 Example: Gradient descent for logistic regression 􏰛 Objective function: 1 􏰊n f(w) := n 􏰛 Gradient descent: given initial iterate w(0) ∈ Rd and step size η > 0,
􏰛 For t = 1,2,…:
w(t) := w(t−1) − η∇f(w(t−1))
1n
= w(t−1) + η 􏰊(1 − σ(yixTi w(t−1)))yixi
􏰛 How much of yixi to add to w(t−1) is scaled by how far σ(yixTi w(t−1)) currently is from 1.
llogistic(yixTi w).
i=1
􏰛 Interpretation of update:
n i=1
17/32

Convergence of gradient descent on smooth objectives
􏰛 Theorem: Assume f is twice-differentiable and convex, and λmax(∇2f(w)) ≤ β for all w ∈ Rd (“f is β-smooth”). Then gradient descent with step size η := 1/β satisfies
f(w(t)) ≤ f(w⋆) + β∥w(0) − w⋆∥2 . 2t
􏰛 Same holds even if f only once-differentiable, as long as gradient ∇f(w) does not change too fast with w:
∥∇f(w) − ∇f(w′)∥2 ≤ β∥w − w′∥2.
􏰛 Note: it is possible to have convergence even with η > 1/β in some cases; should really treat η as a hyperparameter.
18/32

Example: smoothness of empirical risk with squared loss
􏰛 Empirical risk with squared loss
∇2 􏰝∥Aw − b∥2􏰞 = ATA.
2
So objective function is β-smooth with β = λmax(ATA).
19/32

Example: smoothness of empirical risk with logistic loss
􏰛 Empirical risk with logistic loss
 1 􏰊n ∇2
n i=1

ln(1 + exp(−yixTi w)) 
20/32

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
10.0 7.5
5.0 2.5 0.0 2.5
5.0 7.5 10.0
Figure 7: Gradient descent for logistic regression
21/32
2.000
4.000
6.000
8.000
10.000
12.000
14.000

Analysis of gradient descent for smooth objectives (1)
􏰛 By Taylor’s theorem, can upper-bound f (w + δ) by quadratic: f(w + δ) ≤ f(w) + ∇f(w)Tδ + β∥δ∥2.
2
􏰛 Gradient descent is based on making local quadratic upper-bounds, and minimizing that quadratic:
minf(w)+∇f(w)Tδ+ β∥δ∥2.
δ∈Rd 2
Minimized by δ := − 1 ∇f (w). β
􏰛 Plug-in this value of δ into above inequality to get
􏰅1􏰆12 f w−β∇f(w) −f(w)≤−2β∥∇f(w)∥2.
22/32

Analysis of gradient descent for smooth objectives (2)
􏰛 If f is convex (in addition to β-smooth), then repeatedly making such local changes is sufficient to approximately minimize f.
1 0.9 0.8 0.7 0.6 0.5
-1 -0.5 0 0.5 1 w
Figure 8: Linear and quadratic approximations to a convex function
23/32

Example: Text classification (1)
􏰛 Data: articles posted to various internet message boards 􏰛 Label: −1 for articles from “religion”, +1 for articles from
“politics” 􏰛 Features:
􏰛 Vocabulary of d = 61188 words
􏰛 Each document is a binary vector x ∈ {0, 1}d, where
xi = 1{document contains i-th vocabulary word}
􏰛 Executed gradient descent with η = 0.25 for 500 iterations
24/32

Example: Text classification (2)
0.7 0.6 0.5 0.4 0.3 0.2 0.1
0
0 50 100 150 200 250 300 350 400 450 500
Iteration
Figure 9: Objective value as a function of number of gradient descent iterations
25/32
Empirical risk

Example: Text classification (3)
0.5
0.4
0.3
0.2
0.1
0
0 50 100 150 200 250 300 350 400 450 500
Iteration
Figure 10: Error rate as a function of number of gradient descent iterations
Training
Test
26/32
Error rate

Stochastic gradient method (1)
􏰛 Every iteration of gradient descent takes Θ(nd) time.
􏰛 Pass through all training examples to make a single update. 􏰛 If n is enormous, too expensive to make many passes.
􏰛 Alternative: Stochastic gradient descent (SGD)
􏰛 Another example of plug-in principle!
􏰛 Use one or a few training examples to estimate the gradient.
􏰛 Gradient at w(t):
1 􏰊n
∇l(yj xTj w(t)). (A.k.a. full batch gradient.)
􏰛 Pick term J uniformly at random: ∇l(yJ xTJ w(t)).
􏰛 What is expected value of this random vector?
n
j=1
27/32

Stochastic gradient method (2)
􏰛 Minibatch
􏰛 To reduce variance of estimate, use several random examples
J1,…,JB andaverage—calledminibatchgradient.
1 􏰊B
B
􏰛 Rule of thumb: larger batch size B → larger step size η.
􏰛 Alternative: instead of picking example uniformly at random,
shuffle order of training examples, and take next example in this order.
􏰛 Verify that expected value is same!
􏰛 Seems to reduce variance as well, but not fully understood.
∇l(yJb xTJb w(t)).
b=1
28/32

Example: SGD for logistic regression
􏰛 Logistic regression MLE for data (x1,y1),…,(xn,yn)∈Rd ×{−1,+1}.
􏰛 Startwithw(0) ∈Rd,η>0,t=1 􏰛 For epoch p = 1,2,…:
􏰛 For each training example (x, y) in a random order:
w(t) := w(t−1) + η(1 − σ(yxTw(t−1)))yx t := t + 1.
29/32

Optimization for linear regression
􏰛 Back to considering ordinary least squares.
􏰛 Gaussian elimination to solve normal equations can be slow
when d is large (time is O(nd2)).
􏰛 Alternative: find approximate solution using gradient descent
􏰛 Algorithm: start with some w(0) ∈ Rd and η > 0.
􏰛 For t = 1,2,…:
w(t) := w(t−1) − 2ηAT(Aw(t−1) − b)
􏰛 Time to multiply matrix by vector is linear in matrix size. 􏰛 So each iteration takes time O(nd).
􏰛 Can describe behavior of gradient descent for least squares (empirical risk) objective very precisely.
30/32

Behavior of gradient descent for linear regression
􏰛 Theorem: Let wˆ be the minimum Euclidean norm solution to normal equations. Assume w(0) = 0. Write eigendecomposition ATA=􏰉ri=1λiviviT withλ1 ≥λ2 ≥···≥λr >0. Then
w(t) ∈ range(AT) and
t−1  
viTw(t) = 2ηλi 􏰊(1−2ηλi)k viTwˆ, i=1,…,r. k=0
􏰛 Implications:
􏰛 If we choose η such that 2ηλi < 1, then t−1 2ηλ 􏰊(1−2ηλ)k =1−(1−2ηλ)t, iii k=0 which converges to 1 as t → ∞. 􏰛 So,when2ηλ1 <1,wehavew(t) →wˆast→∞. 􏰛 Rate of convergence is geometric, i.e., “exponentially fast convergence”. 􏰛 Algorithmic inductive bias! 31/32 Postscript 􏰛 There are many optimization algorithms for convex optimization 􏰛 Gradient descent, Newton’s method, BFGS, coordinate descent, mirror descent, etc. 􏰛 Stochastic variants thereof 􏰛 Many also usable even when objective function is non-convex 􏰛 Typically just converge to a local minimizer or stationary point 􏰛 Can also handle constraints on the optimization variable 􏰛 E.g., want coordinates of w to lie in a specific range 􏰛 The algorithmic inductive bias not always well-understood, but it is there! 32/32