Machine learning lecture slides
COMS 4771 Fall 2020
0/32
Optimization I: Convex optimization
Outline
Convex sets and convex functions
Local minimizers and global minimizers
Gradient descent
Analysis for smooth objective functions
Stochastic gradient method
Gradient descent for least squares linear regression
1/32
Convex sets
Convex set: a set that contains every line segment between pairs of points in the set.
Examples:
All of Rd
Empty set
Half-spaces
Intersections of convex sets Convex hulls
convex not convex convex convex
Figure 1: Which of these sets are convex?
2/32
Convex functions (1)
Convex function: a function satisfying the two-point version of Jensen’s inequality:
f((1−α)w+αw′) ≤ (1−α)f(w)+αf(w′), w, w′ ∈ Rd, α ∈ [0, 1].
xx′ xx′ f is not convex f is convex
Figure 2: Which of these functions are convex?
3/32
Convex functions (2)
Examples:
f(w)=cforc∈R
f(w) = exp(w) (on R)
f(w)=|w|c forc≥1(onR)
f(w)=bTwforb∈Rd
f(w)=∥w∥foranynorm∥·∥
f(w) = wTAw for any symmetric positive semidefinite matrix A w→af(w)+g(w)forconvexfunctionsf,ganda≥0
w → max{f (w), g(w)} for convex functions f, g
f(w) = logsumexp(w) = ln di=1 exp(wi)
w → f(g(w)) for convex function f and affine function g
4/32
Verifying convexity of Euclidean norm
Verify f(w) = ∥w∥ is convex
5/32
Convexity of differentiable functions (1)
Differentiable function f is convex iff f(w)≥f(w0)+∇f(w0)T(w−w0) forallw,w0 ∈Rd.
f (x)
a(x)
x0
Figure 3: Affine approximation
Twice-differentiable function f is convex iff ∇2f(w) is positive semidefinite for all w ∈ Rd.
6/32
Convexity of differentiable functions (2)
Example: Verify f(w) = w4 is convex Use second-order condition
7/32
Convexity of differentiable functions (3)
Example: Verify f(w) = ebTw for b ∈ Rd is convex Use first-order condition
8/32
Verifying convexity of least squares linear regression
Verify f (w) = ∥Aw − b∥2 is convex
9/32
Verifying convexity of logistic regression MLE problem
Verify f(w) = 1 n ln(1 + e−yixTi w) is convex n i=1
10/32
Local minimizers
Sayw⋆ ∈Rd isalocalminimizeroff:Rd →Rifthereisan “open ball” U = {w ∈ Rd : ∥w − w⋆∥2 < r} of positive radius
r > 0 such that f(w⋆) ≤ f(w) for all w ∈ U.
I.e., nothing looks better in the immediate vicinity of w⋆.
locally optimal
Figure 4: Local minimizer
11/32
Local minimizers of convex problems
If f is convex, and w⋆ is a local minimizer, then it is also a global minimizer.
“Local to global” phenomenon
Local search is well-motivated for convex optimization problems
local global
Figure 5: Local-to-global phenomenon
12/32
Gradient descent
Consider (unconstrained) convex optimization problem min f (w).
w∈Rd
Gradient descent: iterative algorithm for (approximately) minimizing f
Given initial iterate w(0) ∈ Rd and step size η > 0, For t = 1,2,…:
w(t) := w(t−1) − η∇f(w(t−1)). (Lots of things unspecified here . . . )
13/32
Motivation for gradient descent
Why move in direction of (negative) gradient?
Affine approximation of f (w + δ) around w:
f(w + δ) ≈ f(w) + ∇f(w)Tδ.
Therefore, want δ such that ∇f(w)Tδ < 0
Useδ:=−η∇f(w)forsomeη>0: ∇f(w)T(−η∇f(w)) = −η∥∇f(w)∥2 < 0
as long as ∇f(w) ̸= 0.
Need η to be small enough so still have improvement given
error of affine approximation.
14/32
x4 x3
x2 x1
*
*
x0
Figure 6: Trajectory of gradient descent
15/32
Example: Gradient of logistic loss
Negative gradient of logistic loss on i-th training example: using chain rule,
−∇{llogistic(yixTi w)} = −l′logistic(yixTi w)yixi 1
= 1 − 1 + exp(−yixTi w) yixi = (1 − σ(yixTi w))yixi
where σ is the sigmoid function.
Recall,Prw(Y =y|X=x)=σ(yxTw)for(X,Y)following
the logistic regression model.
16/32
Example: Gradient descent for logistic regression
Objective function:
1 n
f(w) := n
Gradient descent: given initial iterate w(0) ∈ Rd and step size η > 0,
For t = 1,2,…:
w(t) := w(t−1) − η∇f(w(t−1))
1n
= w(t−1) + η (1 − σ(yixTi w(t−1)))yixi
How much of yixi to add to w(t−1) is scaled by how far σ(yixTi w(t−1)) currently is from 1.
llogistic(yixTi w).
i=1
Interpretation of update:
n i=1
17/32
Convergence of gradient descent on smooth objectives
Theorem: Assume f is twice-differentiable and convex, and λmax(∇2f(w)) ≤ β for all w ∈ Rd (“f is β-smooth”). Then gradient descent with step size η := 1/β satisfies
f(w(t)) ≤ f(w⋆) + β∥w(0) − w⋆∥2 . 2t
Same holds even if f only once-differentiable, as long as gradient ∇f(w) does not change too fast with w:
∥∇f(w) − ∇f(w′)∥2 ≤ β∥w − w′∥2.
Note: it is possible to have convergence even with η > 1/β in some cases; should really treat η as a hyperparameter.
18/32
Example: smoothness of empirical risk with squared loss
Empirical risk with squared loss
∇2 ∥Aw − b∥2 = ATA.
2
So objective function is β-smooth with β = λmax(ATA).
19/32
Example: smoothness of empirical risk with logistic loss
Empirical risk with logistic loss
1 n ∇2
n i=1
ln(1 + exp(−yixTi w))
20/32
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
10.0 7.5
5.0 2.5 0.0 2.5
5.0 7.5 10.0
Figure 7: Gradient descent for logistic regression
21/32
2.000
4.000
6.000
8.000
10.000
12.000
14.000
Analysis of gradient descent for smooth objectives (1)
By Taylor’s theorem, can upper-bound f (w + δ) by quadratic: f(w + δ) ≤ f(w) + ∇f(w)Tδ + β∥δ∥2.
2
Gradient descent is based on making local quadratic upper-bounds, and minimizing that quadratic:
minf(w)+∇f(w)Tδ+ β∥δ∥2.
δ∈Rd 2
Minimized by δ := − 1 ∇f (w). β
Plug-in this value of δ into above inequality to get
112 f w−β∇f(w) −f(w)≤−2β∥∇f(w)∥2.
22/32
Analysis of gradient descent for smooth objectives (2)
If f is convex (in addition to β-smooth), then repeatedly making such local changes is sufficient to approximately minimize f.
1 0.9 0.8 0.7 0.6 0.5
-1 -0.5 0 0.5 1 w
Figure 8: Linear and quadratic approximations to a convex function
23/32
Example: Text classification (1)
Data: articles posted to various internet message boards Label: −1 for articles from “religion”, +1 for articles from
“politics” Features:
Vocabulary of d = 61188 words
Each document is a binary vector x ∈ {0, 1}d, where
xi = 1{document contains i-th vocabulary word}
Executed gradient descent with η = 0.25 for 500 iterations
24/32
Example: Text classification (2)
0.7 0.6 0.5 0.4 0.3 0.2 0.1
0
0 50 100 150 200 250 300 350 400 450 500
Iteration
Figure 9: Objective value as a function of number of gradient descent iterations
25/32
Empirical risk
Example: Text classification (3)
0.5
0.4
0.3
0.2
0.1
0
0 50 100 150 200 250 300 350 400 450 500
Iteration
Figure 10: Error rate as a function of number of gradient descent iterations
Training
Test
26/32
Error rate
Stochastic gradient method (1)
Every iteration of gradient descent takes Θ(nd) time.
Pass through all training examples to make a single update. If n is enormous, too expensive to make many passes.
Alternative: Stochastic gradient descent (SGD)
Another example of plug-in principle!
Use one or a few training examples to estimate the gradient.
Gradient at w(t):
1 n
∇l(yj xTj w(t)). (A.k.a. full batch gradient.)
Pick term J uniformly at random: ∇l(yJ xTJ w(t)).
What is expected value of this random vector?
n
j=1
27/32
Stochastic gradient method (2)
Minibatch
To reduce variance of estimate, use several random examples
J1,…,JB andaverage—calledminibatchgradient.
1 B
B
Rule of thumb: larger batch size B → larger step size η.
Alternative: instead of picking example uniformly at random,
shuffle order of training examples, and take next example in this order.
Verify that expected value is same!
Seems to reduce variance as well, but not fully understood.
∇l(yJb xTJb w(t)).
b=1
28/32
Example: SGD for logistic regression
Logistic regression MLE for data (x1,y1),…,(xn,yn)∈Rd ×{−1,+1}.
Startwithw(0) ∈Rd,η>0,t=1 For epoch p = 1,2,…:
For each training example (x, y) in a random order:
w(t) := w(t−1) + η(1 − σ(yxTw(t−1)))yx t := t + 1.
29/32
Optimization for linear regression
Back to considering ordinary least squares.
Gaussian elimination to solve normal equations can be slow
when d is large (time is O(nd2)).
Alternative: find approximate solution using gradient descent
Algorithm: start with some w(0) ∈ Rd and η > 0.
For t = 1,2,…:
w(t) := w(t−1) − 2ηAT(Aw(t−1) − b)
Time to multiply matrix by vector is linear in matrix size. So each iteration takes time O(nd).
Can describe behavior of gradient descent for least squares (empirical risk) objective very precisely.
30/32
Behavior of gradient descent for linear regression
Theorem: Let wˆ be the minimum Euclidean norm solution to normal equations. Assume w(0) = 0. Write eigendecomposition ATA=ri=1λiviviT withλ1 ≥λ2 ≥···≥λr >0. Then
w(t) ∈ range(AT) and
t−1
viTw(t) = 2ηλi (1−2ηλi)k viTwˆ, i=1,…,r. k=0
Implications:
If we choose η such that 2ηλi < 1, then
t−1
2ηλ (1−2ηλ)k =1−(1−2ηλ)t,
iii k=0
which converges to 1 as t → ∞.
So,when2ηλ1 <1,wehavew(t) →wˆast→∞.
Rate of convergence is geometric, i.e., “exponentially fast
convergence”.
Algorithmic inductive bias!
31/32
Postscript
There are many optimization algorithms for convex optimization
Gradient descent, Newton’s method, BFGS, coordinate descent, mirror descent, etc.
Stochastic variants thereof
Many also usable even when objective function is non-convex
Typically just converge to a local minimizer or stationary point Can also handle constraints on the optimization variable
E.g., want coordinates of w to lie in a specific range
The algorithmic inductive bias not always well-understood, but it is there!
32/32