Machine learning lecture slides
Machine learning lecture slides
COMS 4771 Fall 2020
0 / 32
Optimization I: Convex optimization
Outline
I Convex sets and convex functions
I Local minimizers and global minimizers
I Gradient descent
I Analysis for smooth objective functions
I Stochastic gradient method
I Gradient descent for least squares linear regression
1 / 32
Convex sets
I Convex set: a set that contains every line segment between
pairs of points in the set.
I Examples:
I All of Rd
I Empty set
I Half-spaces
I Intersections of convex sets
I Convex hulls
convex not convex convex convex
Figure 1: Which of these sets are convex?
2 / 32
Convex functions (1)
I Convex function: a function satisfying the two-point version of
Jensen’s inequality:
f((1−α)w+αw′) ≤ (1−α)f(w)+αf(w′), w, w′ ∈ Rd, α ∈ [0, 1].
f is not convex f is convex
x x′ x x′
Figure 2: Which of these functions are convex?
3 / 32
Convex functions (2)
I Examples:
I f(w) = c for c ∈ R
I f(w) = exp(w) (on R)
I f(w) = |w|c for c ≥ 1 (on R)
I f(w) = bTw for b ∈ Rd
I f(w) = ‖w‖ for any norm ‖ · ‖
I f(w) = wTAw for any symmetric positive semidefinite matrix A
I w 7→ af(w) + g(w) for convex functions f, g and a ≥ 0
I w 7→ max{f(w), g(w)} for convex functions f, g
I f(w) = logsumexp(w) = ln
(∑d
i=1 exp(wi)
)
I w 7→ f(g(w)) for convex function f and affine function g
4 / 32
Verifying convexity of Euclidean norm
I Verify f(w) = ‖w‖ is convex
5 / 32
Convexity of differentiable functions (1)
I Differentiable function f is convex iff
f(w) ≥ f(w0) +∇f(w0)T(w − w0) for all w,w0 ∈ Rd.
f(x)
x0
a(x)
Figure 3: Affine approximation
I Twice-differentiable function f is convex iff ∇2f(w) is positive
semidefinite for all w ∈ Rd.
6 / 32
Convexity of differentiable functions (2)
I Example: Verify f(w) = w4 is convex
I Use second-order condition
7 / 32
Convexity of differentiable functions (3)
I Example: Verify f(w) = ebTw for b ∈ Rd is convex
I Use first-order condition
8 / 32
Verifying convexity of least squares linear regression
I Verify f(w) = ‖Aw − b‖22 is convex
9 / 32
Verifying convexity of logistic regression MLE problem
I Verify f(w) = 1
n
∑n
i=1 ln(1 + e−yix
T
i
w) is convex
10 / 32
Local minimizers
I Say w? ∈ Rd is a local minimizer of f : Rd → R if there is an
“open ball” U = {w ∈ Rd : ‖w − w?‖2 < r} of positive radius
r > 0 such that f(w?) ≤ f(w) for all w ∈ U .
I I.e., nothing looks better in the immediate vicinity of w?.
locally optimal
Figure 4: Local minimizer
11 / 32
Local minimizers of convex problems
I If f is convex, and w? is a local minimizer, then it is also a
global minimizer.
I “Local to global” phenomenon
I Local search is well-motivated for convex optimization problems
local global
Figure 5: Local-to-global phenomenon
12 / 32
Gradient descent
I Consider (unconstrained) convex optimization problem
min
w∈Rd
f(w).
I Gradient descent: iterative algorithm for (approximately)
minimizing f
I Given initial iterate w(0) ∈ Rd and step size η > 0,
I For t = 1, 2, . . .:
w(t) := w(t−1) − η∇f(w(t−1)).
I (Lots of things unspecified here . . . )
13 / 32
Motivation for gradient descent
I Why move in direction of (negative) gradient?
I Affine approximation of f(w + δ) around w:
f(w + δ) ≈ f(w) +∇f(w)Tδ.
I Therefore, want δ such that ∇f(w)Tδ < 0 I Use δ := −η∇f(w) for some η > 0:
∇f(w)T(−η∇f(w)) = −η‖∇f(w)‖22 < 0 as long as ∇f(w) 6= 0. I Need η to be small enough so still have improvement given error of affine approximation. 14 / 32 x 0 x 1 x 2 x 3 x 4 * * Figure 6: Trajectory of gradient descent 15 / 32 Example: Gradient of logistic loss I Negative gradient of logistic loss on i-th training example: using chain rule, −∇{`logistic(yixTiw)} = −` ′ logistic(yix T iw)yixi = ( 1− 1 1 + exp(−yixTiw) ) yixi = (1− σ(yixTiw))yixi where σ is the sigmoid function. I Recall, Prw(Y = y | X = x) = σ(yxTw) for (X,Y ) following the logistic regression model. 16 / 32 Example: Gradient descent for logistic regression I Objective function: f(w) := 1 n n∑ i=1 `logistic(yixTiw). I Gradient descent: given initial iterate w(0) ∈ Rd and step size η > 0,
I For t = 1, 2, . . .:
w(t) := w(t−1) − η∇f(w(t−1))
= w(t−1) + η
1
n
n∑
i=1
(1− σ(yixTiw
(t−1)))yixi
I Interpretation of update:
I How much of yixi to add to w(t−1) is scaled by how far
σ(yixTiw
(t−1)) currently is from 1.
17 / 32
Convergence of gradient descent on smooth objectives
I Theorem: Assume f is twice-differentiable and convex, and
λmax(∇2f(w)) ≤ β for all w ∈ Rd (“f is β-smooth”). Then
gradient descent with step size η := 1/β satisfies
f(w(t)) ≤ f(w?) +
β‖w(0) − w?‖22
2t
.
I Same holds even if f only once-differentiable, as long as
gradient ∇f(w) does not change too fast with w:
‖∇f(w)−∇f(w′)‖2 ≤ β‖w − w′‖2.
I Note: it is possible to have convergence even with η > 1/β in
some cases; should really treat η as a hyperparameter.
18 / 32
Example: smoothness of empirical risk with squared loss
I Empirical risk with squared loss
∇2
{
‖Aw − b‖22
}
= ATA.
So objective function is β-smooth with β = λmax(ATA).
19 / 32
Example: smoothness of empirical risk with logistic loss
I Empirical risk with logistic loss
∇2
1n
n∑
i=1
ln(1 + exp(−yixTiw))
20 / 32
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
10.0
7.5
5.0
2.5
0.0
2.5
5.0
7.5
10.0
2.
00
0
4.
00
0
6.000
8.000
10.000
12.000
14.000
Figure 7: Gradient descent for logistic regression
21 / 32
Analysis of gradient descent for smooth objectives (1)
I By Taylor’s theorem, can upper-bound f(w + δ) by quadratic:
f(w + δ) ≤ f(w) +∇f(w)Tδ +
β
2
‖δ‖22.
I Gradient descent is based on making local quadratic
upper-bounds, and minimizing that quadratic:
min
δ∈Rd
f(w) +∇f(w)Tδ +
β
2
‖δ‖22.
Minimized by δ := − 1
β
∇f(w).
I Plug-in this value of δ into above inequality to get
f
(
w −
1
β
∇f(w)
)
− f(w) ≤ −
1
2β
‖∇f(w)‖22.
22 / 32
Analysis of gradient descent for smooth objectives (2)
I If f is convex (in addition to β-smooth), then repeatedly
making such local changes is sufficient to approximately
minimize f .
-1 -0.5 0 0.5 1
w
0.5
0.6
0.7
0.8
0.9
1
Figure 8: Linear and quadratic approximations to a convex function
23 / 32
Example: Text classification (1)
I Data: articles posted to various internet message boards
I Label: −1 for articles from “religion”, +1 for articles from
“politics”
I Features:
I Vocabulary of d = 61188 words
I Each document is a binary vector x ∈ {0, 1}d, where
xi = 1{document contains i-th vocabulary word}
I Executed gradient descent with η = 0.25 for 500 iterations
24 / 32
Example: Text classification (2)
0 50 100 150 200 250 300 350 400 450 500
Iteration
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
E
m
p
ir
ic
a
l
ri
s
k
Figure 9: Objective value as a function of number of gradient descent
iterations
25 / 32
Example: Text classification (3)
0 50 100 150 200 250 300 350 400 450 500
Iteration
0
0.1
0.2
0.3
0.4
0.5
E
rr
o
r
ra
te
Training
Test
Figure 10: Error rate as a function of number of gradient descent iterations
26 / 32
Stochastic gradient method (1)
I Every iteration of gradient descent takes Θ(nd) time.
I Pass through all training examples to make a single update.
I If n is enormous, too expensive to make many passes.
I Alternative: Stochastic gradient descent (SGD)
I Another example of plug-in principle!
I Use one or a few training examples to estimate the gradient.
I Gradient at w(t):
1
n
n∑
j=1
∇`(yjxTjw
(t)).
(A.k.a. full batch gradient.)
I Pick term J uniformly at random:
∇`(yJxTJw
(t)).
I What is expected value of this random vector?
27 / 32
Stochastic gradient method (2)
I Minibatch
I To reduce variance of estimate, use several random examples
J1, . . . , JB and average—called minibatch gradient.
1
B
B∑
b=1
∇`(yJbx
T
Jb
w(t)).
I Rule of thumb: larger batch size B → larger step size η.
I Alternative: instead of picking example uniformly at random,
shuffle order of training examples, and take next example in
this order.
I Verify that expected value is same!
I Seems to reduce variance as well, but not fully understood.
28 / 32
Example: SGD for logistic regression
I Logistic regression MLE for data
(x1, y1), . . . , (xn, yn) ∈ Rd × {−1,+1}.
I Start with w(0) ∈ Rd, η > 0, t = 1
I For epoch p = 1, 2, . . .:
I For each training example (x, y) in a random order:
w(t) := w(t−1) + η(1− σ(yxTw(t−1)))yx
t := t+ 1.
29 / 32
Optimization for linear regression
I Back to considering ordinary least squares.
I Gaussian elimination to solve normal equations can be slow
when d is large (time is O(nd2)).
I Alternative: find approximate solution using gradient descent
I Algorithm: start with some w(0) ∈ Rd and η > 0.
I For t = 1, 2, . . .:
w(t) := w(t−1) − 2ηAT(Aw(t−1) − b)
I Time to multiply matrix by vector is linear in matrix size.
I So each iteration takes time O(nd).
I Can describe behavior of gradient descent for least squares
(empirical risk) objective very precisely.
30 / 32
Behavior of gradient descent for linear regression
I Theorem: Let ŵ be the minimum Euclidean norm solution to
normal equations. Assume w(0) = 0. Write eigendecomposition
ATA =
∑r
i=1 λiviv
T
i with λ1 ≥ λ2 ≥ · · · ≥ λr > 0. Then
w(t) ∈ range(AT) and
vTi w
(t) =
2ηλi t−1∑
k=0
(1− 2ηλi)k
vTi ŵ, i = 1, . . . , r.
I Implications:
I If we choose η such that 2ηλi < 1, then
2ηλi
t−1∑
k=0
(1− 2ηλi)k = 1− (1− 2ηλi)t,
which converges to 1 as t→∞.
I So, when 2ηλ1 < 1, we have w(t) → ŵ as t→∞.
I Rate of convergence is geometric, i.e., “exponentially fast
convergence”.
I Algorithmic inductive bias!
31 / 32
Postscript
I There are many optimization algorithms for convex
optimization
I Gradient descent, Newton’s method, BFGS, coordinate descent,
mirror descent, etc.
I Stochastic variants thereof
I Many also usable even when objective function is non-convex
I Typically just converge to a local minimizer or stationary point
I Can also handle constraints on the optimization variable
I E.g., want coordinates of w to lie in a specific range
I The algorithmic inductive bias not always well-understood, but
it is there!
32 / 32
Optimization I: Convex optimization