CS计算机代考程序代写 chain algorithm Machine learning lecture slides

Machine learning lecture slides

COMS 4771 Fall 2020

0 / 32

Optimization I: Convex optimization

Outline

I Convex sets and convex functions
I Local minimizers and global minimizers
I Gradient descent
I Analysis for smooth objective functions
I Stochastic gradient method
I Gradient descent for least squares linear regression

1 / 32

Convex sets

I Convex set: a set that contains every line segment between
pairs of points in the set.

I Examples:
I All of Rd
I Empty set
I Half-spaces
I Intersections of convex sets
I Convex hulls

convex not convex convex convex

Figure 1: Which of these sets are convex?

2 / 32

Convex functions (1)

I Convex function: a function satisfying the two-point version of
Jensen’s inequality:

f((1−α)w+αw′) ≤ (1−α)f(w)+αf(w′), w, w′ ∈ Rd, α ∈ [0, 1].

f is not convex f is convex
x x′ x x′

Figure 2: Which of these functions are convex?

3 / 32

Convex functions (2)

I Examples:
I f(w) = c for c ∈ R
I f(w) = exp(w) (on R)
I f(w) = |w|c for c ≥ 1 (on R)
I f(w) = bTw for b ∈ Rd
I f(w) = ‖w‖ for any norm ‖ · ‖
I f(w) = wTAw for any symmetric positive semidefinite matrix A
I w 7→ af(w) + g(w) for convex functions f, g and a ≥ 0
I w 7→ max{f(w), g(w)} for convex functions f, g
I f(w) = logsumexp(w) = ln

(∑d
i=1 exp(wi)

)
I w 7→ f(g(w)) for convex function f and affine function g

4 / 32

Verifying convexity of Euclidean norm

I Verify f(w) = ‖w‖ is convex

5 / 32

Convexity of differentiable functions (1)

I Differentiable function f is convex iff

f(w) ≥ f(w0) +∇f(w0)T(w − w0) for all w,w0 ∈ Rd.

f(x)

a(x)

Figure 3: Affine approximation

I Twice-differentiable function f is convex iff ∇2f(w) is positive
semidefinite for all w ∈ Rd.

6 / 32

Convexity of differentiable functions (2)

I Example: Verify f(w) = w4 is convex
I Use second-order condition

7 / 32

Convexity of differentiable functions (3)

I Example: Verify f(w) = ebTw for b ∈ Rd is convex
I Use first-order condition

8 / 32

Verifying convexity of least squares linear regression

I Verify f(w) = ‖Aw − b‖22 is convex

9 / 32

Verifying convexity of logistic regression MLE problem

I Verify f(w) = 1
n

∑n
i=1 ln(1 + e−yix

T
i
w) is convex

10 / 32

Local minimizers

I Say w? ∈ Rd is a local minimizer of f : Rd → R if there is an
“open ball” U = {w ∈ Rd : ‖w − w?‖2 < r} of positive radius r > 0 such that f(w?) ≤ f(w) for all w ∈ U .

I I.e., nothing looks better in the immediate vicinity of w?.

locally optimal

Figure 4: Local minimizer

11 / 32

Local minimizers of convex problems

I If f is convex, and w? is a local minimizer, then it is also a
global minimizer.

I “Local to global” phenomenon
I Local search is well-motivated for convex optimization problems

local global

Figure 5: Local-to-global phenomenon

12 / 32

Gradient descent

I Consider (unconstrained) convex optimization problem

min
w∈Rd

f(w).

I Gradient descent: iterative algorithm for (approximately)
minimizing f

I Given initial iterate w(0) ∈ Rd and step size η > 0,
I For t = 1, 2, . . .:

w(t) := w(t−1) − η∇f(w(t−1)).

I (Lots of things unspecified here . . . )

13 / 32

Motivation for gradient descent

I Why move in direction of (negative) gradient?
I Affine approximation of f(w + δ) around w:

f(w + δ) ≈ f(w) +∇f(w)Tδ.

I Therefore, want δ such that ∇f(w)Tδ < 0 I Use δ := −η∇f(w) for some η > 0:

∇f(w)T(−η∇f(w)) = −η‖∇f(w)‖22 < 0 as long as ∇f(w) 6= 0. I Need η to be small enough so still have improvement given error of affine approximation. 14 / 32 x 0 x 1 x 2 x 3 x 4 * * Figure 6: Trajectory of gradient descent 15 / 32 Example: Gradient of logistic loss I Negative gradient of logistic loss on i-th training example: using chain rule, −∇{`logistic(yixTiw)} = −` ′ logistic(yix T iw)yixi = ( 1− 1 1 + exp(−yixTiw) ) yixi = (1− σ(yixTiw))yixi where σ is the sigmoid function. I Recall, Prw(Y = y | X = x) = σ(yxTw) for (X,Y ) following the logistic regression model. 16 / 32 Example: Gradient descent for logistic regression I Objective function: f(w) := 1 n n∑ i=1 `logistic(yixTiw). I Gradient descent: given initial iterate w(0) ∈ Rd and step size η > 0,

I For t = 1, 2, . . .:

w(t) := w(t−1) − η∇f(w(t−1))

= w(t−1) + η
1
n

n∑
i=1

(1− σ(yixTiw
(t−1)))yixi

I Interpretation of update:
I How much of yixi to add to w(t−1) is scaled by how far

σ(yixTiw
(t−1)) currently is from 1.

17 / 32

Convergence of gradient descent on smooth objectives

I Theorem: Assume f is twice-differentiable and convex, and
λmax(∇2f(w)) ≤ β for all w ∈ Rd (“f is β-smooth”). Then
gradient descent with step size η := 1/β satisfies

f(w(t)) ≤ f(w?) +
β‖w(0) − w?‖22

2t
.

I Same holds even if f only once-differentiable, as long as
gradient ∇f(w) does not change too fast with w:

‖∇f(w)−∇f(w′)‖2 ≤ β‖w − w′‖2.

I Note: it is possible to have convergence even with η > 1/β in
some cases; should really treat η as a hyperparameter.

18 / 32

Example: smoothness of empirical risk with squared loss

I Empirical risk with squared loss

∇2
{
‖Aw − b‖22

}
= ATA.

So objective function is β-smooth with β = λmax(ATA).

19 / 32

Example: smoothness of empirical risk with logistic loss

I Empirical risk with logistic loss

∇2

 1n

n∑
i=1

ln(1 + exp(−yixTiw))




20 / 32

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
10.0

7.5

5.0

2.5

0.0

2.5

5.0

7.5

10.0

2.
00

4.
00

6.000

8.000

10.000

12.000

14.000

Figure 7: Gradient descent for logistic regression

21 / 32

Analysis of gradient descent for smooth objectives (1)

I By Taylor’s theorem, can upper-bound f(w + δ) by quadratic:

f(w + δ) ≤ f(w) +∇f(w)Tδ +
β

2
‖δ‖22.

I Gradient descent is based on making local quadratic
upper-bounds, and minimizing that quadratic:

min
δ∈Rd

f(w) +∇f(w)Tδ +
β

2
‖δ‖22.

Minimized by δ := − 1
β
∇f(w).

I Plug-in this value of δ into above inequality to get

(
w −

1
β
∇f(w)

)
− f(w) ≤ −

1
2β
‖∇f(w)‖22.

22 / 32

Analysis of gradient descent for smooth objectives (2)

I If f is convex (in addition to β-smooth), then repeatedly
making such local changes is sufficient to approximately
minimize f .

-1 -0.5 0 0.5 1

0.5

0.6

0.7

0.8

0.9

Figure 8: Linear and quadratic approximations to a convex function
23 / 32

Example: Text classification (1)

I Data: articles posted to various internet message boards
I Label: −1 for articles from “religion”, +1 for articles from

“politics”
I Features:

I Vocabulary of d = 61188 words
I Each document is a binary vector x ∈ {0, 1}d, where

xi = 1{document contains i-th vocabulary word}
I Executed gradient descent with η = 0.25 for 500 iterations

24 / 32

Example: Text classification (2)

0 50 100 150 200 250 300 350 400 450 500

Iteration

0.1

0.2

0.3

0.4

0.5

0.6

0.7

E
m

p
ir
ic

a
l
ri
s
k

Figure 9: Objective value as a function of number of gradient descent
iterations

25 / 32

Example: Text classification (3)

0 50 100 150 200 250 300 350 400 450 500

Iteration

0.1

0.2

0.3

0.4

0.5

E
rr

o
r

ra
te

Training

Test

Figure 10: Error rate as a function of number of gradient descent iterations

26 / 32

Stochastic gradient method (1)

I Every iteration of gradient descent takes Θ(nd) time.
I Pass through all training examples to make a single update.
I If n is enormous, too expensive to make many passes.

I Alternative: Stochastic gradient descent (SGD)
I Another example of plug-in principle!
I Use one or a few training examples to estimate the gradient.
I Gradient at w(t):

1
n

n∑
j=1
∇`(yjxTjw

(t)).

(A.k.a. full batch gradient.)
I Pick term J uniformly at random:

∇`(yJxTJw
(t)).

I What is expected value of this random vector?

27 / 32

Stochastic gradient method (2)

I Minibatch
I To reduce variance of estimate, use several random examples

J1, . . . , JB and average—called minibatch gradient.

1
B

B∑
b=1
∇`(yJbx

T
Jb
w(t)).

I Rule of thumb: larger batch size B → larger step size η.
I Alternative: instead of picking example uniformly at random,

shuffle order of training examples, and take next example in
this order.
I Verify that expected value is same!
I Seems to reduce variance as well, but not fully understood.

28 / 32

Example: SGD for logistic regression

I Logistic regression MLE for data
(x1, y1), . . . , (xn, yn) ∈ Rd × {−1,+1}.

I Start with w(0) ∈ Rd, η > 0, t = 1
I For epoch p = 1, 2, . . .:

I For each training example (x, y) in a random order:

w(t) := w(t−1) + η(1− σ(yxTw(t−1)))yx
t := t+ 1.

29 / 32

Optimization for linear regression

I Back to considering ordinary least squares.
I Gaussian elimination to solve normal equations can be slow

when d is large (time is O(nd2)).
I Alternative: find approximate solution using gradient descent
I Algorithm: start with some w(0) ∈ Rd and η > 0.

I For t = 1, 2, . . .:

w(t) := w(t−1) − 2ηAT(Aw(t−1) − b)

I Time to multiply matrix by vector is linear in matrix size.
I So each iteration takes time O(nd).

I Can describe behavior of gradient descent for least squares
(empirical risk) objective very precisely.

30 / 32

Behavior of gradient descent for linear regression

I Theorem: Let ŵ be the minimum Euclidean norm solution to
normal equations. Assume w(0) = 0. Write eigendecomposition
ATA =

∑r
i=1 λiviv

T
i with λ1 ≥ λ2 ≥ · · · ≥ λr > 0. Then

w(t) ∈ range(AT) and

vTi w
(t) =


2ηλi t−1∑

k=0
(1− 2ηλi)k


 vTi ŵ, i = 1, . . . , r.

I Implications:
I If we choose η such that 2ηλi < 1, then 2ηλi t−1∑ k=0 (1− 2ηλi)k = 1− (1− 2ηλi)t, which converges to 1 as t→∞. I So, when 2ηλ1 < 1, we have w(t) → ŵ as t→∞. I Rate of convergence is geometric, i.e., “exponentially fast convergence”. I Algorithmic inductive bias! 31 / 32 Postscript I There are many optimization algorithms for convex optimization I Gradient descent, Newton’s method, BFGS, coordinate descent, mirror descent, etc. I Stochastic variants thereof I Many also usable even when objective function is non-convex I Typically just converge to a local minimizer or stationary point I Can also handle constraints on the optimization variable I E.g., want coordinates of w to lie in a specific range I The algorithmic inductive bias not always well-understood, but it is there! 32 / 32 Optimization I: Convex optimization

Related Posts