Smooth functions
The Hessian of R at w is
Gradient descent
Daniel Hsu (COMS 4771)
Smooth functions are functions whose derivatives (gradients) do not change too quickly. The change in the derivative is the second-derivative, so smoothness is a constraint on the second-derivatives of a function.
For any β > 0, we say a twice-differentiable function f : Rd → R is β-smooth if the eigenvalues of its Hessian matrix at any point in Rd are at most β.
Example: logistic regression
Consider the empirical logistic loss risk on a training data set (x1 , y1 ), . . . , (xn , yn ) ∈ Rd × {−1, +1}:
R(w)= n
1 n
1 n
ln(1+exp(−yixTiw)).
∇2R(w) = n
where σ(t) = 1/(1 + exp(−t)) is the sigmoid function. For any unit vector u ∈ Rd, we have
1 n
uT∇2R(w)u = n ≤ 4 n
i=1
i=1
σ(yixTi w)σ(−yixTi w)xixTi
σ(yixTi w)σ(−yixTi w)(xTi u)2 1 n
i=1
( x Ti u ) 2 n
=1uT 1xixTi u 4n
i=1 n
i=1
≤ 1λ 1 x xT ,
4 maxn
where λmax(M) is used to denote the largest eigenvalue of a symmetric matrix M. So if λ1 is the largest
eigenvalue of the empirical second moment matrix 1 n xixTi , then R is β-smooth for β = λ1/4. n i=1
Quadratic upper bound for smooth functions
A consequence of β-smoothness is the following. Recall that by Taylor’s theorem, for any w, δ ∈ Rd, there exists w ̃ ∈ Rd on the line segment between w and w + δ such that
f ( w + δ ) = f ( w ) + ∇ f ( w ) T δ + 1 δ T ∇ 2 f ( w ̃ ) δ . 2
i=1
i i
1
If f is β-smooth, then we can bound the third term from above as
1δT∇2f(w ̃ )δ ≤ 2
≤ ≤
1∥δ∥2 max uT∇2f(w ̃ )u 2 u∈Rd :∥u∥2 =1
1∥δ∥2λmax(∇2f(w ̃)) 2
1∥δ∥2β. 2
Therefore, if f is β-smooth, then for any w, δ ∈ Rd,
f(w + δ) ≤ f(w) + ∇f(w)Tδ + β ∥δ∥2.
2
Gradient descent starts with an initial point w(0) ∈ Rd, and for a given step size η, iteratively computes a sequence of points w(1), w(2), . . . as follows. For t = 1, 2, . . .:
w(t) := w(t−1) − η∇f(w(t−1)). Motivation for gradient descent on smooth functions
The motivation for the gradient descent update is the following. Suppose we have a current point w ∈ Rd, and we would like to locally change it from w to w + δ so as to decrease the function value. How should we choose δ?
In gradient descent, we consider the quadratic upper-bound that smoothness grants, i.e.,
f(w + δ) ≤ f(w) + ∇f(w)Tδ + β ∥δ∥2, 2
and then choose δ to minimize this upper-bound. The upper-bound is a convex quadratic function of δ, so its minimizer can be written in closed-form. The minimizer is the value of δ such that
∇f(w) + βδ = 0.
Gradient descent on smooth functions
In other words, it is δ⋆(w), defined by
Plugging in δ⋆(w) for δ in the quadratic upper-bound gives
f(w + δ⋆(w)) ≤ f(w) + ∇f(w)Tδ⋆(w) + β ∥δ⋆(w)∥2 2
=f(w)−1∇f(w)T∇f(w)+ 1∥∇f(w)∥2 β 2β
= f(w) − 1 ∥∇f(w)∥2. 2β
This inequality tells us that this local change to w will decrease the function value as long as the gradient at w is non-zero. It turns out that if the function f is convex (in addition to β-smooth), then repeatedly making such local changes is sufficient to approximately minimize the function.
Analysis of gradient descent on smooth convex functions
One of the simplest ways to mathematically analyze the behavior of gradient descent on smooth functions (with step size η = 1/β) is to monitor the change in a potential function during the execution of gradient
δ⋆(w) := −1∇f(w). β
2
descent. The potential function we will use is the squared Euclidean distance to a fixed vector w⋆ ∈ Rd, which could be a minimizer of f (but need not be):
Φ(w):= 1∥w−w⋆∥2. 2η
The scaling by 1 is used just for notational convenience. 2η
Let us examine the “drop” in the potential when we change a point w to w + δ⋆(w) (as in gradient descent): Φ(w)−Φ(w+δ⋆(w))= 1 ∥w−w⋆∥2 − 1 ∥w+δ⋆(w)−w⋆∥2
2η 2η
= β∥w−w⋆∥2 − β ∥w−w⋆∥2 +2δ⋆(w)T(w−w⋆)+∥δ⋆(w)∥2
2222 2 = −βδ⋆(w)T(w⋆ − w) − β ∥δ⋆(w)∥2
2 =∇f(w)T(w−w⋆)− 1∥∇f(w)∥2.
2β
In the last step, we have plugged in δ⋆(w) = − 1 ∇f(w). Now we use two key facts. The first is the inequality
we derived above based on the smoothness of f:
f(w+δ⋆(w))≤f(w)− 1∥∇f(w)∥2,
which rearranges to
2β
− 1 ∥∇f(w)∥2 ≥ f(w + δ⋆(w)) − f(w).
which rearranges to
(We’ll discuss this inequality more later.) So, we can bound the drop in potential as follows:
β
2β
The second comes from the first-order definition of convexity:
f(w⋆) ≥ f(w) + ∇f(w)T(w⋆ − w), ∇f(w)T(w − w⋆) ≥ f(w) − f(w⋆).
Φ(w)−Φ(w+δ⋆(w))=∇f(w)T(w−w⋆)− 1∥∇f(w)∥2 2β
≥ f(w) − f(w⋆) + f(w + δ⋆(w)) − f(w) = f(w + δ⋆(w)) − f(w⋆).
Let us write this inequality in terms of the iterates of gradient descent with η = 1/β: Φ(w(t−1)) − Φ(w(t)) ≥ f(w(t)) − f(w⋆).
Summing this inequality from t = 1,2,…,T:
TT
Φ(w(t−1)) − Φ(w(t)) ≥ f(w(t)) − f(w⋆) . t=1 t=1
The left-hand side simplifies to Φ(w(0))−Φ(w(T)). Furthermore, since f(w(t)) ≥ f(w(T)) for all t = 1,…,T, the right-hand side can be bounded from below by
T f(w(T))−f(w⋆).
f(w(T))−f(w⋆)≤ 1 Φ(w(0))−Φ(w(T))= β ∥w(0) −w⋆∥2 −∥w(T) −w⋆∥2.
So we are left with the inequality
T 2T 2 2
3
Gradient descent on Lipschitz convex functions
Gradient descent can also be used for non-smooth convex functions as long as the function itself does not change too quickly.
For any L > 0, we say that a differentiable function f : Rd → R is L-Lipschitz if its gradient at any point in Rd is bounded in Euclidean norm by L.
The motivation for gradient descent based on minimizing quadratic upper-bounds no longer applies. Indeed, the gradient at w could be very different from the gradient at a nearby w′, so the function value at w−η∇f(w) could be worse than the function value at w. Therefore, we cannot expect to have the same convergence guarantee for non-smooth functions that we had for smooth functions.
Gradient descent, nevertheless, will produce a sequence w(1), w(2), . . . such that the function value at these points is approximately minimal on average.
Motivation for gradient descent on Lipschitz convex functions
A basic motivation for gradient descent for convex functions, that does not assume smoothness, comes from the first-order condition for convexity:
which rearranges to
f(w⋆) ≥ f(w) + ∇f(w)T(w⋆ − w), (−∇f(w))T(w⋆ − w) ≥ f(w) − f(w⋆).
Suppose f(w) > f(w⋆), so that moving from w to w⋆ would improve the function value. Then, the inequality implies that the negative gradient −∇f(w) at w makes a positive inner product with the direction from w to w⋆. This is the crucial property that makes gradient descent work.
Analysis of gradient descent on Lipschitz convex functions
We again monitor the change in the potential function
Φ(w):= 1∥w−w⋆∥2,
2η
for a fixed vector w⋆ ∈ Rd.
Again, let us examine the “drop” in the potential when we change a point w to w − η∇f (w) (as in gradient
descent):
Φ(w)−Φ(w−η∇f(w))= 1 ∥w−w⋆∥2 − 1 ∥w−η∇f(w)−w⋆∥2 2η 2η
= (−∇f(w))T(w − w⋆) − η∥∇f(w)∥2 2
≥ f(w) − f(w⋆) − L2η, 2
where the inequality uses the convexity and Lipschitzness of f. In terms of the iterates of gradient descent,
this reads
Φ(w(t−1)) − Φ(w(t)) ≥ f(w(t−1)) − f(w⋆) − L2η. 2
Summing this inequality from t = 1,2,…,T:
Φ(w(0)) − Φ(w(T)) ≥ T f(w(t−1)) − f(w⋆) − L2ηT .
t=1
2
4
Rearranging and dividing through by T (and dropping a term):
1T f(w(t−1))−f(w⋆)≤∥w(0)−w⋆∥2 +L2η.
T 2ηT2 t=1
The left-hand side is the average sub-optimality relative to f(w⋆). Therefore, there exists some t∗ ∈ {0,1,…,T −1} such that
f(w(t∗))−f(w⋆)≤ 1T f(w(t−1))−f(w⋆)≤∥w(0)−w⋆∥2 +L2η. T 2ηT2
t=1 √√
The right-hand side is O(1/ T ) when we choose η = 1/ T .1 Alternatively, if we compute the average point
1 T
1 w(t−1) ≤ 1 f(w(t−1)). T T
So the bound for w(t∗ ) also applies to w ̄ T :
f(w ̄T)−f(w⋆)≤ 1T f(w(t−1))−f(w⋆)≤∥w(0)−w⋆∥2 +L2η.
T 2ηT2 t=1
then by Jensen’s inequality we have f(w ̄T)=f
w ̄T :=T TT
t=1
t=1 t=1
w(t−1),
√
1A similar guarantee holds when the step size used for the t-th update is ηt = 1/ t. 5