Gradient descent
Gradient descent
Daniel Hsu (COMS 4771)
Smooth functions
Smooth functions are functions whose derivatives (gradients) do not change too quickly. The change in the
derivative is the second-derivative, so smoothness is a constraint on the second-derivatives of a function.
For any β > 0, we say a twice-differentiable function f : Rd → R is β-smooth if the eigenvalues of its Hessian
matrix at any point in Rd are at most β.
Example: logistic regression
Consider the empirical logistic loss risk on a training data set (x1, y1), . . . , (xn, yn) ∈ Rd × {−1,+1}:
R̂(w) =
1
n
n∑
i=1
ln(1 + exp(−yixTiw)).
The Hessian of R̂ at w is
∇2R̂(w) =
1
n
n∑
i=1
σ(yixTiw)σ(−yix
T
iw)xix
T
i
where σ(t) = 1/(1 + exp(−t)) is the sigmoid function. For any unit vector u ∈ Rd, we have
uT∇2R̂(w)u =
1
n
n∑
i=1
σ(yixTiw)σ(−yix
T
iw)(x
T
iu)
2
≤
1
4n
n∑
i=1
(xTiu)
2
=
1
4
uT
1
n
n∑
i=1
xix
T
i
u
≤
1
4
λmax
1
n
n∑
i=1
xix
T
i
,
where λmax(M) is used to denote the largest eigenvalue of a symmetric matrix M . So if λ1 is the largest
eigenvalue of the empirical second moment matrix 1
n
∑n
i=1 xix
T
i , then R̂ is β-smooth for β = λ1/4.
Quadratic upper bound for smooth functions
A consequence of β-smoothness is the following. Recall that by Taylor’s theorem, for any w, δ ∈ Rd, there
exists w̃ ∈ Rd on the line segment between w and w + δ such that
f(w + δ) = f(w) +∇f(w)Tδ +
1
2
δT∇2f(w̃)δ.
1
If f is β-smooth, then we can bound the third term from above as
1
2
δT∇2f(w̃)δ ≤
1
2
‖δ‖22 max
u∈Rd:‖u‖2=1
uT∇2f(w̃)u
≤
1
2
‖δ‖22λmax(∇
2f(w̃))
≤
1
2
‖δ‖22β.
Therefore, if f is β-smooth, then for any w, δ ∈ Rd,
f(w + δ) ≤ f(w) +∇f(w)Tδ +
β
2
‖δ‖22.
Gradient descent on smooth functions
Gradient descent starts with an initial point w(0) ∈ Rd, and for a given step size η, iteratively computes a
sequence of points w(1),w(2), . . . as follows. For t = 1, 2, . . .:
w(t) := w(t−1) − η∇f(w(t−1)).
Motivation for gradient descent on smooth functions
The motivation for the gradient descent update is the following. Suppose we have a current point w ∈ Rd,
and we would like to locally change it from w to w + δ so as to decrease the function value. How should we
choose δ?
In gradient descent, we consider the quadratic upper-bound that smoothness grants, i.e.,
f(w + δ) ≤ f(w) +∇f(w)Tδ +
β
2
‖δ‖22,
and then choose δ to minimize this upper-bound. The upper-bound is a convex quadratic function of δ, so
its minimizer can be written in closed-form. The minimizer is the value of δ such that
∇f(w) + βδ = 0.
In other words, it is δ?(w), defined by
δ?(w) := −
1
β
∇f(w).
Plugging in δ?(w) for δ in the quadratic upper-bound gives
f(w + δ?(w)) ≤ f(w) +∇f(w)Tδ?(w) +
β
2
‖δ?(w)‖22
= f(w)−
1
β
∇f(w)T∇f(w) +
1
2β
‖∇f(w)‖22
= f(w)−
1
2β
‖∇f(w)‖22.
This inequality tells us that this local change to w will decrease the function value as long as the gradient
at w is non-zero. It turns out that if the function f is convex (in addition to β-smooth), then repeatedly
making such local changes is sufficient to approximately minimize the function.
Analysis of gradient descent on smooth convex functions
One of the simplest ways to mathematically analyze the behavior of gradient descent on smooth functions
(with step size η = 1/β) is to monitor the change in a potential function during the execution of gradient
2
descent. The potential function we will use is the squared Euclidean distance to a fixed vector w? ∈ Rd,
which could be a minimizer of f (but need not be):
Φ(w) :=
1
2η
‖w −w?‖22.
The scaling by 12η is used just for notational convenience.
Let us examine the “drop” in the potential when we change a point w to w + δ?(w) (as in gradient descent):
Φ(w)− Φ(w + δ?(w)) =
1
2η
‖w −w?‖22 −
1
2η
‖w + δ?(w)−w?‖22
=
β
2
‖w −w?‖22 −
β
2
(
‖w −w?‖22 + 2δ
?(w)T(w −w?) + ‖δ?(w)‖22
)
= −βδ?(w)T(w? −w)−
β
2
‖δ?(w)‖22
= ∇f(w)T(w −w?)−
1
2β
‖∇f(w)‖22.
In the last step, we have plugged in δ?(w) = − 1
β
∇f(w). Now we use two key facts. The first is the inequality
we derived above based on the smoothness of f :
f(w + δ?(w)) ≤ f(w)−
1
2β
‖∇f(w)‖22,
which rearranges to
−
1
2β
‖∇f(w)‖22 ≥ f(w + δ
?(w))− f(w).
The second comes from the first-order definition of convexity:
f(w?) ≥ f(w) +∇f(w)T(w? −w),
which rearranges to
∇f(w)T(w −w?) ≥ f(w)− f(w?).
(We’ll discuss this inequality more later.) So, we can bound the drop in potential as follows:
Φ(w)− Φ(w + δ?(w)) = ∇f(w)T(w −w?)−
1
2β
‖∇f(w)‖22
≥
(
f(w)− f(w?)
)
+
(
f(w + δ?(w))− f(w)
)
= f(w + δ?(w))− f(w?).
Let us write this inequality in terms of the iterates of gradient descent with η = 1/β:
Φ(w(t−1))− Φ(w(t)) ≥ f(w(t))− f(w?).
Summing this inequality from t = 1, 2, . . . , T :
T∑
t=1
(
Φ(w(t−1))− Φ(w(t))
)
≥
T∑
t=1
(
f(w(t))− f(w?)
)
.
The left-hand side simplifies to Φ(w(0))−Φ(w(T )). Furthermore, since f(w(t)) ≥ f(w(T )) for all t = 1, . . . , T ,
the right-hand side can be bounded from below by
T
(
f(w(T ))− f(w?)
)
.
So we are left with the inequality
f(w(T ))− f(w?) ≤
1
T
(
Φ(w(0))− Φ(w(T ))
)
=
β
2T
(
‖w(0) −w?‖22 − ‖w
(T ) −w?‖22
)
.
3
Gradient descent on Lipschitz convex functions
Gradient descent can also be used for non-smooth convex functions as long as the function itself does not
change too quickly.
For any L > 0, we say that a differentiable function f : Rd → R is L-Lipschitz if its gradient at any point in
Rd is bounded in Euclidean norm by L.
The motivation for gradient descent based on minimizing quadratic upper-bounds no longer applies. Indeed,
the gradient at w could be very different from the gradient at a nearby w′, so the function value at w−η∇f(w)
could be worse than the function value at w. Therefore, we cannot expect to have the same convergence
guarantee for non-smooth functions that we had for smooth functions.
Gradient descent, nevertheless, will produce a sequence w(1),w(2), . . . such that the function value at these
points is approximately minimal on average.
Motivation for gradient descent on Lipschitz convex functions
A basic motivation for gradient descent for convex functions, that does not assume smoothness, comes from
the first-order condition for convexity:
f(w?) ≥ f(w) +∇f(w)T(w? −w),
which rearranges to
(−∇f(w))T(w? −w) ≥ f(w)− f(w?).
Suppose f(w) > f(w?), so that moving from w to w? would improve the function value. Then, the inequality
implies that the negative gradient −∇f(w) at w makes a positive inner product with the direction from w
to w?. This is the crucial property that makes gradient descent work.
Analysis of gradient descent on Lipschitz convex functions
We again monitor the change in the potential function
Φ(w) :=
1
2η
‖w −w?‖22,
for a fixed vector w? ∈ Rd.
Again, let us examine the “drop” in the potential when we change a point w to w − η∇f(w) (as in gradient
descent):
Φ(w)− Φ(w − η∇f(w)) =
1
2η
‖w −w?‖22 −
1
2η
‖w − η∇f(w)−w?‖22
= (−∇f(w))T(w −w?)−
η
2
‖∇f(w)‖22
≥ f(w)− f(w?)−
L2η
2
,
where the inequality uses the convexity and Lipschitzness of f . In terms of the iterates of gradient descent,
this reads
Φ(w(t−1))− Φ(w(t)) ≥ f(w(t−1))− f(w?)−
L2η
2
.
Summing this inequality from t = 1, 2, . . . , T :
Φ(w(0))− Φ(w(T )) ≥
T∑
t=1
(
f(w(t−1))− f(w?)
)
−
L2ηT
2
.
4
Rearranging and dividing through by T (and dropping a term):
1
T
T∑
t=1
(
f(w(t−1))− f(w?)
)
≤
‖w(0) −w?‖22
2ηT
+
L2η
2
.
The left-hand side is the average sub-optimality relative to f(w?). Therefore, there exists some t∗ ∈
{0, 1, . . . , T − 1} such that
f(w(t
∗))− f(w?) ≤
1
T
T∑
t=1
(
f(w(t−1))− f(w?)
)
≤
‖w(0) −w?‖22
2ηT
+
L2η
2
.
The right-hand side is O(1/
√
T ) when we choose η = 1/
√
T .1 Alternatively, if we compute the average point
w̄T :=
1
T
T∑
t=1
w(t−1),
then by Jensen’s inequality we have
f(w̄T ) = f
1
T
T∑
t=1
w(t−1)
≤ 1
T
T∑
t=1
f(w(t−1)).
So the bound for w(t
∗) also applies to w̄T :
f(w̄T )− f(w?) ≤
1
T
T∑
t=1
(
f(w(t−1))− f(w?)
)
≤
‖w(0) −w?‖22
2ηT
+
L2η
2
.
1A similar guarantee holds when the step size used for the t-th update is ηt = 1/
√
t.
5
Smooth functions
Example: logistic regression
Quadratic upper bound for smooth functions
Gradient descent on smooth functions
Motivation for gradient descent on smooth functions
Analysis of gradient descent on smooth convex functions
Gradient descent on Lipschitz convex functions
Motivation for gradient descent on Lipschitz convex functions
Analysis of gradient descent on Lipschitz convex functions