STAT 513/413: Lecture 20 Walking hills and ravines
(A refresment of calculus)
Hills and ravines
One-dimensional case too trivial to bother with Multi-dimensional cases too difficult to imagine/picture
The only one that somewhat can be: the two-dimensional case
hills = maximization ravines = minimization
Some map reading experience may be handy – in particular, when I cannot wave my hands in class to enact 3D
1
Multidimensional calculus
That is, two-dimensional here: f(x1, x2) = 1 x41 + 1×2, say 64 4
Well, this is nice, as it is readily generalized into p dimensions… …but will be notational pain when doing iterative methods (although could be handled: (x1,x12), (x21,x2), (x31,x32), …)
So, let us make it rather f(x,y) = 1 x4 + 1y2, 64 4
and have a look
> x=seq(-7,7,len=50)
> y=x
> xy=expand.grid(x,y)
> z=matrix((1/64)*xy[,1]^4 + (1/4)*xy[,2]^2,
+ nrow=length(x),ncol=length(y))
> persp(x,y,z,theta=-30,phi=30)
> contour(x,y,z,nlevels=100)
> points(3,-2)
> arrows(3,-2,3+3^3/16,-2+(-2)/2)
2
Many looks possible, of course
3
z
y
x
So, now the map view
-6 -4 -2 0 2 4 6
-6 -4 -2 0 2 4 6
4
Question: where does the gradient point?
Partial derivatives first: good notation is here hard to come by…
∂f ∂x
…good, but where is this taken? ∂f ∂f(x, y)
or or??? ∂x(x1,y1) ∂x
I like more this: Dxf(x,y), Dyf(x,y)
(In general, x1,x2 notation could be D1f(x1,x2), D2f(x1,x2), …)
We can handle also higher-order partial derivatives this way
Dxxf(x, y) = Dx(Dxf(x, y)) Dyyf(x, y) = Dy(Dyf(x, y))
Dyxf(x, y) = Dy(Dxf(x, y))
which can be easily confused with Dxyf(x, y) = Dx(Dyf(x, y))
because in general it is not the same – but very often it is, and thus will be also in all what follows
Dyxf(x, y) = Dxyf(x, y)
5
So, where does the gradient point?
(was just kidding: no bonus, too late)
Gradient: partial derivatives put in a (column) vector
Dxf(x, y) Dyf(x, y)
∇f(x, y) = Gf(x, y) = G(x, y) = For our particular f(x, y) = 1 x4 + 1y2,
64 4 G(x,y) = 16 which at
1 x3 12 y
1 −1 is
5 −3
So, where does the gradient point?
6
In the direction of maximal slope up!
-6 -4 -2 0 2 4 6
-6 -4 -2 0 2 4 6
7
Gradient descent algorithm
Also: method of steepest descent
Steepest ascent: in the direction of G(x,y) Steepest descent: in the direction of −G(x,y)
xk+1 xk
y = y −G(xk,yk)
k+1 k
As this may bomb quite quickly, we may consider a modification
xk+1 xk
y = y −sG(xk,yk)
k+1 k
where s > 0 is some suitable number
As simple as that…
8
… but as lousy
-6 -4 -2 0 2 4 6
-6 -4 -2 0 2 4 6
9
Another demo
100 iterations of plain gradient descent
0123456
-2 0 2 4 6 8 10
10
Gradient descent works – if the step is right
Slow, unreliable – but a workhorse in neural networks, for instance
And it works – if the step is set right at each iteration
Instead of simply doing xk+1 xk
y = y −G(xk,yk) k+1 k
we do
xk+1 xk
y = y −skG(xk,yk) k+1 k
where sk is found using the one-dimensional search along the halfline (so it is not constant, like γ, over all iterations)
xk
y − sG(xk, yk)
k
The determination of sk is often done by backtracking: we select
first s = 1, and then halve it, until we arrive to a suitable value
Another improvement of the gradient descent method: method of conjugate gradients (to be discussed later)
11
And now, the Newton method
First, we are in the optimization here now: minimizing f means looking for f′(x) = 0
and the Newton method then does xk+1 = xk − f′(xk) f′′(xk)
Now, f′ is likely the gradient. What is f′′?
The Hessian matrix (“the derivative of gradient”): the element in the i-th row and j-th column is the (second) partial derivative, in i-th and j-th variables (recall: we consider only functions where the order of differentiation is inessential)
In two-dimensional notation started above, the Hessian is Dxxf(x, y) Dxyf(x, y)
Hf(x, y) = H(x, y) = Dyxf(x, y) Dyyf(x, y)
1 4 1 2 3×2 0
Forourfunctionf(x,y)=64x +4y, H= 16 1 02
Convexity via Hessian: if Hf is positive definite everywhere, then f is convex
12
Simply, instead of xk+1 = xk − f′(xk) f′′(xk)
as
k
xk = H(xk, yk) y
k
k+1
The Newton method
we do
xk+1 y
=
xk −1 y − H
k
(xk, yk)G(xk, yk)
Well… we do not invert any matrix, of course. In fact, the Newton method linearizes the function at a given point; the linearized function is the gradient now, and the linearization is
x−xk
H(xk, yk) y − y + G(xk, yk)
k
And now we obtain (xk+1,yk+1)T via solving the linear system
x−xk
H(xk,yk) y−y +G(xk,yk) = O which is the same
x H(xk, yk) y
− G(xk, yk)
13
Newton demo
0123456
-2 0 2 4 6 8 10
14
Newton method: pros and cons
Newton method, when works, then it works very well: the convergence is fast, the precision good
Newton methods derives from the quadratic approximation of the minimized function at (xk,yk)
1 x f(x, y) = f(xk, yk) + (x, y)G(xk, yk) + 2(x, y)H(xk, yk) y
When the approximation is good, it works: for a quadratic function, Newton method gives the optimum right away
When the approximation is bad, Newton method can fail completely. To this end, it is sometimes improved in several ways: for instance
– adjusting the length of the step (similarly as for gradient descent) instead of −H−1(xk,yk)G(xk,yk) taking −sH−1(xk,yk)G(xk,yk)
– forcing H(xk,yk) to be positive definite
This may be done by the eigenvalue decomposition – negative eigenvalues are changed to their absolute values. However, the eigenvalue decomposition at each step can considerably slow down the algorithm; and there may be also eigenvalues equal to zero
15
Compromise methods
Levenberg-Marquardt compromise: originated for the Gauss-Newton method in nonlinear regression
Note:
gradient method: the step is −sG(xk,yk)
Newton method: the step is −sH−1(xk,yk)G(xk,yk) the step is −s(H(xk, yk) + λI)−1G(xk, yk) (where λ 0)
For some λ, the matrix H(xk, yk) + λI is positive definite
In such case we have a guarantee: at every iteration, the objective
function, the minimized function, decreases
Methods with this guarantee are called descent methods; gradient descent is just one of them
16
Other problems
Another problem with the Newton method: we have to provide those derivatives. In p dimensions it means: we have to provide p partial derivatives for the gradient, and then we need another p2 second partial derivatives… well, not really: only p(p + 1)/2, but that is still quite a lot for large p
We would like to avoid calculating the Hessian – perhaps via approximating it somehow (recall the secant method in dim 1)
For very large problems, the simple fact of having to have all the Hessian in the computer memory can be an issue. If there are 1000 variables, for instance, then the Hessian has 1000000 elements…
17
And other methods then
The first issue (having to provide the Hessian) is mitigated by quasi-Newton methods: like the secant method in dim 1, they can calculate some approximation – or, better, substitute – of Hessian from past iterations
The method of conjugate gradients eliminates even the need of keeping this matrix inside the computer
Both methods provide guarantees when the minimized function is quadratic: in such a case, they converge to the optimum
(Note, however, that in such case the Newton methods converged right away, in one step. But, on the other hand, the methods often behave better in cases when the Newton method fails.)
18
Quasi-Newton methods
Newton method step (with adjustable length)
xk+1 xk −1 y − y = −skH
(xk, yk)G(xk, yk) can be generalized to a general scheme
k+1 k
xk+1 xk
dk = y − y =−skQkG(xk,yk)
k+1 k
Firstly, sk is calculated so that it gives minimum, in s, of
xk f y − sQkG(xk, yk)
k
Secondly, Qk is defined in a way so that the previous step
xk xk−1
dk−1 = y − y = Qk (G(xk, yk) − G(xk−1, yk−1))
k k−1
Under these requirements, the method guarantees to find the optimum of a quadratic function
19
Quasi-Newton methods: the choice
Updates: idea by Davidon, elaborated by Fletcher and Powell (DFP) Qk = Qk−1 + dk−1 ⊗ dk−1 − (Qk−1ck) ⊗ (Qk−1ck)
dTk−1ck ckQk−1ck ck = G(xk, yk) − G(xk−1, yk−1)
and ⊗ is the “tensor product”: for vectors a and b,
the element of a ⊗ b in the i-th row and j-th columns is aibj
More popular is BFGS methods, named after Broyden, Fletcher, Goldfarb, and Shanno, and possessing also superior engineering details (which we do not discuss here); its updating scheme is
Qk = Qk−1 + dk−1 ⊗ dk−1 − (Qk−1ck) ⊗ (Qk−1ck) + (ckQkck) uk ⊗ uk dTk−1ck ckQk−1ck
where
uk= dk−1 − Qck
dTk−1ck ckQkck
where
20
Yet another demo
0123456
-2 0 2 4 6 8 10
21
xk+1 y
k
Conjugate gradients: principle
The method of gradient descent has
xk
= y +skCk where Ck =−G(xk,yk)
k+1
The method of conjugate gradients selects the next Ck+1 so that
CTk+1QCk = 0
where Q pertains to some quadratic approximation – which is, however, not calculated; instead, Ck is calculated as a linear combination of Ck−1 and G(xk,yk)
22
xk
y + sCk
k
Conjugate gradients: details
Given (xk,yk) and Ck, we calculate sk such that
is minimal in s, and then set
+ skCk and Ck+1 = −G(xk+1, yk+1) + tkCk
k+1
Common choices for tk are
tk = G(xk+1, yk+1)TG(xk+1, yk+1) G(xk, yk)TG(xk, yk)
xk+1 y
xk = y
k
or tk = (G(xk, yk) − G(xk+1, yk+1)TG(xk+1, yk+1) G(xk, yk)TG(xk, yk)
The method starts with C1 = −G(x1, y1), and is reset – returns to it Ck = −G(xk, yk) periodically, always after certain fixed number of steps
23
And yet another demo
0123456
-2 0 2 4 6 8 10
24