STAT 513/413: Lecture 19 Optimization
(Fundamental)
So, what is it all that root finding good for?
The most frequent application of all the methods for finding roots are are not that much for roots themselves – but for finding a minimizer (or maximizer) of some function
F(x) min ! x
Sometimes they speak just about “finding minimum or maximum” – but if we adopt more precise language, minimum (or maximum) is the value
minf(x) (or maxf(x)) xx
while minimizer (or maximizer) is an x that attains it: it is an x0 such that
f(x0) = min f(x) (or f(x0) = max f(x)) xx
Of course, in even a more precise world we can ask whether such x0 exists at all – that is whether max or min exists at all (instead of sup and inf, which exist always, even if sometimes not finite)
Nonetheless, one has to keep in mind that the most precise formalism is often not adopted
1
Maximization vs minimization
We could well focus just on minimization (or maximization): maximization (or minimization) can be done by considering −F instead
F(x) max ! is equivalent to −F(x) min ! xx
2
Optimization vs equation solving
If we are able to evaluate the derivative f(x) of F(x), then the minimization of F may be replaced by solving the equation
f(x) = 0
(Here, “derivative” is meant in an abstract sense; in the multidimensional case, the equation above is actually a system of equations)
Of course, it is not the same thing: by solving this equation we may, in general, obtain only local minimum; or even not a minimum, but a maximum, or something else
(The same is actually true for numerical equation solving methods: they do not guarantee to find all zeros, but just some zero – everything depends on the starting point of the iterations.)
Differentiability implies continuity; if a function is not differentiable, but still continuous, there are algorithms that may work; if a function is not even continuous (at least in some sense), the prospects of finding the minimizer are not big
3
General deliberations
So, this is how it is: the majority of optimization methods out there do not guarantee to find a global, but only local optima
Sometimes we are fine with those; and sometimes we hope that a good initial point will give us the right result. And sometimes we base our hope on a specific character of the function: if we know somehow that a local minimum found would be a global one, then we are fine
We are not actually very happy when there are several (and distant) different minimizers giving the same global minimum (even if in such a case every local minimum is a global minimum) – as we are typically not after the value of the global minimum itself, but rather a global minimizer
However, we could accept this situation if all minimizers are somewhat close: such functions are called unimodal (to be made precise later). Every local minimum of such a function is a global one – but not all functions with this property are unimodal
If a function has only one global (and thus local) minimum we call it strictly unimodal
4
Convex functions
A prominent example of unimodal and strictly unimodal functions are convex functions. A function F is called convex, if for any two points z1 and z2 all values of F on the segment connecting z1 and z2 lie under the line connecting F(z1) and F(z2): for any z1,z2,
F(α1z1 + α2z2) α1F(z1) + α2F(z2) α1, α2 0, α1 + α2 = 1
If sharp inequality < holds true for every z1, z2 and every α1, α2 > 0,
then F is strictly convex
Important examples: x2 (strictly), |x|; also all functions ax + b, of course – the latter are the only ones with the property that both f and −f are convex
A function F is called (strictly) concave, if F is (strictly) convex
For one-dimensional functions (x ∈ R), convexity can be verified in terms of derivatives as follows: if f′ is nondecreasing, then f is convex; if f′′ is nonnegative, then f is convex. Strict convexity of f follows from f′ being increasing or f′′ strictly positive
The definition and the criteria are analogous in the multidimensional case
5
Some properties of convex functions
The concept readily generalizes into multidimensional case: F is convex, if for any z1 = (x1, y1)T, z2 = (x2, y2)T, and α1, α2 0, α1 + α2 = 1
x1 x2 x1 x2 F α1 y +α2 y α1F y +α2F y
1212
(Strict convexity defined in the completely analogous fashion)
Quadratic functions of the type xTAx (possibly also +Bx + c) are convex whenever A is nonnegative definite
Convex functions satisfy the property that every local minimum of a convex function is its global minimum. They are also unimodal.
Convex functions have a few other properties that distinguish them among other classes of unimodal functions – in particular, with respect to composition and other aspects
Convexity is a very important property in optimization
6
So, what does it mean “unimodal”
For instance, it can be this: function F is unimodal if for every c the set {x : F(x) c} is convex
(The empty set is convex, by the way)
Then, every convex function is unimodal – but not every unimodal function is convex
Unimodal functions, as defined above, are also called quasiconvex. A more general definition of unimodality replaces “convex” in the definition above by “connected”
– but this only in case you know what it means; otherwise beyond the scope of this course
7
Global optimization
If we are not happy with a local minimum – because we do not have any guarantees it is a global minimum, and really need the latter – then we have to use a global optimization method
Such methods have been proposed (an example is the simulated annealing method, available in R), but typically are not that efficient, not that elaborated, and do not give complete, but merely some guarantees – thus in practice we often resort to some strategies pursuing local minima
8
One-dimensional case again trivial – but fun
The multimodal (= multiple local maxima) case brings nothing new
The best global method (and feasible in dimension one): direct search
9
Direct search vs optimize
> fun = function(x) (1/x)*sin(1/x)
> x=seq(0.01,1,len=100000)
> plot(x,fun(x),type=”l”)
> x[which(fun(x) == max(fun(x)))]
[1] 0.0104356
> fun(x[which(fun(x) == max(fun(x)))])
[1] 95.8233
> abline(h=fun(x[which(fun(x) == max(fun(x)))]),lty=3)
> obj=optimize(fun,lower=0.01,upper=1,maximum=TRUE,tol=.Machine$double.eps)
> obj
$maximum
[1] 0.4929125
$objective
[1] 1.819706
> abline(h=obj$objective,lty=2)
> obj=optimize(fun,lower=0.01,upper=0.02,maximum=TRUE,tol=.Machine$double.eps)
> obj
$maximum
[1] 0.01414428
$objective
[1] 70.69291
> abline(h=obj$objective,lty=2)
10
A picture
0.0 0.2 0.4
0.6 0.8 1.0
x
11
fun(x)
-100 -50 0 50 100
Search methods for unimodal functions
How to make the direct search more optimal? It all comes to how many times we have to evaluate the function: the fewer times, the more optimal the method is
Fibonacci numbers, Fn: 1, 1, 2, 3, 5, 8, 13, 21, 34, . . . F1 = F2 = 1; otherwise, Fn = Fn−1 + Fn−2
Discrete search: unimodal function f on 1,2,…,k
i1234567 f(i) 0.6377 0.4196 0.1546 0.1010 0.3274 0.5693 0.6803
1 2 3 4 5 6 7 8 9 10 11 12 0.57 0.54 0.49 0.35 0.21 0.16 0.23 0.37 0.50 0.28 0.58 0.59
12
Fibonacci search
Allows for finding the optimum in least possible number of evaluations
General rule: we have always k = Fn − 1 for some n (if not, we padd on either end with some numbers larger than every other one already involved)
We evaluate at the points Fn−2 and Fn−1:
– if f(Fn−2) < f(Fn−1), we discard all Fn−1, Fn−1 + 1, . . . , Fn − 1 - if f(Fn−2) > f(Fn−1), then we discard 1, 2, . . . Fn−2
(If f(Fn−2) = f(Fn−1), then we may discard any one of the above.)
After figuring out the result, we have Fn−1 − 1 points left, and we repeat the procedure with Fn−2 and Fn−3; finding the minimum needs (at most) n − 1 evaluations
13
Golden section search
For continuous problems, Fibonacci search is impractical, as we need to know the number of evaluations that give different values in advance. We choose instead to divide the search interval – which after rescaling we can assume to be [0,1] – at locations γ2 and γ, where
√
γ= 5−1= lim Fn =0.618034 2 n→∞ Fn−1
is so-called golden ratio
In R: optimize() – golden ratio search + parabolic interpolation
golden a4 letter
14
Multidimensional case: simplex method
In R as Nelder-Mead method; not to be confused with simplex method in linear programming
Simplex: segment in dim 1, triangle in dim 2, tetrahedron in dim 3, …
If d is the dimension, it has d + 1 vertices; we suppose that f(x1) f(x2) … f(xd) f(xd+1)
(in fact, we rather like to have <)
We take the centroid x1,x2,...,xd - which is their mean x ̄
and reflect xd+1 about x ̄, obtaining xr = 2x ̄ − xd+1; we can do it
scaled, in which case
xr = x ̄ + α(x ̄ − xd+1) (typically a = 1)
(in fact, we rather like to have <) and then we have four possibilities...
15
Four possibilities - before repeating all again
1. if f(x1) f(xr) < f(xd), we accept xr in place of xd+1, reorder points and repeat
2. if f(xr) < f(x1), we expand in the direction of xr; try new xe xe = x ̄ + γ(x ̄ − xd+1) (typically γ = 2)
if f(xe) < f(xr), accept xe, reorder points and repeat
if f(xe) f(xr), accept xr, reorder points and repeat (Note: we do the expansion only once at a time)
3. if f(xr) f(xd), then we contract: compute
xc = x ̄ + β(x ̄ − xd+1) (typically β = 0.5)
if f(xc) < f(xd+1), then we accept xc in place of xd+1, reorder points
and repeat
4. if f(xc) f(xd+1), then we shrink: compute d new vertices,
keeping x1 in place and taking, for i=2,3,...,d+1 xi = x1 + δ(xi − x1) (typically δ = 0.5)
16
Picturing it
17
A demo
0123456
-2 0 2 4 6 8 10
18
Another dubious method: coordinate descent
Minimize at one coordinate at time: first in x1, then in x2, then in x3 ... finally in xd, and when done, start again: first in x1, then in x2, ...
Doesn’t work! Well... unless the objective function is differentiable, then it may
(For nondifferentiable function of two variables one can find a counterexample showing that the method gets stuck far from the optimum.)
But, it was a breakthrough for minimizing functions, in β, like p d
(yi −xi1β1 −···−xidβd)2 +λ |βj| (λ>0) i=1 j=1
so-called LASSO, for p and q large
19
Another demo
0123456
-2 0 2 4 6 8 10
20
A counterexample
21
[1] -2 -3
[1] -3.00001 -3.00000
[1] -3.000010 -3.000019
[1] -3.000029 -3.000019
[1] -3.000029 -3.000038
[1] -3.000048 -3.000038
[1] -3.000048 -3.000057
[1] -3.000066 -3.000057
[1] -3.000066 -3.000076
[1] -3.000085 -3.000076
[1] -3.000085 -3.000095
[1] -3.000104 -3.000095
[1] -3.000104 -3.000113
[1] -3.000123 -3.000113
[1] -3.000123 -3.000132
[1] -3.000141 -3.000132
[1] -3.000141 -3.000151
[1] -3.000160 -3.000151
[1] -3.000160 -3.000169
…
[1] -2 -3
[1] -3 -3
[1] -3 -3
[1] -3 -3
[1] -3 -3
[1] -3 -3
[1] -3 -3
[1] -3 -3
[1] -3 -3
[1] -3 -3
[1] -3 -3
[1] -3 -3
[1] -3 -3
[1] -3 -3
[1] -3 -3
[1] -3 -3
[1] -3 -3
[1] -3 -3
[1] -3 -3
A demonstration?
22
A bit of code
fun = function(x,y) abs(x+y) + 3*abs(x-y)
x = -2; y=-3
print(c(x,y))
tst = seq(-5,5,len=1000001)
for (k in 1:100)
{
## x = optimize(function(z) fun(z,y), interval=c(-5,5))$minimum
## print(c(x,y))
## y = optimize(function(z) fun(x,z), interval=c(-5,5))$minimum
## print(c(x,y))
fx = fun(tst,y)
x = tst[which(fx == min(fx))]
print(c(x,y))
fy = fun(x,tst)
y = tst[which(fy == min(fy))]
print(c(x,y))
}
23
Some afterword
“In earlier years, many optimization methods were simple enough so that it was appropriate for someone with a problem to consult a research paper, or even develop a special-purpose method, and then to write a computer program to execute steps of the method. This approach is no longer feasible today, for several reasons.
…
These developments mean that the typical person who wishes to solve an optimization problem would not (and, in our view, should not) start from scratch to devise his own optimization method or write his own implementation. Rather, he should be able to use selected routines from high-quality mathematical software libraries. However, this does not mean that the user should remain in complete ignorance of the nature of the algorithms that he will use, or the important features of optimization software.”
(Gill, Murray, and Wright, 1981)
24