Numerical Optimisation Nonsmooth optimisation
Numerical Optimisation
Nonsmooth optimisation
Marta M. Betcke
m.betcke@ucl.ac.uk,
Kiko Rullan
f.rullan@cs.ucl.ac.uk
Department of Computer Science,
Centre for Medical Image Computing,
Centre for Inverse Problems
University College London
Lecture 16
M.M. Betcke Numerical Optimisation
Subgradient
For convex differentiable function f : Rn → R it holds
f (y) ≥ f (x) +∇f (x)T(y − x).
A vector g ∈ Rn is a subgradient of a function f at x ∈ dom f if
f (y) ≥ f (x) + gT(y − x) ∀y ∈ dom f .
f (x) + gT(y − x) is affine global underestimator
g is a subgradient of f at x if (g ,−1) supports the epigraph
of f at (x , f (x))
Figure: ∂f (x1) = {∇f (x1)} = {g1}, ∂f (x2) = [g3, g2]. Fig. from S. Boyd,
EE364b, Stanford University.
M.M. Betcke Numerical Optimisation
Subdifferential
A function f is called subdifferentiable at x if there exists at least
one subgradient at x .
Subdifferential of f at x , ∂f (x), is the set of all subgradients of f
at x .
∂f (x) is a closed convex set (can be empty) even if f is not convex.
Proof: It follows from it being intersection of infinite set of
halfspaces
∂f (x) =
⋂
z∈dom f
{g : f (z) ≥ f (x) + gT(z − x)}.
If f is continuous at x , then the ∂f (x) is bounded.
If f (x) is convex and x ∈ relint dom f
∂f (x) is nonempty and bounded
∂f (x) = {∇f (x)} iff f differentiable at x (abuse of notation!)
M.M. Betcke Numerical Optimisation
Minimum of nondifferentiable function (unconstraint)
A point x? is a minimiser of a function f (not necessarily convex)
iff f is subdifferentiable at x? and
0 ∈ ∂f (x?),
i.e. g = 0 is a subgradient of f at x?.
Proof: This follows directly from f (x) ≥ f (x?) for all x ∈ dom f .
f is subdifferentiable at x? with 0 ∈ ∂f (x?) is equivalent to
f (x) ≥ f (x?) + 0T(x − x?) for all x ∈ dom f .
The condition 0 ∈ ∂f (x?) reduces to ∇f (x?) = 0 when f is convex
and differentiable at x?. Note, that in that case also it is a
necessary and sufficient condition.
M.M. Betcke Numerical Optimisation
Minimum of nondifferentiable function (constraint)
Convex constraint optimisation problem
min
x∈Rn
f (x) (COP)
subject to fi (x) ≤ 0, i = 1, . . . ,m,
where
f , fi : Rn → R is convex hence subdifferentiable
strict feasibility holds (Slater’s conditions)
Generalised KKT conditions:
x? is primal optimal and λ? dual optimal iff
fi (x
?) ≤ 0, (KKT)
λ?i ≥ 0,
0 ∈ ∂f (x?) +
m∑
i=1
λ?i ∂fi (x
?),
λ?i fi (x
?) = 0
M.M. Betcke Numerical Optimisation
Directional derivatives and subdifferential
For a convex function the directional derivative at x in the
direction v is
f ′(x ; v) = lim
t→0
f (x + tv)− f (x)
t
.
The limit always exists for a convex function, thought it can be
±∞. If f is finite in a neighbourhood of x , then f ′(x ; v) exists.
f is differentiable at x iff for some g (which is ∇f (x)) and all
v ∈ Rn we have f ′(x ; v) = gTv (f ′(x ; v) is a linear function of v).
The directional derivative f ′(x ; v) of a convex function f satisfies
f ′(x ; v) = sup
g∈∂f (x)
gTv .
Proof idea: Note that f ′(x ; v) ≥ supg∈∂f (x) gTv by the definition
of the subgradient f (x + tv)− f (x) ≥ tgTv for any t ∈ R and
g ∈ ∂f (x). Other direction: show that all affine functions below
v → f ′(x ; v) may be taken to be linear.
M.M. Betcke Numerical Optimisation
Subgradient calculus
Weak subgradient calculus: formulas for finding one g ∈ ∂f (x).
If you can compute f , you can usually compute one subgradient.
Many algorithms require only one subgradient.
Strong subgradient calculus: formular for finding the whole
subdifferential ∂f (x)
Optimality conditions and some algorithms require the whole
differential.
Basic rules:
non-negative scaling: for α > 0, ∂(αf ) = α∂f
addition: ∂(f1 + f2) = ∂(f1) + ∂(f2)
affine transformation: g(x) = f (Ax + b),
∂g(x) = AT∂f (Ax + b)
finite point wise maximum: f = maxi=1,…,m fi ,
∂f (x) = Co
⋃
{∂fi (x) : fi (x) = f (x)} (convex hull of a union
of subdiffrentials of active functions at x)
M.M. Betcke Numerical Optimisation
Subgradient and descent direction
p is a descent direction for f at x if f ′(x ; p) < 0. If f is differentiable, −∇f is always a descent direction (except when it is 0). For a nondifferentiable convex function f , p = −g , g ∈ ∂f (x) need not to be a descent direction. Example: f (x) = |x1|+ 2|x2| Figure: Fig. from S. Boyd, EE364b, Stanford University. M.M. Betcke Numerical Optimisation Subgradient and distance to sublevel set For a convex f , if f (z) < f (x), g ∈ ∂f (x), then for small t > 0
‖x − tg − z‖2 < ‖x − z‖2. Thus −g is descent direction for ‖x − z‖2, for any z with f (z) < f (x). Proof: ‖x − tg − z‖22 = ‖x − z‖ 2 2 − 2tg T(x − z) + t2‖g‖22 ≤ ‖x − z‖22 − 2t(f (x)− f (z)︸ ︷︷ ︸ >0
) + t2‖g‖22︸ ︷︷ ︸
t: t
2
‖g‖22
Evaluating proxf involves solving a convex optimisation problem.
Can evaluate numerically via e.g. BFGS, but often the convex
problem (PROX) has an analytical solution or at least a specialised
linear-time algorithm.
Indicator function of a closed convex set, C 6= ∅
IC(x) =
{
0 x ∈ C
∞ otherwise
Proximal operator of IC is the Euclidean projection
proxλIC(v) = arg min
x∈C
‖x − v‖2 = ΠC(v)
Many properties of projection carry over to proximal operator.
M.M. Betcke Numerical Optimisation
Examples of proximal operators
Important special choices of f , for which proxλf has a closed form:
f (x) = 1
2
‖Px − q‖22,
proxλf (v) = (P
TP + λ−1I )−1(PTq + λ−1v).
(PTP + Q)−1 = Q−1 − Q−1PT(I + PQ−1PT)−1PQ−1.
f is separable i.e. f (x) =
∑n
i fi (xi ), proximal operator acts
componentwise
(proxλf (v))i = proxλfi (vi ), i = 1, . . . , n
f (x) = ‖x‖1
proxλf (v) = Sλ(v),
with elementwise soft thresholding
Sδ(x) =
x − δ x > δ
0 x ∈ [−δ, δ]
x + δ x < −δ x-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
S
/
(x)
M.M. Betcke Numerical Optimisation
Examples of proximal operators
Another important example which does not admit close form is
Total Variation, f (x) = TV (x), defined as follows
TV (x) :=
m−1∑
i=1
n−1∑
j=1
√
(xi ,j − xi+1,j)2 + (xi ,j − xi ,j+1)2
+
m−1∑
i=1
|xi ,n − xi+1,n|+
n−1∑
i=1
|xm,j − xm,j+1|
assuming standard reflexive boundary conditions
xm+1,j = xm,j , xi ,n+1 = xi ,n.
The proximal operator has to be computed iteratively using
e.g. Chambolle-Pock algorithm (primal dual proximal gradient).
M.M. Betcke Numerical Optimisation
Resolvent of subdifferential operator
Proximal operator
proxλf (v) = arg min
x
(
f (x) + 1/(2λ)‖x − v‖22
)
.
The first order condition for the minimiser reads
0 ∈ ∂f (x) + 1/λ(x − v)
v − x ∈ λ∂f (x)
v ∈ λ∂f (x) + x
proxλf (v) = (I + λ∂f )
−1v
Mapping (I + λ∂f )−1 is called resolvent of operator ∂f .
x? minimises f iff x? is a fixed point
x? = proxf (x
?)
M.M. Betcke Numerical Optimisation
Moreau-Yosida regularisation
Moreau envelope or Moreau-Yosida regularisation of f
Mλf (v) = inf
x
(
f (x) + 1/(2λ)‖x − v‖22
)
.
Mλf is a smoothed (regularised) version of f :
always has full domain
always continuously differentiable
has the same minimisers as f
Can show that Mf = (f
? + 1/2‖ · ‖22)
?.
Example: Moreau envelope of | · | is the Huber function
M|·|(x) =
{
1
2
x2 |x | ≤ 1
|x | − 1
2
|x | > 1
Moreau decomposition: v = proxf (v) + proxf ?(v) is
generalisation of orthogonal decomposition v = ΠW (v) + ΠW⊥(v).
It follows from Moreau decomposition that (IW )
? = IW⊥.
M.M. Betcke Numerical Optimisation
Forward Backward splitting
min
x
f (x) + g(x) subject to x ∈ E (1)
E: finite dimensional Euclidean space with inner product
< ·, · > and a self dual norm ‖ · ‖ =< ·, · >1/2= ‖ · ‖∗,
e.g. space of n ×m images, Rn×m
f : E→ R continuously differentiable with Lipschitz
continuous gradient,
‖∇f (x)−∇f (y)‖ ≤ L(f )‖x − y‖, ∀x , y ∈ E.
g : E→ (−∞,∞] proper closed convex.
From first order optimality condition we have
0 ∈ ∇f (x∗) + ∂g(x∗)
0 ∈ τ∇f (x∗) + τ∂g(x∗)− x∗ + x∗
(I + τ∂g)(x∗) ∈ (I − τ∇f )(x∗)
x∗ = (I + τ∂g)−1(I − τ∇f )(x∗) (2)
M.M. Betcke Numerical Optimisation
Iterative scheme:
xk = proxτkg (xk−1 − τk∇f (xk−1))
= arg min
x
{
g(x) +
1
2τk
‖x − (xk−1 − τk∇f (xk−1))‖2
}
.
Gradient Projection: g(x) = IC(x): smooth constrained
minimisation, τk ∈ (0, 2/L(f ))
xk = ΠC(xk−1 − τk∇f (xk−1)).
Proximal Minimization: f (x) = 0: non-smooth convex
minimisation
xk = arg min
x
{
g(x) +
1
2τk
‖x − xk−1‖2
}
.
Iterative Shrinkage Thresholding Algorithm (ISTA):
g(x) = ‖x‖1, f (x) = ‖Ax − b‖2, τk ∈ (0, 2/L(f ))
xk = Sτk (xk−1 − τk∇f (xk−1)).
M.M. Betcke Numerical Optimisation
Proximal gradient:
xk = proxτkg (xk−1 − τk∇f (xk−1))
= arg min
x
{
g(x) +
1
2τk
‖x − (xk−1 − τk∇f (xk−1))‖2
}
.
Gradient Projection: g(x) = IC(x): smooth constrained
minimisation, τk ∈ (0, 2/L(f ))
xk = ΠC(xk−1 − τk∇f (xk−1)).
Iterative Shrinkage Thresholding Algorithm (ISTA):
g(x) = ‖x‖1, f (x) = ‖Ax − b‖2, τk ∈ (0, 2/L(f ))
xk = Sτk (xk−1 − τk∇f (xk−1)).
Slow convergence, if τk = τ = 1/L, L ≥ L(f ), F (x) := f (x) + g(x)
F (xk)− F ∗ ≤
L‖x0 − x∗‖2
2k
.
M.M. Betcke Numerical Optimisation
Fast Iterative Shrinkage Thresholding Algorithm (FISTA):
Initialize: y1 := x0 ∈ E, τ1 = 1.
Step k : xk = prox1/L(g)
(
yk −
1
L
∇f (yk)
)
τk+1 =
1 +
√
1 + 4τ 2k
2
yk+1 = xk +
τk − 1
τk+1
(xk − xk−1).
Convergence, if τk = τ = 1/L, L ≥ L(f ), F (x) := f (x) + g(x)
F (xk)− F ∗ ≤
2L‖x0 − x∗‖2
(k + 1)2
.
Fast Projection Gradient [Nesterov’83]: g(x) = IC(x)
xk = ΠC
(
yk −
1
L
∇f (yk)
)
.
More details on Nesterov algorithm see e.g. http:
//www.seas.ucla.edu/~vandenbe/236C/lectures/fgrad.pdf
M.M. Betcke Numerical Optimisation
http://www.seas.ucla.edu/~vandenbe/236C/lectures/fgrad.pdf
http://www.seas.ucla.edu/~vandenbe/236C/lectures/fgrad.pdf
Review: Optimisation with equality constraints
Let f : Rn → R ∪ {+∞}, closed, proper and convex.
Primal problem
min
x
f (x) subject to Ax = b
Lagrangian
L(x , y) = f (x) + yT (Ax − b)
Dual function
g(y) = inf
x
L(x , y) = −f ∗(−ATy)− bTy
y : dual variable (Lagrange multiplier),
f ∗: convex conjugate of f ( f ∗ is convex and closed even if f is
not).
Dual problem (always concave, y∗ ≤ x∗, y∗ = x∗ if strong duality
holds)
max
y
g(y). (3)
M.M. Betcke Numerical Optimisation
Gradient methods
Gradient descent for primal problem (assuming f continuously
differentiable)
xk+1 = xk − τk∇f (xk).
Gradient ascent for dual problem (assuming g continuously
differentiable)
xk+1 = arg min L(x , yk)
yk+1 = yk + τk (Axk+1 − b)︸ ︷︷ ︸
=∇g(yk )
Remark: the primal update is part of evaluation of ∇g(yk)
+ for separable f it leads to a parallel algorithm.
– various conditions necessary for convergence e.g. strict
convexity of f , f (x) <∞, ∀x .
M.M. Betcke Numerical Optimisation
Augmented Lagrangian
Augmented Lagrangian
Lρ(x , y) = f (x) + yT(Ax − b) + ρ/2‖Ax − b‖22, ρ > 0 (AL)
Equivalent to Lagrangian of an equivalent problem (for all feasible
x the quadratic term equals 0)
min
x
f (x) + ρ/2‖Ax − b‖22, subject to Ax = b.
Method of multipliers (MM): dual ascent applied to (AL)
xk+1 = arg minLρ(x , yk)
yk+1 = yk + ρ (Axk+1 − b)︸ ︷︷ ︸
=∇g(yk )
Using ρ as the step size guarantees dual feasibility of (xk+1, yk+1):
0 = ∇xLρ(xk+1, yk) = ∇f (x) + ATyk + ρAT(Ax − b)
∣∣
x=xk+1
=
∇f (xk+1) + ATyk+1 =: sk+1 = 0.
+ converges under more general conditions
– augmented Lagrangian is non-separable.
M.M. Betcke Numerical Optimisation
Alternating Directions Methods of Multipliers (ADMM)
Blend separability of dual ascent with superior convergence of MM:
min
x∈Rn, z∈Rm
f (x) + g(z) subject to Ax + Bz = c (4)
with A ∈ Rp×n,B ∈ Rp×m, c ∈ Rp and g : Rm → R ∪ {+∞}
closed, proper and convex.
The equality constraint comes from the split of the variable into x
and z with the objective function separable across the splitting.
Augmented Lagrangian
Lρ(x , z , y) = f (x) + g(z) + y
T(Ax + Bz − c) + ρ/2‖Ax + Bz − c‖22, ρ > 0
ADMM
xk+1 = arg min
x
Lρ(x , zk , yk)
zk+1 = arg min
z
Lρ(xk+1, z , yk)
yk+1 = yk + ρ(Axk+1 + Bzk+1 − c).
M.M. Betcke Numerical Optimisation
Alternating Directions Methods of Multipliers (ADMM)
Blend separability of dual ascent with superior convergence of MM:
min
x∈Rn, z∈Rm
f (x) + g(z) subject to Ax + Bz = c (4)
with A ∈ Rp×n,B ∈ Rp×m, c ∈ Rp and g : Rm → R ∪ {+∞}
closed, proper and convex.
The equality constraint comes from the split of the variable into x
and z with the objective function separable across the splitting.
Augmented Lagrangian
Lρ(x , z , y) = f (x) + g(z) + y
T(Ax + Bz − c) + ρ/2‖Ax + Bz − c‖22, ρ > 0
Dual ascent on Lρ (joint minimisation)
(xk+1, zk+1) = arg min
x ,z
Lρ(x , z , yk)
yk+1 = yk + ρ(Axk+1 + Bzk+1 − c).
M.M. Betcke Numerical Optimisation
ADMM: scaled form
Augmented Lagrangian
Lρ(x , z , y) = f (x) + g(z) + y
T (Ax + Bz − c)︸ ︷︷ ︸
=:r
+ρ/2‖Ax + Bz − c︸ ︷︷ ︸
r
‖22, ρ > 0
= f (x) + g(z) + yTr + ρ/2‖r‖22.
= f (x) + g(z) + ρ/2‖r + u‖22 − ρ/2‖ u︸︷︷︸
:=(1/ρ)y
‖22,
with u = (1/ρ)y the scaled dual variable.
ADMM: scaled form
xk+1 = arg min
x
f (x) + ρ/2‖Ax + Bzk − c + uk‖22
zk+1 = arg min
z
g(z) + ρ/2‖Axk+1 + Bz − c + uk‖22
uk+1 = uk + Axk+1 + Bzk+1 − c .
M.M. Betcke Numerical Optimisation
ADMM convergence
Assume in addition that the unaugmented Lagrangian L has a
saddle point.
For f , g proper, closed, convex, it follows that strong duality holds
(no explicit assumptions on A,B, c).
Under these assumptions the ADMM iterates satisfy
Residual convergence: rk → 0 as k →∞ i.e. the iterates
approach feasibility.
Objective convergence: f (xk) +g(zk)→ p? as k →∞ i.e. the
objective function of the iterates approach the optimal value
Dual variable convergence: yk → y? as k →∞, where y? is a
dual optimal point.
Note, that xk , zk need not converge to optimal points, although
such a result can be shown under additional assumptions.
M.M. Betcke Numerical Optimisation
Optimality conditions
Necessary and sufficient optimality conditions for ADMM
Ax? + Bz? − c = 0 primal feasibility
0 ∈ ∂f (x?) + ATy? dual feasibility
0 ∈ ∂g(z?) + BTy? dual feasibility
As for MM, it follows from zk+1 = arg minz Lρ(xk+1, z , yk) that
zk+1 and yk+1 always satisfy the last equation.
From xk+1 = arg minx Lρ(x , zk , yk) we have
0 ∈ ∂f (xk+1) + ATyk + ρAT(Axk+1 + Bzk − c)
= ∂f (xk+1) + A
T(yk + ρrk+1 + ρB(zk − zk+1))
= ∂f (xk+1) + A
Tyk+1 + ρA
TB(zk − zk+1)
or equivalently
sk+1 := ρA
TB(zk+1 − zk) ∈ ∂f (xk+1) + ATyk+1,
which can be interpreted as dual feasibility condition and sk+1 is
the dual residual at iteration k + 1.
M.M. Betcke Numerical Optimisation
Literature
S.Boyd, Stanford EE364b
http://stanford.edu/class/ee364b/lectures.html
L. Vandenberghe, UCLA EE236C
http://www.seas.ucla.edu/~vandenbe/ee236c.html
Distributed Optimization and Statistical Learning via the
Alternating Direction Method of Multipliers, S.Boyd at all,
2010
A Fast Iterative Shrinkage-Thresholding Algorithm for Linear
Inverse Problems, A. Beck, M. Teboulle, 2009
Fast Gradient-Based Algorithms for Constrained Total
Variation Image Denoising and Deblurring Problems, A. Beck,
M. Teboulle, 2009
A First-Order Primal-Dual Algorithm for Convex Problems
withApplications to Imaging, A. Chambolle, T. Pock, 2011
M.M. Betcke Numerical Optimisation
http://stanford.edu/class/ee364b/lectures.html
http://www.seas.ucla.edu/~vandenbe/ee236c.html