程序代写代做代考 scheme flex algorithm Numerical Optimisation Nonsmooth optimisation

Numerical Optimisation Nonsmooth optimisation

Numerical Optimisation
Nonsmooth optimisation

Marta M. Betcke
m.betcke@ucl.ac.uk,

Kiko Rullan
f.rullan@cs.ucl.ac.uk

Department of Computer Science,
Centre for Medical Image Computing,

Centre for Inverse Problems
University College London

Lecture 16

M.M. Betcke Numerical Optimisation

Subgradient

For convex differentiable function f : Rn → R it holds
f (y) ≥ f (x) +∇f (x)T(y − x).

A vector g ∈ Rn is a subgradient of a function f at x ∈ dom f if
f (y) ≥ f (x) + gT(y − x) ∀y ∈ dom f .

f (x) + gT(y − x) is affine global underestimator
g is a subgradient of f at x if (g ,−1) supports the epigraph
of f at (x , f (x))

Figure: ∂f (x1) = {∇f (x1)} = {g1}, ∂f (x2) = [g3, g2]. Fig. from S. Boyd,
EE364b, Stanford University.

M.M. Betcke Numerical Optimisation

Subdifferential

A function f is called subdifferentiable at x if there exists at least
one subgradient at x .

Subdifferential of f at x , ∂f (x), is the set of all subgradients of f
at x .

∂f (x) is a closed convex set (can be empty) even if f is not convex.
Proof: It follows from it being intersection of infinite set of
halfspaces

∂f (x) =

z∈dom f
{g : f (z) ≥ f (x) + gT(z − x)}.

If f is continuous at x , then the ∂f (x) is bounded.

If f (x) is convex and x ∈ relint dom f
∂f (x) is nonempty and bounded

∂f (x) = {∇f (x)} iff f differentiable at x (abuse of notation!)
M.M. Betcke Numerical Optimisation

Minimum of nondifferentiable function (unconstraint)

A point x? is a minimiser of a function f (not necessarily convex)
iff f is subdifferentiable at x? and

0 ∈ ∂f (x?),

i.e. g = 0 is a subgradient of f at x?.

Proof: This follows directly from f (x) ≥ f (x?) for all x ∈ dom f .
f is subdifferentiable at x? with 0 ∈ ∂f (x?) is equivalent to
f (x) ≥ f (x?) + 0T(x − x?) for all x ∈ dom f .

The condition 0 ∈ ∂f (x?) reduces to ∇f (x?) = 0 when f is convex
and differentiable at x?. Note, that in that case also it is a
necessary and sufficient condition.

M.M. Betcke Numerical Optimisation

Minimum of nondifferentiable function (constraint)

Convex constraint optimisation problem

min
x∈Rn

f (x) (COP)

subject to fi (x) ≤ 0, i = 1, . . . ,m,

where

f , fi : Rn → R is convex hence subdifferentiable
strict feasibility holds (Slater’s conditions)

Generalised KKT conditions:
x? is primal optimal and λ? dual optimal iff

fi (x
?) ≤ 0, (KKT)
λ?i ≥ 0,

0 ∈ ∂f (x?) +
m∑
i=1

λ?i ∂fi (x
?),

λ?i fi (x
?) = 0

M.M. Betcke Numerical Optimisation

Directional derivatives and subdifferential

For a convex function the directional derivative at x in the
direction v is

f ′(x ; v) = lim
t→0

f (x + tv)− f (x)
t

.

The limit always exists for a convex function, thought it can be
±∞. If f is finite in a neighbourhood of x , then f ′(x ; v) exists.

f is differentiable at x iff for some g (which is ∇f (x)) and all
v ∈ Rn we have f ′(x ; v) = gTv (f ′(x ; v) is a linear function of v).

The directional derivative f ′(x ; v) of a convex function f satisfies

f ′(x ; v) = sup
g∈∂f (x)

gTv .

Proof idea: Note that f ′(x ; v) ≥ supg∈∂f (x) gTv by the definition
of the subgradient f (x + tv)− f (x) ≥ tgTv for any t ∈ R and
g ∈ ∂f (x). Other direction: show that all affine functions below
v → f ′(x ; v) may be taken to be linear.

M.M. Betcke Numerical Optimisation

Subgradient calculus

Weak subgradient calculus: formulas for finding one g ∈ ∂f (x).
If you can compute f , you can usually compute one subgradient.
Many algorithms require only one subgradient.

Strong subgradient calculus: formular for finding the whole
subdifferential ∂f (x)
Optimality conditions and some algorithms require the whole
differential.

Basic rules:

non-negative scaling: for α > 0, ∂(αf ) = α∂f

addition: ∂(f1 + f2) = ∂(f1) + ∂(f2)

affine transformation: g(x) = f (Ax + b),
∂g(x) = AT∂f (Ax + b)

finite point wise maximum: f = maxi=1,…,m fi ,
∂f (x) = Co


{∂fi (x) : fi (x) = f (x)} (convex hull of a union

of subdiffrentials of active functions at x)

M.M. Betcke Numerical Optimisation

Subgradient and descent direction

p is a descent direction for f at x if f ′(x ; p) < 0. If f is differentiable, −∇f is always a descent direction (except when it is 0). For a nondifferentiable convex function f , p = −g , g ∈ ∂f (x) need not to be a descent direction. Example: f (x) = |x1|+ 2|x2| Figure: Fig. from S. Boyd, EE364b, Stanford University. M.M. Betcke Numerical Optimisation Subgradient and distance to sublevel set For a convex f , if f (z) < f (x), g ∈ ∂f (x), then for small t > 0

‖x − tg − z‖2 < ‖x − z‖2. Thus −g is descent direction for ‖x − z‖2, for any z with f (z) < f (x). Proof: ‖x − tg − z‖22 = ‖x − z‖ 2 2 − 2tg T(x − z) + t2‖g‖22 ≤ ‖x − z‖22 − 2t(f (x)− f (z)︸ ︷︷ ︸ >0

) + t2‖g‖22︸ ︷︷ ︸
t: t

2
‖g‖22 0 (PROX)

Evaluating proxf involves solving a convex optimisation problem.

Can evaluate numerically via e.g. BFGS, but often the convex
problem (PROX) has an analytical solution or at least a specialised
linear-time algorithm.

Indicator function of a closed convex set, C 6= ∅

IC(x) =

{
0 x ∈ C
∞ otherwise

Proximal operator of IC is the Euclidean projection

proxλIC(v) = arg min
x∈C

‖x − v‖2 = ΠC(v)

Many properties of projection carry over to proximal operator.
M.M. Betcke Numerical Optimisation

Examples of proximal operators

Important special choices of f , for which proxλf has a closed form:

f (x) = 1
2
‖Px − q‖22,

proxλf (v) = (P
TP + λ−1I )−1(PTq + λ−1v).

(PTP + Q)−1 = Q−1 − Q−1PT(I + PQ−1PT)−1PQ−1.

f is separable i.e. f (x) =
∑n

i fi (xi ), proximal operator acts
componentwise

(proxλf (v))i = proxλfi (vi ), i = 1, . . . , n

f (x) = ‖x‖1
proxλf (v) = Sλ(v),

with elementwise soft thresholding

Sδ(x) =



x − δ x > δ
0 x ∈ [−δ, δ]
x + δ x < −δ x-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 S / (x) M.M. Betcke Numerical Optimisation Examples of proximal operators Another important example which does not admit close form is Total Variation, f (x) = TV (x), defined as follows TV (x) := m−1∑ i=1 n−1∑ j=1 √ (xi ,j − xi+1,j)2 + (xi ,j − xi ,j+1)2 + m−1∑ i=1 |xi ,n − xi+1,n|+ n−1∑ i=1 |xm,j − xm,j+1| assuming standard reflexive boundary conditions xm+1,j = xm,j , xi ,n+1 = xi ,n. The proximal operator has to be computed iteratively using e.g. Chambolle-Pock algorithm (primal dual proximal gradient). M.M. Betcke Numerical Optimisation Resolvent of subdifferential operator Proximal operator proxλf (v) = arg min x ( f (x) + 1/(2λ)‖x − v‖22 ) . The first order condition for the minimiser reads 0 ∈ ∂f (x) + 1/λ(x − v) v − x ∈ λ∂f (x) v ∈ λ∂f (x) + x proxλf (v) = (I + λ∂f ) −1v Mapping (I + λ∂f )−1 is called resolvent of operator ∂f . x? minimises f iff x? is a fixed point x? = proxf (x ?) M.M. Betcke Numerical Optimisation Moreau-Yosida regularisation Moreau envelope or Moreau-Yosida regularisation of f Mλf (v) = inf x ( f (x) + 1/(2λ)‖x − v‖22 ) . Mλf is a smoothed (regularised) version of f : always has full domain always continuously differentiable has the same minimisers as f Can show that Mf = (f ? + 1/2‖ · ‖22) ?. Example: Moreau envelope of | · | is the Huber function M|·|(x) = { 1 2 x2 |x | ≤ 1 |x | − 1 2 |x | > 1

Moreau decomposition: v = proxf (v) + proxf ?(v) is
generalisation of orthogonal decomposition v = ΠW (v) + ΠW⊥(v).
It follows from Moreau decomposition that (IW )

? = IW⊥.
M.M. Betcke Numerical Optimisation

Forward Backward splitting

min
x

f (x) + g(x) subject to x ∈ E (1)

E: finite dimensional Euclidean space with inner product
< ·, · > and a self dual norm ‖ · ‖ =< ·, · >1/2= ‖ · ‖∗,
e.g. space of n ×m images, Rn×m
f : E→ R continuously differentiable with Lipschitz
continuous gradient,
‖∇f (x)−∇f (y)‖ ≤ L(f )‖x − y‖, ∀x , y ∈ E.
g : E→ (−∞,∞] proper closed convex.

From first order optimality condition we have

0 ∈ ∇f (x∗) + ∂g(x∗)
0 ∈ τ∇f (x∗) + τ∂g(x∗)− x∗ + x∗

(I + τ∂g)(x∗) ∈ (I − τ∇f )(x∗)
x∗ = (I + τ∂g)−1(I − τ∇f )(x∗) (2)

M.M. Betcke Numerical Optimisation

Iterative scheme:

xk = proxτkg (xk−1 − τk∇f (xk−1))

= arg min
x

{
g(x) +

1

2τk
‖x − (xk−1 − τk∇f (xk−1))‖2

}
.

Gradient Projection: g(x) = IC(x): smooth constrained
minimisation, τk ∈ (0, 2/L(f ))

xk = ΠC(xk−1 − τk∇f (xk−1)).
Proximal Minimization: f (x) = 0: non-smooth convex
minimisation

xk = arg min
x

{
g(x) +

1

2τk
‖x − xk−1‖2

}
.

Iterative Shrinkage Thresholding Algorithm (ISTA):
g(x) = ‖x‖1, f (x) = ‖Ax − b‖2, τk ∈ (0, 2/L(f ))

xk = Sτk (xk−1 − τk∇f (xk−1)).
M.M. Betcke Numerical Optimisation

Proximal gradient:

xk = proxτkg (xk−1 − τk∇f (xk−1))

= arg min
x

{
g(x) +

1

2τk
‖x − (xk−1 − τk∇f (xk−1))‖2

}
.

Gradient Projection: g(x) = IC(x): smooth constrained
minimisation, τk ∈ (0, 2/L(f ))

xk = ΠC(xk−1 − τk∇f (xk−1)).

Iterative Shrinkage Thresholding Algorithm (ISTA):
g(x) = ‖x‖1, f (x) = ‖Ax − b‖2, τk ∈ (0, 2/L(f ))

xk = Sτk (xk−1 − τk∇f (xk−1)).

Slow convergence, if τk = τ = 1/L, L ≥ L(f ), F (x) := f (x) + g(x)

F (xk)− F ∗ ≤
L‖x0 − x∗‖2

2k
.

M.M. Betcke Numerical Optimisation

Fast Iterative Shrinkage Thresholding Algorithm (FISTA):

Initialize: y1 := x0 ∈ E, τ1 = 1.

Step k : xk = prox1/L(g)

(
yk −

1

L
∇f (yk)

)
τk+1 =

1 +

1 + 4τ 2k
2

yk+1 = xk +
τk − 1
τk+1

(xk − xk−1).

Convergence, if τk = τ = 1/L, L ≥ L(f ), F (x) := f (x) + g(x)

F (xk)− F ∗ ≤
2L‖x0 − x∗‖2

(k + 1)2
.

Fast Projection Gradient [Nesterov’83]: g(x) = IC(x)

xk = ΠC

(
yk −

1

L
∇f (yk)

)
.

More details on Nesterov algorithm see e.g. http:
//www.seas.ucla.edu/~vandenbe/236C/lectures/fgrad.pdf

M.M. Betcke Numerical Optimisation

http://www.seas.ucla.edu/~vandenbe/236C/lectures/fgrad.pdf
http://www.seas.ucla.edu/~vandenbe/236C/lectures/fgrad.pdf

Review: Optimisation with equality constraints

Let f : Rn → R ∪ {+∞}, closed, proper and convex.

Primal problem

min
x

f (x) subject to Ax = b

Lagrangian
L(x , y) = f (x) + yT (Ax − b)

Dual function

g(y) = inf
x
L(x , y) = −f ∗(−ATy)− bTy

y : dual variable (Lagrange multiplier),
f ∗: convex conjugate of f ( f ∗ is convex and closed even if f is
not).

Dual problem (always concave, y∗ ≤ x∗, y∗ = x∗ if strong duality
holds)

max
y

g(y). (3)

M.M. Betcke Numerical Optimisation

Gradient methods

Gradient descent for primal problem (assuming f continuously
differentiable)

xk+1 = xk − τk∇f (xk).

Gradient ascent for dual problem (assuming g continuously
differentiable)

xk+1 = arg min L(x , yk)

yk+1 = yk + τk (Axk+1 − b)︸ ︷︷ ︸
=∇g(yk )

Remark: the primal update is part of evaluation of ∇g(yk)

+ for separable f it leads to a parallel algorithm.

– various conditions necessary for convergence e.g. strict
convexity of f , f (x) <∞, ∀x . M.M. Betcke Numerical Optimisation Augmented Lagrangian Augmented Lagrangian Lρ(x , y) = f (x) + yT(Ax − b) + ρ/2‖Ax − b‖22, ρ > 0 (AL)
Equivalent to Lagrangian of an equivalent problem (for all feasible
x the quadratic term equals 0)

min
x

f (x) + ρ/2‖Ax − b‖22, subject to Ax = b.

Method of multipliers (MM): dual ascent applied to (AL)

xk+1 = arg minLρ(x , yk)
yk+1 = yk + ρ (Axk+1 − b)︸ ︷︷ ︸

=∇g(yk )

Using ρ as the step size guarantees dual feasibility of (xk+1, yk+1):
0 = ∇xLρ(xk+1, yk) = ∇f (x) + ATyk + ρAT(Ax − b)

∣∣
x=xk+1

=

∇f (xk+1) + ATyk+1 =: sk+1 = 0.
+ converges under more general conditions
– augmented Lagrangian is non-separable.

M.M. Betcke Numerical Optimisation

Alternating Directions Methods of Multipliers (ADMM)

Blend separability of dual ascent with superior convergence of MM:

min
x∈Rn, z∈Rm

f (x) + g(z) subject to Ax + Bz = c (4)

with A ∈ Rp×n,B ∈ Rp×m, c ∈ Rp and g : Rm → R ∪ {+∞}
closed, proper and convex.

The equality constraint comes from the split of the variable into x
and z with the objective function separable across the splitting.

Augmented Lagrangian

Lρ(x , z , y) = f (x) + g(z) + y
T(Ax + Bz − c) + ρ/2‖Ax + Bz − c‖22, ρ > 0

ADMM

xk+1 = arg min
x

Lρ(x , zk , yk)

zk+1 = arg min
z

Lρ(xk+1, z , yk)

yk+1 = yk + ρ(Axk+1 + Bzk+1 − c).

M.M. Betcke Numerical Optimisation

Alternating Directions Methods of Multipliers (ADMM)

Blend separability of dual ascent with superior convergence of MM:

min
x∈Rn, z∈Rm

f (x) + g(z) subject to Ax + Bz = c (4)

with A ∈ Rp×n,B ∈ Rp×m, c ∈ Rp and g : Rm → R ∪ {+∞}
closed, proper and convex.

The equality constraint comes from the split of the variable into x
and z with the objective function separable across the splitting.

Augmented Lagrangian

Lρ(x , z , y) = f (x) + g(z) + y
T(Ax + Bz − c) + ρ/2‖Ax + Bz − c‖22, ρ > 0

Dual ascent on Lρ (joint minimisation)

(xk+1, zk+1) = arg min
x ,z

Lρ(x , z , yk)

yk+1 = yk + ρ(Axk+1 + Bzk+1 − c).

M.M. Betcke Numerical Optimisation

ADMM: scaled form

Augmented Lagrangian

Lρ(x , z , y) = f (x) + g(z) + y
T (Ax + Bz − c)︸ ︷︷ ︸

=:r

+ρ/2‖Ax + Bz − c︸ ︷︷ ︸
r

‖22, ρ > 0

= f (x) + g(z) + yTr + ρ/2‖r‖22.
= f (x) + g(z) + ρ/2‖r + u‖22 − ρ/2‖ u︸︷︷︸

:=(1/ρ)y

‖22,

with u = (1/ρ)y the scaled dual variable.

ADMM: scaled form

xk+1 = arg min
x

f (x) + ρ/2‖Ax + Bzk − c + uk‖22

zk+1 = arg min
z

g(z) + ρ/2‖Axk+1 + Bz − c + uk‖22

uk+1 = uk + Axk+1 + Bzk+1 − c .

M.M. Betcke Numerical Optimisation

ADMM convergence

Assume in addition that the unaugmented Lagrangian L has a
saddle point.

For f , g proper, closed, convex, it follows that strong duality holds
(no explicit assumptions on A,B, c).

Under these assumptions the ADMM iterates satisfy

Residual convergence: rk → 0 as k →∞ i.e. the iterates
approach feasibility.

Objective convergence: f (xk) +g(zk)→ p? as k →∞ i.e. the
objective function of the iterates approach the optimal value

Dual variable convergence: yk → y? as k →∞, where y? is a
dual optimal point.

Note, that xk , zk need not converge to optimal points, although
such a result can be shown under additional assumptions.

M.M. Betcke Numerical Optimisation

Optimality conditions

Necessary and sufficient optimality conditions for ADMM

Ax? + Bz? − c = 0 primal feasibility
0 ∈ ∂f (x?) + ATy? dual feasibility
0 ∈ ∂g(z?) + BTy? dual feasibility

As for MM, it follows from zk+1 = arg minz Lρ(xk+1, z , yk) that
zk+1 and yk+1 always satisfy the last equation.

From xk+1 = arg minx Lρ(x , zk , yk) we have
0 ∈ ∂f (xk+1) + ATyk + ρAT(Axk+1 + Bzk − c)

= ∂f (xk+1) + A
T(yk + ρrk+1 + ρB(zk − zk+1))

= ∂f (xk+1) + A
Tyk+1 + ρA

TB(zk − zk+1)

or equivalently

sk+1 := ρA
TB(zk+1 − zk) ∈ ∂f (xk+1) + ATyk+1,

which can be interpreted as dual feasibility condition and sk+1 is
the dual residual at iteration k + 1.

M.M. Betcke Numerical Optimisation

Literature

S.Boyd, Stanford EE364b
http://stanford.edu/class/ee364b/lectures.html

L. Vandenberghe, UCLA EE236C
http://www.seas.ucla.edu/~vandenbe/ee236c.html

Distributed Optimization and Statistical Learning via the
Alternating Direction Method of Multipliers, S.Boyd at all,
2010

A Fast Iterative Shrinkage-Thresholding Algorithm for Linear
Inverse Problems, A. Beck, M. Teboulle, 2009

Fast Gradient-Based Algorithms for Constrained Total
Variation Image Denoising and Deblurring Problems, A. Beck,
M. Teboulle, 2009

A First-Order Primal-Dual Algorithm for Convex Problems
withApplications to Imaging, A. Chambolle, T. Pock, 2011

M.M. Betcke Numerical Optimisation

http://stanford.edu/class/ee364b/lectures.html
http://www.seas.ucla.edu/~vandenbe/ee236c.html