代写 R C algorithm math matlab scala parallel graph OptimizationTheoryandPractice

OptimizationTheoryandPractice
Author(s) Imprint ISBN Permalink
Pages
Forst, Wilhelm; Hoffmann, Dieter Springer New York, 2010 9780387789774, 9780387789767
https://books.scholarsportal.info/en/read?id=/ ebooks/ebooks2/ springer/2011-02-17/2/9780387789774
35 to 86
Downloaded from Scholars Portal Books on 2018-08-30 Téléchargé de Scholars Portal Books sur 2018-08-30

2
Optimality Conditions
2.0 Introduction
2.1 Convex Sets, Inequalities
2.2 Local First-Order Optimality Conditions
Karush–Kuhn–Tucker Conditions Convex Functions
Constraint Qualifications
Convex Optimization Problems
2.3 Local Second-Order Optimality Conditions 2.4 Duality
Lagrange Dual Problem
Geometric Interpretation Saddlepoints and Duality Perturbation and Sensitivity Analysis Economic Interpretation of Duality Strong Duality
Exercises
2.0 Introduction
In this chapter we will focus on necessary and sufficient optimality conditions for constrained problems.
As an introduction let us remind ourselves of the optimality conditions for unconstrained and equality constrained problems, which are commonly dealt with in basic Mathematics lectures.
We consider a real-valued function f : D −→ R with domain D ⊂ Rn and define, as usual, for a point x0 ∈ D:
1) f has a local minimum in x0
:⇐⇒∃U∈Ux0 ∀x∈U∩Df(x)≥f(x0)
W. Forst and D. Hoffmann, Optimization—Theory and Practice, 35 Springer Undergraduate Texts in Mathematics and Technology,
DOI 10.1007/978-0-387-78977-4 2, ⃝c Springer Science+Business Media, LLC 2010
Chapter 2

36 Optimality Conditions
2) f has a strict local minimum in x0
:⇐⇒ ∃U∈Ux0 ∀x∈U∩D\{x0} f(x)>f(x0)
3) f has a global minimum in x0
:⇐⇒ ∀x∈D f(x)≥f(x0)
4) f has a strict global minimum in x0
:⇐⇒ ∀x∈D\{x0} f(x)>f(x0)
Here, Ux0 denotes the neighborhood system of x0 .
We often say “x0 is a local minimizer of f” or “x0 is a local minimum point of f” instead of “f has a local minimum in x0” and so on. The minimizer is a point x0 ∈ D, the minimum is the corresponding value f(x0).
Necessary Condition

Suppose that the function f has a local minimum in x0 ∈ D, that is, in an interior point of D. Then:
a) If f is differentiable in x0, then ∇f(x0) = 0 holds.
b) If f is twice continuously differentiable in a neighborhood of x0, then the
2 􏰦∂2f 􏰧
Hessian Hf(x0) = ∇ f(x0) = ∂xν∂xμ (x0) is positive semidefinite.
We will use the notation f′(x0) (to denote the derivative of f at x0; as we know, this is a linear map from Rn to R, read as a row vector) as well as the corresponding transposed vector ∇f(x0) (gradient, column vector).

Points x ∈ D with ∇f(x) = 0 are called stationary points. At a stationary point there can be a local minimum, a local maximum or a saddlepoint. To determine that there is a local minimum at a stationary point, we use the following:
Sufficient Condition
Suppose that the function f is twice continuously differentiable in a neigh- borhood of x0 ∈ D; also suppose that the necessary optimality condition ∇f(x0) = 0 holds and that the Hessian ∇2f(x0) is positive definite. Then f has a strict local minimum in x0.
The proof of this proposition is based on the Taylor theorem and we regard it as known from Calculus. Let us recall that a symmetric (n, n)-matrix A is positive definite if and only if all principal subdeterminants
⎛a …a ⎞ 11 1k
d e t ⎜⎝ . . . . . . ⎟⎠ ( k = 1 , . . . , n ) ak1 …akk
Chapter 2

2.0 Introduction 37
are positive (cf. exercise 3).
Now let f be a real-valued function with domain D ⊂ Rn which we want to
minimize subject to the equality constraints
hj(x) = 0 (j = 1,…,p)
forp0
f > 0
–3 –2 –1 0 1 2

The aim of our further investigations will be to generalize the Lagrange Multiplier Rule to minimization problems with inequality constraints:

2.1 Convex Sets, Inequalities 39
f(x) −→ min subject to the constraints (P) gi(x) ≤ 0 for i ∈ I := {1,…,m}
hj(x)=0 for j∈E:={1,…,p} .
With m,p ∈ N0 (hence, E = ∅ or I = ∅ are allowed), the functions f, g1, . . . , gm, h1, . . . , hp are supposed to be continuously differentiable on an open subset D in Rn and p≤n. The set
􏰤􏰥
F:= x∈D|gi(x)≤0fori∈I,hj(x)=0forj∈E
— in analogy to the above — is called the feasible region or set of feasible
points of (P).
In most cases we state the problem in the slightly shortened form
⎧⎪⎨ f(x) −→ min
(P) ⎪gi(x)≤0fori∈I
⎩hj(x)=0forj∈E .
The optimal value v(P) to problem (P) is defined as v(P) := inf{f(x):x∈F}.
We allow v(P) to attain the extended values +∞ and −∞. We follow the standard convention that the infimum of the empty set is ∞. If there are feasible points xk with f(xk) −→ −∞ (k −→ ∞), then v(P) = −∞ and we say problem (P) — or the function f on F — is unbounded from below.
We say x0 is a minimal point or a minimizer if x0 is feasible and f(x0) = v(P). In order to formulate optimality conditions for (P), we will need some simple
tools from Convex Analysis. These will be provided in the following section. 2.1 Convex Sets, Inequalities
In the following consider the space Rn for n ∈ N with the euclidean norm
and let C be a nonempty subset of Rn. The standard inner product or scalar
productonRnisgivenby⟨x,y⟩:=xTy=􏰚n xνyν for x,y∈Rn.The n ν=1 􏰑
euclidean norm of a vector x ∈ R is defined by ∥x∥ := ∥x∥2 := ⟨x, x⟩.
Definition
a ) C i s c a l l e d c o n v e x : ⇐⇒ ∀ x 1 , x 2 ∈ C ∀ λ ∈ ( 0 , 1 ) ( 1 − λ ) x 1 + λ x 2 ∈ C b) C is called a cone (with apex 0) : ⇐⇒ ∀ x ∈ C ∀ λ > 0 λx ∈ C
Chapter 2

40 Optimality Conditions
Remark
C is a convex cone if and only if: ∀x1,x2∈C∀λ1,λ2>0 λ1×1+λ2×2∈C
Proposition 2.1.1 (Separating Hyperplane Theorem)
Let C be closed and convex, and b∈Rn \C. Then there exist p∈Rn \{0} and α∈R such that ⟨p,x⟩≥α>⟨p,b⟩ for all x∈C, that is, the hyper- plane defined by H := {x ∈ Rn | ⟨p,x⟩ = α} strictly separates C and b. If furthermore C is a cone, we can choose α = 0.
The following two little pictures show that none of the two assumptions that C is convex and closed can be dropped. The set C on the left is convex but not closed; on the right it is closed but not convex.
C
b
Proof: Since C is closed,
δ := δ(b,C) = inf ∥x−b∥:x∈C
is positive, and there exists a sequence (xk ) in C such that ∥xk − b∥ −→ δ . wlog let xk → q for a q ∈ Rn (otherwise use a suitable subsequence). Then q is in C with ∥p∥ = δ > 0 for p := q − b.
For x∈C and 0<τ <1 it holds that ∥p∥2 = δ2 ≤∥(1−τ)q+τx−b∥2 =∥q−b+τ(x−q)∥2 =∥p∥2 +2τ⟨x−q,p⟩+τ2∥x−q∥2. From this we obtain 0 ≤ 2⟨x−q,p⟩+τ∥x−q∥2 and after passage to the limit τ → 0 0 ≤ ⟨x−q,p⟩ . With α := δ2 + ⟨b,p⟩ the first assertion ⟨p,x⟩ ≥ α > ⟨p,b⟩ follows. If C is
a cone, then for all λ>0 and x∈C the vectors 1x and λx are also in C. λ
C
b
􏰤􏰥
Chapter 2

2.1 Convex Sets, Inequalities 41
Therefore ⟨p,x⟩ = λ􏱃p,1x􏱄≥λα holdsandconsequently ⟨p,x⟩≥0. λ
λ⟨p,x⟩ = ⟨p,λx⟩≥α shows 0≥α, hence, ⟨p,b⟩<α≤0. 􏰀 Definition 􏰢􏰣 Chapter 2 C∗:= y∈Rn|∀x∈C⟨y,x⟩≥0 is called the dual cone of C. C* C Remark C∗ is a closed, convex cone. We omit a proof. The statement is an immediate consequence of the definition of the dual cone. As an important application let us now consider the following situation: Let A = (a1,...,an) ∈ Rm×n be an (m,n)-matrix with columns a1,...,an ∈ Rm. Definition cone(A) := cone(a1,...,an) := ARn+ = {Aw | w ∈ Rn+} is called the (positive) conic hull of a1, . . . , an . Lemma 2.1.2 1) cone(A) is a closed, convex cone. 2) 􏰓cone(A)􏰔∗ = 􏰤y∈Rm |ATy ≥ 0􏰥 Proof: 1) It is obvious that Cn := cone(a1,...,an) is a convex cone. We will prove that it is closed by means of induction over n: Forn=1theconeC1 ={ξ1a1 |ξ1≥0}is—inthenontrivialcase— a closed half line. For the induction step from n to n + 1 we assume that 42 Optimality Conditions every conic hull generated by not more than n vectors is closed. Firstly, consider the case that −aj ∈cone(a1,...,aj−1,aj+1,...,an+1) forall j=1,...,n+1. It follows that Cn+1 = span{a1,...,an+1} and therefore obviously that Cn+1 is closed: The inclusion from left to right is trivial, and the other one follows, with ξ1,...,ξn+1 ∈ R from n+1 n+1 ξjaj = j=1 j=1 Otherwise, assume wlog −an+1 ∈/ with ξ1,...,ξn+1 ∈R+. Then ξn+1 ≤ ∥x∥ δ 􏰛􏰛 the induction hypothesis, Cn is closed and therefore δ := δ(−an+1,Cn) is positive. Every x ∈ Cn+1 can be written in the form x = 􏰚n+1 ξj aj ∥x∥ = ξn+1 􏱅 − an+1 − ξ aj 􏱅 ≥ ξn+1 δ . j=1 n+1 Let (x(k)) beasequenceinCn+1 and x∈Rm with x(k) →x for k→∞. |ξj|sign(ξj)aj . cone (a1, . . . , an) = Cn ; because of j=1 holds because in the nontrivial case ξn+1 > 0 this follows directly from 􏱅􏱅 􏰛n ξj 􏱅􏱅
n+1
(k) 􏰛 (k)
x = ξj aj. j=1
As (x(k)) is a convergent sequence, there exists an M > 0 such that ∥x(k)∥≤M forall k∈N,andweget
0≤ξ(k) ≤M. n+1 δ
wlog let the sequence 􏰓ξ(k) 􏰔 be convergent (otherwise, consider a suit- n+1
able subsequence), and set ξn+1 := lim ξ(k) . So we have n+1
Cn ∋x(k) −ξ(k) an+1 −→x−ξn+1an+1 . n+1
By induction, Cn is closed, thus x − ξn+1an+1 is an element of Cn and consequently x is in Cn+1 .
􏰿 􏰾􏰽 􏱀 ∈Cn
We want to show x ∈ Cn+1: For k ∈ N there exist ξ(k),…,ξ(k) 1 n+1
such that
∈ R+
Chapter 2

2.1 Convex Sets, Inequalities 43
2) The definitions of cone(A) and of the dual cone give immediately:
􏰓cone(A)􏰔∗ =􏰤y∈Rm|∀v∈cone(A) ⟨v,y⟩≥0􏰥 =􏰤y∈Rm|∀w∈Rn+ ⟨Aw,y⟩≥0􏰥
=􏰤y∈Rm|∀w∈Rn 􏱃w,ATy􏱄≥0􏰥 +
=􏱈 􏰤 y ∈ R m | A T y ≥ 0 􏰥 􏰀
A crucial tool for the following considerations is the
Theorem of the Alternative (Farkas (1902))
For A ∈ Rm×n and b ∈ Rm the following are strong alternatives:
1 ) ∃ x ∈ R n+ A x = b
2) ∃y∈Rm ATy≥0 ∧ bTy<0 Proof: 1)=⇒¬2): For x∈Rn+ with Ax = b and y∈Rm with ATy≥0 we have bT y = xT AT y ≥ 0 . ¬1) ⇐= 2): C := cone(A) is a closed convex cone which does not contain the vector b: Following the addendum in the Separating Hyperplane Theorem there exists a y ∈ Rm with ⟨y,x⟩ ≥ 0 > ⟨y,b⟩ for all x ∈ C, in particular a Tν y = ⟨ y , a ν ⟩ ≥ 0 , t h a t i s , A T y ≥ 0 . 􏰀
If we illustrate the assertion, the theorem can be memorized easily: 1) means nothing but b ∈ cone(A). With the open ‘half space’
Hb :=􏰤y∈Rm|⟨y,b⟩<0􏰥 the condition 2) states that 􏰓cone(A)􏰔∗ and Hb have a common point. In the two-dimensional case, for example, we can illustrate the theorem with the following picture, which shows case 1): a1 b cone(A) a2 􏰓cone(A)􏰔∗ If you rotate the vector b out of cone(A), you get case 2). Hb Chapter 2 44 Optimality Conditions 2.2 Local First-Order Optimality Conditions We want to take up the minimization problem (P) from page 39 again and use the notation introduced there. For x0 ∈ F , the index set 􏰤􏰥 A(x0):= i∈I|gi(x0)=0 describes the inequality restrictions which are active at x0. The active constraints have a special significance: They restrict feasible corrections around a feasible point. If a constraint is inactive (gi(x0) < 0) at the feasible point x0, it is possible to move from x0 a bit in any direction without violating this constraint. Definition Let d ∈ Rn and x0 ∈ F. Then d is called the feasible direction of F at x0 : ⇐⇒ ∃ δ > 0 ∀ τ ∈ [ 0 , δ ] x 0 + τ d ∈ F .
A ‘small’ movement from x0 along such a direction gives feasible points. The set of all feasible directions of F at x0 is a cone, denoted by
Cfd(x0) .
Let d be a feasible direction of F at x0 . If we choose a δ according to the
definition, then we have
gi(x0 + τ d) = gi(x0) + τ gi′(x0)d + o(τ)
for i∈A(x0) and 0<τ ≤δ.Dividingby τ andpassingtothelimitas τ →0 gives gi′(x0)d ≤ 0. In the same way we get hj′(x0)d = 0 for all j ∈ E. Definition For any x0 ∈ F Cl(P,x0) := d∈Rn |∀i∈A(x0) gi′(x0)d≤0, ∀j∈E hj′(x0)d=0 is called the linearizing cone of (P) at x0. Hence, Cl(x0) := Cl(P,x0) contains at least all feasible directions of F at x0 : Cfd(x0) ⊂ Cl(x0) The linearizing cone is not only dependent on the set of feasible points F but also on the representation of F (compare Example 4). We therefore write more precisely Cl(P,x0). 􏰿 􏰾􏰽 􏱀 􏰿􏰾􏰽􏱀 ≤0 =0 􏰢􏰣 Chapter 2 2.2 Local First-Order Optimality Conditions 45 Definition For any x0 ∈ D 􏰢􏰣 Cdd(x0):= d∈Rn|f′(x0)d<0 is called the cone of descent directions of f at x0 . Note that 0 is not in Cdd (x0); also, for all d ∈ Cdd (x0) f(x0 +τd) = f(x0)+τf′(x0)d + o(τ) 􏰿 􏰾􏰽 􏱀 <0 holds and therefore, f(x0 + τ d) < f(x0) for sufficiently small τ > 0.
Thus, d ∈ Cdd(x0) guarantees that the objective function f can be reduced along this direction. Hence, for a local minimizer x0 of (P) it necessarily holds that Cdd (x0 ) ∩ Cfd (x0 ) = ∅ .
We will illustrate the above definitions with the following
Example 1
Chapter 2
Let
F:= x=(x1,x2)T∈R2|x21+x2−1≤0,−x1≤0,−x2≤0,
􏰢􏰣
and f be defined by f(x) := x1 +x2. Hence, F is the part of the unit disk which lies in the first quadrant. The objective function f evidently attains a (strict, global) minimum at (0, 0)T .
In both of the following pictures F is colored in dark blue.
x0 := (0,0)T 1
0.5
x0 := (1,0)T
∇f(x0)
1
0.5
0
∇f(x0) 12
d
–1 1 d
–0.5
a) Let x0 := (0,0)T. g1(x) := x21 +x2 −1, g2(x) := −x1 and g3(x) := −x2 give A(x0) = {2, 3}. A vector d := (d1, d2)T ∈ R2 is a feasible direction

46
Optimality Conditions
of F at x0 if and only if d1 ≥ 0 and d2 ≥ 0 hold. Hence, the set Cfd(x0) of feasible directions is a convex cone, namely, the first quadrant, and it is represented in the left picture by the gray angular domain.g2′(x0) = (−1,0) and g3′(x0) = (0,−1) produce
Cl(x0) = 􏰤d∈R2 |−d1 ≤0,−d2 ≤0􏰥.
Hence, in this example, the linearizing cone and the cone of feasible direc- tions are the same. Moreover, the cone of descent directions Cdd(x0) — colored in light blue in the picture — is, because of f′(x0)d = (1,1)d = d1 + d2 , an open half space and disjoint to Cl(x0).
b)Ifx0 :=(1,0)T,wehaveA(x0)={1,3}andd:=(d1,d2)T ∈R2 isa feasible direction of F at x0 if and only if d = (0,0)T or d1 < 0 and d2 ≥ 0 hold. The set of feasible directions is again a convex cone. In the right picture it is depicted by the shifted gray angular domain. Because of g1′ (x0) = (2,0) and g3′ (x0) = (0,−1), we get 􏰢􏰣 Cl(x0) = d∈R2 |d1 ≤0,d2 ≥0 . As we can see, in this case the linearizing cone includes the cone of fea- sible directions properly as a subset. In the picture the cone of descent directions has also been moved to x0. We can see that it contains feasible directions of F at x0 . Consequently, f does not have a local minimum in x0. ▹ Proposition 2.2.1 For x0 ∈ F it holds that Cl(x0) ∩ Cdd (x0) = ∅ if and only if there exist λ∈Rm+ andμ∈Rp suchthat and 􏰛m ∇f(x0) + i=1 λi ∇gi(x0) + 􏰛p j=1 μj ∇hj(x0) = 0 (2) λigi(x0) = 0 for all i∈I. (3) Together, these conditions — x0 ∈ F, λ ≥ 0, (2) and (3) — are called Karush–Kuhn–Tucker conditions, or KKT conditions. (3) is called the complementary slackness condition or complementarity condition. This con- dition of course means λi = 0 or (in the nonexclusive sense) gi(x0) = 0 for all Chapter 2 2.2 Local First-Order Optimality Conditions 47 i ∈ I. A corresponding pair (λ,μ) or the scalars λ1,...,λm,μ1,...,μp are called Lagrange multipliers. The function L defined by 􏰛m L(x,λ,μ) := f(x)+ i=1 λigi(x)+ 􏰛p j=1 μjhj(x) = f(x)+λTg(x)+μTh(x) for x ∈ D, λ ∈ Rm+ and μ ∈ Rp is called the Lagrange function or Lagrangian of (P ) . Here we have combined the m functions gi to a vector-valued function g and respectively the p functions hj to a vector-valued function h. Points x0 ∈ F fulfilling (2) and (3) with a suitable λ ∈ Rm+ and μ ∈ Rp play an important role. They are called Karush–Kuhn–Tucker points, or KKT points. Owing to the complementarity condition (3), the multipliers λi corresponding to inactive restrictions at x0 must be zero. So we can omit the terms for i ∈ I \ A(x0) from (2) and rewrite this condition as 􏰛 􏰛p ∇f(x0)+ λi∇gi(x0)+ μj∇hj(x0) = 0. (2′) i∈A(x0) j=1 Proof: By definition of Cl(x0) and Cdd (x0) it holds that: d∈Cl(x0)∩Cdd(x0) ⇐⇒ ⇐ ⇒ ⎧⎪⎨ f ′ ( x 0 ) d < 0 ⎪∀i∈A(x0) gi′(x0)d≤0 ⎩∀j∈E hj′(x0)d=0 ⎧⎪ ⎪ ⎪ f ′ ( x 0 ) d < 0 ⎨∀i∈A(x0) −gi′(x0)d≥0 ⎪ ⎪ ⎪⎩ ∀ j ∈ E − h j ′ ( x 0 ) d ≥ 0 ∀j∈E hj′(x0)d≥0 With that the Theorem of the Alternative from section 2.1 directly provides the following equivalence: Cl(x0)∩Cdd(x0) = ∅ if and only if there exist λi ≥ 0 for i ∈ A(x0) and μ′ ≥ 0, μ′′ ≥ 0 for j ∈ E such that jj with 􏰛m ∇f(x0) + i=1 􏰛p j=1 􏰛 􏰛p 􏰛p ∇f(x ) = λ (−∇g(x ))+ μ′ (−∇h (x ))+ μ′′∇h (x ). 0ii0jj0jj0 i∈A(x0 ) j=1 j=1 If we now set λ := 0 for i ∈ I \ A(x ) and μ := μ′ − μ′′ for j ∈ E, the i0jjj above is equivalent to: There exist λi ≥ 0 for i ∈ I and μj ∈ R for j ∈ E λi ∇gi(x0) + μj ∇hj(x0) = 0 Chapter 2 48 Optimality Conditions and λigi(x0)=0 forall i∈I. 􏰀 So now the question arises whether not just Cfd (x0 ) ∩ Cdd (x0 ) = ∅ , but even Cl(x0) ∩ Cdd (x0) = ∅ is true for any local minimizer x0 ∈ F. The following simple example gives a negative answer to this question: Example 2 (Kuhn–Tucker (1951)) For n=2 and x = (x1,x2)T ∈R2 =: D let f(x) := −x1, g1(x) := x2 +(x1 −1)3, g2(x) := −x1 and g3(x) := −x2. Forx0 :=(1,0)T,m=3andp=0wehave: ∇f(x0) = (−1, 0)T , ∇g1(x0) = (0, 1)T , ∇g2(x0) = (−1, 0)T and ∇g3(x0) = (0,−1)T. 􏰤 T 2 􏰥 SinceA(x0)={1,3},wegetCl(x0)= (d1,d2) ∈R|d2=0,as well as Cdd(x0) = 􏰤(d1,d2)T ∈ R2 | d1 > 0􏰥; evidently, Cl(x0) ∩ Cdd(x0) is nonempty. However, the function f has a minimum at x0 subject to the given constraints.
1
x2
F
0• 0 0.5 1
x1

Cl(x0)∩Cdd(x0)=∅ ⇐⇒ ∀d∈Cl(x0) ⟨∇f(x0),d⟩ = f′(x0)d≥0
⇐⇒ ∇f(x0)∈Cl(x0)∗ 􏰀
The cone Cfd(x0) of all feasible directions is too small to ensure general optimality conditions. Difficulties may occur due to the fact that the boundary of F is curved.
Therefore, we have to consider a set which is less intuitive but bigger and with more suitable properties. To attain this goal, it is useful to state the concept of being tangent to a set more precisely:
Definition
A sequence (xk) converges in direction d to x0
: ⇐⇒ x k = x 0 + α k ( d + r k ) w i t h α k ↓ 0 a n d r k → 0 .
Lemma 2.2.2
For x0 ∈ F it holds that: Cl(x0) ∩ Cdd (x0) = ∅ ⇐⇒ ∇f(x0) ∈ Cl(x0)∗ Proof:
Chapter 2

2.2 Local First-Order Optimality Conditions 49
d
We will use the following notation: xk −→ x0
d
xk −→ x0 simply means: There exists a sequence of positive numbers (αk)
􏰢n Nd􏰣 Ct(M,x0):= d∈R |∃(xk)∈M xk−→x0
is called the tangent cone of M at x0 . The vectors of Ct (M, x0) are called tangents or tangent directions of M at x0 .
Of main interest is the special case
Ct(x0):= Ct(F,x0).
Example 3
a) The following two figures illustrate the cone of tangents for 􏰢􏰣
F:= x=(x1,x2)T∈R2|x1≥0,x21≥x2≥x21(x1−1)
and the points x0 ∈ 􏰤(0, 0)T , (2, 4)T , (1, 0)T 􏰥. For convenience the origin is translated to x0 . The reader is invited to verify this:
such that αk ↓ 0 and
αk
Chapter 2
1 ( x k − x 0 ) −→ d f o r k −→ ∞ . Let M be a nonempty subset of Rn and x0 ∈ M. Then
Definition
4
2
4
2
x0 = (0,0)T and x0 = (2,4)T
x0 = (1,0)T
0201 b)F:=􏰤x∈Rn|∥x∥2=1􏰥: Ct(x0)=􏰤d∈Rn|⟨d,x0⟩=0􏰥
c)F:=􏰤x∈Rn|∥x∥2≤1􏰥:ThenCt(x0)=Rn if∥x0∥2<1holds,and Ct(x0) = 􏰤d∈Rn |⟨d,x0⟩≤0􏰥 if ∥x0∥2 = 1. These assertions have to be proven in exercise 10. ▹ 50 Optimality Conditions Lemma 2.2.3 1) Ct(x0) is a closed cone, 0 ∈ Ct(x0). 2) Cfd(x0) ⊂ Ct(x0) ⊂ Cl(x0) Proof: The proof of 1) is to be done in exercise 9. 2) First inclusion: As the tangent cone Ct(x0) is closed, it is sufficient to show the inclusion Cfd(x0) ⊂ Ct(x0). For d ∈ Cfd(x0) and ‘large’ integers kitholdsthatx0+1d∈F.Withαk :=1 andrk :=0thisshows kk Second inclusion: Let d ∈ Ct (x0 ) and (xk ) ∈ F N be a d ∈ Ct(x0). xk =x0+αk(d+rk),αk↓0andrk→0.Fori∈A(x0) sequence with gi(xk) = gi(x0) +αk gi′(x0)(d + rk) + o(αk) 􏰿 􏰾􏰽 􏱀 􏰿 􏰾􏰽 􏱀 ≤0 =0 p r o d u c e s t h e i n e q u a l i t y g i′ ( x 0 ) d ≤ 0 . I n t h e s a m e w a y w e g e t h j′ ( x 0 ) d = 0 for j ∈ E. 􏰀 Now the question arises whether Ct (x0) = Cl(x0) always holds. The following example gives a negative answer: Example4 􏰤 2 3 􏰥 T a)ConsiderF:= x∈R|−x1+x2≤0,−x2≤0 andx0:=(0,0). In this case A(x0) = {1,2}. This gives Cl(x0) = 􏰤d ∈ R2 | d2 = 0􏰥 and Ct(x0) = 􏰤d ∈ R2 | d1 ≥ 0 , d2 = 0􏰥. The last statement has to be shown in exercise 10. b ) N o w l e t F : = 􏰤 x ∈ R 2 | − x 31 + x 2 ≤ 0 , − x 1 ≤ 0 , − x 2 ≤ 0 􏰥 a n d x0 := (0,0)T. Then A(x0) = {1,2,3} and therefore Cl(x0) = 􏰤d ∈ R2 | d1 ≥ 0 , d2 = 0􏰥 = Ct(x0). Hence, the linearizing cone is dependent on the representation of the set of feasible points F which is the same in both cases! 1.5 1 x2 0.5 0• x0 1 x 1 ▹ Chapter 2 2.2 Local First-Order Optimality Conditions 51 Lemma 2.2.4 For a local minimizer x0 of (P) it holds that ∇f(x0) ∈ Ct(x0)∗, hence Cdd(x0)∩Ct(x0) = ∅. Geometrically this condition states that for a local minimizer x0 of (P) the angle between the gradient and any tangent direction, especially any feasible direction, does not exceed 90◦. Proof: Let d ∈ Ct(x0). Then there exists a sequence (xk) ∈ FN such that xk = x0 +αk(d+rk), αk ↓0 and rk −→0. 0 ≤ f(xk) − f(x0) = αkf ′(x0)(d + rk) + o(αk) gives the result f ′(x0)d ≥ 0. 􏰀 The principal result in this section is the following: Theorem 2.2.5 (Karush–Kuhn–Tucker) Suppose that x0 is a local minimizer of (P), and the constraint qualifica- tion1 Cl(x0)∗ = Ct (x0)∗ is fulfilled. Then there exist vectors λ ∈ Rm+ and μ ∈ Rp such that 􏰚m ∇f(x0) + 􏰚p j=1 λi ∇gi(x0) + λigi(x0) = 0 for i=1,...,m. Proof: If x0 is a local minimizer of (P), it follows from lemma 2.2.4 with the help of the presupposed constraint qualification that ∇f(x0)∈Ct(x0)∗ = Cl(x0)∗; lemma 2.2.2 yields Cl(x0) ∩ Cdd (x0) = ∅ and the latter together with propo- sition 2.2.1 gives the result. 􏰀 In the presence of the presupposed constraint qualification Ct(x0)∗ = Cl(x0)∗ the condition ∇f(x0) ∈ Ct (x0)∗ of lemma 2.2.4 transforms to ∇f(x0) ∈ Cl(x0)∗. This claim can be confirmed with the aid of a simple linear optimization problem: Example 5 (Kleinmichel (1975)) For x=(x1,x2)T ∈R2 weconsidertheproblem f(x):=x1+x2 −→min − x 31 + x 2 ≤ 1 x1 ≤ 1, −x2 ≤ 0 1 Guignard (1969) i=1 μj ∇hj(x0) = 0 and Chapter 2 52 Optimality Conditions 2 x2 1• F •0 –1 1 x1 and ask whether the feasible points x := (−1, 0)T 00 minimizers. (The examination of the picture shows immediately that this is not the case for x􏰷, and that the objective function f attains a (strict, global) 0 minimum at x0. But we try to forget this for a while.) We have A(x0) = {1,3}. In order to show that ∇f(x0) ∈ Cl(x0)∗, hence, f′(x0)d ≥ 0 for all d ∈ Cl(x0), we compute problem: min f ′(x0)d. So we have the following linear d∈Cl(x0) d1+d2 −→min −3d1 + d2 ≤ 0 −d2 ≤ 0 Evidently it has the minimal value 0; lemma 2.2.2 gives that Cl(x0) ∩ Cdd (x0) is empty. Following proposition 2.2.1 there exist λ1,λ3 ≥ 0 for x0 satisfying 􏰕1􏰖 􏰕−3􏰖 􏰕 0􏰖 􏰕0􏰖 1+λ1 1+λ3−1=0. The above yields λ1 = 1 , λ3 = 4 . For x􏰷 we have A(x􏰷) = {1}. In the same way as the above this leads to the 00 subproblem d1+d2 −→min d2 ≤ 0 whose objective function is unbounded; therefore C (x􏰷) ∩ Cdd (x􏰷) ̸= ∅. l00 So x􏰷 is not a local minimizer, but the point x remains as a candidate. ▹ 00 Convex Functions Convexity plays a central role in optimization. We already had some simple results from Convex Analysis in section 2.1. Convex optimization problems — the functions f and gi are supposed to be convex and the funcions hj affinely linear — are by far easier to solve than general nonlinear problems. These assumptions ensure that the problems are well-behaved. They have two significant properties: A local minimizer is always a global one. The KKT conditions are sufficient for optimality. A special feature of strictly convex functions is that they have at most one minimal point. But convex functions also play an important role in problems that are not convex. Therefore a simple and short treatment of convex functions is given here: and x􏰷 := (0, 1)T are local 33 Chapter 2 2.2 Local First-Order Optimality Conditions 53 Definition Let D ⊂ Rn be nonempty and convex. A real-valued function f defined on at least D is called convex on D if and only if 􏰓􏰔 f (1−τ)x+τy ≤ (1−τ)f(x)+τf(y) holds for all x,y ∈ D and τ ∈ (0, 1). f is called strictly convex on D if and only if 􏰓􏰔 f (1−τ)x+τy < (1−τ)f(x)+τf(y) for all x,y ∈ D with x ̸= y and τ ∈ (0,1). The addition “on D” will be omitted, if D is the domain of definition. We say f is concave (on D) iff −f is convex, and strictly concave (on D) iff −f is strictly convex. For a concave function the line segment joining two points on the graph is never above the graph. Let D ⊂ Rn be nonempty and convex and f : D −→ R a convex function. Properties 1) If f attains a local minimum at a point x∗ ∈ D, then f(x∗) is the global minimum. ◦ 2) f is continuous in D. 3) The function φ defined by φ(τ) := f(x+τh)−f(x) for x ∈ D◦ , h ∈ Rn and sufficiently small, positive τ is isotone, that is, order-preserving. 4) For D open and a differentiable f it holds that f(y)−f(x) ≥ f ′(x)(y−x) for all x,y ∈ D. With the function f defined by f(x) := 0 for x ∈ [0, 1) and f(1) := 1 we can see that assertion 2) cannot be extended to the whole of D. Proof: 1) If there existed an x ∈ D such that f(x) < f(x∗), then we would have f((1−τ)x∗ +τx)≤(1−τ)f(x∗)+τf(x) 0 such that for
K := {h∈Rn |∥h∥∞ ≤ρ}
τ
Chapter 2

54
Optimality Conditions

itholdsthatx0+K⊂D.Evidently,thereexistm∈Nanda1,…,am ∈Rn with K = conv(a1, . . . , am) (convex hull). Every h ∈ K may be repre- sented as h = 􏰚mμ=1 γμ aμ with γμ ≥ 0 satisfying 􏰚mμ=1 γμ = 1. With
􏰨max{|ψ(aμ)| | μ = 1,…,m}, if positive α := 1 , otherwise
we have ψ(h) ≤ 􏰚mμ=1 γμ ψ(aμ) ≤ α. Now let ε ∈ (0, α]. Then firstly for all h∈Rn with ∥h∥∞ ≤ερ/α we have
􏰦􏰓 ε􏰔 ε􏰓α 􏰔􏰧 ε 􏰓α 􏰔 ψ(h)=ψ1−α0+αεh ≤αψεh≤ε
and therefore with
0=ψ(0)=ψ 2h−2h ≤2ψ(h)+2ψ(−h)
􏰦11􏰧1 1 ψ(h) ≥ −ψ(−h) ≥ −ε , hence, all together |ψ(h)| ≤ ε .
3)
Since f is convex, we have
f(x+τ0h) = f􏰓􏰓1− τ0􏰔x+ τ0(x+τ1h)􏰔
τ1 τ1
≤ 􏰓1− τ0􏰔f(x)+ τ0f(x+τ1h)
4)
f(x+τ0h)−f(x) ≤ f(x+τ1h)−f(x). τ0 τ1
This follows directly from 3) (with h = y − x):
f′(x)h = lim f(x+τh)−f(x) ≤ f(x+h)−f(x) 􏰀
τ1 τ1 for 0 < τ0 < τ1 . Transformation leads to τ→0+ τ 1 Constraint Qualifications The condition Cl(x0)∗ = Ct(x0)∗ is very abstract, extremely general, but not easily verifiable. Therefore, for practical problems, we will try to find regularity assump- tions called constraint qualifications (CQ) which are more specific, easily verifiable, but also somewhat restrictive. For the moment we will consider the case that we only have inequality con- 􏱊􏱌 straints. Hence, and I = {1,...,m} with an m ∈ N0. Linear con- 􏱍 straints pose fewer problems than nonlinear constraints. Therefore, we will assume the partition I = I1 ⊎I2. E=∅ 􏱋 Chapter 2 2.2 Local First-Order Optimality Conditions 55 If and only if i ∈ I2 let gi(x) = aTi x−bi with suitable vectors ai and bi , that is, gi is ‘linear’, more precisely affinely linear. Corresponding to this partition, we will also split up the set of active constraints A(x0 ) for x0 ∈ F into Aj(x0) := Ij ∩A(x0) for j=1,2. We will now focus on the following Constraint Qualifications: (GCQ) Guignard Constraint Qualification: Cl(x0)∗ = Ct (x0)∗ (ACQ) Abadie Constraint Qualification: Cl(x0) = Ct (x0) (MFCQ) Mangasarian–Fromovitz Constraint Qualification: 􏰙gi′(x0)d < 0 for i ∈ A1(x0) gi′(x0)d ≤ 0 for i ∈ A2(x0) (SCQ) Slater Constraint Qualification: The functions gi are convex for all i ∈ I and ∃x ̃∈F gi(x ̃)<0 fori∈I1. The conditions gi′(x0)d < 0 and gi′(x0)d ≤ 0 each define half spaces. (MFCQ) means nothing else but that the intersection of all of these half spaces is nonempty. We will prove (SCQ) =⇒ (MFCQ) =⇒ (ACQ) . The constraint qualification (GCQ) introduced in theorem 2.2.5 is a trivial consequence of (ACQ). Proof: (SCQ) =⇒ (MFCQ): From the properties of convex and affinely linear functions and the definition of A(x0) we get: gi′(x0)(x ̃ − x0) ≤ gi(x ̃) − gi(x0) = gi(x ̃) < 0 for i ∈ A1(x0) gi′(x0)(x ̃ − x0) = gi(x ̃) − gi(x0) = gi(x ̃) ≤ 0 for i ∈ A2(x0). (MFCQ)=⇒(ACQ): Lemma2.2.3givesthatCt(x0)⊂Cl(x0)and0∈Ct(x0) always hold. Therefore it remains to prove that Cl(x0) \ {0} ⊂ Ct (x0). So let d0 ∈ Cl(x0) \ {0}. Take d as stated in (MFCQ). Then for a sufficiently small λ>0wehaved0+λd̸=0.Sinced0 isinCl(x0),itfollowsthat
gi′(x0)(d0 + λd) < 0 for i ∈ A1(x0) and gi′(x0)(d0 + λd) ≤ 0 for i ∈ A2(x0). For the moment take a fixed λ. Setting u := d0+λd produces ∥d0 +λ d∥2 ∃ d ∈ Rn Chapter 2 56 Optimality Conditions gi(x0 + tu) = gi(x0) + t gi′(x0)u + o(t) for i ∈ A1(x0) and 􏰿􏰾􏰽􏱀 􏰿􏰾􏰽􏱀 =0 <0 gi(x0+tu)=gi(x0)+tgi′(x0)u for i∈A2(x0). =0 ≤0 􏰿􏰾􏰽􏱀 􏰿􏰾􏰽􏱀 Thus, we have gi(x0 + tu) ≤ 0 for i ∈ A(x0) and t > 0 sufficiently small.
For the indices i ∈ I \ A(x0) this is obviously true. Hence, there exists a
t0 > 0 such that x0 +tu ∈ F for 0 ≤ t ≤ t0. For the sequence (xk) t0 u
defined by xk := x0 + k u it holds that xk −→ x0. Therefore, u ∈ Ct(x0) and consequently d0 +λd ∈ Ct(x0). Passing to the limit as λ −→ 0 yields d0 ∈ Ct (x0 ) . Lemma 2.2.3 or respectively exercise 9 gives that Ct (x0 ) is closed. Hence, d0 ∈Ct(x0). 􏰀
Now we will consider the general case, where there may also occur equality constraints. In this context one often finds the following linear independence constraint qualification in the literature:
􏰓􏰔􏰓􏰔
(LICQ) The vectors ∇gi(x0) | i ∈ A(x0) and ∇hj(x0) | j ∈ E are linearly independent.
(LICQ) greatly reduces the number of active inequality constraints. Instead of (LICQ) we will now consider the following weaker constraint qualification which is a variant of (MFCQ), and is often cited as the Arrow–Hurwitz– Uzawa constraint qualification:
􏰙gi′(x0)d < 0 for i ∈ A(x0), (AHUCQ) Thereexistsad∈Rn suchthat hj′(x0)d=0 for j∈E, 􏰓􏰔 and the vectors ∇hj(x0) | j ∈ E are linearly independent. We will show: (LICQ) =⇒ (AHUCQ) =⇒ (ACQ) Proof: (LICQ) =⇒ (AHUCQ): (AHUCQ) follows, for example, directly from the solvability of the system of linear equations gi′(x0)d = −1 for i ∈ A(x0), hj′(x0)d=0 forj∈E. (AHUCQ) =⇒ (ACQ): Lemma 2.2.3 gives that again we only have to show d0 ∈ Ct (x0) for all d0 ∈ Cl(x0) \ {0}. Take d as stated in (AHUCQ). Then we have d0 + λd =: w ̸= 0 for a sufficiently small λ > 0 and thus
Denote
gi′(x0)w < 0 for i ∈ A(x0) and h j′ ( x 0 ) w = 0 f o r j ∈ E . Chapter 2 2.2 Local First-Order Optimality Conditions 57 A := 􏰓∇h1(x0),...,∇hp(x0)􏰔∈Rn×p . For that ATA is regular because rank(A) = p. Now consider the following system of linear equations dependent on u ∈ Rp and t ∈ R: φj(u,t) := hj(x0 + Au + tw) = 0 (j = 1,...,p) For the corresponding vector-valued function φ we have φ(0,0) = 0, and because of ∂φj(u,t)=h′(x +Au+tw)∇h(x), ∂ui j0 i0 we are able to solve φ(u, t) = 0 locally for u, that is, there exist a nullneigh- borhood U0 ⊂ R and a continuously differentiable function u: U0 −→ Rp satisfying u(0) = 0, hj(x0+Au(t)+tw)=0 for t∈U0 (j=1,...,p). Differentiation with respect to t at t = 0 leads to hj′(x0)􏰓Au′(0)+w􏰔=0 (j=1,...,p) and consequently — considering that hj′(x0)w = 0 and ATA is regular — to u′(0) = 0. Then for i ∈ A(x0) it holds that gi(x(t)) = gi(x0) + t gi′(x0)x′(0) + o(t) = t gi′(x0)􏰓Au′(0) + w􏰔 + o(t) . With u′(0) = 0 we obtain 􏰦 ′ o(t)􏰧 gi(x(t))=tgi(x0)w+ t and the latter is negative for t > 0 sufficiently small.
Hence, there exists a t1 >0 with x(t)∈F for 0≤t≤t1. From
􏰿 􏰾􏰽 􏱀 =: x(t)
􏰕t 􏰖 t 􏰦 u(t /k) 􏰧 x1 =x0+1w+A1
􏰓t1􏰔 w
by passing to the limit as λ → 0
k k t1/k
􏰿 􏰾􏰽 􏱀
−→0 (k→∞)
fork∈Nwegetx k −→x0;thisyieldsw=d0+λd∈Ct(x0)andalso
d0 ∈Ct(x0) = Ct(x0). 􏰀
Chapter 2

58 Optimality Conditions
Convex Optimization Problems
Firstly suppose that C ⊂ Rn is nonempty and the functions f, gi : C −→ R are arbitrary for i ∈ I . We consider the general optimization problem
􏰨 f(x) −→ min
(P) gi(x)≤0 for i∈I:={1,…,m} .
In the following section the Lagrangian L to (P) defined by
to a vector-valued function g.
Definition
Apair(x∗,λ∗)∈C×Rm+ iscalledasaddlepointofLifandonlyif
L(x∗,λ) ≤ L(x∗,λ∗) ≤ L(x,λ∗)
holds for all x ∈ C and λ ∈ Rm+ , that is, x∗ minimizes L(·,λ∗) and λ∗
maximizes L(x∗, · ). Lemma 2.2.6
If (x∗,λ∗) is a saddlepoint of L, then it holds that:
• x∗ is a global minimizer of (P). • L(x∗,λ∗) = f(x∗) •λ∗igi(x∗)=0 forall i∈I.
Proof: Let x∈C and λ∈Rm+. From
0 ≥ L(x∗,λ)−L(x∗,λ∗) = ⟨λ−λ∗,g(x∗)⟩ (4)
we obtain for λ := 0
⟨λ∗,g(x∗)⟩ ≥ 0. (5) With λ:=λ∗+ei weget—alsofrom(4)—
gi(x∗)≤0 forall i∈I,thatis,g(x∗)≤0. (6) Because of (6), it holds that ⟨λ∗ , g(x∗)⟩ ≤ 0. Together with (5) this produces
⟨λ∗,g(x∗)⟩=0 andhence, λ∗igi(x∗)=0 forall i∈I.
􏰛m i=1
λigi(x) = f(x)+⟨λ,g(x)⟩ for x∈C and λ∈Rm+
L(x,λ) := f(x)+
will play an important role. As usual we have combined the m functions gi
Chapter 2

2.2 Local First-Order Optimality Conditions 59
For x ∈ F it follows that
f(x∗) = L(x∗,λ∗) ≤ L(x,λ∗) = f(x)+⟨λ∗,g(x)⟩ ≤ f(x).
􏰿􏰾􏰽􏱀
Therefore x∗ is a global minimizer of (P ). ≤0 􏰀
We assume now that C is open and convex and the functions f, gi : C −→ R are continuously differentiable and convex for i ∈ I . In this case we write more precisely (CP) instead of (P).
Theorem 2.2.7
If the Slater constraint qualification holds and x∗ is a minimizer of (CP), then there exists a vector λ∗ ∈ Rm+ such that (x∗, λ∗) is a saddlepoint of L.
Proof: Taking into account our observations from page 55, theorem 2.2.5 gives that there exists a λ∗ ∈ Rm+ such that
0 = Lx(x∗,λ∗) and ⟨λ∗,g(x∗)⟩ = 0. With that we get for x ∈ C 1
and
L(x,λ∗)−L(x∗,λ∗) ≥ Lx(x∗,λ∗)(x−x∗) = 0
L(x∗,λ∗)−L(x∗,λ)=−􏱃 λ ,g(x∗)􏱄≥0. 􏰿􏰾􏰽􏱀 􏰿 􏰾􏰽 􏱀
≥0
Hence, (x∗, λ∗) is a saddlepoint of L.
The following example shows that the Slater constraint qualification is es-
sential in this theorem:
Example 6
With n = 1 and m = 1 we regard the convex problem 􏰨 f(x) := −x −→ min
≤0
􏰀
(P) g(x) := x2 ≤ 0 .
The only feasible point is x∗ = 0 with value f(0) = 0. So 0 minimizes f(x)
subject to g(x) ≤ 0.
L(x,λ) := −x+λx2 for λ ≥ 0,x ∈ R. There is no λ∗ ∈ [0,∞) such that
(x∗, λ∗) is a saddlepoint of L. ▹ The following important observation shows that neither constraint qualifica-
tions nor second-order optimality conditions, which we will deal with in the
1 By the convexity of f and gi the function L( · , λ∗) is convex.
Chapter 2

60 Optimality Conditions
next section, are needed for a sufficient condition for general convex optimiza- tion problems:
Suppose that f, gi, hj : Rn −→ R are continuously differentiable functions with f and gi convex and hj (affinely) linear (i ∈ I,j ∈ E), and consider the following convex optimization problem2
⎧⎨ f ( x ) − → m i n (CP) ⎩gi(x)≤0fori∈I
hj(x)=0 for j∈E .
We will show that for this special kind of problem every KKT point already
gives a (global) minimum: Theorem 2.2.8
Suppose x0 ∈ F and there exist vectors λ ∈ Rm+ and μ ∈ Rp such that
f(x) − f(x0)
≥ f ′(x0)(x − x0)
􏰚m ∇f(x0) +
􏰚p j=1
λi ∇gi(x0) +
λigi(x0) = 0 for i=1,…,m,
then (CP) attains its global minimum at x0.
The Proof of this theorem is surprisingly simple: Taking into account 4) on page 53, we get for x ∈ F :
f convex
= −
􏰚m′ 􏰚p′
i=1
λi gi (x0 ) (x − x0 ) − λi(gi(x)−gi(x0)) = −
μj hj (x0 ) (x − x0 ) j=1 􏰿 􏰾􏰽 􏱀
=hj(x)−hj(x0)=0
􏰚m i=1
With n=2,m=2and x=(x1,x2)T ∈D:=R2 weconsider:
2 Since the functions hj are assumed to be (affinely) linear, exercise 6 gives that this problem can be written in the form from page 58 by substituting the two inequalities hj (x) ≤ 0 and −hj (x) ≤ 0 for every equation hj (x) = 0 .
i=1 􏰚m
gi convex
μj ∇hj(x0) = 0 and
≥ −
λigi(x) ≥ 0 􏰀 The following example shows that even if we have convex problems the KKT condi-
tions are not necessary for minimal points:
Example 7
i=1
Chapter 2

2.3 Local Second-Order Optimality Conditions 61
⎧⎨ f ( x ) : = x 1 − → m i n
(P) ⎩ g1(x) := x21+(x2−1)2−1 ≤ 0 g2(x) := x21+(x2+1)2−1 ≤ 0
Obviously, only the point x0 := (0, 0)T is feasible. Hence, x0 is the (global) minimal point. Since ∇f(x0) = (1,0)T, ∇g1(x0) = (0,−2)T and ∇g2(x0) = (0, 2)T , the gradient condition of the KKT conditions is not met. f is linear, the functions gν are convex. Evidently, however, the Slater condition is not fulfilled.
x2 2
1
−1 x0 1
g1 (x)=0 ∇f (x0 ) x1
−1 g2 (x)=0
−2
Of course, one could also argue from proposition 2.2.1: The cones
Cdd(x0) = 􏰤d∈R2 |f′(x0)d < 0􏰥 = 􏰤d∈R2 |d1 <0􏰥 Cl(x0) = 􏰤d∈R2 |∀i∈A(x0) gi′(x0)d≤0􏰥 = 􏰤d∈R2 |d2 =0􏰥 are clearly not disjoint. ▹ 2.3 Local Second-Order Optimality Conditions To get a finer characterization, it is natural to examine the effects of second-order terms near a given point too. The following second-order results take the ‘curvature’ of the feasible region in a neighborhood of a ‘candidate’ for a minimizer into account. The necessary second-order condition sT Hs ≥ 0 and the sufficient second-order condition sT Hs > 0 for the Hessian H of the Lagrangian with respect to x regard only certain subsets of vectors s.
Suppose that the functions f, gi and hj are twice continuously differentiable.
and
∇g2 (x0 )
Chapter 2
∇g1 (x0 )

62 Optimality Conditions
Theorem 2.3.1 (Necessary second-order condition)
Suppose x0 ∈F and there exist λ∈Rm+ and μ∈Rp such that
􏰚p j=1
If (P) has a local minimum at x0, then
􏰦 􏰛m 􏰛p 􏰧
􏰚m ∇f(x0) +
μj∇hj(x0) = 0 and
λi ∇gi(x0) + λigi(x0)=0 forall i∈I.
sT ∇2f(x0) +
holds for all s ∈ Ct+ (x0), where
more clearly respectively
Proof: It holds that
∇xL(x0,λ,μ) = 0,
s T ∇ x2 x L ( x 0 , λ , μ ) s ≥ 0 .
i=1
i=1
λi ∇2gi(x0) + μj ∇2hj(x0) s ≥ 0 j=1
􏰤􏰥
F+ :=F+(x0):= x∈F|gi(x)=0 forall i∈A+(x0) with 􏰤􏰥
A+(x0) := i∈A(x0)|λi >0 and 􏰤nNd􏰥
Ct+(x0):=Ct(F+,x0)= d∈R |∃(xk)∈F+ xk−→x0 .
With the help of the Lagrangian L the second and fifth lines can be written
λigi(x)=0 forall x∈F+
because we have λi = 0 for i∈I\A+(x0) and gi(x) = 0 for i∈A+(x0),
respectively.
With the function φ defined by
􏰛m φ(x) := f(x) +
λigi(x) +
􏰛p
μjhj(x) = L(x,λ,μ)
i=1
for x ∈ D this leads to the following relation:
φ(x)=f(x) for x∈F+.
x0 gives a local minimum of f on F, therefore one of φ on F+ .
Now let s ∈ Ct+(x0). Then by definition of the tangent cone there exists a sequence (x(k)) in F+, such that x(k) = x0 +αk(s+rk), αk ↓ 0 and rk → 0.
j=1
Chapter 2

2.3 Local Second-Order Optimality Conditions 63
By assumption ∇φ(x0) = 0. With the Taylor theorem we get φ(x0) ≤ φ􏰓x(k)􏰔 = φ(x0) +αk φ′(x0)(s+rk)
􏰿 􏰾􏰽 􏱀
1=0 􏰓 􏰔
+2α2k(s+rk)T∇2φ x0 +τk(x(k) −x0) (s+rk) for all sufficiently large k and a suitable τk ∈ (0, 1) .
Dividing by α2k/2 and passing to the limit as k → ∞ gives the result
sT ∇2φ(x0)s ≥ 0. 􏰀
In the following example we will see that x0 := (0,0,0)T is a stationary point. With the help of theorem 2.3.1 we want to show that the necessary condition for a minimum is not met.
Example 8 f(x) := x3 − 1×21 −→ min 2
g 1 ( x ) : = − x 21 − x 2 − x 3 ≤ 0 g 2 ( x ) : = − x 21 + x 2 − x 3 ≤ 0 g3(x) := −x3 ≤ 0
For the point x0 := (0,0,0)T we have f′(x0) = (0,0,1), A(x0) = {1,2,3}
and g1′ (x0) = (0, −1, −1), g2′ (x0) = (0, 1, −1), g3′ (x0) = (0, 0, −1).
We start with the gradient condition:
⎛⎞ ⎛⎞ ⎛⎞ ⎛⎞⎛⎞
00000 ∇x L(x0,λ) = ⎝0⎠+λ1 ⎝−1⎠+λ2 ⎝ 1⎠+λ3 ⎝ 0⎠ = ⎝0⎠
1 −1 −1 −1 0
􏰙
⇐⇒
−λ1−λ2−λ3 =−1
⇐⇒ λ2 =λ1 , λ3 = 1−2λ1
For λ1 := 1/2 we obtain λ = (1/2,1/2,0)T ∈ R3+ and λigi(x0) = 0 for i ∈ I.
Hence, we get A+(x0) = {1, 2},
F+ = 􏰤x∈R3 |g1(x)=g2(x)=0, g3(x)≤0􏰥 = {(0,0,0)T}
and therefore Ct+(x0) = {(0,0,0)T}. In this way no decision can be made!
Setting λ1 := 0 we obtain respectively λ = e3, A+(x0) = {3}, 􏰤􏰥
F+={x∈F|x3=0},Ct+(x0)= αe1|α∈R and ⎛⎞
−1 0 0 H:=∇2f(x0)+∇2g3(x0)=⎝ 000⎠.
000
−λ1+λ2 =0
Chapter 2

64 Optimality Conditions
H is negative definite on Ct+ (x0). Consequently there is no local minimum of (P) at x0 = 0. ▹
In order to expand the second-order necessary condition to a sufficient condi- tion, we will now have to make stronger assumptions.
Before we do that, let us recall that there will remain a ‘gap’ between these two conditions. This fact is well-known (even for real-valued functions of one variable) and is usually demonstrated by the functions f2 , f3 and f4 defined by
fk(x) := xk for x ∈ R,k = 2,3,4,
The following Remark can be proven in the same way as 2) in lemma 2.2.3:
⎧⎪⎨ 􏰟 􏰟 􏰟 g i ′ ( x 0 ) s = 0 f o r i ∈ A + ( x 0 ) ⎫⎪⎬ Ct+(x0) ⊂ Cl+(x0) := ⎪s∈Rn 􏰟􏰟 gi′(x0)s ≤ 0 for i∈A(x0)\A+(x0)⎪ ⎩ h j′ ( x 0 ) s = 0 f o r j ∈ E ⎭
Theorem 2.3.2 (Sufficient second-order condition)
Suppose x0 ∈ F and there exist vectors λ ∈ Rm+ and μ ∈ Rp such that
∇x L(x0,λ,μ) = 0 and λTg(x0) = 0. Furthermore, suppose that
s T ∇ x2 x L ( x 0 , λ , μ ) s > 0
for all s∈Cl+(x0)\{0}. Then (P) attains a strict local minimum at x0.
Proof (indirect): If f does not have a strict local minimum at x0, then there exists a sequence (x(k)) in F \ {x0} with x(k) −→ x0 and f(x(k)) ≤ f(x0).
For s := x(k)−x0 it holds that ∥s ∥ = 1. Hence, there exists a convergent k ∥x(k)−x0∥2 k 2
subsequence. wlog suppose sk −→ s for an s ∈ Rn. With αk := ∥x(k) − x0∥2 we have x(k) = x0 + αksk and wlog αk ↓ 0. From
f(x0) ≥ f(x(k)) = f(x0) + αk f ′(x0)sk + o(αk)
at the point x0 = 0.
it follows that
For i∈A(x0) and j∈E wegetinthesameway:
f′(x0)s ≤ 0.
Chapter 2

2.3 Local Second-Order Optimality Conditions 65
gi(x(k)) = gi(x0) +αkgi′(x0)sk + o(αk) =⇒ gi′(x0)s ≤ 0 􏰿􏰾􏰽􏱀 􏰿􏰾􏰽􏱀
≤0 =0
h j ( x ( k ) ) = h j ( x 0 ) + α k h j′ ( x 0 ) s k + o ( α k ) =⇒ h j′ ( x 0 ) s = 0
=0 =0
With the assumption ∇x L(x0, λ, μ) = 0 it follows that
􏰿 􏰾􏰽 􏱀 􏰿 􏰾􏰽 􏱀
f′(x0)s + 􏰿 􏰾􏰽 􏱀
􏰛m 􏰚p
λigi′(x0)s + μj hj′(x0)s = 0
j=1 􏰿 􏰾􏰽 􏱀 ≤0 􏰿 􏰾􏰽 􏱀 =0
i=1
= 􏰚 λigi′(x0)s
and from that gi′(x0)s = 0 for all i ∈ A+(x0) .
Since ∥s∥2 = 1, we get s ∈ Cl+ (x0) \ {0}. For the function φ defined by
φ(x(k)) = f(x(k)) + λi gi(x(k)) + 􏰿 􏰾􏰽 􏱀 i=1 􏰿 􏰾􏰽 􏱀
i∈A+(x0) 􏰿 􏰾􏰽 􏱀 ≤0
􏰛m φ(x) := f(x) +
􏰛p j=1
μj hj(x(k)) ≤ f(x0) = φ(x0) j=1 􏰿 􏰾􏰽 􏱀
λigi(x) + it holds by assumption that ∇φ(x0) = 0.
μjhj(x) = L(x,λ,μ)
i=1
􏰛m 􏰛p
≤f(x0) ≤0 =0 The Taylor theorem yields
φ ( x ( k ) ) = φ ( x 0 ) + α k φ ′ ( x 0 ) s k + 1 α 2k s Tk ∇ 2 φ 􏰓 x 0 + τ k ( x ( k ) − x 0 ) 􏰔 s k 􏰿 􏰾􏰽 􏱀 2
=0
with a suitable τk ∈ (0, 1). From this we deduce, as usual, sT ∇2φ(x0)s ≤ 0. With s ∈ Cl + (x0 ) \ {0} we get a contradiction to our assumption. 􏰀
The following example gives a simple illustration of the necessary and sufficient second-order conditions of theorems 2.3.1 and 2.3.2:
Example 9 (Fiacco and McCormick (1968)) f(x) := (x1 −1)2 +x2 −→min
g1(x) := x1 − ρx2 ≤ 0
We are looking for a ρ > 0 such that x0 := (0,0)T is a local minimizer of the problem: With ∇f(x0) = (−2,0)T ,∇g1(x0) = (1,0)T the condition ∇xL(x0,λ,μ) = 0 firstly yields λ1 = 2.
In this case (MFCQ) is fulfilled with A1(x0) = A(x0) = {1} = A+(x0). We have
Chapter 2

66 Optimality Conditions
–2
2.4 Duality
–2
Cl+(x0) = {d∈R2 |d1 =0} = Ct+(x0). 􏰕􏰖􏰕􏰖􏰕􏰖
The matrix
∇2f(x0)+2∇2g1(x0) = 2 0 +2 0 0 = 2 1 0
0 2 0 −2ρ 0 1−2ρ
is negative definite on Ct + (x0 ) for ρ > 1/2 . Thus the second-order necessary condition of theorem 2.3.1 is violated and so there is no local minimum at x0 . For ρ < 1/2 the Hessian is positive definite on Cl + (x0 ). Hence, the sufficient conditions of theorem 2.3.2 are fulfilled and thus there is a strict local minimum at x0. When ρ = 1/2, this result is not determined by the second-order conditions; but we can confirm it in the following simple way: f(x) = (x1 −1)2 +x2 = x21 +1+(x2 −2x1). Because of x2 −2x1 ≥0 this yieldsf(x)≥1andf(x)=1onlyforx1 =0andx2−2x1 =0.Hence, there is a strict local minimum at x0. ρ = 1/4 ρ = 1 22 –113 –1013 Chapter 2 Duality plays a crucial role in the theory of optimization and in the development of corresponding computational algorithms. It gives insight from a theoretical point of view but is also significant for computational purposes and economic interpretations, for example shadow prices. We shall concentrate on some of the more basic results and limit ourselves to a particular duality — Lagrange duality — which is the most popular and useful one for many purposes. Given an arbitrary optimization problem, called primal problem, we consider a prob- lem that is closely related to it, called the Lagrange dual problem. Several prop- erties of this dual problem are demonstrated in this section. They help to provide strategies for solving the primal and the dual problem. The Lagrange dual problem of large classes of important nonconvex optimization problems can be formulated as an easier problem than the original one. ▹ 2.4 Duality 67 Lagrange Dual Problem Withn∈N,m,p∈N0,∅̸=C⊂Rn,functionsf:C−→R,g = (g1,...,gm)T: C −→ Rm, h = (h1,...,hp)T: C −→ Rp and the feasible region we regard the primal problem in standard form: gi(x) ≤ 0 or hj(x) = 0 can be included in the definition of the set C. Substituting the two inequalities hj(x) ≤ 0 and −hj(x) ≤ 0 for every equation hj(x) = 0 we can assume wlog p = 0. Then we have 􏰤􏰥 F= x∈C|g(x)≤0. The Lagrangian function L is defined as a weighted sum of the objective 􏰤􏰥 F:= x∈C|g(x)≤0,h(x)=0 􏰨 f(x) −→ min x∈F (P) There is a certain flexibility in defining a given problem: Some of the constraints function and the constraint functions, defined by T 􏰛m L(x,λ) := f(x) + λ g(x) = f(x) + ⟨λ,g(x)⟩ = f(x) + i=1 f o r x ∈ C a n d λ = ( λ 1 , . . . , λ m ) T ∈ R m+ . The vector λ is called the dual variable or multiplier associated with the problem. For i = 1, . . . , m we refer to λi as the dual variable or multiplier associated with the inequality constraint gi(x) ≤ 0. The Lagrange dual function, or dual function, φ is defined by φ(λ) := inf L(x, λ) x∈C on the effective domain of φ 􏰨􏱆 F D : = λ ∈ R m+ | i n f L ( x , λ ) > − ∞ . x∈C
The Lagrange dual problem, or dual problem, then is defined by 􏰨 φ(λ) −→ max
In the general case, the dual problem may not have a solution, even if the primal problem has one; conversely, the primal problem may not have a solution, even if the dual problem has one:
λi gi(x)
(D) λ ∈ FD .
Chapter 2

68 Optimality Conditions
Example 10
Forbothexampleslet C:=R,m:=1andp:=0:
􏰨 f(x) := x+2010 −→ min
a) (P) g(x) := 1×2 ≤0 2
1. x∗ := 0 is the only feasible point. Thus inf{f(x)| x∈F} = f(0) = 2010.
2. L(x,λ) := f(x)+λg(x) = x+2010+λx2 2
b)
(λ≥0,x∈R)
FD = R++ (for λ > 0: parabola opening upwards; for λ = 0: unbounded
from below) : φ(λ) = 2010 − 1 2λ
􏰨 f(x) := exp(−x) −→ min (P) g(x) := −x ≤ 0
1. Wehave inf{f(x)| x∈F}=inf{exp(−x)| x≥0}=0,butthereexists no x ∈ F = R+ with f(x) = 0.
2.L(x,λ):=f(x)+λg(x)=exp(−x)−λx (λ≥0)showsFD ={0} with φ(0) = 0. So we have sup{φ(λ)| λ∈FD} = 0 = φ(0). ▹
The dual objective function φ — as the pointwise infimum of a family of affinely linear functions — is always a concave function, even if the initial problem is not convex. Hence the dual problem can always be written (φ 􏱂→ −φ) as a convex minimum problem:
Remark The set FD is convex, and φ is a concave function on FD . Proof: Letx∈C,α∈[0,1]andλ,μ∈FD:
L(x,αλ+(1−α)μ) = f(x)+⟨αλ+(1−α)μ,g(x)⟩ 􏰓􏰔􏰓􏰔
= α f(x)+⟨λ,g(x)⟩ +(1−α) f(x)+⟨μ,g(x)⟩ = αL(x,λ)+(1−α)L(x,μ)
≥ αφ(λ) + (1 − α)φ(μ)
This inequality has two implications: α λ + (1 − α) μ ∈ FD , and further, φ(αλ+(1−α)μ) ≥ αφ(λ)+(1−α)φ(μ). 􏰀
As we shall see below, the dual function yields lower bounds on the optimal value
p∗ := v(P) := inf(P) := inf{f(x):x∈F}
of the primal problem (P). The optimal value of the dual problem (D) is defined by
Chapter 2

2.4 Duality
69
d∗ := v(D) := sup(D) := sup{φ(λ) : λ ∈ FD}.
We allow v(P) and v(D) to attain the extended values +∞ and −∞ and follow the standard convention that the infimum of the empty set is ∞ and the supremum of the empty set is −∞. If there are feasible points xk with f(xk) → −∞ (k → ∞), then v(P) = −∞ and we say problem (P) — or the function f on F — is un- bounded from below. If there are feasible points λk with φ(λk ) → ∞ (k → ∞), then v(D) = ∞ and we say problem (D) — or the function φ on FD — is unbounded from above. The problems (P) and (D) always have optimal values — possibly ∞ or −∞. The question is whether or not they have optimizers, that is, there ex- ist feasible points achieving these values. If there exists a feasible point achieving inf(P), we sometimes write min(P) instead of inf(P), accordingly max(D) instead of sup(D) if there is a feasible point achieving sup(D). In example 10,a) we had min(P ) = sup(D), in example 10, b) we got inf (P ) = max(D).
What is the relationship between d∗ and p∗? The following theorem gives a first answer:
Weak Duality Theorem
If x is feasible to the primal problem (P ) and λ is feasible to the dual problem (D), then we have φ(λ) ≤ f(x) . In particular
d∗ ≤ p∗ . Proof: Let x∈F and λ∈FD:
φ(λ) ≤ L(x,λ) = f(x)+ λT g(x) ≤ f(x) 􏰿􏰾􏰽􏱀 􏰿􏰾􏰽􏱀
≥0
This implies immediately d∗ ≤ p∗.
Although very easy to show, the weak duality result has useful implications: For instance, it implies that the primal problem has no feasible points if the optimal value of (D) is ∞. Conversely, if the primal problem is unbounded from below, the dual problem has no feasible points. Any feasible point λ to the dual problem provides a lower bound φ(λ) on the optimal value p∗ of problem (P), and any feasible point x to the primal problem (P) provides an upper bound f(x) on the optimal value d∗ of problem (D). One aim is to generate good bounds. This can help to get termination criteria for algorithms: If one has a feasible point x to (P) and a feasible point λ to (D), whose values are close together, then these values must be close to the optima in both problems.
Corollary
If f(x∗) = φ(λ∗) for some x∗ ∈F and λ∗ ∈FD, then x∗ is a minimizer to the primal problem (P) and λ∗ is a maximizer to the dual problem (D).
≤0
􏰀
Chapter 2

70 Optimality Conditions
Proof:
φ(λ∗) ≤ sup{φ(λ)|λ∈FD} ≤ inf{f(x)|x∈F} ≤ f(x∗) = φ(λ∗) Hence, equality holds everywhere, in particular
f(x∗) = inf{f(x)|x∈F} and φ(λ∗) = sup{φ(λ)|λ∈FD}. 􏰀
The difference p∗ − d∗ is called the duality gap. If this duality gap is zero, that is, p∗ = d∗, then we say that strong duality holds. We will see later on: If the functions f and g are convex (on the convex set C ) and a certain constraint qualification holds, then one has strong duality. In nonconvex cases, however, a duality gap
p∗ − d∗ > 0
has to be expected. The following examples illustrate the necessity of making more demands on f,g and C to get a close relation between the problems (P) and (D):
Example 11 With n:=1,m:=1:
a) d∗=−∞,p∗=∞ C:=R+,f(x):=−x,g(x):=π (x∈C):
L(x,λ)=−x+λπ (x∈C,λ∈R+)
F=∅,p∗ =∞; infL(x,λ)=−∞,FD =∅,d∗ =−∞
x∈C
b) d∗=0,p∗=∞ C:=R++,f(x):=x,g(x):=x (x∈C):
L(x,λ)=x+λx = (1+λ)x
F = ∅, p∗ = ∞; FD = R+, φ(λ) = 0, d∗ = 0
c) −∞=d∗ φ(λ∗) = inf L(x,λ∗) = sup L(x∗,λ) = L(x∗,λ∗).
x∈C λ∈Rm+
By lemma 2.2.6 we know already: x∗ is a minimizer of (P ) with f (x∗ ) =
L(x∗,λ∗). b) now follows by the corollary to the weak duality theorem. Conversely, suppose now that b) holds true:
φ(λ∗) = inf {L(x,λ∗) | x ∈ C} ≤ L(x∗,λ∗) = f(x∗)+⟨λ∗,g(x∗)⟩ ≤ f(x∗)
(7) We have φ(λ∗) = f(x∗), by assumption. Therefore, equality holds everywhere
in (7), especially, ⟨λ∗,g(x∗)⟩ = 0. This leads to
L(x∗,λ∗) = f(x∗) ≤ L(x,λ∗) for x∈C and
L(x∗,λ) = f(x∗)+⟨λ,g(x∗)⟩ ≤ f(x∗) = L(x∗,λ∗) for λ∈Rm+. 􏰀
Perturbation and Sensitivity Analysis
In this subsection, we discuss how changes in parameters affect the solution of the primal problem. This is called sensitivity analysis. How sensitive are the minimizer
Chapter 2

2.4 Duality 75
and its value to ‘small’ perturbations in the data of the problem? If parameters change, sensitivity analysis often helps to avoid having to solve a problem again.
For u ∈ Rm we consider the ‘perturbed’ optimization problem 􏰨 f(x) −→ min
(Pu) x ∈ Fu with the feasible region
􏰤􏰥
Fu := x∈C| g(x)≤u .
The vector u is called the ‘perturbation vector’. Obviously we have (P0 ) = (P ).
If a variable ui is positive, this means that we ‘relax’ the i-th constraint gi(x) ≤ 0 to gi(x) ≤ ui; if ui is negative we tighten this constraint.
We define the perturbation or sensitivity function p: Rm −→R∪{−∞,∞}
associated with the problem (P) by
p(u) := inf{f(x) | x∈Fu} = inf{f(x) | x∈C, g(x) ≤ u} for u∈Rm
(with inf ∅ := ∞). Obviously we have p(0) = p∗.
The function p gives the minimal value of the problem (Pu) as a function of ‘per-
turbations’ of the right-hand side of the constraint g(x) ≤ 0. Its effective domain is given by the set
dom(p) := {u∈Rm | p(u)<∞} =􏱈 {u∈Rm | ∃x∈Cg(x) ≤ u}. Obviously the function p is antitone, that is, order-reversing: If the vector u increases, the feasible region Fu increases and so p decreases (in the weak sense). Remark If the original problem (P) is convex, then the effective domain dom(p) is convex and the perturbation function p is convex on it. Since −∞ is possible as a value for p on dom(p), convexity here means the convexity of the epigraph3 epi(p) := {(u,z)∈Rm ×R | u∈dom(p), p(u) ≤ z} 3 The prefix ‘epi’ means ‘above’. A real-valued function p is convex if and only if the set epi(p) is convex (cf. exercise 8). Chapter 2 76 Optimality Conditions Proof: The convexity of dom(p) and p is given immediately by the convexity of the set C and the convexity of the function g: Let u,v ∈ dom(p) and ρ ∈ (0,1). For α,β ∈ R with p(u) < α and p(v) < β there exist vectors x,y ∈ C with g(x) ≤ u, g(y) ≤ v and f ( x ) < α , f ( y ) < β . T h e v e c t o r x􏰶 : = ρ x + ( 1 − ρ ) y b e l o n g s t o C w i t h g(x􏰶) ≤ ρg(x)+(1−ρ)g(y) ≤ ρu+(1−ρ)v =: u􏰶 f(x􏰶) ≤ ρf(x)+(1−ρ)f(y)<ρα+(1−ρ)β. and This shows p(u􏰶) ≤ f(x􏰶) < ρα + (1 − ρ)β, hence, p(u􏰶) ≤ ρp(u) + (1 − ρ)p(v). Remark We assume that strong duality holds and that the dual optimal value is at- tained. Let λ∗ be a maximizer to the dual problem (D). Then we have p(u) ≥ p(0)−⟨λ∗,u⟩ for all u∈Rm . Proof: For a given u ∈ Rm and any feasible point x to the problem (Pu), that is, x ∈ Fu , we have p(0) = p∗ = d∗ = φ(λ∗) ≤ f(x)+⟨λ∗,g(x)⟩ ≤ f(x)+⟨λ∗,u⟩. From this follows p(0) ≤ p(u) + ⟨λ∗ , u⟩ . 􏰀 This inequality gives a lower bound on the optimal value of the perturbed problem (Pu). The hyperplane given by z = p(0) − ⟨λ∗ ,u⟩ ‘supports’ the epigraph of the function p at the point (0,p(0)). For a problem with only one inequality constraint the inequality shows that the affinely linear function u 􏱂→ p∗ − λ∗ u (u ∈ R) lies below the graph of p and is tangent to it at the point (0,p∗). 􏰀 p∗ − λ∗ u We get the following rough sensitivity results: p∗ = p(0) p u Chapter 2 2.4 Duality 77 If λ∗i is ‘small’, relaxing the i-th constraint causes a small decrease of the optimal value p(u). Conversely, if λ∗i is ‘large’, tightening the i-th constraint causes a large increase of the optimal value p(u). Under the assumptions of the foregoing remark we have: Remark If the function p is differentiable4 at the point u = 0, then the maximizer λ∗ of the dual problem (D) is related to the gradient of p at u = 0 : ∇p(0) = −λ∗ Here the Lagrange multipliers λ∗i are exactly the local sensitivities of the function p with respect to perturbations of the constraints. Proof: The differentiability at the point u = 0 gives: p(u) = p(0)+⟨∇p(0),u⟩+r(u)∥u∥ with r(u)→0 for Rm ∋u→0. Hence we obtain − ⟨∇p(0) + λ∗ , u⟩ ≤ r(u) ∥u∥. We set u := −t[∇p(0) + λ∗] for t > 0 and get t ∥∇p(0) + λ∗∥2 ≤ t ∥∇p(0) + λ∗∥ r􏰓−t[∇p(0)+λ∗]􏰔. This
shows: ∥∇p(0) + λ∗∥ ≤ r􏰓−t[∇p(0)+λ∗]􏰔. Passage to the limit t → 0 yields ∇p(0)+λ∗ = 0. 􏰀
For the rest of this section we consider only the special case of a convex opti- mization problem, where the functions f and g are convex and continuously differentiable and the set C is convex.
Economic Interpretation of Duality
The equation or
∇p(0) = −λ∗ −∂p(0)=λ∗i for i=1,…,m
∂ui
leads to the following interpretation of dual variables in economics:
The components λ∗i of the Lagrange multiplier λ∗ are often called shadow prices or attribute costs. They represent the ‘marginal’ rate of change of the optimal value
p∗ = v(P) = inf(P)
4 Subgradients generalize the concept of gradient and are helpful if the function p is not differentiable at the point u = 0. We do not pursue this aspect and its relation to the concept of stability.
Chapter 2

78 Optimality Conditions
of the primal problem (P) with respect to changes in the constraints. They describe the incremental change in the value p∗ per unit increase in the right- hand side of the constraint.
If, for example, the variable x ∈ Rn determines how an enterprise ‘operates’, the objective function f describes the cost for some production process, and the constraint gi(x) ≤ 0 gives a bound on a special resource, for example labor, material or space, then p∗(u) shows us how much the costs (and with it the profit) change when the resource changes. λ∗i determines approximately how much fewer costs the enterprise would have, for a ‘small’ increase in avail- ability of the i-th resource. Under these circumstances λ∗i has the dimension of dollars (or euros) per unit of capacity of the i-th resource and can therefore be regarded as a value per unit resource. So we get the maximum price we should pay for an additional unit of ui .
Strong Duality
Below we will see: If the Slater constraint qualification holds and the original problem is convex, then we have strong duality, that is, p∗ = d∗. We see once more: The class of convex programs is a class of ‘well-behaved’ optimization problems. Convex optimization is relatively ‘easy’.
We need a slightly different separation theorem (compared to proposition 2.1.1). We quote it without proof (for a proof see, for example: [Fra], p. 49f):
Separation Theorem
Given two disjoint nonempty convex sets V and W in Rk, there exist a real α and a vector p∈Rk \{0} with
⟨p,v⟩≥α forall v∈V and ⟨p,w⟩≤α forall w∈W.
In other words: The hyperplane 􏰤x ∈ Rk | ⟨p,x⟩ = α􏰥 separates V and W.
The example
􏰢􏰣
V:= x=(x1,x2)T ∈R2 |x1 ≤0 and 􏰢􏰣
W:= x=(x1,x2)T∈R2|x1>0,x1x2≥1
(with separating ‘line’ x1 = 0) shows that the sets cannot be ‘strictly’ separated.
Strong Duality Theorem
Suppose that the Slater constraint qualification ∃x ̃∈F gi(x ̃)<0 foralli∈I1 holds for the convex problem (P). Then we have strong duality, and the value of the dual problem (D) is attained if p∗ > −∞.
Chapter 2

2.4 Duality 79
In order to simplify the proof, we verify the theorem under the slightly stronger condition
∃x ̃∈F gi(x ̃)<0 foralli∈I. For an extension of the proof to the (refined) Slater constraint qualification see for example [Rock], p. 277. Proof: There exists a feasible point, hence we have p∗ < ∞. If p∗ = −∞, then we get d∗ = −∞ by the weak duality theorem. Hence, we can suppose that p∗ is finite. The two sets V := {(v, w) ∈ Rm × R | ∃ x ∈ C g(x) ≤ v and f (x) ≤ w} W := {(0, w) ∈ Rm × R | w < p∗} are nonempty and convex. By the definition of p∗ they are disjoint: Let (v, w) b e i n W ∩ V : ( v , w ) ∈ W s h o w s v = 0 a n d w < p ∗. F o r ( v , w ) ∈ V t h e r e e x i s t s an x ∈ C with g(x) ≤ v = 0 and f(x) ≤ w < p∗, which is a contradiction to the definition of p∗. The quoted separation theorem gives the existence of a pair (λ,μ)∈Rm×R\{(0,0)} andan α∈R suchthat: ⟨λ,v⟩+μw≥αforall(v,w)∈V and (8) ⟨ λ , v ⟩ + μ w ≤ α for all (v, w) ∈ W (9) From (8) we get λ ≥ 0 and μ ≥ 0. (9) means that μw ≤ α for all w < p∗, hence μp∗ ≤ α. (8) and the definition of V give for any x ∈ C: ⟨λ,g(x)⟩+μf(x) ≥ α ≥ μp∗ (10) 􏱊􏱌 For we get from (10) that ⟨λ,g(x)⟩ ≥ 0 for any x ∈ C, especially 􏱍 ⟨ λ , g ( x􏰶 ) ⟩ ≥ 0 f o r a p o i n t x􏰶 ∈ C w i t h g i ( x􏰶 ) < 0 f o r a l l i ∈ I . T h i s s h o w s 􏱊􏱌 λ = 0 arriving at a contradiction to (λ,μ) ̸= (0,0). So we have : We divide the inequality (10) by μ and obtain L􏰓x,λ/μ􏰔≥p∗ foranyx∈C. μ=0 􏱋 From this follows φ􏰓λ/μ􏰔 ≥ p∗. By the weak duality theorem we have φ􏰓λ/μ􏰔 ≤ d∗ ≤ p∗. This shows strong duality and that the dual value is attained. 􏰀 Strong duality can be obtained for some special nonconvex problems too: It holds for any optimization problem with quadratic objective function and one quadratic inequality constraint, provided Slater’s constraint qualification holds. See for ex- ample [Bo/Va], Appendix B. μ>0 􏱋
􏱍
Chapter 2

80 Optimality Conditions
Exercises
1. Orthogonal Distance Line Fitting
Consider the following approximation problem arising from quality control
in manufacturing using coordinate measurement techniques [Ga/Hr]. Let M := {(x1,y1), (x2,y2), …, (xm,ym)}
be a set of m ∈ N given points in R2. The task is to find a line L L(c,n1,n2) := 􏰤(x,y)∈R2 |c+n1x+n2y=0􏰥
in Hessian normal form with n21 + n2 = 1 which best approximates the point set M such that the sum of squares of the distances of the points from the straight line becomes minimal. If we calculate rj := c + n1xj + n2yj for a point (xj,yj), then |rj| is its distance to L.
a) Formulate the above problem as a constrained optimization problem.
b) Show the existence of a solution and determine the optimal parameters c, n1 and n2 by means of the Lagrange multiplier rule. Explicate when and in which sense these parameters are uniquely defined.
c) Find a (minimal) example which consists of three points and has in- finitely many optimizers.
d) Solve the optimization problem with Matlab⃝R or Maple⃝R and test your program with the following data (cf. [Ga/Hr]):
2. a)
xj 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 yj 0.2 1.0 2.6 3.6 4.9 5.3 6.5 7.8 8.0 9.0
Solve the optimization problem
f ( x 1 , x 2 ) : = 2 x 1 + 3 x 2 −→ m a x
√x1 + √x2 = 5 using Lagrange multipliers (cf. [Br/Ti]).
b) Visualize the contour lines of f as well as the set of feasible points, and mark the solution. Explain the result!
3. Let n ∈ N and A = (aν,μ) be a real symmetric (n,n)-matrix with the
submatrices Ak
Ak := ⎜⎝ . . . . ⎟⎠ for k ∈ {1,…,n}.
ak1 ak2 … akk
Then the following statements are equivalent:
⎛a a …a ⎞ 1112 1k
⎜a21 a22 … a2k⎟
Chapter 2

Exercises to Chapter 2 81
4.
5.
a) A is positive definite.
b) ∃δ>0∀x∈Rn xTAx≥δ∥x∥2
c) ∀ k ∈ {1,…,n} detAk > 0
Consider a function f : Rn −→ R.
a) If f is differentiable, then the following holds:
f c o n v e x ⇐⇒ ∀ x , y ∈ R n f ( y ) − f ( x ) ≥ f ′ ( x ) ( y − x )
b) If f is twice continuously differentiable, then:
f convex ⇐⇒ ∀x ∈ Rn ∇2 f (x) positive semidefinite
c) What do the corresponding characterizations of strictly convex func- tions look like?
In the “colloquial speech” of mathematicians one can sometimes hear the following statement: “Strictly convex functions always have exactly one minimizer.”
However, is it really right to use this term so carelessly? Consider two typical representatives fi : R2 −→ R, i ∈ {1, 2}:
f1(x, y) = x2 + y2 f2(x,y) = x2 −y2
Visualize these functions and plot their contour lines. Which function is convex? Show this analytically as well. Is the above statement correct?
LetDj⊂R2 forj∈{1,2,3,4,5}bearegioninR2with
D1:={(x,y)∈R2 :x21+x2 ≤0.04}
D2 :={(x,y)∈R2 : (x1 −0.55)2 +(x2 −0.7)2 ≤ 0.04} D3 :={(x,y)∈R2 : (x1 −0.55)2 +x2 ≤ 0.04}.
The outer boundary of the regions D4 and D5 is defined by
x = 0.5(0.5+0.2cos(6θ))cosθ+xc , θ∈[0,2π),
y = 0.5(0.5+0.2cos(6θ))sinθ+yc
where (xc, yc) = (0, 0) for D4 and (xc, yc) = (0, −0.7) for D5.
If we now restrict the above functions fi to Dj (i ∈ {1, 2}, j ∈ {1, 2, 3, 4, 5}), does the statement about the uniqueness of the minimizers still hold? Find all the minimal points, where possible! Where do they lie? Which role does the convexity of the region and the function play?
Show that a function f : Rn −→ R is affinely linear if and only if it is convex as well as concave.
6.
Chapter 2

82
7.
Optimality Conditions
Let X be a real vector space. For m∈N and x1,…,xm ∈X let
􏰙􏰛m 􏰛m􏰪 conv(x1,…,xm) := λi xi | λ1,…,λm > 0, λi = 1 .
i=1 i=1
Verify that the following assertions hold for a nonempty subset A ⊂ X:
a) A convex ⇐⇒ ∀ m ∈ N ∀ a1,…,am ∈ A conv(a1,…,am) ⊂ A
b) LetAbeconvexandf: A−→Raconvexfunction.Forx1,x2,…,xm ∈ A and x ∈ conv(x1, . . . , xm) in a representation as given above, it then
holds that
􏰠􏰛m 􏰡􏰛m
f λixi ≤ λif(xi).
i=1 i=1
c) The intersection of an arbitrary number of convex sets is convex. Consequently there exists the smallest convex superset conv(A) of A, called the convex hull of A.
d) It holds that conv(A) = 􏰱 conv(a1, . . . , am). m∈N
a1 ,…,am ∈A
e) Carath ́eodory’s lemma:
For X = Rn it holds that conv(A) = 􏰱 conv(a1,…,am). m≤n+1
a1 ,…,am ∈A
f) In which way does this lemma have to be modified for X = Cn?
g) For X ∈ {Rn , Cn } and A compact the convex hull conv(A) is also compact.
8. ForanonemptysubsetD⊂Rn andafunctionf:D−→Rlet epi(f) := {(x,y)∈D×R:f(x)≤y}
9.
10.
be the epigraph of f. Show that for a convex set D we have f convex ⇐⇒ epi(f) convex.
Prove part 1) of lemma 2.2.3 and additionally show the following asser- tions for F convex and x0 ∈ F:
a) Cfd(x0) = {μ(x−x0)|μ>0,x∈F}
b) Ct (x0) = Cfd (x0)
c) Ct(x0) is convex.
Prove for the tangent cones of the following sets
F1 :={x∈Rn |∥x∥2 =1},
F2 :={x∈Rn |∥x∥2 ≤1},
F 3 : = { x ∈ R 2 | − x 31 + x 2 ≤ 0 , − x 2 ≤ 0 } :
Chapter 2

Exercises to Chapter 2 83
11.
12.
13.
With f(x) := x21 + x2
dual cones at the (strict global) minimal point x0 := (0, 0)T .
Let x0 be a feasible point of the optimization problem (P). According to
page 56 it holds that (LICQ) =⇒ (AHUCQ) =⇒ (ACQ).
Show by means of the following examples (with n = m = 2 and p = 0)
that these two implications do not hold in the other direction:
a) f(x) := x21+(x2+1)2 , g1(x) := −x31−x2 , g2(x) := −x2 , x0 := (0, 0)T
b) f(x) := x21+(x2+1)2 , g1(x) := x2−x21 , g2(x) := −x2 , x0 := (0, 0)T
Let the following optimization problem be given: f(x)−→min, x∈R2
g1(x1,x2) := 3(x1 −1)3 −2×2 +2 ≤ 0 g2(x1,x2) := (x1 −1)3 +2×2 −2 ≤ 0 g3(x1, x2) := −x1 ≤ 0
g4(x1, x2) := −x2 ≤ 0
a) Forx0 ∈F1 itholdsthatCt(x0)={d∈Rn | ⟨d,x0⟩=0}.
􏰙
b) Forx0∈F2wehaveCt(x0)=
c) Forx0:=(0,0)T∈F3itholdsthatCt(x0)={d∈R2|d1≥0,d2=0}.
for x ∈ R2 consider ⎧
⎪ ⎪ ⎪⎨ f ( x ) − → m i n −x2 ≤ 0
(P)
x31(x2 − x31) ≤ 0
⎪ ⎪ ⎪⎩ x 31 − x 2 ≤ 0
and determine the linearizing cone, the tangent cone and the respective
Rn , ∥x0∥2 < 1, {d∈Rn|⟨d,x0⟩≤0}, ∥x0∥2 =1. a) b) Regard the objective function on the ‘upper boundary’ of F. (iii) f(x1,x2):=(x1−5)2+(x2−5)2 Do the KKT conditions hold at the optimal point? Hint: In addition illustrate these problems graphically. Plot the feasible region. Solve the optimization problem for the following objective functions: (i) f(x1,x2):=(x1−1)2+(x2−3)2 2 (ii) f(x1,x2) := (x1 − 1)2 + (x2 − 4)2 44 14. Optimal Location of a Rescue Helicopter (see example 4 of chapter 1) Chapter 2 84 Optimality Conditions 15. a) Formulate the minimax problem 􏰞 dmax(x,y) := max (x−xj)2 +(y−yj)2 1≤j≤m as a quadratic optimization problem 􏰙 (with f quadratic, gj linear). You can find some hints on page 13. b) Visualize the function dmax by plotting its contour lines for the points (0,0), (5,−1), (4,6), (1,3). c) Give the corresponding Lagrangian. Solve the problem by means of the Karush–Kuhn–Tucker conditions. Determine a triangle with minimal area containing two disjoint disks with radius 1. wlog let (0, 0), (x1, 0) and (x2, x3) with x1, x3 ≥ 0 be the ver- tices of the triangle; (x4,x5) and (x6,x7) denote the centers of the disks. 3 2 1 0 a) Formulate this problem as a minimization problem in terms of seven variables and nine constraints (see [Pow 1]). b) x∗ = 􏰓4+2√2,2+√2,2+√2,1+√2,1,3+√2,1􏰔T is a solution of this problem; calculate the corresponding Lagrange multipliers λ∗, such that the Karush–Kuhn–Tucker conditions are fulfilled. c) Check the sufficient second-order optimality conditions for (x∗ , λ∗ ). Find the point x ∈ R2 that lies closest to the point p := (2, 3) under the c o n s t r a i n t s g 1 ( x ) : = x 1 + x 2 ≤ 0 a n d g 2 ( x ) : = x 21 − 4 ≤ 0 . a) Illustrate the problem graphically. b) Verify that the problem is convex and fulfills (SCQ). c) Determine the KKT points by differentiating between three cases: none is active, exactly the first one is active, exactly the second one is active. d) Now conclude with theorem 2.2.8. f(x,y,ρ) → min gj(x,y,ρ) ≤ 0 (j = 1,...,m) 0246 16. Chapter 2 Exercises to Chapter 2 85 The problem can of course be solved elementarily. We, however, want to practice the theory with simple examples. 17. In a small power network the power r runs through two different channels. Let xi be the power running through channel i for i = 1, 2. The total loss is given by the function f : R2 −→ R with f(x,x):=x +1􏰓x2+x2􏰔. 121212 Determine the current flow such that the total loss stays minimal. The constraintsaregivenby x1+x2=r, x1≥0,x2≥0. 18. Verify in the linear case that the Lagrange dual problem of (D) (cf. p. 71) — transformed into standard form — is again the primal problem. 19. Consider the optimization problem (cf. [Erik]): ⎧⎨ 􏰚n xi n f(x):= xilog(pi)−→min, x∈R ⎩ i=1 AT x = b , x ≥ 0 where A ∈ Rn×m,b ∈ Rm and p1,p2,...,pn ∈ R++ are given. Let further 0 ln 0 be defined as 0. Prove: a) The dual problem is given by c) ∇2φ(λ) = −AT XA, where X = Diag(x) with x from b). 􏰓􏰔 20. Support Vector Machines cf. [Cr/Sh] Support vector machines have been extensively used in machine learning and data mining applications such as classification and regression, text categorization as well as medical applications, for example breast cancer diagnosis. Let two classes of patterns be given, i. e., samples of observable characteristics which are represented by points xi in Rn. The patterns are given in the form (xi,yi), i = 1,...,m, with yi ∈ {1,−1}. yi = 1 means that xi belongs to class 1; otherwise xi belongs to class 2. In the simplest case we are looking for a separating hyperplane described by ⟨w,x⟩+β = 0 with⟨w,xi⟩+β≥1ifyi =1and⟨w,xi⟩+β≤−1ifyi =−1.These 􏰓􏰔 conditions can be written as yi ⟨w,xi⟩ + β ≥ 1 (i = 1,...,m). We 􏰑 􏰛n i=1 piexp(eTi Aλ−1)−→max, λ∈Rm. b) ∇φ(λ) = b−ATx with xi = pi exp(eTi Aλ−1). φ(λ):=bTλ− aim to maximize the ‘margin’ (distance) 2/ ⟨ w , w ⟩ between the two hyperplanes ⟨w,x⟩ + β = 1 and ⟨w,x⟩ + β = −1. This gives a linearly constrained convex quadratic minimization problem 􏰙 1⟨w,w⟩−→min 2􏰓 􏰔 (11) yi ⟨w,xi⟩+β ≥1 (i=1,...,m). Chapter 2 86 Optimality Conditions Separable Case Non-Separable Case 66 44 22 02460246 Chapter 2 In the case that the two classes are not linearly separable (by a hyper- plane), we introduce nonnegative penalties ξi for the ‘misclassification’ of xi and minimize both ⟨ w , w ⟩ and 􏰚mi=1 ξi . We solve this optimization problem in the following way with soft margins 􏰙1⟨w,w⟩+C􏰚m ξi→min (P) 2 􏰓 􏰔 i=1 (12) yi ⟨w,xi⟩+β ≥1−ξi,ξi≥0 (i=1,...,m). Here, C is a weight parameter of the penalty term. a) Introducing the dual variables λ ∈ Rm+ , derive the Lagrange dual problem to (P): 􏰙 − 1 􏰚 m y i y j ⟨ x i , x j ⟩ λ i λ j + 􏰚 m λ i −→ m a x 2 i,j=1 i=1 (D) 􏰚mi=1yiλi =0, 0≤λi ≤C (i=1,...,m) (13) Compute the coefficients w ∈ Rn and β ∈ R of the separating hyper- plane by means of the dual solution λ and show w=􏰚mj=1yjλjxj, β=yj−⟨w,xj⟩ if 0<λj 0 are called support vectors.
b) Calculate a support vector ‘machine’ for breast cancer diagnosis using the file wisconsin-breast-cancer.data from the Breast Cancer Wiscon-
􏰓􏰔
sin Data Set cf. http://archive.ics.uci.edu/ml/ . The file wisconsin-
breast-cancer.names gives information on the data set: It contains
699 instances consisting of 11 attributes. The first attribute gives the
sample code number. Attributes 2 through 10 describe the medical
status and give a 9-dimensional vector xi. The last attribute is the
class attribute (“2” for benign, “4” for malignant). Sixteen samples
have a missing attribute, denoted by “?”. Remove these samples from
the data set. Now split the data into two portions: The first 120 in-
stances are used as training data. Take software of your choice to solve
the quadratic problem (P), using the penalty parameter C = 1000.
The remaining instances are used to evaluate the ‘performance’ of the
classifier or decision function given by f(x) := sgn ⟨w,x⟩ + β .
􏰤􏰥