553.740 Project 2; Optimization. Fall 2020
Due on Wednesday October 21.
Please read carefully the directions below.
• You must type your answers (using, e.g., LateX or MS Word.) A 5% penalty will be applied otherwise.
• The solution must be written with sufficient explanations and your results must be commented, as if you were writing a report or a paper.. Returning a series of numerical results and figures is not enough. A solution in the form of a program output is not acceptable either, or a Jupyter notebook output.
• You must provide your program sources. They must be added as an appendix, separated to your answers to the questions. They will not graded (so no direct credit for them), but they will be useful in order to understand why results are not correct (and decide whether partial credit can be given) and to ensure that your work is original. You may use any programming language, although Python is strongly recommended.
• Data files. The “.csv” files associated with this project are available on Blackboard. The columns are formatted as (Index, X1, . . . , Xd, Y) and each row corresponds to a sample. Here, X1, . . . , Xd are the d coordinates of the input variable (in this homework, d = 1 or 2), and Y is the input variable (Index is just the line number).
Question 1.
Let X be a real-valued random variable modeled with a normal distribution N(m,σ2). Fixing a positive number ρ, we propose to estimate m and σ2, based on a sample (x1, . . . , xN ) by maximizing the penalized log-likelihood
N |m|
logφm,σ2(xk)−Nρ σ k=1
where φm,σ2 is the p.d.f. of N (m, σ2).
We will assume in the following that the observed samples are not equal to the same constant value.
(1.1) Let α = m/σ and β = 1/σ. Let also μ1 and μ2 denote the first- and second-order empirical moments: μ1 = 1 (x1 + · · · + xN )
N
μ 2 = 1 ( x 21 + · · · + x 2N ) N
Prove that this maximization problem is equivalent to minimizing (introducing an auxiliary variable ξ) F(α,β,ξ)= 1α2 −μ1αβ+ 1μ2β2 −logβ+ρξ
22 subject to the constraints ξ − α ≥ 0 and ξ + α ≥ 0.
1
(1.2) Prove that F is a convex function over R × (0, +∞) × R and that that any feasible point for this constrained minimization problem is regular.
(1.3) Prove that the KKT conditions for this problem are:
α − μ 1 β = λ 2 − λ 1
1
−μα+μβ− =0 12β
ρ − λ 1 − λ 2 = 0
λ,λ ≥0 12
λ (α−ξ)=0 1
λ 2 ( α + ξ ) = 0
where λ1, λ2 are Lagrange multipliers.
(1.4) Find a necessary and sufficient condition on μ1 , μ2 and ρ for a solution to this system to satisfy α = 0,
and provide the optimal β in that case.
(1.5) Assume that the previous condition is not satisfied. (a) Prove that in that case, α has the same sign
as μ1. (b) Completely describe the solution of the system, separating the cases μ1 > 0 and μ1 < 0.
(1.6) Summarize this discussion and prove that one of the two following statements (with extra credit if you
probe both, i.e., prove that they are equivalent). (a) The optimal parameters are mˆ =sign(μ1)max(0,|μ1|−ρσˆ)
√ 2s2 σˆ = min μ2, ρ2μ21 + 4s2 − ρ|μ1|
with s2 = μ2 − μ21.
(b) The optimal parameters can be computed as follows
• Ifμ21 ≤ρ2μ2: mˆ =0,σˆ=√μ2. • If μ21 ≥ ρ2μ2:
(1.7) What happens if ρ ≥ 1?
Question 2.
We now consider a second random variable, Y ∈ {0,1}, and assume that, conditionally to Y = y, X ∼ N(my,σ2) for some m0, m1 and σ2. Based on samples (x1,...,xN) and (y1,...,yN), we want to
mˆ = μ1 − sign(μ1)ρσˆ 2s2
σˆ = ρ2μ21 + 4s2 − ρ|μ1|
2
maximize the penalized log-likelihood
N | m 1 − m 0 |
log φmyk ,σ2 (xk) − Nρ σ k=1
with respect to m0, m1 and σ2.
Let N0 and N1 denote the number of observations with Y = 0 and Y = 1. where
1 N μ1,0 = N
0 k=1
1 N μ1,1 = N
xk1yk=0 xk1yk=1
1 k=1
μ1 = 1 (N0μ1,0 + N1μ1,1)
N
1 N μ2=N x2k
k=1 s 2 = μ 2 − μ 21
(2.1) Let m = (N0m0 + N1m1)/N, α = (m1 − m0)/σ and β = 1/σ. Express the problem in terms of these new parameters and prove that, denoting the optimal solution by mˆ,αˆ,βˆ, that
1. mˆ = μ1
2. αˆ and βˆ minimize
s2β2 N0N1 N0N1 2
2 − N2 (μ1,1 −μ1,0)αβ+ 2N2 α −logβ+ρ|α|.
with respect to α and β.
(2.2) Adapt the solution found in question 1 to show one of the following statements (with extra credits if you show both)
(a) The optimal solution of the penalized likelihood problem is
2(s2 − N0N1(μ1,1 − μ1,0)2/N2) σˆ = min s, ρ2(μ1,1 − μ1,0)2 + 4(s2 − N0N1(μ1,1 − μ1,0)2/N2) − ρ|μ1,1 − μ1,0|
N1 N2 mˆ0=μ1−Nsign(μ1,1−μ1,0)max 0,|μ1,1−μ1,0|−NNρσˆ
01
N0 N2 mˆ1=μ1+Nsign(μ1,1−μ1,0)max 0,|μ1,1−μ1,0|−NNρσˆ
01
(b) The optimal solution of the penalized likelihood problem is computed as follows. • If N0N1|μ1,1 − μ1,0| ≤ N2ρs: mˆ 0 = mˆ 1 = μ1, σ = s.
3
• If N0N1|μ1,1 − μ1,0| ≤ N2ρs:
σˆ = ρ2(μ1,1 − μ1,0)2 + 4(s2 − N0N1(μ1,1 − μ1,0)2/N2) − ρ|μ1,1 − μ1,0|
Question 3.
2(s2 − N0N1(μ1,1 − μ1,0)2/N2) mˆ 0 = μ1,0 − N sign(μ1,1 − μ1,0)ρσˆ
N1
N0
mˆ 1 = μ1,1 − N sign(μ1,1 − μ1,0)ρσˆ
We now assume that the random variable X is multi-dimensional and that, conditional to Y , it follows a normal distribution N(my,Σ) with m0,m1 ∈ Rd and Σ a diagonal matrix. We will denote by my(i) the ith coefficient of my (y = 0, 1) and by σ2(i) the ith diagonal entry of Σ.
(3.1) Use the previous question to describe the maximizers of the penalized likelihood N d |m0(i) − m1(i)|
logΦmy ,Σ(xk)−Nρ k
σ(i)
i=1 where Φm,Σ is the p.d.f. of the multivariate Gaussian N (m, Σ).
k=1
(3.2) Write a program for training that takes as input an N by d array X, an N-dimensional binary vector Y and a value of ρ, and that returns m0, m1 and σ as vectors.
(a) Test this program on the dataset “OPT FALL2020 train1.csv” with ρ = 0.1 and return the number of coordinates such that m0(i) ̸= m1(i).
(b) Same question, with ρ = 0.15.
(3.3) Write a program for prediction that takes as input vectors m0, m1 and σ (such as those returned by the previous training program) and an M by d array X of test data and that returns an M-dimensional vector Y such that, for k = 1,...,M: Y(k) = 0 if
d (X(k, j) − m0(j))2
<
d (X(k, j) − m1(j))2
σ2(j)
For ρ = 0.01, 0.02, 0.03, . . . , 0.34, 0.35, execute the following operations:
j=1
j=1
and Y(k) = 1 otherwise.
(1) Run the training program of question(3.2) to estimate m0,m1,σ. Compute ν(ρ), the number of
coordinates such that m0(i) ̸= m1(i).
(2) Using these values of m0,m1 and σ obtained in the previous question for λ = 0.10 and λ = 0.15 and apply the classification program to the dataset provided in the file “OPT FALL2020 test1.csv.” Compute the balanced error rate E(ρ) defined as the average between the fraction of wrong answers among test data with Y = 0 and that among test data with Y = 1, formally defined by
E=1 #{k:Y(k)=y,Y′(k)̸=y}
σ2(j)
2
where Y′ contains the predicted classes and Y the true ones.
y∈{0,1}
4
#{k : Y(k) = y}
Provide, in your answer,
• The numerical value of the E(0.1) and E(0.15). • The values ρ for which E is minimal.
• Two plots: of ν(ρ) vs. ρ and of E(ρ) vs. ρ.
(3.4) Instead of using a closed form for the solution of the problem considered in this question, we consider here an iterative approach using proximal gradient descent. Let α0(i) = (m0(i)/σ(i), i = 1, . . . , d), α1(i) = (m1(i)/σ(i),i = 1,...,d) and β = (1/σ(i),i = 1,...,d) be d-dimensional vectors.
(a) Rewrite the penalized likelihood maximization problem considered in this question as an equivalent minimization problem for a function taking the form
F(α0,α1,β) = f(α0,α1,β)+ρg(α0,α1,β) with g(α0, α1, β) = di=1 |α1(i) − α0(i)|.
(b) Provide an explicit expression of the proximal operator ′′′1′2′2′2
proxγρg(α0,α1,β)=argminα′0,α′1,β′ ρg(α0,α1,β)+2γ(|α0−α0| +|α1−α1| +|β−β|)
(c) Give the expression of ∇f(α0,α1,β) for the function f obtained in (a).
(d) Write a new version of the training program that takes the same input X, Y and ρ and returns the optimal m0 , m1 , σ obtained after stabilization of a proximal gradient algorithm
(α0,n+1, α1,n+1, βn+1) = proxγρg ((α0,n, α1,n, βn) − γ∇f(α0,n, α1,n, βn)) ,
where stabilization is defined as the maximal change between the optimized variable during one iteration being less than τ = 10−6.
Run this program on the dataset “OPT FALL2020 train1.csv” using ρ = 0.1 and γ = 0.1. Provide in your answer the maximum value (over all estimated variables) of the difference between the result obtained after convergence and the exact formula obtained in question (3.1).
5