Programming Assignment Due Wednesday, May 6th

Programming Assignment Due Wednesday, May 6th

Sparse (Linear) Logistic Regression

Suppose that we have a collection of vectors xi 2 Rn, i = 1, 2, …, N, and that for each i there is a value yi 2 {0, 1}. Given a new value x 2 Rn, we want to predict the value y to be expected. One way of dealing with this in a probabilistic way is to suppose that for x 2 Rn there is a probability p(x) that y = 1 and probability 1 p(x) that y = 0. For each i there is a likelihood function

p(xi )yi (1 p(xi ))1 yi
that given x = xi, y has the value yi. Assuming the data points (xi,yi) are obtained

independently, the likelihood for the data set as a whole is

N
’ p(xi )yi (1 p(xi ))1 yi . i=1

Assuming that we have a formula p(x) = 1/(1 + exp(g + cT x)), we want to find g and c that maximizes the likelihood as a function of (g,c).

Show that this is equivalent to minimizing
`(g,c) := ÂN yi ln(1+exp(g +cT xi))+(1 yi)ln(1+exp( g cT xi)) .

i=1
This function `(g,c) is called the log likelihood function. Show that `(g,c) is a convex

function of (g,c).

If n > N then there cannot be unique minimizers, since if w is perpendicular to all xi, i = 1, 2, …, N, then `(g,c+w) = `(g,c). In this case, it is still appropriate to look for sparse c; that is, we want most entries of c to be zero. To help us find suitable sparse c we add a penalty term especially designed to make this happen:

f(g,c):=`(g,c)+akck1

where kck1 = Âni=1 |ci| and a > 0 is chosen to control how many entries of c are non- zero. We now have a non-smooth optimization problem. Show that f is a convex function. We can avoid the non-smoothness here by introducing some simple con- straints:

1

n
min`(g,c)+a Âwi subject to

g ,c,w
wi +ci, i=1,2,…,n, wi ci, i=1,2,…,n.

i=1

Write out the KKT necessary conditions for this problem. Suppose that we have opti- mal g, c, w. Determine the appropriate Lagrange multipliers in terms of g, c, w so that the KKT conditions can be verified.

We will carry out this minimization using an active set strategy: for a given c, for each i we can put wi =sici with si =±1 chosen so that wi 0. We can use the gradient of `(g , c) + a Âni=1 si ci with respect to (g , c) to determine a direction to move in. However, we should not move so far that ci changes sign for i = 1, 2, …, n; we can allowci =0. Ifci =0and|∂`/∂ci(c)|a thenweshouldkeepci =0. Butifci =0and ∂`/∂ci(c) > a then we should make ci negative; if ci = 0 and ∂`/∂ci(c) < a then we should make ci positive. For each i we keep track of a variable si 2 { 1, 0, +1 } so that si = sign(ci).

Since `(g,c) is a convex function of (g,c), Hess`(g,c) is at least positive semi-definite for every (g,c).

Implement a version of the algorithm described below. Test it on the dataset icu.dat (see icu.txt) available on ICON. Note that this dataset is actually copyrighted by J. Wiley & Sons. Read it in using the script read icu data file.m. All files needed are in ICU-data.zip on ICON. Note that column 2 of line i of data is the yi value; the remainder of line i after column 2 is the row vector xTi . This can be extracted by the Matlab code

     yvals = data(:,2); xvals = data(:,3:end);

In applying your optimization algorithm, use the initial guess c = 0 and set a = 0.05. Algorithm

The algorithm you will develop is mostly based on Newton’s method using the vec- tor s = (s1,s2,…,sn) of signs (si = 0 or si = ±1). We first take a Newton step, but keepingcj =0ifsj =0,andignoringtheequation∂/∂cj[`(g,c)+aÂni=1sici]=0. This reduces both the number of unknowns and the number of equations to satisfy. Compute the Newton step (d,d) where d is the Newton step for g and d the Newton step for c. (To do this we need to solve a linear system for both d and d.) Use an Armijo/backtracking line search limited by the condition that the sign of each ci should notchange. Thatis,ifsi=+1(soci>0),welimitasothatci+adi 0,andif si = 1 we limit a so that ci +adi  0. This can be implemented using

a0 min(1, min ci , min +ci ) i:ci>0,di<0 di i:ci<0,di>0 di

before starting the line search. If after the step we have ci + a di = 0 but ci 6= 0, then when we update ci, we also need to update si 0.

If we simply repeated the Newton-type step described above, if ci = 0 for some itera- tion, it will remain zero for all future iterations. This is not good. So after each Newton- type step, we need to see if ∂`/∂cj(g,c) > a where cj = 0. If ∂`/∂cj(g,c) > a and

2

cj = 0 then we want to decrease cj, and we set sj = 1; if ∂`/∂cj(g,c) < a and

cj = 0 then we want to increase cj, and we set sj = +1. Otherwise we do not change

sj in this step. Then we take a gradient step for `(g,c) + Ân sici. If cj = 0 and i=1

∂`/∂cj(g,c)  a then we do not want to change cj. So our gradient step direction is (d , d) where

d = ∂`/∂g(g,c),
di = ∂`/∂ci(g,c)+si if ci 6= 0 or ∂`/∂cj(g,c) > a, di = 0 otherwise.

Again, use an Armijo/backtracking line search for this gradient step.

The overall algorithm involves taking a Newton-type step as described above, checking if ∂`/∂cj(g,c) >a wherecj =0forsome j;ifthisisso,thenagradientstepistaken. The process then repeats until the KKT conditions are (approximately) satisfied.

Report

There are a number of things you need to include in your report.

  1. Descriptionofyouralgorithmandyourimplementation.Youshouldgivepseudo- code or something similar to describe how your algorithm works. The descrip- tion of the implementation does not need to include the complete source code, unless it is quite compact. You should submit your code as a separate file, per- haps as a zip file. However, you should include documentation showing the main parts of your implementation, and how they are used.
  2. Give evidence of the correctness of your implementation. Describe what tests you do. These tests can (and, in fact, should) include tests of the pieces of your implementation as well as the complete system. Computations of gradients and Jacobian matrices are well-known as sources of bugs and so should be tested before using them. Comment on how to tell if the computed results are locally optimal or a KKT point for the problem.
  3. Describe the performance of the algorithm both in terms of the accuracy of the result(s), the number of iterations of your method, and the computer time spent in the calculations. Comment on how any poor performance you identify could be improved by either improving the algorithm or by improving the implementation. Does the algorithm fail completely? If so, why? And what could be done to remedy the problem?
  4. Report the results for the above problems. When you do this, you should think carefully about how to represent them so that the reader will be able to extract useful information from your report. Consider combining graphs so that results can be easily compared. Report on the effects of changing the value of a as this is a parameter that is under the control of the user.

    3