COMS 4771-2 Fall-B 2020 HW 3 (due December 7, 2020)
Instructions Submitting the write-up
• Submit your write-up on Gradescope as a neatly typeset (not scanned nor hand-written) PDF document by 10:00 PM of the due date (New York City local time).
• You may produce the PDF using LATEX, Word, Markdown, etc.—whatever you like—as long as the output is readable!
• On Gradescope, be sure to select the pages containing your answer for each problem. (If you don’t select pages containing your answer to a problem, you’ll receive a zero for that problem.)
Submitting as a group
• If you are working in a group (of up to three members), make sure all group members’ names and UNIs appear prominently on the first page of the write-up.
• Only one group member should submit the write-up on Gradescope. That student must make sure to add all of the group members to the submission.
Submitting the source code
• Please submit all requested source code on Gradescope in a single ZIP file.
• Use the concatenation of your group members’ UNIs followed by .zip as the filename for the
ZIP file (e.g., abc1234def5678.zip).
• Use the format hwXproblemY for filenames (before the extension), where X is the homework
number and Y is the problem number (e.g., hw1problem3).
• If you are submitting a Jupyter notebook (.ipynb), you must also include a separate Python
source file (.py) with the same basename that contains only the Python code.
• Do not include the data files that we have provided; only include source code.
• Only one group member should submit the code on Gradescope. That student must
make sure to add all of the group members to the submission.
Please consult the Gradescope Help Center if you need help with submitting.
1
Problem 1 (5 points)
In this problem, you will derive and interpret algorithms for approximately maximizing the log- likelihood objective for logistic regression. (This a prelude to Problem 2.)
In the logistic regression model for classification data (X1, Y1), . . . , (Xn, Yn) (iid random examples, each taking values in Rd × {−1, +1}), the log-likelihood of the parameter vector w ∈ Rd given data
(Xi,Yi) = (xi,yi) for i = 1,…,n is
J(w)=−ln(1+e−yixTi w).
n i=1
(We omitted terms that don’t involve w).
A way to approximately maximize an objective function is to use a “hill-climbing” algorithm on the objective’s landscape. Suppose we have a candidate solution w ∈ Rd, which has log-likelihood J(w). We seek a “step” δ ∈ Rd such that w + δ has higher log-likelihood than w does, i.e., J (w + δ) > J (w). We then replace w with w + δ and repeat.
The question, of course, is what step δ will lead to a higher log-likelihood.
Here are two possible options, each based on multivariate Taylor series approximations:
1. The first option is motivated by the affine approximation of J(w + δ): J(w) + ∇J(w)Tδ.
By the Cauchy-Schwarz inequality, this is at most J(w)+∥∇J(w)∥2∥δ∥2, and the upper-bound is achieved by δ satisfying
δ = η∇J(w)
for some η > 0. (Well, it is also achieved when η = 0, but that’s not useful.) Of course, the affine approximation may only be accurate if ∥δ∥2 small, which means we have to choose η small enough in order for the improvement to be realized.
2. The second option is motivated by the quadratic approximation of J(w + δ): J(w) + ∇J(w)Tδ + 1δT ∇2J(w) δ.
2
This is a quadratic function of δ. It turns out it is a concave quadratic (i.e., “hill” shaped, as opposed to “bowl” shaped), and hence it is maximized when
δ = −∇2J(w)−1 ∇J(w).
This, of course, assumes that ∇2J(w) is invertible. (A “step size” η may not be necessary if
the quadratic approximation is very accurate in the neighborhood of w.)
The options described above are generic strategies for (approximately) maximizing numerical objective functions. In this problem, you will consider these strategies in the specific case of the log-likelihood objective for the logistic regression model. This will involve analytically determining details of ∇J and ∇2J, and it will also afford the opportunity to interpret the resulting procedures.
2
(a) Suppose(X1,Y1),…,(Xn,Yn)arerandomexampleswhosedistributioncomesfromthelogistic regression model with parameter vector w ∈ Rd. Derive a formula for the conditional variance of Yi given Xi = xi. Your answer should be given in terms of xi and w. (Below, we’ll refer to this as varw(Yi | Xi = xi).)
(b) Verify the following formulas for the ∇J(w) and ∇2J(w):
n
∇J(w) = Prw(Yi = −yi | Xi = xi)yixi
∇2J(w) = −4
i=1
1 n
varw(Yi | Xi = xi)xixTi
where Prw(·) and varw(·) refer to the probability distribution for (X1, Y1), . . . , (Xn, Yn) from the logistic regression model with parameter vector w. (You don’t need to write anything for this part in your write-up.)
(c) The hill-climbing method based on the affine approximation is as follows. Start with w(1) = 0, and then for t = 1,2,…:
n
δ(t) := Prw(t) (Yi = −yi | Xi = xi)yixi
i=1
w(t+1) := w(t) + ηδ(t)
Observe that δ(t) is a linear combination of the vectors yixi for i = 1, . . . , n, where the “weight” on the i-th example is Prw(t)(Yi = −yi | Xi = xi). If Prw(t)(Yi = −yi | Xi = xi) is close to zero—i.e., Prw(t) (Yi = yi | Xi = xi) is close to one—then yixi does not really contribute much to δ(t). So the largest contributions to δ(t) are from yixi for which the parameter vector w(t) does not have particularly high likelihood given (Xi, Yi) = (xi, yi).
What is a formula for δ(1) given in terms of (x1, y1), . . . , (xn, yn)? Keep in mind that w(1) = 0. Briefly explain your answer.
(d) The hill-climbing method based on the quadratic approximation is as follows. Start with w(1) = 0, and then for t = 1,2,…:
i=1
1 n −1 n
δ(t) := var (t)(Y |X =x)xxT Pr (t)(Y =−y |X =x)yx
4 w i i iii w i i i iii i=1 i=1
w(t+1) := w(t) + δ(t)
(We assume that the matrix in the definition of δ(t) is invertible.) Compared to δ(t) from the method based on the affine approximation, this method applies a linear transformation that is the inverse of some matrix that is a linear combination of the outer product matrices xixTi . The resulting δ(t) can thus be pointing a rather different direction compared to the vector without this linear transformation.
What is a formula for δ(1) given in terms of (x1, y1), . . . , (xn, yn)? Keep in mind that w(1) = 0. Briefly explain your answer. Does it remind you of a formula you have seen before?
3
Problem 2 (20 points)
In this problem, you will fit a logistic regression model to labeled text data.
The data set is a subset of the “20 Newsgroups data set”; you can obtain it from Courseworks, and load it into Python with the following commands:
from scipy.io import loadmat
news = loadmat(‘news_crypt_space.mat’)
# news[‘crypt’] is the (sparse) matrix of training feature vectors from sci.crypt
# news[‘space’] is the (sparse) matrix of training feature vectors from sci.space
# news[‘crypt_test’] is the (sparse) matrix of test feature vectors from sci.crypt
# news[‘space_test’] is the (sparse) matrix of test feature vectors from sci.space
The data are messages posted to two different newsgroups, sci.crypt (“different methods of data en/decryption”) and sci.space (“space, space programs, space related research, etc.”). There are 1187 feature vectors for training, and 787 feature vectors for testing. The representation of each message is a (sparse) binary vector in {0, 1}d (for d = 61188), where the j-th entry in the feature vector is 1 if and only if the corresponding message contains the word in the j-th line of the text file news.vocab.txt (also available on Courseworks). Regard the messages from sci.crypt as negative examples (label −1) and the messages from sci.space as positive examples (label +1).
(a) Implement the hill-climbing algorithm from Problem 1(c) with η = 1 (where n = 1187 is the 4n
number of training examples; this is a somewhat arbitrary choice . . . ), and run the program
with the training data as input to compute w(1), . . . , w(501). (Of course, w(1) is just the zero
vector.) Plot 1J(w(t)) as a function of t (in the range {1,…,501}). Label the axes and n
provide an appropriate caption. Include the plot in your write-up.
(b) Report the training and test error rates for the classifier corresponding to w(501). Both should be below 5%.
(c) For each t, the parameter vector w(t) corresponds to a linear classifier of the following form: −1 ifxTw(t)≤0;
x→ +1 ifxTw(t)>0.
Plot both the training error rate of these classifiers and the test error rate (using the test data) of these classifiers as a function of t. Label the axes and provide an appropriate caption and legend. Include the plot in your write-up.
(d) The number of iterations (in this case above, T = 500) to execute the hill-climbing algorithm should also be regarded as a hyperparameter that should be determined with cross validation (though often it is constrained by your patience and/or other factors). If you were willing to regard the test data as validation data in the context of the hold-out validation method (perhaps because you have other data set aside for use as test data. . . ), what would you select for T (from among the possible values of {0, . . . , 500})? Feel free to propose a model averaging
approach instead of model selection, if you like. In either case, write a brief justification.
(e) Consider the weight vector w(501) = (w(501), . . . , w(501)) ∈ Rd. Report the vocabulary words
1d
whose indices j ∈ {1, . . . , d} correspond to the 10 largest (i.e., most positive) w(501) value, and
j
also the vocabulary words whose indices j ∈ {1, . . . , d} correspond to the 10 smallest (i.e., 4
most negative) w(501) value. Don’t report the indices j’s, but rather the actual vocabulary j
words (from news.vocab.txt). Please order each list by decreasing |w(501)| value. j
(f) There are a lot of software packages that implement (or at least purport to implement) maximum likelihood estimation for logistic regression. Find and learn to use one such imple- mentation (e.g., sklearn, statsmodels), and use it to compute an approximate maximum likelihood estimate of the parameter w given the training data. Briefly describe the package and subroutines that you use; report both the training and test error rates of the corresponding linear classifier as in Part (b); and also report the two lists of “high weight” words for this classifier as in Part (e).
Caution: The fitting procedure in some packages may, with default settings, try to optimize a different objective other than the log-likelihood. For example, they may attempt to optimize a log-posterior probability based on some prior (as in “MAP”, discussed in lecture). Or, they may use a slightly different model, such as one that includes a y-intercept coefficient. These, of course, may be reasonable choices, but they would not make the results comparable to what you obtained in the other parts of this problem.
(g) What is a potential obstacle for implementing the hill-climbing procedure from Problem 1(d)?
Include the source code in your code submission on Gradescope.
Please ALSO include a copy/screenshot of your source code in your write-up. This will make it easier for the grading. If you are using LATEX, see https://www.overleaf .com/learn/latex/code_listing.
5
Problem 3 (15 points)
In this problem, you will discover the connection between the pseudoinverse and minimum norm solutions to normal equations.
Definition of pseudoinverse
The (Moore-Penrose) pseudoinverse of a matrix A ∈ Rn×d, typically denoted by A†, can be defined via the singular value decomposition of A as follows. Let r denote the rank of A, and let the singular value decomposition of A be written as
A = USV T
where U = [u1| · · · |ur] ∈ Rn×r is the matrix of (orthonormal) left singular vectors of A, V = [v1|···|vr] ∈ Rd×r is the matrix of (orthonormal) right singular vectors of A, and S = diag(σ1, . . . , σr) ∈ Rr×r is the diagonal matrix of (positive) singular values of A.
(a) Convince yourself that the column space of A is the same as the column space of U, and that the row space of A is the same as the column space of V . (You don’t need to write anything for this part in your write-up.)
The pseudoinverse A† ∈ Rd×n of the matrix A is
A† =VS−1UT.
Minimum Euclidean norm solution to the normal equations
Consider A ∈ Rn×d (with SVD A = USV T as above) and b ∈ Rn. We are interested in determining the minimum (Euclidean) norm solution to the normal equations:
ATAw = ATb.
Recall (from lecture) that solving the normal equations is equivalent to solving Aw = ˆb, where ˆb is
the orthogonal projection of b onto the column space of A, i.e., ˆb = UUTb.
(b) Briefly explain why finding a solution to Aw = ˆb is equivalent to finding a solution to V Tw = S−1UTb.
In light of (a), we seek the minimum norm w such that V Tw = S−1UTb. Recall (also from lecture) that the minimum norm w must also be in the row space of A, i.e., w must be in the column space of V .
(c) Briefly explain why wˆ := VS−1UTb is in the column space of V and also satisfies VTw = S−1UTb.
(d) Can there exist another vector q (different from wˆ from above) in the column space of V that also satisfies V Tq = S−1UTb? Briefly explain why or why not.
From the arguments above, we see that A†b gives the minimum norm solution to the normal equations.
6
Problem 4 (10 points)
In this problem, you will use singular value decomposition for compressing a single image.
Get the Jupyter notebook hw3p4.ipynb from Courseworks and implement the function compressed_image. This function takes as input a positive integer k, as well as a m × n pixel image, which is represented as three m × n matrices R, G, B ∈ [0, 1]m×n, one for each color channel. It should return a new image—i.e., three new matrices R′,G′,B′ ∈ Rm×n—such that each of R′,G′,B′ is the best rank k approximation of the corresponding matrix from the input image. By
“best rank k approximation”, we mean closest in terms of Frobenius norm.
The Jupyter notebook has code that shows how this function will be used. We recommend you use
numpy library functions like numpy.linalg.svd and numpy.diag in your implementation.
Include the source code in your code submission on Gradescope.
Please ALSO include a copy/screenshot of your source code for compressed_image in your write-up. This will make it easier for the grading. If you are using LATEX, see https://www.overleaf .com/learn/latex/code_listing.
7