COMS 4771-2 Fall-B 2020 HW 4 (due December 16, 2020)
Instructions Submitting the write-up
• Submit your write-up on Gradescope as a neatly typeset (not scanned nor hand-written) PDF document by 10:00 PM of the due date (New York City local time).
• You may produce the PDF using LATEX, Word, Markdown, etc.—whatever you like—as long as the output is readable!
• On Gradescope, be sure to select the pages containing your answer for each problem. (If you don’t select pages containing your answer to a problem, you’ll receive a zero for that problem.)
Submitting as a group
• If you are working in a group (of up to three members), make sure all group members’ names and UNIs appear prominently on the first page of the write-up.
• Only one group member should submit the write-up on Gradescope. That student must make sure to add all of the group members to the submission.
Submitting the source code
• Please submit all requested source code on Gradescope in a single ZIP file.
• Use the concatenation of your group members’ UNIs followed by .zip as the filename for the
ZIP file (e.g., abc1234def5678.zip).
• Use the format hwXproblemY for filenames (before the extension), where X is the homework
number and Y is the problem number (e.g., hw1problem3).
• If you are submitting a Jupyter notebook (.ipynb), you must also include a separate Python
source file (.py) with the same basename that contains only the Python code.
• Do not include the data files that we have provided; only include source code.
• Only one group member should submit the code on Gradescope. That student must
make sure to add all of the group members to the submission.
Please consult the Gradescope Help Center if you need help with submitting.
1
Problem 1 (10 points)
In this problem, you will practice checking the convexity (or concavity) of real-valued functions. We say a function f : Rd → R is concave if −f is convex.
Are the following functions convex, concave, or neither? If convex or concave, provide a brief justification (e.g., by verifying an appropriate definition of convexity or concavity).
(a) Letf:R→Rbethefunctiondefinedbyf(x)=−x4.
(b) Letf:R→Rbethefunctiondefinedbyf(x)=x3.
(c) Let f : R → R be the function defined by f(x) = sin(x).
(d) Let f : R2 → R be the function defined by
where
f(x) = 1xTAx − bTx + c, 2
1
A= 6 2 , b= 2 , and c=9.
2 6
(e) Let f : R2 → R be the function defined by
f(x) = 1xT(A − 2λ1v1v1T)x + bTx + c, 2
where λ1 is the largest eigenvalue of A (defined in Part (d)), and v1 is a unit length eigenvector corresponding to λ1.
(f) Let f : R2 → R be the function defined by f(x) = |bTx| where b is defined in Part (d).
(g) Let f: R2 → R be the function defined by f(x) = |xTMx| where M = A−2λ1v1v1T as
defined in Part (e).
(h) Letf:R2 →Rbethefunctiongivenby
f(x) = (σ(bTx) − 1)2 + (σ(−bTx) − 0)2,
where σ(t) = 1/(1 + exp(−t)) is the sigmoid function, and b is defined in Part (d).
2
Problem 2 (15 points)
In this problem, you will train neural networks on synthetic and OCR data using PyTorch. Template code (hw4p2template.py) is provided on Courseworks. A PyTorch tutorial is available here: https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html
One-layer neural network on XOR data
First, start with a one-layer neural network fθ : R2 → R given by fθ(x):=W1x+b1, forallx∈R2,
where θ = (W1, b1) ∈ R1×2 × R1 are the parameters of the function. (We are using an identity activation function for the output node.) This is just an affine function! We shall use it as a binary classifier via
x → 1{fθ(x)>0}.
Implement such a neural network in PyTorch, as well as a gradient descent procedure to fit the
parameters to the training data (x1, y1), . . . , (x4, y4) ∈ R2 × {0, 1},
(x1, y1) = ((−1, −1), 0), (x2, y2) = ((+1, −1), 1), (x3, y3) = ((−1, +1), 1), (x4, y4) = ((+1, +1), 0).
This is the “XOR” pattern that we gave as an example of a non-linearly separable data set. This data set is provided by the XOR_data() function in the template code. Run 25 iterations of gradient descent with step size η = 1.0 on the empirical logistic loss risk:
1 4
yiln(1+exp(−fθ(xi)))+(1−yi)ln(1+exp(fθ(xi))) .
In PyTorch, this objective function is implemented as torch.nn.BCEWithLogitsLoss. Use the default PyTorch initialization scheme to set the initial parameters for gradient descent, but do so immediately after setting the random number seed using torch.manual_seed(0). (This will help reproducibility!) Record the objective values and error rates prior to training, and also after every iteration of gradient descent. Report the initial and final values of each, and separately plot the objective values and error rates as a function of the iteration number.
Two-layer neural network on XOR data
Next, consider the two-layer neural network function fθ : R2 → R given by fθ(x) := W2σ(W1x + b1) + b2, for all x ∈ R2,
where θ = (W1, b1, W2, b2) ∈ R2×2 × R2 × R1×2 × R1 are the parameters of the function, and σ: R → R is the ReLU function σ(z) := max{0,z}, which is applied coordinate-wise to vector arguments. (Again, we are using an identity activation function for the output node.)
Implement this network in PyTorch and repeat the tasks from the previous part. 3
R(θ)=4
i=1
Three-layer neural network on handwritten digits
Next, consider a three-layer neural network fθ : R64 → R with the following architecture:
• Layer 1: fully-connected (linear) layer with 64 inputs and 64 outputs, with a bias term. Use the ReLU activation function.
• Layer 2: fully-connected (linear) layer with 64 inputs and 32 outputs, with a bias term. Use the ReLU activation function.
• Layer 3: fully-connected (linear) layer with 32 inputs and 1 output, with a bias term. Use identity (i.e., no) activation function.
In other words,
fθ(x) := W3σ(W2σ(W1x + b1) + b2) + b3, for all x ∈ R64,
where θ = (W1, b1, W2, b2, W3, b3) ∈ R64×64 × R64 × R32×64 × R32 × R1×32 × R1 are the parameters of the function, and σ : R → R is the ReLU function σ(z) := max{0, z}, which is applied coordinate-wise to vector arguments.
Implement this network in PyTorch and repeat the tasks from the previous part with a few changes.
• Instead of the XOR data set, use the “digits” data set that is prepared using the digits_data() function in the template code. This data set is obtained using scikit-learn: https://scikit- learn.org/stable/modules/generated/sklearn.datasets.load_digits.html. The first 180 examples are reserved as test examples; the remaining 1617 examples are used as training examples. The binary label to predict is 1 for odd digits, and 0 for even digits.
• You will have to “reshape” the inputs, as each image is given as an 8 × 8 array, but your network takes 64-dimensional vectors as inputs. (The template code shows how to do this.)
• In your gradient descent implementation, use a step size of η = 0.1 (instead of 1.0), and use 500 iterations (instead of 25).
• Also report the test error rate of the network using the final parameters.
Convolutional neural network on handwritten digits
Finally, consider a convoluational neural network fθ : R8×8 → R with the following architecture:
• Layer 1: 2D convolutional layer with 1 input channel, 8 output channels, and kernel size 3. Use ReLU activation function.
• Layer 2: 2D max pooling layer, with kernel size 2. Use identity (i.e., no) activation function.
• Layer 3: 2D convolutional layer with 8 input channels and 4 output channels, with kernel size
3. Use ReLU activation function.
• Layer 4: fully-connected (linear) layer with 4 inputs and 1 output, with a bias term. Use
identity (i.e., no) activation function.
Both convolutional layers should use the defaults for other PyTorch options (stride=1, padding=0,
dilation=1, groups=1, bias=True).
Implement this network in PyTorch and repeat the tasks from the previous part. In the network, you will have to “reshape” the input x ∈ R8×8 to be a third-order tensor in R1×8×8 as expected by the first 2D convolutional layer. You may also have to do some “reshaping” before the final fully-connected layer.
For extra credit (up to 5 points), design your own neural network architecture and train a classifier with it. Describe your approach for designing/selecting the architecture and training the procedure
4
in sufficient detail so that a classmate will be able to exactly reproduce your entire approach. Report all relevant metrics (e.g., training error rates, validation error rates) involved in this development. Note: You should not involve the test set at all during this development. Only once all of the dust has settled, compute and report the test error rate of the final classifier.
What to submit in your write-up
(a) Source code for gradient_descent (b) Source code for OneLayerNet
(c) One-layer neural network { objective value, training error rate } plots
(d) One-layer neural network { initial, final } { objective value, training error rate }
(e) Source code for TwoLayerNet
(f) Two-layer neural network { objective value, training error rate } plots
(g) Two-layer neural network { initial, final } { objective value, training error rate } (h) Source code for ThreeLayerNet
(i) Three-layer neural network { objective value, training error rate } plots
(j) Three-layer neural network { initial, final } { objective value, training error rate } (k) Three-layer neural network final test error rate
(l) Source code for ConvNet
(m) Convolutional neural network { objective value, training error rate } plots
(n) Convolutional neural network { initial, final } { objective value, training error rate } (o) Convolutional neural network final test error rate
(p) Extra credit work (optional, of course)
Including your source code in the write-up directly will make it easier to grading. If you are using LATEX, see https://www.overleaf.com/learn/latex/code_listing.
Also include the complete source code in your code submission on Gradescope.
PyTorch functions/classes/methods that may be useful
• torch.Tensor
• torch.Tensor.backward
• torch.Tensor.view
• torch.nn.BCEWithLogitsLoss • torch.nn.Conv2d
• torch.nn.Linear
• torch.nn.MaxPool2d
• torch.nn.Module
• torch.nn.Module.parameters • torch.nn.Module.zero_grad • torch.relu
5
Problem 3 (10 points)
In this problem, you will consider the gradient descent algorithm for finding an approximation solution to the normal equations in ordinary least squares.
Recall that the objective function for ordinary least squares is f(w) = ∥Aw − b∥2,
where A ∈ Rn×d is the matrix of feature vectors, and b ∈ Rn is the vector of labels. Its gradient is ∇f(w) = 2AT(Aw − b),
The normal equations determine where the gradient of f evaluates to 0 ∈ Rd.
The gradient descent algorithm for finding an approximate minimizer of f is as follows. Given an
initial iterate w(0) ∈ Rd and step size η > 0, for t = 1,2,…:
w(t) = w(t−1) − 2ηAT(Aw(t−1) − b).
For simplicity, assume that w(0) = 0.
(a) Briefly explain why w(t) is in the row space of A for all non-negative integers t.
Let us write the eigendecomposition of ATA as
d
A T A = λ j v j v jT , j=1
where λ1 ≥ · · · ≥ λd ≥ 0 are the eigenvalues of ATA, and v1, . . . , vd are corresponding orthonormal eigenvectors. Furthermore, let wˆ ∈ Rd denote an arbitrary solution to the normal equations.
We can use mathematical induction to prove that the following statement holds for all non-negative integers t:
2ηλjt−1(1−2ηλj)k vTwˆ ifj∈{1,…,r}
vjTw(t) =
0
where r is the rank of A. (Assume r ≥ 1.)
(b) Briefly explain why the base case (t = 0) is true.
if j ∈ {r + 1, . . . , d}
k=0 j
(c) Verify that the inductive step holds. That is, assuming the statement holds for t = t′ for some non-negative integer t′, verify that it holds for t = t′ + 1. (You don’t need to write anything for this step.)
(d) Assume 0 < 1 − 2ηλ1 < 1. Explain why for all non-negative integers t, we have w(t) = gη,t(ATA)ATb
where
1−(1−2ηz)t gη,t(z) = 1{z>0} z .
(Recall that g(ATA) is shorthand for dj=1 g(λj)vjvjT for any scalar function g: R → R.) Succinctly describe what w(t) converges to as t → ∞.
(e) How does v1Tw(t) evolve with t if 2ηλ1 = 1? Briefly explain your answer.
(f) How does v1Tw(t) evolve with t if 1 < 2ηλ1 < 2? Briefly explain your answer.
6
Problem 4 (15 points)
In this problem, you will fit a logistic regression model to labeled text data using SGD.
The data set to use is the same as that from Homework 3 Problem 2, and the task is again to learn a linear classifier. However, this time you will implement a stochastic gradient method for
minimizing the objective
1nT f(w)= ln(1+e−yixiw).
n i=1
Your algorithm should make a total of eight passes through the training data; each pass should iterate through the n training examples in a uniformly chosen random order, and updating the weight vector w with each training example. The update to the weight vector w based on the training example (x, y) ∈ Rd × {−1, 1} is
w ← w − ηg
where g is the gradient of the function w → ln(1 + e−yxTw), evaluated at the current weight vector. You should use the zero vector as the initial weight vector, and use a “step size” of η = 1/4. Save the weight vector w(p) at the end of each pass p ∈ {1,...,8}, as well as the value of f(w(p)), and the training and test error rates of the linear classifier corresponding to w(p).
(a) Run your stochastic gradient method from above with the data set from Homework 3 Problem 2. Plot f(w(p)) as a function of p. Separately, plot both the training and test error rates of the linear classifier corresponding to w(p) as functions of p. Label the axes and provide appropriate captions and legends. Include these plots in your write-up.
(b) Note that your algorithm is randomized, so if you run it again (with a different random seed...), you may get slightly different results. Run it a total of 10 times (with different random seeds), each time recording the various quantities described above. Make the same plots as in Part (a) for the average values across the 10 repetitions, along with error bars that correspond to one standard deviation above and below the average.
(c) Report the (average) training and test error rates of the linear classifier corresponding to w(8) (where the average is taken over the 10 repetitions, as above), as well as the linear classifier you obtain from the hill-climbing algorithm from Homework 3 Problem 2 after eight iterations
(rather than 500 iterations). How do they compare?
Include the source code in your code submission on Gradescope.
Please ALSO include a copy/screenshot of your source code in your write-up. This will make it easier for the grading. If you are using LATEX, see https://www.overleaf .com/learn/latex/code_listing.
7