HOMEWORK 5: NEURAL NETWORKS
10-301/10-601 Introduction to Machine Learning (Spring 2021)
http://www.cs.cmu.edu/ ̃mgormley/courses/10601/
DUE: Monday, March 29, 2021 11:59 PM
Summary In this assignment, you will build a handwriting recognition system using a neural network. In the Written component, you will walk through an on-paper example of how to implement a neural network. Then, in the Programming component, you will implement an end-to-end system that learns to perform handwritten letter classification.
START HERE: Instructions
• Collaboration Policy: Please read the collaboration policy here: http://www.cs.cmu.edu/
̃mgormley/courses/10601/
• Late Submission Policy: See the late submission policy here: http://www.cs.cmu.edu/ ̃mgormley/courses/10601/
• Submitting your work: You will use Gradescope to submit answers to all questions and code. Please follow instructions at the end of this PDF to correctly submit all your code to Gradescope.
– Written: For written problems such as short answer, multiple choice, derivations, proofs, or plots, we will be using Gradescope (https://gradescope.com/). Please use the provided template. Submissions must be written in LaTeX. Regrade requests can be made, however this gives the staff the opportunity to regrade your entire paper, meaning if additional mistakes are found then points will be deducted. Each derivation/proof should be completed in the boxes provided. For short answer questions you should not include your work in your solution. If you include your work in your solutions, your assignment may not be graded correctly by our AI assisted grader.
– Programming: You will submit your code for programming questions on the homework to Gradescope (https://gradescope.com). After uploading your code, our grading scripts will autograde your assignment by running your program on a virtual machine (VM). When you are developing, check that the version number of the programming language environment (e.g. Python 3.6.9, OpenJDK 11.0.5, g++ 7.4.0) and versions of permitted libraries (e.g. numpy 1.17.0 and scipy 1.4.1) match those used on Gradescope. You have a total of 10 Gradescope programming submissions. Use them wisely. In order to not waste Gradescope submissions, we recommend debugging your implementation on your local machine (or the linux servers) and making sure your code is running correctly first before submitting you code to Gradescope.
• Materials: The data that you will need in order to complete this assignment is posted along with the writeup and template on Piazza.
1
Linear Algebra Libraries When implementing machine learning algorithms, it is often convenient to have a linear algebra library at your disposal. In this assignment, Java users may use EJMLa or ND4Jb and
C++ users Eigenc. Details below. (As usual, Python users have NumPy.)
EJML for Java EJML is a pure Java linear algebra package with three interfaces. We strongly
recommend using the SimpleMatrix interface. The autograder will use EJML version 0.38. When compiling and running your code, we will add the additional command line argument -cp “linalg_lib/ejml-v0.38-libs/*:linalg_lib/nd4j-v1.0.0-beta7-libs/*:./” to ensure that all the EJML jars are on the classpath as well as your code.
ND4J for Java ND4J is a library for multidimensional tensors with an interface akin to Python’s NumPy. The autograder will use ND4J version 1.0.0-beta7. When com- piling and running your code, we will add the additional command line argument -cp “linalg_lib/ejml-v0.38-libs/*:linalg_lib/nd4j-v1.0.0-beta7-libs/*:./” to ensure that all the ND4J jars are on the classpath as well as your code.
Eigen for C++ Eigen is a header-only library, so there is no linking to worry about—just #include what- ever components you need. The autograder will use Eigen version 3.3.7. The command line arguments above demonstrate how we will call you code. When compiling your code we will include, the argument -I./linalg_lib in order to include the linalg_lib/Eigen subdirectory, which contains all the head- ers.
We have included the correct versions of EJML/ND4J/Eigen in the linalg_lib.zip posted on the Piazza Resources page for your convenience. It contains the same linalg_lib/ directory that we will include in the current working directory when running your tests. Do not include EJML, ND4J, or Eigen in your homework submission; the autograder will ensure that they are in place.
a https://ejml.org
b https://deeplearning4j.org/docs/latest/nd4j- overview c http://eigen.tuxfamily.org/
Page 2
Written Questions (44 points) 1 Convolutional Neural Networks
In this problem, consider only the convolutional layer of a standard implementation of a CNN as described in Lecture 12.
1. We are given image X and filter F below.
(a) (1 point) Let X be convolved with F using no padding and a stride of 1 to produce an output Y . What is value of j in the output Y ?
Answer
(b) (1 point) Suppose you had an input feature map of size 6×4 and filter size 2×2, using no padding and a stride of 2, what would be the resulting output size? Write your answer in terms of height × width.
Answer
Page 3
2. Parameter sharing is a very important concept for CNN because it drastically reduces the complexity of the learning problem. For the following questions, assume that there is no bias term in our convolutional layer.
(a) (1 point) Which of the following are parameters of a convolutional layer?
Select all that apply:
stride size padding size
image size
filter size
weights in the filter None of above.
(b) (1 point) Which of the following are hyperparameters of a convolutional layer?
Select all that apply:
stride size padding size
image size
filter size
weights in the filter None of above.
(c) (1point) Supposefortheconvolutionallayer,wearegivenblackandwhiteimagesofsize22×22. Using one single 4 × 4 filter with a stride of 2 and no padding, what is the number of parameters you are learning in this layer?
Answer
(d) (1point) Supposeinsteadofsharingthesamefilterfortheentireimage,youlearnanewfiltereach time you move across the image. Using 4 × 4 filters with a stride of 2 and no padding, what is the number of parameters you are learning in this layer?
Answer
(e) (1 point) Now suppose you are given a 40 × 40 colored image, which consists of 3 channels (so your input is a 40 × 40 × 3 tensor), each representing the intensity of one primary color. Without
Page 4
sharing, using 4 × 4 filters with a stride of 2 and no padding, what is the number of parameters you are learning in this layer?
Answer
(f) (1 point) Parameter sharing is not usually used for fully-connected layers, but is usually used for convolutional layers. In a sentence, describe a reason why parameter sharing is a good idea for a convolutional layer, besides reduction in problem complexity. Hint: think about applications of CNNs.
Answer
Page 5
2 Backpropagation
x1
x2 z1 yˆ1
x3 z2 yˆ2 Figure 1: A Directed Acyclic Graph (DAG)
Consider this above Directed Acyclic Graph (DAG). Notice that it looks different than the fully-connected Neural Networks that we have seen before. Recall from lecture that you can perform back propagation on any DAG. We will work through back propagation on this graph.
Let (x,y) be the training example that is considered, where x = x1 All the other nodes in the graph are defined as:
z1 = ReLU(w1,1×1 + w2,1×2)
z2 = w2,2×2 + w3,2×3 + b2
yˆ 1 = σ ( m 1 , 1 z 13 + c 1 )
yˆ2 = m1,2 sin(z1) + m2,2 cos(z2)
x2 x3T and y = y1
y2T.
1 is the sigmoid function, and ReLU(x) = max(0, x). Let θ be the set of all parameters 1+e−x
where σ(x) =
to be learned in this graph. We have that
θ = {w1,1, w2,1, w2,2, w3,2, m1,1, m1,2, m2,2, b2, c1}
For every set of input x = (x1,x2,x3), we will define objective of the problem as minimizing the loss
function
J(θ) = log (y1 − yˆ1)2 + log (y2 − yˆ2)2
Assume that you have already gone through the forward pass with inputs x = (x1, x2, x3) and stored all the relevant values. In the following questions, you will derive the backpropagation algorithm applied to the above DAG.
(a) (1 point) First, we will derive the gradients with repect to the outputs. What are the expressions for ∂J ? Write your solution in terms of yˆ1.
∂ yˆ 1 Answer
Page 6
(b) Now, we will derive the gradients associated with the last layer, ie nodes y1, y2. Note that for the full
backpropagation algorithm, you would need to calculate the gradients of the loss function with respect
to every parameter (m1,1, m1,2, m2,2, c1) as well as every input of the layer (z1, z2), but we are not
asking for all of them in this part. For all of the questions in this part, you should use Chain Rule, and
write your solution in terms of values from the forward pass, or gradients with respect to the outputs
of this layer, ∂J , ∂J , because you have already calculated these values. In addition, use the sigmoid ∂ yˆ 1 ∂ yˆ 2
function σ(x) in your answer instead of its explicit form.
(a) (1 point) What is the expression for
Answer
(b) (1 point) What is the expression for
Answer
(c) (1 point) What is the expression for
Answer
(d) (1 point) What is the expression for
Answer
(e) (1 point) What is the expression for
Answer
∂J ? ∂z1
∂J ? ∂z2
∂J ? ∂ m1,1
∂J ? ∂ m1,2
∂J ? ∂c1
Page 7
(f) (1 point) Lastly, we will derive the gradients associated with the second layer, ie nodes z1, z2. Note
that for the full backpropagation algorithm, you need to calculate the gradients of the loss function with
respect to every parameter (every wi,j,b2). However, we do not need to calculate the gradients with
respect to the inputs of this layer (x1,x2,x3), because they are fixed inputs of the model. For all of
the questions in this part, you should use Chain Rule, and write your solution in terms of values from
the forward pass or gradients with respect to the outputs of this layer, ∂J , ∂J , because you have ∂z1 ∂z2
already calculated these values.
(a) (1 point) What is the expression for ∂J ?
∂ w2,2
Answer
(b) (1 point) What is the expression for ∂J ? ∂b2
Answer
(c) (2 points) Recall that ReLU(x) = max(x, 0). The ReLU function is not differentiable at x = 0, but for backpropagation, we define its derivative as
Now, what is the expression for
Answer
ReLU′(x) =
∂J ? Explicitly write out the cases.
∂ w1,1
0, x < 0
1, otherwise
Page 8
3 Backpropagation Through Time
Consider the following one-layer many-to-many Recurrent neural network (RNN),
o V
W x
Figure 2: A one hidden layer many-to-many Recurrent Neural Network Recall from lecture, this is equivalent to
o(1) o(2) ...... o(n) ......
h(0) h(1) h(2) h(... ) h(n) h(... )
x(1) x(2) ...... x(n) ......
Figure 3: The unfolding of the above RNN in time of the computation involved in its forward computation
Assume h(0) is given. To keep things simple, we consider an RNN with identity activation function and no bias terms. We also assume that W = I, the identity matrix. For each timestep n ≥ 1, the hidden state and output can be expressed as
h(n) =f(h(n−1),x(n),U)=Ux(n) +h(n−1)
o(n) =g(h(n),V)=Vh(n)
where x(n) ∈ Ra, h(n) ∈ Rb, o(n) ∈ Rc, U ∈ Rb×a, and V ∈ Rc×b. Assume finite time N. The loss can
then be expressed as
∂J ∂J ∂x2
∂x11 ∂x12 ∂x1m ∂J ∂J ∂J ... ∂J ∂x21 ∂x22 ∂x2m
= ∂x ...
=..... ∂X . . . .
h U
N
J(θ) = l(y(t), o(t))
t=1
where l(y(t), o(t)) is the loss at time step t.
For the following questions, express your answer in terms of x(n), h(n), o(n), U, V unless specified other-
wise. Some rules of matrix calculus may be helpful: for J ∈ R, x ∈ Rn, X ∈ Rn×m,
∂J ∂x1
∂J ∂J...∂J
∂J ∂J∂J∂J ∂xn ∂xn1 ∂xn2 ... ∂xnm
Page 9
1. Assume N = 1. For the following subparts (a)-(d), you may give your answer in terms of ∂l . ∂ o(1)
(a) (1 point) What is
Answer
(b) (1 point) What is
Answer
(c) (1 point) What is
Answer
∂J ? ∂ h(1)
∂J ? ∂V
∂J ? ∂U
2. Assume N = 2. For the following subparts (a)-(d), you may give your answer in terms of ∂l .
∂l and ∂ o(1)
∂ o(2)
(a) (1 point) What is
Answer
(b) (1 point) What is
Answer
(c) (1 point) What is
Answer
∂J ? ∂ h(1)
∂J ? ∂V
∂J ? ∂U
Page 10
3. Now let’s generalize the result to an arbitrary time N > 1. For the following subparts (a)-(c), you may
give your answer in terms of ∂l ∂ o(n)
(a) (2 points) What is ∂J ? ∂ h(n)
Answer
(b) (2 points) What is ∂J ? ∂V
Answer
(c) (2 points) What is ∂J ? ∂U
Answer
, n = 1,…,N.
Page 11
4 Empirical Questions
The following questions should be completed after you work through the programming portion of this assignment.
For these questions, use the large dataset. Use the following values for the hyperparameters unless other- wise specified:
Parameter
Value
Number of Hidden Units 50
Weight Initialization
Learning Rate 0.01
Please submit computer-generated plots for (a)i and (b)i. Note: we expect it to take about 5 minutes to train each of these networks.
1. Hidden Units
(a) (2 points) Train a single hidden layer neural network using the hyperparameters mentioned in the table above, except for the number of hidden units which should vary among 5, 20, 50, 100, and 200. Run the optimization for 100 epochs each time.
Plot the average training cross-entropy (sum of the cross-entropy terms over the training dataset divided by the total number of training examples) on the y-axis vs number of hidden units on the x-axis. In the same figure, plot the average validation cross-entropy.
Avg. Train and Validation Cross-Entropy Loss
RANDOM
Page 12
(b) (2 points) Examine and comment on the the plots of training and validation cross-entropy. What is the effect of changing the number of hidden units?
Answer
2. Learning Rate
(a) (6 points) Train a single hidden layer neural network using the hyperparameters mentioned in the table above, except for the learning rate which should vary among 0.1, 0.01, and 0.001. Run the optimization for 100 epochs each time.
Plot the average training cross-entropy on the y-axis vs the number of epochs on the x-axis for the mentioned learning rates. In the same figure, plot the average validation cross-entropy loss. Make a separate figure for each learning rate.
Plot LR 0.1
Page 13
Plot LR 0.01
Plot LR 0.001
Page 14
(b) (2 points) Examine and comment on the plots of training and validation cross-entropy. How does adjusting the learning rate affect the convergence of cross-entropy on the datasets?
Answer
3. Weight Initialization
(a) (2 points) For this exercise, you can work on any data set. Initialize α and β to zero and print them out after the first few updates. For example, you may use the following command to begin:
Compare the values across rows and columns in α and β. Comment on what you observed. Do you think it is reasonable to use zero initialization? Why or why not?
Answer
$ python neuralnet.py smallTrain.csv smallValidation.csv \ smallTrain_out.labels smallValidation_out.labels \ smallMetrics_out.txt 1 4 2 0.1
Page 15
Collaboration Questions
After you have completed all other components of this assignment, report your answers to these questions
regarding the collaboration policy. Details of the policy can be found here.
1. Did you receive any help whatsoever from anyone in solving this assignment? Is so, include full
details.
2. Did you give any help whatsoever to anyone in solving this assignment? Is so, include full details.
3. Did you find or come across code that implements any part of this assignment ? If so, include full details.
Your Answer
Page 16
Programming (56 points)
Figure 4: 10 Random Images of Each of 10 Letters in OCR
5 The Task
Your goal in this assignment is to label images of handwritten letters by implementing a Neural Network from scratch. You will implement all of the functions needed to initialize, train, evaluate, and make predic- tions with the network.
The programs you write will be automatically graded using the Gradescope system. You may write your programs in Python, Java, or C++. However, you should use the same language for all parts below.
6 The Datasets
Datasets We will be using a subset of an Optical Character Recognition (OCR) dataset. This data includes images of all 26 handwritten letters; our subset will include only the letters “a,” “e,” “g,” “i,” “l,” “n,” “o,” “r,” “t,” and “u.” The handout contains three datasets drawn from this data: a small dataset with 60 samples per class (50 for training and 10 for validation), a medium dataset with 600 samples per class (500 for training and 100 for validation), and a large dataset with 1000 samples per class (900 for training and 100 for validation). Figure 4 shows a random sample of 10 images of few letters from the dataset.
File Format Each dataset (small, medium, and large) consists of two csv files—train and validation. Each row contains 129 columns separated by commas. The first column contains the label and columns 2 to 129 represent the pixel values of a 16 × 8 image in a row major format. Label 0 corresponds to “a,” 1 to “e,” 2 to “g,” 3 to “i,” 4 to “l,” 5 to “n,” 6 to “o,” 7 to “r,” 8 to “t,” and 9 to “u.” Because the original images are black-and-white (not grayscale), the pixel values are either 0 or 1. However, you should write your code to
Page 17
accept arbitrary pixel values in the range [0, 1]. The images in Figure 4 were produced by converting these pixel values into .png files for visualization. Observe that no feature engineering has been done here; instead the neural network you build will learn features appropriate for the task of character recognition.
7 Model Definition
In this assignment, you will implement a single-hidden-layer neural network with a sigmoid activation function for the hidden layer, and a softmax on the output layer. Let the input vectors x be of length M , the hidden layer z consist of D hidden units, and the output layer yˆ be a probability distribution over K classes. That is, each element yˆk of the output vector represents the probability of x belonging to the class k.
yˆk = Kl=1 exp(bl) D
bk =βk,0+βkjzj j=1
zj= 1
1 + exp(−aj )
M
aj = αj,0 + αjmxm
m=1
We can compactly express this model by assuming that x0 = 1 is a bias feature on the input and that z0 = 1 is also fixed. In this way, we have two parameter matrices α ∈ RD×(M+1) and β ∈ RK×(D+1). The extra 0th column of each matrix (i.e. α·,0 and β·,0) hold the bias parameters.
yˆk = Kl=1 exp(bl) D
bk =βkjzj j=0
zj= 1
1 + exp(−aj )
M
aj = αjmxm
m=0
The objective function we will use for training the neural network is the average cross entropy over the
training dataset D = {(x(i), y(i))}:
J(α, β) = − y(i) log(yˆ(i)) (1)
exp(bk )
exp(bk )
1NK Nkk
i=1 k=1
In Equation 1, J is a function of the model parameters α and β because yˆ(i) is implicitly a function of x(i),
α, and β since it is the output of the neural network applied to x(i). Of course, yˆ(i) and y(i) are the kth
components of yˆ(i) and y(i) respectively.
Page 18
k
kk
To train, you should optimize this objective function using stochastic gradient descent (SGD), where the gradient of the parameters for each training example is computed via backpropagation. Note that SGD has a slight impact on the objective function, where we are “summing” over the current point, i:
7.1 Initialization
K
JSGD(α, β) = − y(i) log(yˆ(i)) (2) kk
k=1
In order to use a deep network, we must first initialize the weights and biases in the network. This is typically done with a random initialization, or initializing the weights from some other training procedure. For this assignment, we will be using two possible initialization:
RANDOM The weights are initialized randomly from a uniform distribution from -0.1 to 0.1. The bias parameters are initialized to zero.
ZERO Allweightsareinitializedto0.
You must support both of these initialization schemes.
8 Implementation
Write a program neuralnet.{py|java|cpp|m} that implements an optical character recognizer using a one hidden layer neural network with sigmoid activations. Your program should learn the parameters of the model on the training data, report the cross-entropy at the end of each epoch on both train and validation data, and at the end of training write out its predictions and error rates on both datasets.
Your implementation must satisfy the following requirements:
• Use a sigmoid activation function on the hidden layer and softmax on the output layer to ensure it forms a proper probability distribution.
• Number of hidden units for the hidden layer should be determined by a command line flag.
• Support two different initialization strategies, as described in Section 7.1, selecting between them
via a command line flag.
• Usestochasticgradientdescent(SGD)tooptimizetheparametersforonehiddenlayerneuralnetwork.
The number of epochs will be specified as a command line flag.
• Set the learning rate via a command line flag.
• Perform stochastic gradient descent updates on the training data in the order that the data is given in the input file. Although you would typically shuffle training examples when using stochastic gra- dient descent, in order to autograde the assignment, we ask that you DO NOT shuffle trials in this assignment.
• In case there is a tie in the output layer yˆ, predict the smallest index to be the label.
• You may assume that the input data will always have the same output label space (i.e. {0, 1, . . . , 9}). Other than this, do not hard-code any aspect of the datasets into your code. We will autograde your programs on multiple data sets that include different examples.
• Do not use any machine learning libraries. You may use supported linear algebra packages. See Section 8.1 for more details.
Page 19
Implementing a neural network can be tricky: the parameters are not just a simple vector, but a collection of many parameters; computational efficiency of the model itself becomes essential; the initialization strat- egy dramatically impacts overall learning quality; other aspects which we will not change (e.g. activation function, optimization method) also have a large effect. These tips should help you along the way:
8.1
•
•
•
Try to “vectorize” your code as much as possible—this is particularly important for Python. For ex- ample, in Python, you want to avoid for-loops and instead rely on numpy calls to perform operations such as matrix multiplication, transpose, subtraction, etc. over an entire numpy array at once. Why? Because these operations are actually implemented in fast C code, which won’t get bogged down the way a high-level scripting language like Python will.
For low level languages such as Java/C++, the use of primitive arrays and for-loops would not pose any computational efficiency problems—however, it is still helpful to make use of a linear algebra library to cut down on the number of lines of code you will write.
Implement a finite difference test to check whether your implementation of backpropagation is cor- rectly computing gradients. If you choose to do this, comment out this functionality once your back- ward pass starts giving correct results and before submitting to Gradescope—since it will otherwise slow down your code.
Command Line Arguments
The autograder runs and evaluates the output from the files generated, using the following command: For Python:
For Java: For C++:
Where above [args…] is a placeholder for nine command-line arguments:
1.
2.
3.
4.
5.
6.
7.
Page 20
$ python3 neuralnet.py [args…]
$ javac -cp “./lib/ejml-v0.38-libs/*:./” neuralnet.java
$ java -cp “./lib/ejml-v0.38-libs/*:./” neuralnet [args…]
$ g++ -g -std=c++11 -I./lib neuralnet.cpp; ./a.out [args…]
8.
9.
As an example, if you implemented your program in Python, the following command line would run your program with 4 hidden units on the small data provided in the handout for 2 epochs using zero initialization and a learning rate of 0.1.
8.2 Output: Labels Files
\
$ python3 neuralnet.py smallTrain.csv smallValidation.csv \ smallTrain_out.labels smallValidation_out.labels smallMetrics_out.txt 2 4 2 0.1
Your program should write two output .labels files containing the predictions of your model on train- ing data (
Your labels should exactly match those of a reference implementation – this will be checked by the auto- grader by running your program and evaluating your output file against the reference solution.
Note: You should output your predicted labels using the same integer identifiers as the original training data. You should also insert an empty line (again using ’\n’) at the end of each sequence (as is done in the input data files). The first few lines of the predicted labels for the validation dataset is given below.
8.3 Output Metrics
Generate a file where you report the following metrics:
crossentropy AftereachStochasticGradientDescent(SGD)epoch,reportmeancrossentropyonthetrain- ing data crossentropy(train) and validationdata crossentropy(validation) (See Equation 1). These two cross-entropy values should be reported at the end of each epoch and prefixed by the epoch number. For example, after the second pass through the training examples, these should be prefixed by epoch=2. The total number of train losses you print out should equal num epoch— likewise for the total number of validation losses.
6
4 8 8
error Afterthefinalepoch(i.e.whentraininghascompletedfully),reportthefinaltrainingerrorerror(train) and validation error error(validation).
A sample output is given below. It contains the train and validation losses for the first 2 epochs and the final error rate when using the command given above.
epoch=1 crossentropy(train): 2.18506276114
epoch=1 crossentropy(validation): 2.18827302588
Page 21
epoch=2 crossentropy(train): 1.90103257727
epoch=2 crossentropy(validation): 1.91363803461
error(train): 0.77
error(validation): 0.78
Take care that your output has the exact same format as shown above. There is an equal sign = between the word epoch and the epoch number, but no spaces. There should be a single space after the epoch number (e.g. a space after epoch=1), and a single space after the colon preceding the metric value (e.g. a space afterepoch=1 likelihood(train):).EachlineshouldbeterminatedbyaUnixlineending\n.
8.4 Tiny Data Set
To help you with this assignment, we have also included a tiny data set, tinyTrain.csv and tinyValidation.csv, and a reference output file tinyOutput.txt for you to use. The tiny dataset is in a format similar to the other datasets, but it only contains two samples with five features. The reference file contains outputs from each layer of one correctly implemented neural network, for both forward and back-propagation steps. We advise you to use this set to help you debug in case your implementation doesn’t produce the same results as in the written part.
There is more information in the README.md file. Do read through the README file if you plan to use it for debugging. For your reference, tinyOutput.txt is generated from the following command line specifications:
The specific output file names are not important, but be sure to keep the other arguments exactly as they are shown above.
9 Gradescope Submission
You should submit your neuralnet.{py|java|cpp} to Gradescope. Please do not use any other file name for your implementation. This will cause problems for the autograder to correctly detect and run your code.
Some additional tips: Make sure to read the autograder output carefully. The autograder for Gradescope prints out some additional information about the tests that it ran. For this programming assignment we’ve specially designed some buggy implementations that you might implement and will try our best to detect those and give you some more useful feedback in Gradescope’s autograder. Make wise use of autograder’s output for debugging your code.
Note: For this assignment, you may make up to 10 submissions to Gradescope before the deadline, but only your last submission will be graded.
$ python3 neuralnet.py tinyTrain.csv tinyValidation.csv \ tinyTrain_out.labels tinyValidation_out.labels tinyMetrics_out.txt \
1 4 2 0.1
Page 22
A Implementation Details for Neural Networks
This section provides a variety of suggestions for how to efficiently and succinctly implement a neural network and backpropagation.
A.1 SGD for Neural Networks
Consider the neural network described in Section 8 applied to the ith training example (x, y) where y is a one-hot encoding of the true label. Our neural network outputs yˆ = hα,β(x), where α and β are the parameters of the first and second layers respectively and hα,β(·) is a one-hidden layer neural network with a sigmoid activation and softmax output. The loss function is negative cross-entropy J = l(yˆ, y) = −yT log(yˆ). J = Jx,y(α, β) is actually a function of our training example (x, y), and our model parameters α, β though we write just J for brevity.
In order to train our neural network, we are going to apply stochastic gradient descent. Because we want the behavior of your program to be deterministic for testing on Gradescope, we make a few simplifications: (1) you should not shuffle your data and (2) you will use a fixed learning rate. In the real world, you would not make these simplifications.
SGD proceeds as follows, where E is the number of epochs and γ is the learning rate. Algorithm 1 Stochastic Gradient Descent (SGD) without Shuffle
1: 2: 3: 4: 5: 6: 7:
8:
9: 10: 11:
12: 13:
procedureSGD(TrainingdataD,testdataDt) Initialize parameters α, β
for e ∈ {1,2,…,E} do
◃ Use either RANDOM or ZERO from Section 7.1 ◃ For each epoch ◃ For each training example (No shuffling)
14:
for (x, y) ∈ D do
Compute neural network layers:
o = object(x, a, b, z, yˆ, J) = NNFORWARD(x, y, α, β) Compute gradients via backprop:
∂J g=
α ∂α
∂J = NNBACKWARD(x, y, α, β, o)
g=
∂β Update parameters:
β
α←α−γgα β←β−γgβ
Evaluate training mean cross-entropy JD (α, β) Evaluate test mean cross-entropy JDt (α, β)
return parameters α, β
At test time, we output the most likely prediction for each example:
Algorithm 2 Prediction at Test Time
1: procedurePREDICT(UnlabeledtrainortestdatasetD′,Parametersα,β)
2: 3: 4:
for x ∈ D′ do
Compute neural network prediction yˆ = h(x)
Predict the label with highest probability l = argmaxk yˆk
The gradients we need above are themselves matrices of partial derivatives. Let M be the number of input
Page 23
features, D the number of hidden units, and K the number of outputs.
αα…α 10 11 1M
dJ dJ…dJ
∂α . . … . dJ dJ…dJ
dαD0 dαD1 dαDM
dJ dJ…dJ
∂J … α= g = = dα20 dα21 dα2M
α20 α21 … α2M ….α…
(3)
(4)
. . .. . αα…α
D0 D1 DM
ββ…β 10 11 1D
dβ10 dβ11 dβ1D dJdJ dJ
…
β20 β21 … β2D
β= g = = dβ20 dβ21 dβ2D
….β…
. . .. . ββ…β
∂β . . … . dJ dJ…dJ
dβK0 dβK1 dβKD
K0 K1 KD
Observe that we have (in a rather tricky fashion) defined the matrices such that both α and gα are D × (M + 1) matrices. Likewise, β and gβ are K × (D + 1) matrices. The +1 comes from the extra columns α·,0 and β·,0 which are the bias parameters for the first and second layer respectively. We will always assume x0 = 1 and z0 = 1. This should greatly simplify your implementation as you will see in Section A.3.
A.2 Recursive Derivation of Backpropagation
In class, we described a very general approach to differentiating arbitrary functions: backpropagation. One way to understand how we go about deriving the backpropagation algorithm is to consider the natural con- sequence of recursive application of the chain rule.
In practice, the partial derivatives that we need for learning are dJ and dJ . dαij dβkj
A.2.1 Symbolic Differentiation
Note In this section, we motivate backpropagation via a strawman: that is, we will work through the wrong approach first (i.e. symbolic differentiation) in order to see why we want a more efficient method (i.e. backpropagation). Do not use this symbolic differentiation in your code.
Suppose we wanted to find dJ using the method we know from high school calculus. That is, we will dαij
analytically solve for an equation representing that quantity.
1. Considering the computational graph for the neural network, we observe that αij has exactly one child ai = Mm=0 αimxm. That is ai is the first and only intermediate quantity that uses αij. Applying the chain rule, we obtain
dJ = dJ dai = dJ xj dαij dai dαij dai
2. So far so good, now we just need to compute dJ . Not a problem! We can just apply the chain dai
rule again. ai just has exactly one child as well, namely zi = σ(ai). The chain rule gives us that
dJ = dJ dzi = dJ zi(1 − zi). Substituting back into the equation above we find that dai dzi dai dzi
dJ = dJ (zi(1 − zi))xj dαij dzi
∂J
dα10 dα11 dα1M dJdJ dJ
Page 24
3. How do we get dJ ? You guessed it: apply the chain rule yet again. This time, however, there are dzi
multiple children of zi in the computation graph; they are b1 , b2 , . . . bK . Applying the chain rule gives us that dJ = K dJ ∂bk = K dJ βki. Substituting back into the equation above gives:
dzi k=1 dbk ∂zi k=1 dbk
dJ = K dJ βki(zi(1 − zi))xj
dαij k=1 dbk
4. Next we need dJ , which we again obtain via the chain rule: dJ = K dJ ∂yˆl = K dJ yˆl(I[k =
dbk dbk l=1 dyˆl ∂bk l=1 dyˆl l] − yˆk). Substituting back in above gives:
dJ K K dJ
= yˆl(I[k = l] − yˆk)βki(zi(1 − zi))xj
dαij k=1 l=1 dyˆl
5. Finally, we know that dJ = − yl which we can again substitute back in to obtain our final result:
dyˆl yˆl
dJ K K yl
dα =−yˆyˆl(I[k=l]−yˆk)βki(zi(1−zi))xj ij k=1 l=1 l
Although we have successfully derived the partial derivative w.r.t. αij , the result is far from satisfying. It is overly complicated and requires deeply nested for-loops to compute.
The above is an example of symbolic differentiation. That is, at the end we get an equation representing the partial derivative w.r.t. αij . At this point, you should be saying to yourself: What a mess! Isn’t there a better way? Indeed there is and its called backpropagation. The algorithm works just like the above symbolic differentiation except that we never subsitute the partial derivative from the previous step back in. Instead, we work “backwards” through the steps above computing partial derivatives in a top-down fashion.
A.3 Matrix / Vector Operations for Neural Networks
Some programming languages are fast and some are slow. Below is a simple benchmark to show this concretely. The task is to compute a dot-product aT b between two vectors a ∈ R500 and b ∈ R500 one thousand times. Table 1 shows the time taken for several combinations of programming language and data structure.
language
Python Python Java C++
data structure time (ms)
list 200.99 numpy array 1.01 float[] 4.00 vector
Table 1: Computation time required for dot-product in various languages.
Notice that Java1 and C++ with standard data structures are quite efficient. By contrast, Python differs dra- matically depending on which data structure you use: with a standard list object
(e.g.a = [float(i) for x in range(500)])thecomputationtimeisanappallinglyslow200+
1Java would approach the speed of C++ if we had given the just-in-time (JIT) compiler inside the JVM time to “warm-up”.
Page 25
milliseconds. Simply by switching to a numpy array (e.g. a = np.arange(500, dtype=float)) we obtain a 200x speedup. This is because a numpy array is actually carrying out the dot-product computa- tion in pure C, which is just as fast as our C++ benchmark, modulo some Python overhead.
Thus, for this assignment, Java and C++ programmers could easily implement the entire neural network using standard data structures and some for-loops. However, Python programmers would find that their code is simply too slow if they tried to do the same. As such, particularly for Python users, one must convert all the deeply nested for-loops into efficient “vectorized” math via numpy. Doing so will ensure efficient code. Java and C++ programmers can also benefit from linear algebra packages since it can cut down on the total number of lines of code you need to write.
A.4 Procedural Method of Implementation
Perhaps the simplest way to implement a 1-hidden-layer neural network is procedurally. Note that this approach has some drawbacks that we’ll discuss below (Section A.4.1).
The procedural method: one function computes the outputs of the neural network and all intermediate quantities o = NNFORWARD(x, y, α, β) = object(x, a, b, z, yˆ, J), where the object is just some struct. Then another function computes the gradients of our parameters gα, gβ = NNBACKWARD(x, y, α, β, o), where o is a data structure that stores all the forward computation.
One must be careful to ensure that functions are vectorized. For example, your Sigmoid function should be able to take a vector input and return a vector output with the Sigmoid function applied to all of its elements. All of these operations should avoid for-loops when working in a high-level language like Python. We can compute the softmax function in a similar vectorized manner.
A.4.1 Drawbacks to Procedural Method
As noted in Section A.6, it is possible to use a finite difference method to check that the backpropaga- tion algorithm is correctly computing the gradient of its corresponding forward computation. We strongly encourage you to do this.
There is a big problem however: what if your finite difference check informs you that the gradient is not being computed correctly. How will you know which part of your NNFORWARD() or NNBACKWARD() functions has a bug? There are two possible solutions here:
1. As usual, you can (and should) work through a tiny example dataset on paper. Compute each inter- mediate quantity and each gradient. Check that your code reproduces each number. The one that does not should indicate where to find the bug.
2. Replace your procedural implementation with a module-based one (as described in Section A.5) and then run a finite-difference check on each layer of the model individually. The finite-difference check that fails should indicate where to find the bug.
Of course, rather than waiting until you have a bug in your procedural implementation, you could jump straight to the module-based version—though it increases the complexity slightly (i.e. more lines of code) it might save you some time in the long run.
A.5 Module-based Method of Implementation
Module-based automatic differentiation (AD) is a technique that has long been used to develop libraries for deep learning. Dynamic neural network packages are those that allow a specification of the computation
Page 26
graph dynamically at runtime, such as Torch2, PyTorch3, and DyNet4—these all employ module-based AD in the sense that we will describe here.5
The key idea behind module-based AD is to componentize the computation of the neural-network into layers. Each layer can be thought of as consolidating numerous nodes in the computation graph (a subset of them) into one vector-valued node. Such a vector-valued node should be capable of the following and we call each one a module:
1. 2.
A.5.1
Forwardcomputationofoutputb=[b1,…,bB]giveninputa=[a1,…,aA]viasomedifferentiable function f. That is b = f(a).
Backward computation of the gradient of the input ga = ∂J = [ dJ , . . . , dJ ] given the gradient of ∂a da1 daA
output gb = ∂J = [ dJ , . . . , dJ ], where J is the final real-valued output of the entire computation ∂b db1 dbB
graph. This is done via the chain rule dJ = M dJ ∂bj for all i ∈ {1,…,A}. dai j=1 dbj ∂ai
Module Definitions
The modules we would define for our neural network would correspond to a Linear layer, a Sigmoid layer, a Softmax layer, and a Cross-Entropy layer. Each module defines a forward function b = *FORWARD(a), and a backward function ga = *BACKWARD(a, b, gb) method. These methods accept parameters if ap- propriate. You’ll want to pay close attention to the dimensions that you pass into and return from your modules.
Linear Module The linear layer has two inputs: a vector a and parameters ω ∈ RB×A. The output b is not used by LINEARBACKWARD, but we pass it in for consistency of form.
1: 2: 3: 4: 5: 6: 7:
procedureLINEARFORWARD(a,α) Compute b
return b procedureLINEARBACKWARD(a,α,b,gb)
Compute gα Compute ga returngα,ga
It’s also quite common to combine the Cross-Entropy and Softmax layers into one. The reason for this is the cancelation of numerous terms that result from the zeros in the cross-entropy backward calculation. (Said trick is not required to obtain a sufficiently fast implementation for Gradescope.)
A.5.2 Module-based AD for Neural Network
Using these modules, we can re-define our functions NNFORWARD and NNBACKWARD as follows.
2 http://torch.ch/
3 http://pytorch.org/
4 https://dynet.readthedocs.io
5Static neural network packages are those that require a static specification of a computation graph which is subsequently
compiled into code. Examples include Theano, Tensorflow, and CNTK. These libraries are also module-based but the particular form of implementation is different from the dynamic method we recommend here.
Page 27
Algorithm 3 Forward Computation
1: procedureNNFORWARD(Trainingexample(x,y),Parametersα,β)
2: a = LINEARFORWARD(x, α)
3: z = SIGMOIDFORWARD(a)
4: b = LINEARFORWARD(z, β)
5: yˆ = SOFTMAXFORWARD(b)
6: J = CROSSENTROPYFORWARD(y, yˆ)
7: o = object(x, a, z, b, yˆ, J)
8: return intermediate quantities o
Algorithm 4 Backpropagation
1: procedureNNBACKWARD(Trainingexample(x,y),Parametersα,β,Intermediateso)
2: Place intermediate quantities x, a, z, b, yˆ, J in o in scope
3: gJ=∂J=1 ◃Basecase ∂J
4: gyˆ = CROSSENTROPYBACKWARD(y, yˆ, gJ )
5: gb = SOFTMAXBACKWARD(b,yˆ,gyˆ)
6: gβ, gz = LINEARBACKWARD(z, β, gb)
7: ga = SIGMOIDBACKWARD(a, z, gz)
8: gα, gx = LINEARBACKWARD(x, α, ga)
9: return parameter gradients gα , gβ
◃ We discard gx
Here’s the big takeaway: we can actually view these two functions as themselves defining another module! It is a 1-hidden layer neural network module. That is, the cross-entropy of the neural network for a single training example is itself a differentiable function and we know how to compute the gradients of its inputs, given the gradients of its outputs.
A.6 Testing Backprop with Numerical Differentiation
Numerical differentiation provides a convenient method for testing gradients computed by backpropagation. The centered finite difference approximation is:
∂ J(θ)≈ (J(θ+ε·di)−J(θ−ε·di)) (5) ∂θi 2ε
where di is a 1-hot vector consisting of all zeros except for the ith entry of di, which has value 1. Unfortu- nately, in practice, it suffers from issues of floating point precision. Therefore, it is typically only appropriate to use this on small examples with an appropriately chosen ε.
In order to apply this technique to test the gradients of your backpropagation implementation, you will need to ensure that your code is appropriately factored. Any of the modules including NNFORWARD and NNBACKWARD could be tested in this way.
For example, you could use two functions: forward(x,y,theta) computes the cross-entropy for a training example. backprop(x,y,theta) computes the gradient of the cross-entropy for a training example via backpropagation. Finally, finite_diff as defined below approximates the gradient by the centered finited difference method. The following pseudocode provides an overview of the entire procedure.
def finite_diff(x, y, theta): epsilon = 1e-5
Page 28
grad = zero_vector(theta.length) for m in [1, …, theta.length]:
d = zero_vector(theta.length)
d[m] = 1
v = forward(x, y, theta + epsilon * d)
v -= forward(x, y, theta – epsilon * d)
v /= 2*epsilon
grad[m] = v
# Compute the gradient by backpropagation
grad_bp = backprop(x, y, theta)
# Approximate the gradient by the centered finite difference method
grad_fd = finite_diff(x, y, theta)
# Check that the gradients are (nearly) the same
diff = grad_bp – grad_fd # element-wise difference of two vectors
print l2_norm(diff) # this value should be small (e.g. < 1e-7)
A.6.1 Limitations
This does not catch all bugs—the only thing it tells you is whether your backpropagation implementation is correctly computing the gradient for the forward computation. Suppose your forward computation is incorrect, e.g. you are always computing the cross-entropy of the wrong label. If your backpropagation is also using the same wrong label, then the check above will not expose the bug. Thus, you always want to separately test that your forward implementation is correct.
A.6.2 Finite Difference Checking of Modules
Note that the above would test the gradient for the entire end-to-end computation carried output by the neural network. However, if you implement a module-based automatic differentiation method (as in Section A.5), then you can test each individual component for correctness. The only difference is that you need to run the finite-difference check for each of the output values (i.e. a double for-loop).
Page 29