EECS4404/5327, Winter 2021 Assignment 3
For both parts, you will need to produce a report which you will submit online through eClass. Both parts are
due Monday, December 6 at 11:59pm. Late submissions will not be accepted.
Mixture Models
Consider the Gaussian Mixture Model which assumes the data has been generated from the distribution
p(y|θ) =
K∑
j=1
πjN (y|µj ,Σj)
where θ = {(πj , µj ,Σj)}Kj=1 are the parameters of the mixture model with constraints: πj > 0 for all j,
∑K
j=1 πj = 1
and that Σj must be a symmetric positive definite matrix for all j.
1. Traditionally a mixture model is fit with something called the Expectation-Maximization (EM) algo-
rithm. However, it’s also possible to fit it directly using gradient-based optimization. This can be
very useful for GMMs when there are a large number of data points and can be used to fit mixtures
of non-Gaussian distributions. However to do this the gradients of the objective function needs to be
computed. In this question you’ll derive the needed gradients and find a way to handle the constraints
on the parameters. For simplicity, use the Maximum Likelihood objective
E(θ) = −
N∑
i=1
log p(y(i)|θ)
(a) Derive the gradient of E(θ) with respect to µj , ∂E∂µj .
(b) Derive the gradient of E(θ) with respect to πj , ∂E∂πj . Ignore the constraints on the values of πj .
(c) A general way to handle constraints on parameters is to reparameterize the function in terms
of unconstrained variables. In the case of the mixing proportions we have the constraints
πj > 0 and
∑K
j=1 πj = 1. One way to reparameterize the mixing proportions is with the
softmax function
πj = sj(w1, . . . , wK) =
exp(wj)∑K
k=1 exp(wk)
where {wj}Kj=1 are the new parameters. For any inputs w1, . . . , wK the outputs of the
softmax function always satisfy the constraints. Show that
∂πj
∂wk
=
{
πj(1− πj) j = k
−πjπk j ̸= k
(d) Use the results from (b) and (c) to derive the gradient of E(θ) with respect to wk. Hint:
Use the chain rule to combine the results to compute ∂E
∂wj
.
2. We will now consider the first and second moments for a Gaussian Mixture Model.
(a) Derive E[y].
(b) Derive E[yyT ] and cov[y] = E[(y − E[y])(y − E[y])T ].
Page 1
EECS4404/5327, Winter 2021 Assignment 3
Neural Networks and Stochastic Gradient Descent
In this assignment you will implement a neural network for multi-class classification, including the forward and
back propagation algorithms, and train it using stochastic gradient descent with the cross-entropy loss. The dataset
we’ll use is called Fashion MNIST. The classic MNIST dataset of handwritten digits that we used in assignment
2 is largely considered to be “solved,” meaning that state-of-the-art methods are effectively getting nearly 100%
accuracy on the test set. As a result it is no longer considered a useful dataset for research but is convenient to
run on due to it’s low dimensionality and relatively small size. To address this, the Fashion MNIST dataset was
created, containing images of clothing objects, e.g., shoes, hats, coats, shirts, pants, etc. The images have the exact
same dimensionality and datatype as MNIST and there are the same number of class labels (10) making it easy
for it to replace MNIST in research. However Fashion MNIST is a harder dataset, currently the state-of-the-art
accuracy is around 95%. You can check out the leaderboard here.
Step 1
Your first step is to copy the Colab starter code and complete the implementation. Throughout the notebook are
several places where you will need to implement things marked with TODO typically followed by a line (or lines)
with “…” to indicate places you should fill in. Look for those places and begin implementing based on the comments
provided in the code. You will need to, roughly, implement the following:
• The ReLU activation function φ(a) = max(0, a) and it’s derivative.
• The softmax activation function φ(a) = exp a∑
i
exp ai
and it’s derivative. Note that a numerically stable implemen-
tation of this function will look more like φ(a) = exp(a−c)∑
i exp(ai−c)
where c = maxi ai. (This is a direct variation
of the log-sum-exp trick mentioned earlier in the class.)
• The cross entropy loss function ℓ(y, ŷ) = −
∑
i yi log ŷi where y is a onehot encoded representation of the
class label where yi = 1 if i is the correct class label and yj = 0 otherwise. Note that I have provided a
onehot_encode function which converts the integer class labels into onehot encoded vectors for you.
• Parts of the backprop algorithm, specifically, the propagation of the derivatives of the loss with respect to the
hidden layer activations back through the activation functions and linear transformations. That is, updating
∂L
∂zi
= ∂L
∂zi+1
∂zi+1
∂ai+1
∂ai+1
∂zi
. You don’t need to implement the derivatives for the linear layer parameters at each
layer ∂L
∂Wi
, ∂L
∂bi
as I have implemented these for you. You may wish to refer to section 13.3 of the textbook and
the course notes for this.
• Parts of the SGD algorithm, specifically, updating the parameters given the gradient and randomly subsam-
pling the training data into a mini-batch.
Step 2
After implementing the above, you will train a network on the Fashion MNIST dataset to explore the impact of
different parameters. To assess performance you will look at the (mini)batch, training and validation losses and
accuracies. Somewhat unusually for practice, the code is setup to compute the loss function and accuracy on the
full training set periodically to allow you to compare the approximate loss function values that SGD is optimizing
versus the loss function that the full gradient descent algorihm would optimize. The loss and accuracy computed
on the full training set is referred to as the “training loss” and “training accuracy” while the loss and accuracy of a
minibatch at each iteration is the “batch loss” and “batch accuracy”. Based on your outputs, answer the following
questions in your report:
1. For a single training run, compare the training loss and batch loss curves. Both curves show a general
decreasing trend during training but have some significant differences. Describe the differences and connect
the differences that you see back to what you know about SGD.
2. Another set of experiments looks at how the optimization behaves with different batch sizes. In these experi-
ments, the amount of computation performed is roughly the same for each run, so versions that run with large
batch sizes take a proportionately smaller number of iterations. A number of other parameters are adjusted
to try to make the different runs comparable. Comment on how the plots differ for different batch sizes and
explain what you see in terms of what you know about SGD. What batch size do you think it best and why?
Page 2
https://github.com/zalandoresearch/fashion-mnist
http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/
https://colab.research.google.com/drive/1gcMdE57VeNdAdVbLFRj4s8eSir_F_sAa?usp=sharing
EECS4404/5327, Winter 2021 Assignment 3
3. A final set of experiments looks at different numbers of hidden units and compares the best achieved training
and validation performance in each case. What happens to validation vs training loss as the number of hidden
units increases? Explain the behaviour you see based on what you know is happening to the model as the
number of hidden units increase. Propose an approach to help alleviate the issues you see as the number of
hidden units increase.
Step 3
There are a large number of hyperparameters that need to be set for a neural network to specify it’s architecture: the
number of hidden layers, the number of units per hidden layer, the activation functions used, the number of training
iterations, the batch size, the initial step size, the step size decay rate, etc. Explore the space of hyperparameter
choices for this dataset to try to find the one which achieves the best accuracy on the validation set. Note, there is
no specific performance that you need to achieve for this question. You may work on this step with other students,
however you must have done your own implementation and work for all previous steps. If you choose to work with
other students on this part of the assignment, everyone must still submit their own report with their own answers
to the questions below.
In your report answer the following questions:
1. What strategy did you take to try to explore the space of hyperparameters? There are many ways to approach
this. You can vary individual parameters, one at a time in a systematic way or in a random way. You can
randomly or systematically vary all of them together. You can try to interpret statistics of the training process
(loss curves, etc) to guess what needs to happen, etc.
2. What set of parameters did you find that gave you the best results on the validation set and what was both
the average loss value and the accuracy on the training and validation sets?
3. [If you worked with other students] List all other students you worked with in this step of the assignment
and describe how you worked together. Failure to list a student you worked with could be considered academic
misconduct.
Submission
You must submit by the deadline a report answer the above questions as a PDF and your code. Detailed submission
instructions will be posted on eClass.
Page 3