EECS4404/5327, Winter 2021 Assignment 2
The second assignment is meant to explore your understanding of multivariate distributions and logistic regres-
sion. You are strongly encouraged to typeset your solutions. Hand written answers will be accepted, but must be
scanned into a PDF of reasonable size. However all submissions must be easily readable ; unreadable answers
will be given zero marks. Each part is due separately as indicated, late submissions will not be accepted.
Part 1: Due Tuesday, October 26th at 11:59pm
Toronto
1. The Naive Bayes assumption for estimation assumes that the dimensions in a given estimation problem
are (conditionally) independent. For classification we used this to dramatically reduce the number of
parameters that we needed to estimate. The Naive Bayes assumption can also be used for continuous
distributions. Assume x ∈ RD is distributed according to a multivariate Gaussian distribution, i.e.,
p(x) = N (x|µ,Σ). Given a set of samples D = {x(1), . . . ,x(N)}, derive the Maximum Likelihood
estimate of µ and Σ under the Naive Bayes assumption (i.e., that p(x) =
∏D
i=1 p(xi)). Be sure to show
your derivation in detail. What specific structure does Σ have and why?
2. Computing the mean vector and covariance matrix of a set of samples is a common operation in data
analysis and machine learning. The standard equations for computing the mean and covariance from N
samples is
µN =
1
N
N∑
i=1
x(i)
ΣN =
1
N
N∑
i=1
(x(i) − µN )(x(i) − µN )T
The direct implementation of these equations would require two passes through the data, once to
compute µN and once to compute ΣN . However, sometimes this isn’t possible or convenient. For
instance, if N is very large or the data is streaming in and can’t be stored. In those cases we would like
a recursive form of the equations so that we can compute µN and ΣN from µN−1 and ΣN−1.
a) Prove that the mean vector can be updated as
µN = µN−1 +
1
N
(x(N) − µN−1)
b) Using the result from a), derive a method for recursively computing ΣN . That is, you should
derive a formula or set of formulas which allow one to compute ΣN from ΣN−1 and x(N)
without needing to use the values of x(1), . . . ,x(N−1). Hint: Consider recursively computing
the average of 1
N
∑N
i=1 x
(i)(x(i))T as an auxiliary quantity.
3. Consider the following model
• x ∈ RN has a multivariate distribution with mean µx and covariance Σx
• n ∈ RM has a multivariate distribution with mean µn and covariance Σn
• y = Ax+ b+ n where A ∈ RM×N and b ∈ RM are constants
a) Assume that x and n are independent. Derive the mean and covariance of y.
b) Assume x and n are not independent and have covariance
Cov[x,n] = E[(x− µx)(n− µn)T ]
= Σxn
Derive the mean and covariance of y.
Page 1
EECS4404/5327, Winter 2021 Assignment 2
4. Consider the problem of classifying D dimensional inputs x ∈ RD. Suppose we have 2 classes and the
output is denoted y ∈ {0, 1}. If we use a binary logistic regression classifier then the model is:
pb(y|x) = Bernoulli(y|σ(wTx+ b))
where σ(a) = 1
1+e−a
and w ∈ RD, b ∈ R are the parameters of the model.
a) The derivative of the sigmoid function has a special form which can be useful in computing
gradients. Show that dσ
da
= σ(1− σ). Use this fact to derive the form of d
2σ
d2a
.
b) Mathematically, dσ
da
is always greater than zero. However, when implemented on a computer
with finite precision arithmetic, it can become zero due to underflow. Underflow of deriva-
tives can be particularly dangerous for machine learning models learned by using gradients to
update parameters because overall gradients can become numerically zero, causing learning
to fail in a variety of ways.
For what range of values of a will dσ
da
evaluate to zero numerically when computed by the
expression dσ
da
= σ(a)(1 − σ(a)) in single precision floating point arithmetic? What if the
mathematically equivalent expression dσ
da
= σ(a)σ(−a) is used instead? What relevance, if
any, does this have on how we should implement logistic regression?
You should assume that the value of σ has been computed as accurately as possible in single
precision floating point arithmetic. For reference, in single precision the smallest number
larger than zero that can be represented is 2−126 ≈ 1.18 × 10−38. A numerical operation
which results in a value less than this will be rounded either to 0 or 2−126, whichever is closer.
Similarly, the largest number less than one that can be represented is 1−2−24 ≈ 0.99999994.
A numerical operation which results in a value between this value and one will be rounded
to 1 or 1− 2−24, whichever is closer. Hint: You will need to use the inverse of the sigmoid,
σ−1(p) = log p
1−p .
c) Multiclass logistic regression can also be used when there are only two classes. In that case,
the model is
pm(y|x) = Categorical(y|S(Ax+ c))
where S(e) = exp(e)∑
i
exp(ei)
is the softmax function and A ∈ R2×d, c ∈ R2 are the parameters
of the model. Prove that this multiclass logistic regression model is equivalent to the binary
logistic regression model. In particular show how, given the parameters w, b of any binary
logistic regression model, you could construct parameters A, c of a multiclass logistic regres-
sion model which would always give the same predictions. Also, show how, given A, c, you
can compute parameters w, b which would always give the same predictions. Finally, are
these transformations unique? Given the values A, c (or w, b), is the value of w, b (or A, c)
unique? If a direction is unique, give an argument why. If it’s not unique, give at least one
example of a different transformation which would be equivalent.
Page 2
EECS4404/5327, Winter 2021 Assignment 2
Part 2: Logistic Regression
Due: Friday, November 5th at 11:59pm Toronto
In this part of the assignment you will implement a classifier of hand-written digits. We will be using the famous
MNIST dataset which consists of 28×28 pixel pictures of 70,000 handwritten digits and the task is to classify each
image into one of 10 classes. MNIST has played a pivotal role in the development of modern machine learning but
is now considered “saturated” meaning that performance is effectively perfect with the latest methods and so is
not widely used except for illustrative purposes. For this assignment, we will focus only on a binary classification
problem of distinguishing the digits 4 and 9. You will implement a regularized logistic regression model and the
gradient descent algorithm to train it. You will then analyze the model.
Binary logistic regression constructs a (binary) classifier of an output label y ∈ {0, 1} given an input vector
x ∈ RD as follows:
pb(y|x) = Bernoulli(y|σ(wTx+ b))
=
{
σ(wTx+ b) y = 1
1− σ(wTx+ b) y = 0
= σ(wTx+ b)y(1− σ(wTx+ b))(1−y)
where σ(a) = 1
1+e−a
is the sigmoid function and the parameters of the model are θ = (wT , b)T with w ∈ RD
and b ∈ R. Given a dataset D = {(xi, yi)}Ni=1 a MAP estimate of the parameters can be found by assuming
p(w) = N (w|0, 1
λ
I) and deriving the negative log posterior as
L(θ) = − log p(D|θ)p(θ)
= −
N∑
i=1
yi log σ(w
Txi + b) + (1− yi) log(1− σ(wTxi + b)) +
λ
2
wTw
where λ controls the strength of the regularization prior. If λ = 0 then it’s equivalent to a maximum likelihood
estimator and as λ increases, the prior becomes stronger. In this assignment you will use λ = 0 and λ = 10. The
gradient of this objective (or, more accurately, a very similar one) is derived in Sections 10.2.3.3 and 10.2.7 of the
textbook and is given here:
∂L
∂w
= λw −
N∑
i=1
(yi − σ(wTxi + b))xi
∂L
∂b
= −
N∑
i=1
(yi − σ(wTxi + b))
Your task is to implement functions which compute this likelihood and its derivative. You will start by implementing
the sigmoid and log sigmoid functions and their derivatives and then move on to implementing this objective
function. Once that is done, you will need to implement the simple version of gradient descent described in class.
Finally, you will implement some code which visualizes your resulting model and it’s performance to help you answer
questions about the results. Questions are given below.
Code will be marked primarily for correctness and clarity. However, a portion of the mark will be based on
efficiency. An otherwise clear and correct implementation which is inefficient will get a maximum mark of B+/80%.
First and foremost, efficiency means avoiding redundant computation, e.g., recomputing values unnecessarily. How-
ever, more than that, efficient implementations with Python/Numpy should avoid the use of loops. My imple-
mentation of this assignment never loops over entries in a vector in functions like sigmoid(), log_sigmoid(),
logregr_logprob() or logregr_loss(). Instead it uses matrix operations, vectorization and broadcasting.
To learn more about this see this blog post: https://blog.paperspace.com/numpy-optimization-vectorization-and-
broadcasting/. To be clear, you can get 80% of the marks for the implementation without worrying about efficiency.
I recommend you focus on correctness and clarity of your code first and, only after you’ve finished that and answered
all the questions, should you go back and consider efficiency.
Starter code is available as a Google Colab notebook. You should read through the code and comments in that
notebook carefully to find what you need to do. Search for the text TODO for things you need to implement. Note
that I have given you sample outputs of my reference implementation for only some of the sections. However, I have
Page 3
https://blog.paperspace.com/numpy-optimization-vectorization-and-broadcasting/
https://blog.paperspace.com/numpy-optimization-vectorization-and-broadcasting/
https://colab.research.google.com/drive/1R9kTsBUMY7UqmBLM6ZgglpyMlaq511Yo?usp=sharing
EECS4404/5327, Winter 2021 Assignment 2
included “test” functions for other parts of the code to help you double check the correctness of your implementations.
For the remainder I will not be releasing sample outputs because they can vary significantly based on random seeds,
implementation details and more. In practice, a major challenge of working with machine learning models is that
you don’t always know what the “right answer” is for any given dataset and you need to use your understanding of
the model and dataset to help guide you and build confidence. I will give one final output check for you here: both
your training and validation accuracies should be able to reach over 90%.
After your implemention is completed and correct, you will need to create a report which answers the questions
below. Each question refers to a specific section/output of your assignment. Search for “QUESTION #” (replace #
with the number of the question) in the starter code to find the relevant section for each question. In these questions
you are asked to think about your results and analyze the model. Feel free to print additional quantities or make
new figures to come up with your answers. Include any figures or printed output in your report answering these
questions. For many of these questions, there may not be a single correct answer or the correct answer will depend
on your specific outputs. You will be marked mostly on how effectively you formulate and justify your answer.
1. Look at the plots that are generated in “Step 1: Explore the data”. Based on these plots, do you think this
classification problem is solvable by logistic regression? Is it even possible to answer this question from these
plots? Explain your answer.
2. Look at the outputs without any regularization (i.e., λ = 0). Based on the validation loss, at what iteration
should you stop training and why? What about based on the training loss? Explain the difference (if any)
between these two stopping points. Hint: you may need to play with how your training curves are visualized
here to answer this well.
3. Compare the outputs with and without regularization (i.e., with λ = 10 and λ = 0). Explain the differences
in the training curves. Do you think it’s easier or harder to determine when you should stop? Why?
4. Looking at the visualization of the weights of the logistic regression model, what, if any, interpretation can
you make about the structure of the weights? How do the weights differ between the model with and without
regularization? Explain the difference in the weights by making a connection to the prior used to regularize
them.
5. Consider the selected images which have the highest probability of the correct/incorrect class. Do these images
match your expectations? What about the images which have the highest probability of the correct class?
Explain what it means for an image to have a high probability of a correct or incorrect class and use your
sample images as illustration. How do the probabilities change when regularization is used?
6. Consider the selected images which have probabilities of either class closest to 0.5. Explain what it means that
these images have a probability of being in either class close to 0.5 and use your sample images as illustration.
7. In the last step, you’re asked to run the regularized logistic regression training again, but this time to classify
the digits 0 and 1. How does the performance and training curves differ between the 0/1 classification problem
and the 4/9 classification problem? Why?
Submission You must submit by the deadline a report answer the above questions as a PDF and your code.
Detailed submission instructions will be posted on eClass.
Page 4