CMPT 726 Machine Learning Spring 2021
Due: Thursday, February 25 at 11:59 pm Pacific Time
Assignment 1
This assignment comes with a zip archive containing essential data and starter code that are re- quired to complete the coding component of this assignment.
This assignment is designed to be substantially more challenging than the quizzes and requires thorough understanding of the course material and extensive thought, so start early! If you are stuck, come to office hours. Make sure to check the discussion board often for updates, clarifica- tions, corrections and/or hints.
Requests for extensions will not be entertained – make sure you know how to submit your assignment on Canvas and properly account for time difference if you are not in Vancouver.
Warning: Copying others’ solutions, seeking help from others not in this course or posting questions online are considered cheating. Consequences are severe and could lead to suspen- sion or expulsion. If you become aware of such instances, you must report them here: https: //forms.gle/mKgkKbujKtQojXCf6
Submission Instructions:
Carefully follow the instructions below when submitting your assignment.
1. Submit a separate PDF for each problem in the assignment to Canvas. You may typeset your assignment in LaTeX or Word (submit in PDF format, not .doc/.docx format) or submit neatly handwritten and scanned solutions. We will not be able to give credit to solutions that are not legible. If there are graphs, include those graphs in the correct sections. Do not put them in an appendix. We need each solution to be self-contained on pages of its own.
• At the start of each problem, please state: (1) who you received help from on the prob- lem, and (2) who you provided help to.
• At the start of the first problem, please copy the following statement and sign your signature next to it. (Mac Preview and FoxIt PDF Reader, among others, have tools to let you sign a PDF file.) We want to make it extra clear so that no one inadvertently cheats. “I certify that all solutions are entirely in my own words and that I have not looked at another student’s solutions. I have given credit to all external sources I consulted.”
2. For all coding-related questions, we recommend working out the solution in the provided Jupyter notebook in Google Colab. Then take a screenshot of your solution and the output and include it in your PDF. In addition, you must (1) save the Jupyter notebook with your so- lutions and (2) copy your code into the Python script specific to each part of the question. You need to submit both the completed Jupyter notebook and the Python scripts as a zip archive named “CMPT726-419 A1 ⟨Last Name⟩ ⟨Student ID⟩.zip”, whose directory structure
©Simon Fraser University. All Rights Reserved. Sharing this publicly constitutes both academic misconduct and copyright violation. 1
should be the same as that of the zip archive for the starter code. Do NOT include any data files we provided. Please include a short file named README listing your name and student ID. Please make sure that your code doesn’t take up inordinate amounts of time or memory. If your code cannot be executed, your solution cannot be verified.
©Simon Fraser University. All Rights Reserved. Sharing this publicly constitutes both academic misconduct and copyright violation. 2
1 Python Configuration and Data Loading (Optional)
We recommend using Google Colab to complete the parts of this assignment that require coding. However, if you would like to set up your own Python environment on your computer, follow the instructions below.
Please follow the instructions below to ensure your Python environment is configured properly, and you are able to successfully load the data provided with this homework. No solution needs to be submitted for this question. For all coding questions, we recommend using Anaconda for Python 3.
(a) Either install Anaconda for Python 3, or ensure you’re using Python 3. To ensure you’re run- ning Python 3, open a terminal in your operating system and execute the following command:
python –version
Do not proceed until you’re running Python 3.
(b) Installthefollowingdependenciesrequiredforthishomeworkbyexecutingthefollowingcom- mand in your operating system’s terminal:
pip install scikit-learn scipy numpy matplotlib
Please use Python 3 with the modules specified above to complete this homework.
To check whether your Python environment is configured properly for this homework, ensure the following Python script executes without error. Pay attention to errors raised when at- tempting to import any dependencies. Resolve such errors by manually installing the required dependency (e.g. execute pip install numpy for import errors relating to the numpy pack- age).
import sys
if sys.version_info[0] < 3:
raise Exception("Python 3 not detected.")
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm
from scipy import io
for data_name in ["a", "b"]:
data = io.loadmat("data/%s.mat" % data_name)
print("\nloaded %s data!" % data_name)
2 Probabilities
Consider two discrete random variables A and B, where A can take on the values {2, 4} and B can take on the values {1, 3, 5}. The pmf (probability mass function) of the joint probability distribution of A and B is given by the following table:
©Simon Fraser University. All Rights Reserved. Sharing this publicly constitutes both academic misconduct and copyright violation. 3
a b Pr(A=a,B=b) 21 0.1
23 0.1
25 0.1
41 0.2 43 0.0 45 0.5
Using this information solve the following questions.
(a) Calculate the marginal probability distribution of A and B. Use tables in the following format to summarize your answer. Show your work.
a Pr(A = a) 2
4
b Pr(B = b) 1
3
5
(b) Compute E[A], E[B], and E[AB].
(c) Calculate the conditional probability distribution of B given that A = 4. Use a table in
the following format to summarize your answer. Show your work.
b Pr(B=b|A=4) 1
3
5
(d) Are A and B independent? Why?
3 Entropy
Let A, B be discrete random variables with the following pmf for the joint probability distribution:
B
01 0
A
Calculate H(A), H(B), H(A | B), H(A, B), and I(A; B) where all logs should be in base 2.
1
1/2 0 3/8 1/8
©Simon Fraser University. All Rights Reserved. Sharing this publicly constitutes both academic misconduct and copyright violation. 4
4
(a)
(b)
(c) (d)
(e)
Short Answer Questions
Let A, B and C be random variables and a, b and c be the realizations of A, B and C. Define p as the pmf (probability mass function) if A, B and C are discrete random variables, and the pdf (probability density function) if they are continuous random variables. Prove that p(a,b | c) = p(a | b,c)p(b | c).
Hint: Use the definitions of conditional probabilities.
Using the fact proved in part (a), prove the conditional version of Bayes’ rule, i.e.:
p(b|a,c)= p(a|b,c)p(b|c) (1) p(a | c)
Give two examples of situations where it makes sense to add a regularizer to a model.
As you increase the degree of polynomial features in ordinary least squares (OLS), which kind of loss usually goes down at first but then goes back up?
You have an ordinary least squares model for estimating the fair market value of cars (in dollars), and your training dataset consists of data on 10K cars. You used all the training data for training the model and obtained an average training loss / mean squared error (MSE) of 100. You have to submit your model to Kaggle (a website that runs machine learning competitions) for evaluation and it gave you an average testing loss of 1000000. You don’t have access to the testing data that Kaggle uses and can only submit your code and receive the testing loss for a limited number of times. However, there are many choices of features you would like to try, and you will quickly use up all your submission budget if you submit each one to Kaggle. How can you pick the best choice of features among all possible choices you have in mind without using up your submission budget?
SVD and Eigendecomposition
5
(In this question, you can use a computer to help you compute the eigenvalues/vectors, singular values/vectors and the inverse of a matrix.)
(b) Recall that given a linear model yˆ = w⃗ ⊤ ⃗x where w⃗ is the parameter vector, and X, ⃗y are the data, the optimal parameters w⃗ ∗ under the square loss L(w⃗ ) = ∥⃗y − Xw⃗ ∥2 is
w⃗∗ = argminL(w⃗) = (X⊤X)−1X⊤⃗y w⃗
The matrix X† = (X⊤X)−1X⊤ is known as the pseudoinverse.
©Simon Fraser University. All Rights Reserved. Sharing this publicly constitutes both academic misconduct and copyright violation. 5
2 −1
(a) Given a matrix A = −1 2 , find an analytical expression of A100. Show your work. (You may leave powers in the expression and do not need to compute the numerical value.)
√1 2 −√
2
. Compute the pseudoinverse matrix X† = (X⊤X)−1X⊤ using the fol-
lowing approaches. Show your work. (You may use a computer to help you compute the eigenvalues/vectors, singular values/vectors and the inverse of a matrix. You may round num- bers to four decimal places of precision.)
(1) Compute X† using the brute force approach, i.e. by evaluating the expression (X⊤ X)−1 X⊤ directly.
(2) Compute X† by applying eigendecomposition on X⊤X.
(3) Compute X† by applying singular value decomposition (SVD) on X.
(c) Below are four contour plots of 2D Gaussians N(⃗0, Σi) with different covariance matrices Σi:
−0.5 3 0.5 3 1.2 ,Σ = ,Σ = 1 3 0.5 1 4 1.2 1
Let X =
0 3 2
1 0.5 1 Σ = ,Σ =
1 0.5 1 2 −0.5
Match each covariance matrix to the corresponding plot and explain why.
(a) (b)
(c) (d)
Figure 1
(d) Giventwovectors⃗xand⃗y,theEuclideandistanceisdefinedasdE(⃗x,⃗y)=∥⃗x−⃗y∥2 = (⃗x−⃗y)⊤(⃗x−⃗y). Here we introduce a new distance metric called the Mahalanobis distance, defined as dM (⃗x, ⃗y) =
(⃗x − ⃗y)⊤S −1(⃗x − ⃗y), where S is a covariance matrix that is symmetric and positive semi- definite.
In the questions below, consider two points ⃗a = 2-norm of a matrix.
10 7
⃗
5 and b = 9. Let ||M||2 be the induced
(1) Find a covariance matrix S where ||S−1||2 = 1 and dM(⃗a,⃗b) = dE(⃗a,⃗b). Explain why. (2) Find a covariance matrix S where ||S−1||2 = 1 and dM(⃗a,⃗b) = 0.1dE(⃗a,⃗b). Explain why.
Hint: Think about the geometry of the squared Mahalanobis distance, which is a quadratic form ⃗u⊤S −1⃗u. Along which direction does it grow the fastest?
You are required to explain how you found the answer and why it works, though a formal proofisnotrequired. (Youmayuseacomputertoverifyyouranswer.)
©Simon Fraser University. All Rights Reserved. Sharing this publicly constitutes both academic misconduct and copyright violation. 6
6 Linear Regression
Making autonomous vehicles involves machine learning for different purposes. One of which is learning how cars actually behave based on their data.
(a) For t ∈ {1, 2, . . . , T − 1}, let xt , ut ∈ R be scalars that approximately observe the relationship
xt+1 ≈ Axt + But, where A and B are scalars. An expression of this form arises in control
theory, where xt represents the state and ut represents the control input. Define the loss function
L(A, B) = T−1(x − Ax − Bu )2, and consider the problem of finding the optimal parameters t=1 t+1 t t
A and B. Formulate this as an ordinary least squares (OLS) problem, that is, write down what the data matrix X, label vector ⃗y and parameter vector w⃗ in OLS correspond to in the context of this problem. Then write down an expression for the optimal A and B that minimize L.
(b) In the starter code zip archive associated with this assignment, the values of xt and ut are pro- vided in a.mat. Write the code for computing the optimal A and B in the Jupyter notebook included in the zip archive. Then use it to compute the optimal A and B. Take a screenshot of both your code and the output in Jupyter notebook and include it in your PDF submis- sion. Make sure the screenshot resolution is high enough for the code to be readable and that the optimal A and B are printed at a precision of at least eight decimal places in the screenshot. Also save your code in the Jupyter notebook and copy your code to A1 Q6 part b.py provided in the zip archive. Submit both the completed notebook and A1 Q6 part b.py.
(c) For t ∈ {1, 2, . . . , T − 1}, let ⃗xt , ⃗ut ∈ R3 be vectors that approximately observes the relation-
ship, ⃗xt+1 ≈ A⃗xt + B⃗ut, where A, B ∈ R3×3 are matrices. Define the loss function L(A, B) =
T−1∥⃗x −A⃗x −B⃗u∥2,andconsidertheproblemoffindingtheoptimalparametersAandB. t=1t+1 t t2
Formulate this as a multiple output linear regression problem, that is, write down what the data matrix X, label matrix Y and parameter matrix W in multiple output linear regression correspond to in the context of this problem. Then write down an expression for the optimal A and B that minimize L.
(d) In the starter code zip archive associated with this assignment, the values of ⃗xt and ⃗ut are stored in b.mat. Write the code for computing the optimal A and B using the approach in part (c) in the Jupyter notebook included in the zip archive. Then use it to compute the optimal A and B. Take a screenshot of both your code and the output in Jupyter notebook and include it in your PDF submission. Make sure the screenshot resolution is high enough for the code to be readable and that the optimal A and B are printed at a precision of at least eight decimal places in the screenshot. Also save your code in the Jupyter notebook and copy your code to A1 Q6 part d.py provided in the zip archive. Submit both the completed notebook and A1 Q6 part d.py.
(e) Consider the same problem as in part (c), where we are given ⃗xt,⃗ut ∈ R3 and would like to find the optimal A and B that minimize the loss function L(A,B) = T−1∥⃗x − A⃗x − B⃗u ∥2.
t=1t+1 t t2 Whereas part (c) asked you to formulate as a multiple output linear regression problem, now formulate it as an ordinary least squares problem, where the model parameters form a vector
©Simon Fraser University. All Rights Reserved. Sharing this publicly constitutes both academic misconduct and copyright violation. 7
rather than a matrix. Devise a formulation of the problem as an OLS problem, that is, write down what the data matrix X, label vector ⃗y and parameter vector w⃗ in OLS correspond to in the context of this problem. Then write down an expression for the optimal A and B that minimize L.
Hint: Consider each component of ⃗xt+1 separately and express it in terms of the entries of the matrices A and B.
(f) Write the code for computing the optimal A and B using the approach in part (e) in the Jupyter notebook included in the starter code zip archive. Then use it to compute the optimal A and B. Check that it produces the same result as the code in part (d). Take a screenshot of both your code and the output in Jupyter notebook and include it in your PDF submission. Make sure the screenshot resolution is high enough for the code to be readable and that the optimal A and B are printed at a precision of at least eight decimal places in the screenshot. Also save your code in the Jupyter notebook and copy your code to A1 Q6 part f.py provided in the zip archive. Submit both the completed notebook and A1 Q6 part f.py.
(g) Consider the general form of multiple output linear regression: we are given a data matrix X ∈ RN×(n+1) and a label matrix Y ∈ RN×m, where N is the number of observations, n is the number of input dimensions and m is the number of output dimensions. The goal is to find W ∈ Rm×(n+1) that minimizes the loss L(W) = ∥Y − XW⊤∥2F. Prove that optimal parameter
matrix W∗⊤ is X⊤X−1 X⊤Y.
Hint:Usethefactthat∥Y−XW⊤∥2 =m ∥⃗y −Xw⃗∥2,where⃗y istheithcolumnofYand
Fi=1ii2 i w⃗ i is the ith row of W (represented as a column vector).
(h) Recall the problem in part (c), where we are given ⃗xt,⃗ut ∈ R3 and would like to find the optimalAandBthatminimizethelossfunctionL(A,B)=T−1∥⃗x −A⃗x −B⃗u∥2.Suppose
we now add a regularizer to encourage the optimal A to be close to the identity matrix. It
takes the form λ∥A − I∥2F , where λ > 0 is a hyperparameter, so the new loss function becomes
L(A,B) = T−1∥⃗x −A⃗x −B⃗u∥2+λ∥A−I∥2. DeriveanexpressionfortheoptimalA t=1t+1 t t2 F
and B that minimize L.
©Simon Fraser University. All Rights Reserved. Sharing this publicly constitutes both academic misconduct and copyright violation. 8
t=1t+1 t t2