CS.542 Machine Learning, Fall 2020
1. Math and Probability Basics Q1.1 Definitions
[a] Give the definition of an orthogonal matrix.
[b] Give the definition of an eigenvector and eigenvalue.
[c] How is the probability density function different from the cumulative probability distribution?
[d] What is a ‘singular’ matrix?
[e] Give the definition of Baye’s Rule.
Answer: see Bishop or Wikipedia for these definitions. These and other pre-requisites you should brush up on were discussed in the first week of lectures.
CS.542 Machine Learning, Fall 2020
Q1.2 Short questions
a)
b)
c)
Supervised classification models can be used to predict if an email attachment contains a computer virus [T/F]
Answer: true, eg binary classification with an SVM
Suppose we use polynomial features for linear regression, then the hypothesis is linear in the original features [T/F]
Answer: false, it is linear in the new polynomial features
Find the maximum of the function f(x , 𝑥 ) = 1 − 𝑥2 − 𝑥2 subject to the constraint 𝑔(𝑥 , 𝑥 ) = 𝑥 + 1212 121
𝑥2 − 1 = 0, using Lagrange Multipliers. The 3-D plot shows f and the 2-D plot shows its contours (blue circles) and the line corresponding to 𝑔 = 0 (red line).
Answer: solve the following system of equations:
𝛻𝑓(𝑥1, 𝑥2) = 𝜆𝛻𝑔(𝑥1, 𝑥2) 𝑔(𝑥1, 𝑥2) = 0
In this case the first equation yields −2𝑥1 = 𝜆; −2𝑥2 = 𝜆. Solving for 𝑥1, 𝑥2 and substituting into the second equation, we get 𝜆 = −1, so x = (0.5, 0.5) is the solution. For more info: http://www.math.harvard.edu/archive/21a_spring_09/PDF/11-08-Lagrange-Multipliers.pdf
CS.542 Machine Learning, Fall 2020
Q1.3 Covariance Matrix
Recall that in the derivation of normal equations in class, we used the fact that the data covariance matrix, i.e. a matrix whose element in the k, j position is the covariance between the k-th and j-th elements of the random input vector, is given b
𝑚
∑𝑥(𝑖)𝑥(𝑖)𝑇 =𝑋𝑇𝑋
𝑖=1
where 𝑥(𝑖) is the 𝑖th 𝑛𝑥1 input vector, 𝑚 is the number of input vectors in the dataset, and 𝑋 is the
𝑚𝑥𝑛 design matrix. Show that the above equality is true. Show all your steps.
ing linear algebra rules, we can prove this by re-writing the vector multiplication ins
Answer: using linear algebra rules, we can prove this by re-writing the vector multiplication inside the sum as a matrix, then re-writing the sum of matrices as a single matrix with sums as its elements, and finally as a product of two matrices.
the sum as a matrix, then re-writing the sum of matrices as a single matrix with sums as a product of
CS.542 Machine Learning, Fall 2020
Q1.4 Matrix Norm
The trace of a square matrix 𝐴 ∈ R𝑛𝑥𝑛 is defined as the sum of diagonal entries, or
Prove the following fact
where ‖𝐴‖𝐹 is the matrix Frobenius norm. Show all your steps.
Answer: this is straightforward to show using linear algebra rules. Without providing all the
details here, the idea is to first write down each diagonal element of the matrix 𝐶 = 𝐴𝑇𝐴 as
𝐶 = ∑𝑛 𝐴 𝐴 and then compute the trace as the sum of these diagonal elements, taking 𝑘𝑘 𝑖=1 𝑖𝑘 𝑖𝑘
all sums outside.is straightforward to show using linear algebra rules. Without 𝐶 = 𝐴𝑇𝐴 as
Q1.4 Flu Virus Test
After your yearly checkup, the doctor has bad news and good news. The bad news is that you tested positive for a flu virus, and that the test is 99% accurate (i.e., the probability of testing positive given that you have the virus is 0.99, as is the probability of testing negative given that you don’t have the disease). The good news is that this is a rare virus, striking only one in 10,000 people (10−4). What are the chances that you actually have the disease? Show your calculations as well as giving the final result. Hint: use Baye’s Rule.
Answer: This is a straightforward application of the rules of probability. Let T be the binary variable with value 1 if your test is positive and 0 if negative, and D be a variable with value 1 if you have the disease and 0 of you don’t. We are given
p(T=1|D=1)=0.99, p(T=0|D=1)=0.01 p(T=0|D=0)=0.99, p(T=1|D=0)=0.01 p(D=1)=0.0001, p(D=0)=0.9999
We want the probability of having the disease given the test was positive, or p(D=1|T=1), which is given by
𝑝(𝐷 = 1|𝑇 = 1) = 𝑝(𝑇 = 1, 𝐷 = 1) = 𝑝(𝑇 = 1|𝐷 = 1)𝑝(𝐷 = 1)
𝑝(𝑇 = 1) 𝑝(𝑇 = 1)
𝑝(𝑇 = 1) = 𝑝(𝑇 = 1|𝐷 = 1)𝑝(𝐷 = 1) + 𝑝(𝑇 = 1|𝐷 = 0)𝑝(𝐷 = 0)
𝑝(𝑇 = 1) = 0.99 ∗ 0.0001 + 0.01 ∗ 0.9999 = 0.000099 + 0.009999
𝑝(𝐷 = 1|𝑇 = 1) = 0.000099 ≈ 0.0098 0.000099 + 0.009999
An answer that left the fraction un-simplified is given full points.0.9
CS.542 Machine Learning, Fall 2020
2. Gradient Descent
Q2.1 Gradient Descent for a Cost Function
Suppose we have a cost function
𝐽(𝜃)= 1 (∑𝑚 𝑥𝑇𝜃+𝑏𝑦)+1𝜃𝑇𝐴𝜃,
𝑚𝑖=1𝑖 𝑖2
where 𝜃 ∈ R𝑛 is the parameter vector, 𝑥𝑖 ∈ R𝑛, 𝑦𝑖 ∈ R, {𝑥𝑖, 𝑦𝑖} are m training data points, 𝐴 ∈ R𝑛𝑥𝑛
is a symmetric matrix, and 𝑏 ∈ R. We want to find parameters 𝜃 using gradient descent.
a) [3 points] Give the pseudo code for the gradient decent algorithm for a generic cost function
𝐽(𝜃) (not the specific one above).
Answer: initialize 𝜃, then update 𝜃 ∶= 𝜃 − 𝛼 𝜕 𝐽(𝜃) until convergence.
𝜕𝜃
b) [3 points] For the specific function above, what is the vector of partial gradients of the cost function, i.e. the vector with the jth element equal to 𝜕 𝐽(𝜃)?
𝜕𝜃𝑗
c) [3 points] What is the design matrix? Describe its entries and give its dimensions.
Answer: matrix 𝑋 ∈ R𝑚𝑥𝑛 with the ith row equal to 𝑥𝑇. 𝑖
d) [3 points] Re-write the expression for the gradient without using the summation notation ∑. Hint: use the design matrix X.
Answer: 𝐴𝜃 + 1 ∑𝑖 𝑥𝑖 = 𝐴𝜃 + 1 𝑋𝑇𝟏, where 𝟏 is an m-dimensional vector with all entries 𝑚𝑚
equal to 1. As a sanity check, the result is a nx1 vector, as it should be.
e) [3 points] Suppose we run gradient descent for two iterations. Give the expression for 𝜃 after two updates, with step size 𝛼 = 1 and initial value of 𝜃 = 𝟎 (vector of zeros).
Answer: 𝜕 𝐽(𝜃)= 1 ∑𝑖𝑥𝑖 +12𝐴𝜃= 1 ∑𝑖𝑥𝑖 +𝐴𝜃 𝜕𝜃 𝑚 2 𝑚
Answer: 𝜃 ≔ 𝜃 − 𝛼(𝐴𝜃 + 𝑥̅); where 𝑥̅ = 1 𝑋𝑇𝟏
𝑚
𝜃(2) ≔−𝑥̅−1(−𝐴𝑥̅+𝑥̅)=−𝑥̅+𝐴𝑥̅−𝑥̅=(𝐴−2)𝑥̅
𝜃(1) ≔0−1(𝐴0+𝑥̅)=−𝑥̅,
CS.542 Machine Learning, Fall 2020
f) [3 points] How do we know when the algorithm has converged?
Answer: when the cost or parameter is not changing much between iterations.
g) [3 points] Give the closed-form solution for 𝜃. You do not need to prove it is the minimum of the cost.
Answer: Setting the gradient to zero, 𝐴𝜃 + 𝑥̅ = 0 and solving, we get 𝜃 = −𝐴−1𝑥̅
CS.542 Machine Learning, Fall 2020
3. Regression and Classification
Q3.1 Linear Regression: Online Art Auction
Imagine you work for an online art auctioneer. You would like to estimate the price 𝑦 that a piece of art will sell for in an auction (in dollars), based on the following features:
𝑥1= type of art (out of 15 types such as 1:painting, 2:sculpture, etc.), 𝑥2= artist popularity (rank out of 100 artists),
𝑥3= estimated value in dollars,
𝑥4= days in the auction,
𝑥5= previously owned (binary), 𝑥6= is abstract (binary),
…etc.
For example, a feature vector for the 𝑖𝑡h item could be 𝑥(𝑖) = [1, 28, 1700, 5, 1, 0, … ] . You have collected data points from previous auction sales, (𝑥(𝑖), 𝑦(𝑖)), 𝑖 = 1, … , 𝑚.
a. [3 points] You decide to use a linear regression model, 𝑦 = ∑𝑛 𝜃 𝑥 . In what circumstances 𝑗=0 𝑗 𝑗
should you use gradient descent vs normal equations to fit the parameters?
Answer: Use gradient descent when the number of features n is large, i.e. the matrix XTX is too large to invert (O(n3)), otherwise use normal equations.
b. [3 points] Suppose you decide to use gradient descent. How can you tell if it is converging?
Answer: By plotting the cost as a function of the number of iterations: convergence is likely when the decrease in cost diminishes.
c. [3 points] Suppose you’re monitoring convergence and find it is slow. Name two things you
can try to speed it up (be specific).
Answer: 1) Try a larger value for the step size; 2) In this case, some of the features have larger scale which could cause slow convergence, so we could use feature re-scaling to normalize all features, e.g. to [-1 1]
d. [3 points] You want to add new features to improve your predictor. You consider adding total
minutes spent in the auction. Is this a good idea? Why or why not?
Answer: No, because it is redundant with (a constant multiple of) the time in months and would not add new information.
e. [3 points] Your boss does not know how much to trust your prediction 𝑦𝑡𝑒𝑠𝑡 = $15,000 for a certain watercolor painting. She asks you to estimate the probability of the painting selling for more than $20,000. Give the equation for this probability, using a linear regression model that assumes the outputs have Gaussian noise with variance 𝛽−1.
CS.542 Machine Learning, Fall 2020
Answer: Assuming that the outputs y contain Gaussian noise gives us a Normal predictive distribution,
𝑝(𝑦𝑡𝑒𝑠𝑡|𝑥𝑡𝑒𝑠𝑡, 𝜃, 𝛽−1) = 𝑁(𝑦𝑡𝑒𝑠𝑡|h𝜃(𝑥𝑡𝑒𝑠𝑡), 𝛽−1). Therefore the probability of 𝑦𝑡𝑒𝑠𝑡 being more than 20,000 is
𝑝(𝑦𝑡𝑒𝑠𝑡 > 20000) = ∫∞ 𝑁(𝑦𝑡𝑒𝑠𝑡|15000, 𝛽−1)𝑑𝑦. 20000
f. [2 points] What is the probability the painting will sell for more than $15,000?
Answer: 𝑝(𝑦𝑡𝑒𝑠𝑡 > 15000) = ∫∞ 𝑁(𝑦𝑡𝑒𝑠𝑡|15000, 𝛽−1)𝑑𝑦=0.5 15000
g. [3 points] Suppose you now want to estimate the probability that a piece of art does not get any bids, 𝑝(𝑛𝑜_𝑏𝑖𝑑𝑠|𝑥), based on historic data. What sort of features and machine learning method should you use?
Answer: The same features could be used as in the previous example; any classification algorithm could be applied, e.g. logistic regression or LDA or neural net, etc., with a binary class label (got bids or didn’t get bids).
CS.542 Machine Learning, Fall 2020
Q3.2 Softmax Classifier
Typically, we use the softmax function to model the probability of one of several classes given the input. Instead, consider what would happen if a binary classifier with output labels 𝑗 ∈ {0,1} used the softmax function to model the probability of the binary label 𝑦 = 𝑗 given the input:
𝑒𝑤𝑗𝑇𝑥 𝑃(𝑦 = 𝑗|𝑥) = ∑𝑗=0,1 𝑒𝑤𝑗𝑇𝑥
Where 𝑤 is the parameter vector of the jth class. 𝑗
a) [5 points] Show that the solution to this problem can be expressed as a logistic regression classifier with parameter w, and give the expression for w.
Answer: we can use a logistic regression classifier with parameter 𝑤 = 𝑤1 − 𝑤0
𝑒𝑤1𝑇𝑥 1 ∗ (𝑒𝑤1𝑇𝑥) 1 1 ∑𝑗=0,1 𝑒𝑤𝑗𝑇𝑥 = (1 + 𝑒(𝑤0−𝑤1)𝑇𝑥) ∗ (𝑒𝑤1𝑇𝑥) = (1 + 𝑒(𝑤0−𝑤1)𝑇𝑥) = (1 + 𝑒−𝑤𝑇𝑥)
Where we used the fact that 𝑒(𝑤0𝑇𝑥−𝑤1𝑇𝑥) = 𝑒(𝑤0−𝑤1)𝑇𝑥
b) [5 points] Show that the posterior 𝑃(𝑦 = 𝑗|𝑥) is invariant if a constant vector b is added to
both weight vectors 𝑤 . 𝑗
Answer: this is trivial to show using (a), since for any b, 𝑤 = 𝑤0 + b − 𝑤1 − 𝑏 = 𝑤0 − 𝑤1
c) [5 points] (b) implies that the solution 𝑤 is not unique. We can guarantee a unique solution by 𝑗
making the objective function regularized, e.g. using the squared norm regularizer. Write down the objective function and say whether you should minimize or maximize it. Answer: Again, several possibilities, e.g. minimize
𝐽 = − ∑ 𝑙𝑜𝑔𝑃(𝑦 = 𝑦𝑖|𝑥𝑖) + 𝜆‖𝑤‖2 𝑖
CS.542 Machine Learning, Fall 2020
4. Overfitting and Regularization
Q4.1 Bias-Variance and 𝝀
Alice has a binary classification dataset of m points with n-dimensional inputs.
a) [3 points] She has trained several regularized logistic regression models using regularization parameters 𝜆 = 10−1, 10−2, 10−3, 10−4. She computed the cross-validation (CV) and training errors for each value of 𝜆, shown in the table below, but the rows are out of order. Fill in the correct values of 𝜆 for each row.
Train error
CV error
𝜆
80%
85%
40%
45%
70%
76%
35%
50%
Answer: 10−1, 10−3, 10−2, 10−4
b) [3 points] Based on these results, which 𝜆 should she choose, and why?
Answer: 10−3, the one with lowest CV error has best generalization
c) [3 points] Which of the four models will have the highest error due to variance? Why?
Answer: 10−4, it is has the least regularization and is the most complex
𝜆= 𝜆=
Answer: Left: 10−1 because it has high bias, more data doesn’t help; Right: 10−4 because it has high variance, more data may help.
d) [3 points] Alice also plotted learning curves for the models with 𝜆 = 10−1, 10−4. Match each plot with the correct value, and explain why it matches.
CS.542 Machine Learning, Fall 2020
Q4.2 Regularization for Linear Regression
Alice is trying to fit a linear regression model to predict house price based on size using polynomial features. Since her training dataset is very small, she is applying regularization. She fit several models by minimizing the cost function
for λ = 100, 101, 102, 103. The following are sketches of the resulting models.
a) [3 points] Which value of λ goes with each of the plots? (Write it next to the plot)
Answer: 𝑎: 102 𝑏: 100 𝑐: 101 𝑑: 103
b) [3 points] Alice tries her model on a test set. Which model will have the highest error due to
bias?
Answer: d
c) [3 points] Which model will have the highest error due to variance?
Answer: b
d) [3 points] Which model, if any, will always have zero test error?
Answer: none
CS.542 Machine Learning, Fall 2020
5. Maximum Likelihood Principle Q5.1 ML for Probabilistic Linear Regression
where h(𝑥) is the linear regression hypothesis, 𝜃, 𝛽 are parameters and 𝑁(𝑥|𝜇, 𝜎2) is the normal (Gaussian) probability density with mean 𝜇 and variance 𝜎2. Here 𝛽 = 𝜎−2 is the inverse variance of the Gaussian noise that we assume is added to the data.
Recall that probabilistic linear regression defines the likelihood of observing outputs 𝑡(𝑖) ∈ R given inputs 𝑥(𝑖) ∈ R𝑝, where 𝑖 = 1, … , 𝑚 and 𝑚 is the number of samples in the dataset, as
𝑚
𝑝(𝑡1, … , 𝑡𝑚|𝑥1 … , 𝑥𝑚, 𝜃, 𝛽) = ∏ 𝑁(𝑡(𝑖)|h(𝑥(𝑖)), 𝛽−1) 𝑖=1
(a) [8 points] Find 𝛽𝑀𝐿, the maximum likelihood solution for 𝛽. Hint: maximize log likelihood with respect to only 𝛽.
Answer: maximize log likelihood w.r.t. 𝛽. The likelihood function is: 𝑚
ln𝑝(𝒕|𝒙,𝜃,𝛽)=−𝛽∑(h(𝑥(𝑖))−𝑡(𝑖))2 +𝑚ln𝛽−𝑚ln(2𝜋) 2𝑖=1 22
Take partial derivative w.r.t. 𝛽 and set it to 0, then solve for w.r.t. 𝛽 𝑚
− 1 ∑(h(𝑥(𝑖)) − 𝑡(𝑖))2 + 𝑚 1 = 0
2 𝑖=1 1
2 𝛽
𝑚
= 1 ∑(h(𝑥(𝑖)) − 𝑡(𝑖))2 𝑚 𝑖=1
𝛽𝑀𝐿
To get 𝛽𝑀𝐿 we just need to take the inverse of the right-hand side.
Answer: It is the inverse of the average squared error of the prediction.
(b) [2 points] What is the interpretation of the solution 𝛽𝑀𝐿? Explain in one sentence.
CS.542 Machine Learning, Fall 2020
Q5.2 ML for Linear Regression with Multivariate Outputs
Consider a probabilistic linear regression model for a multivariate p-dimensional target variable 𝑡 = [𝑡1, … , 𝑡𝑝]𝑇 that has a Gaussian distribution over t of the form
𝑝(𝑡|𝑊, Σ) = 𝑁(𝑡|𝑦(𝜙(𝑥), 𝑊), Σ) where 𝜙(𝑥) is a basis function representation of the input, and
𝑦(𝑥, 𝑊) = 𝑊𝑇𝜙(𝑥)
𝑊 is a matrix of parameters and Σ is the covariance parameter matrix. We are given a training dataset
of input basis vectors 𝑥𝑛 and corresponding target vectors 𝑡𝑛, 𝑛 = 1, … , 𝑁. Show that the Maximum
Likelihood solution 𝑊 for the parameter matrix 𝑊has the property that each column is given by 𝑀𝐿
𝑤𝑀𝐿 = (Φ𝑇Φ)−1Φ𝑇𝑡
where Φ is the design matrix. You do not need to provide a solution for Σ. Show all your steps..
Hint: the p-dimensional multivariate normal distribution is given by
𝑁(𝑥; 𝜇, Σ) = 1 exp (− 1 (𝑥 − 𝜇)𝑇Σ−1(𝑥 − 𝜇)
Then the likelihood is
𝐿(𝑡1,…,𝑡𝑁|𝑥;𝑊,Σ)=
and the log likelihood is
𝑙𝑛𝐿(𝑡1, … , 𝑡𝑁|𝑥; 𝑊, Σ)
𝑝12 (2𝜋)2 |Σ|2
Hint: you may also find some of the matrix differentiation rules in the appendix helpful.
Answer: First we write down the Likelihood function, or the probability of the output datapoints given the input datapoints and parameters. The probability of a single datapoint is the multivariate normal
𝑝(𝑡|𝑥; 𝑊, Σ) = 1 exp (− 1 (𝑡 − 𝑊𝑇𝜙(𝑥))𝑇Σ−1(𝑡 − 𝑊𝑇𝜙(𝑥))) 𝑝12
(2𝜋)2 |Σ|2
1 𝑁 1 𝑇 𝑇−1 𝑇
𝑝 1 ∏exp(−2(𝑡𝑛 −𝑊 𝜙(𝑥𝑛)) Σ (𝑡𝑛 −𝑊 𝜙(𝑥𝑛)))
(2𝜋)2|Σ|2 𝑛=1
𝑁
2 2 2𝑛=1
Taking the derivative with respect to W, the first two terms vanish, and we set the rest to zero
𝑁
𝛿 𝑙𝑛𝐿 = 𝛿 (− 1 ∑(𝑡𝑛 − 𝑊𝑇𝜙(𝑥𝑛))𝑇Σ−1(𝑡𝑛 − 𝑊𝑇𝜙(𝑥𝑛))) = 0 𝛿𝑊 𝛿𝑊 2𝑛=1
Applying chain rule with 𝑢 = 𝑡𝑛 − 𝑊𝑇𝜙(𝑥𝑛), and using the matrix derivative rules
= − 𝑁𝑝 𝑙𝑛(2𝜋) − 𝑁 𝑙𝑛|Σ| − 1 ∑(𝑡𝑛 − 𝑊𝑇𝜙(𝑥𝑛))𝑇Σ−1(𝑡𝑛 − 𝑊𝑇𝜙(𝑥𝑛))
CS.542 Machine Learning, Fall 2020
1𝑁𝛿 𝑇 𝛿𝑇−1 −2∑𝛿𝑊(𝑡𝑛−𝑊𝜙(𝑥𝑛))𝛿𝑢(𝑢Σ 𝑢)=0
𝑛=1
1𝑁
− 2 ∑(−𝜙(𝑥𝑛))(2𝑢𝑇Σ−1) = 0
𝑛=1 1𝑁
− 2 ∑(−𝜙(𝑥𝑛))(2(𝑡𝑛 − 𝑊𝑇𝜙(𝑥𝑛))𝑇Σ−1) = 0
𝑛=1 Multiplying both sides by Σ−1
𝑁
∑ 𝜙(𝑥𝑛)(𝑡𝑛 − 𝑊𝑇𝜙(𝑥𝑛))𝑇 = 0 𝑛=1
Let Φ be the Nxm data design matrix, where m is the dimension of 𝜙(𝑥𝑛), and 𝑇 be the Nxp output matrix of output vectors 𝑡𝑛, then we can rewrite the above as
Φ𝑇ΦW = ΦT𝑇
Then the ML solution is given by the matrix below, with each column equal to 𝑤𝑀𝐿as required
W = (Φ𝑇Φ)−1ΦT𝑇 ML
Q5.3 ML for Poisson Regression
This problem asks you to derive the maximum likelihood solution for a Poisson hypothesis.
a) [3 points] Given a training set {𝑥𝑖 , 𝑦𝑖 }, Poisson regression models the probability of observing
an output given an input as 𝑝(𝑦𝑖|𝜆) = 1 𝜆𝑦𝑖𝑒−𝜆 where 𝜆 = 𝑒𝜃𝑇𝑥𝑖 for some parameter vector 𝑦𝑖!
𝜃. Derive the cost function 𝐽(𝜃) corresponding to maximizing the log likelihood for a single training example.
Answer: log(𝑝(𝑦𝑖|𝑥𝑖,𝜃)=log(𝑒𝑦𝑖𝜃𝑇𝑥𝑖)+log(𝑒−𝑒𝜃𝑇𝑥𝑖)−log(𝑦𝑖!) = 𝑦𝑖𝜃𝑇𝑥𝑖 − 𝑒𝜃𝑇𝑥𝑖 − log(𝑦𝑖!)
b) [3 points] The cost function in (b) has no closed form solution, so we must use an iterative method. Show the stochastic gradient descent update for this cost function.
Answer: 𝜕 log (𝑝(𝑦𝑖|𝑥𝑖,𝜃) = 𝑦𝑖𝑥𝑖 − 𝑥𝑖𝑒𝜃𝑇𝑥𝑖 = (𝑦𝑖 − 𝑒𝜃𝑇𝑥𝑖)𝑥𝑖, so the update is
𝜕𝜃
𝜃 ≔ 𝜃 + 𝛼(𝑦𝑖 − 𝑒𝜃𝑇𝑥𝑖)𝑥𝑖
Note that we are maximizing the function so we use gradient ascent, not descent.
CS.542 Machine Learning, Fall 2020
6. Unsupervised Learning
Q6.1 Principle Component Analysis
a) PCA assumes a specific relationship between the unobserved latent coordinates 𝑧 and the observed data points 𝑥. Express this relationship as an equation. Clearly identify and name the parameters which are learned.
Answer: 𝑧𝑘 = 𝑢(𝑘)𝑇𝑥̅ where 𝑥̅ is a normalized data point and 𝑢(𝑘), 𝑘 = 1, … , 𝐾 are the principle component vectors, which are the parameters that need to be learned.
b) Name one objective function which could be minimized to learn the parameters of PCA.
Answer: Reconstruction error. Another objective is maximizing the total variance of projected points.
c) For a dataset of arbitrary points 𝑥(1), … , 𝑥(𝑚), specify the steps of the PCA algorithm.
Answer: First, normalize the points to have zero mean and unit standard deviation in each coordinate. Then, compute the covariance matrix of the data, and decompose it with Singular Value Decomposition to obtain its eigenvectors/values. Finally, project each point onto the top k eigenvectors to obtain the lower-dimensional points. The eigenvalues can be used to determine how many components to keep in order to preserve a certain percentage of the total variance.
d) Suppose you are given 2D feature vectors for a classification task which are distributed according to the figure below. You apply PCA to the entire dataset. On the figure, draw all the PCA components.
e) In (d) above, could you use PCA components directly to classify the data (without training a classifier)? Explain.
Answer: yes, we can project the points onto the second eigenvector, 𝑢2, and then threshold the one-dimensional points at 0 to assign the class label. Specifically, the classifier will assign labels 𝑠𝑖𝑔𝑛((𝑢2)𝑇𝑥) where +1 corresponds to class 1.
CS.542 Machine Learning, Fall 2020
Q6.2 Gaussian Mixture Models
a) Describe in words the two main steps of the Expectation Maximization algorithm used to solve Gaussian Mixture Models.
Answer: the E step estimates the values of the latent variables, the M step maximizes the
likelihood of the data given the latent variables computed in the E step.
b) True or False: In the case of fully observed data, i.e. when all latent variables are observed, EM reduces to Maximum Likelihood.
Answer: True
c) True or False: Since the EM algorithm guarantees that the value of its objective function will increase after each iteration, it is guaranteed to eventually reach the global maximum.
Answer: False. The algorithm is guaranteed to converge (as opposed to iterating forever) because the objective function is increased at each step, but it may converge to a local
maximum rather than global one.
d) Sketch a dataset on which K-Means would work poorly but a Gaussian Mixture Model with the same number of clusters would do well. Describe why K-Means wouldn’t work well.
Answer: One advantage of GMM over k-means is that it accounts for data variance (or covariance in multiple dimensions.) One way to think about the k-means model is that it places a circle (or, in higher dimensions, a hyper-sphere) at the center of each cluster, with a radius defined by the most distant point in the cluster, as shown on the left. This works fine when your data is circular. However, when your data takes on different shape, you end up with
something like the middle plot. On the other hand, a GMM can handle even non-circular clusters (right) by using the covariance to calculate the distance.
CS.542 Machine Learning, Fall 2020
7. Neural Networks
Q7.1 Neural Network for XOR
Design a neural network to solve the XOR problem, i.e. the network should output 1 if only one of the two binary input variables is 1, and 0 otherwise (see left figure). Use the hyperbolic tangent, or tanh, activation function in all nodes (right figure), which ranges in [-1, +1].
Note that (A XOR B) can be expressed as (A OR B) AND NOT(A AND B)), as illustrated below:
In the diagrams below, we filled in most of the tanh units’ parameters. Fill in the remaining parameters, keeping in mind that tanh outputs +1/-1, not 0/1. Note that we need to appropriately change the second layer (the AND node) to take +1/-1 as inputs. Also, we must add an extra last layer to convert the final output from +1/-1 to 0/1. Hint: assume tanh outputs −1 for any input 𝑥 ≤ −2, +1 for any input 𝑥 ≥ +2, 0 for 𝑥 = 0.
?
6
-4 2
-2 ??
4
Answer: -2, -4, +2
CS.542 Machine Learning, Fall 2020
Q7.2 Computation Graph and Backpropagation
In class, we learned how to take a complex function that consists of multiple nested functions and represent it with a computation graph, which allows us to write down the forward and backward pass used to compute the function gradient.
a) Practice converting different functions 𝑓 (𝑥) = 𝑓 (𝑓 (… 𝑓 (𝑥)) of input vector 𝑥 𝜃 𝑘𝑘−11
parametrized by 𝜃 to their computation graphs. Answer: see lecture notes for examples.
b) For the computation graphs obtained in (a), write down the forward pass and the backward pass equations.
Answer: see lecture notes for examples.
Q7.3 Neural Network Architectures
a) Draw a convolutional network with input 𝑥 ∈ 𝑅4, one hidden layer with 2×1 filters and 2 channels with stride 2, and a fully-connected output layer with one neuron. How many parameters does this network have?
Answer: 9 parameters, see w’s in the plot. The c’s are the hidden convolutional units, with each channel sharing parameters. Note that the last layer should also have a “dummy” unit set to 1 and thus an extra parameter.
b) What algorithm is used for learning the parameters of a recurrent network? Name the algorithm and sketch out its main steps.
Answer: Backpropagation in time. The steps for computing the gradient for 1 training sequence are as follows. First, we unroll the network by replicating it once for each time step in our training sequence. Then we use regular backprop on the unrolled network to find the gradient. See lecture notes for more detail.
CS.542 Machine Learning, Fall 2020
Appendix: Useful Formulas
Matrix Derivatives
For vectors x, y and matrix A,
𝑦=𝐴𝑥,then𝜕𝑦 =𝐴 𝜕𝑥
If 𝑧 = 𝑥𝑇𝐴𝑥, then 𝜕𝑧 = 𝑥𝑇(𝐴 + 𝐴𝑇). For the special case of a symmetric matrix A, 𝜕𝑧 = 2𝑥𝑇𝐴. 𝜕𝑥 𝜕𝑥
Chain Rule: if 𝑧 is a function of 𝑦, which is a function of 𝐴, then 𝜕𝑧 = 𝜕𝑦 𝜕𝑧 (note the order). 𝜕𝐴 𝜕𝐴 𝜕𝑦
Single-Dimension Normal Distribution
2 1
𝑁(𝑥;𝜇,σ )=𝜎√2𝜋 exp(−
Multivariate Normal Distribution
(𝑥 − 𝜇)2 2𝜎2 )
The p-dimensional multivariate normal distribution with mean 𝜇 and covariance matrix Σ is given by 𝑁(𝑥;𝜇,Σ) = 1 exp (−1(𝑥 − 𝜇)𝑇Σ−1(𝑥 − 𝜇))
𝑝12 (2𝜋)2 |Σ|2