程序代写代做代考 chain graph algorithm GMM CS.542 Machine Learning, Fall 2019, Prof. Saenko

CS.542 Machine Learning, Fall 2019, Prof. Saenko
Machine Learning Midterm Practice Problems
Some of these sample problems had been used in past exams and are provided for practice, in addition to the homework problems which you should also review. A typical exam would have around 5 questions worth a total of 100 points. The exam is closed book, no electronics. A single 8“x11” sheet of paper with typed or handwritten notes on both sides is allowed.
Contents
1. Math and Probability Basics ……………………………………………………………………………………………………………………………. 2 Q1.1 Definitions ……………………………………………………………………………………………………………………………………………. 2 Q1.2 Covariance Matrix …………………………………………………………………………………………………………………………………. 3 Q1.3 Matrix Norm …………………………………………………………………………………………………………………………………………. 4 Q1.4 Flu Virus Test ………………………………………………………………………………………………………………………………………… 4
2. Gradient Descent…………………………………………………………………………………………………………………………………………… 5 Q2.1 Gradient Descent for a Cost Function ………………………………………………………………………………………………………. 5 3. Regression and Classification ………………………………………………………………………………………………………………………….. 7 Q3.1 Linear Regression: Online Art Auction ……………………………………………………………………………………………………… 7 Q3.2 Softmax Classifier ………………………………………………………………………………………………………………………………….. 9 4. Overfitting and Regularization ………………………………………………………………………………………………………………………. 10 Q4.1 Bias-Variance and 𝝀𝝀……………………………………………………………………………………………………………………………… 10 Q4.2 Regularization for Linear Regression………………………………………………………………………………………………………. 11 5. Maximum Likelihood Principle ………………………………………………………………………………………………………………………. 12 Q5.1 ML for Probabilistic Linear Regression ……………………………………………………………………………………………………. 12 Q5.2 ML for Linear Regression with Multivariate Outputs ………………………………………………………………………………… 13 Q5.3 ML for Poisson Regression ……………………………………………………………………………………………………………………. 14 6. Unsupervised Learning …………………………………………………………………………………………………………………………………. 15 Q6.1 Principle Component Analysis……………………………………………………………………………………………………………….. 15 Q6.2 Gaussian Mixture Models……………………………………………………………………………………………………………………… 16 7. Neural Networks …………………………………………………………………………………………………………………………………………. 17 Q7.1 Neural Network for XOR……………………………………………………………………………………………………………………….. 17 Q7.2 Computation Graph and Backpropagation ……………………………………………………………………………………………… 18 Q7.3 Neural Network Architectures ………………………………………………………………………………………………………………. 18 Appendix: Useful Formulas ………………………………………………………………………………………………………………………………. 19

CS.542 Machine Learning, Fall 2019, Prof. Saenko
1. Math and Probability Basics Q1.1 Definitions
[a] Give the definition of an orthogonal matrix.
[b] Give the definition of an eigenvector and eigenvalue.
[c] How is the probability density function different from the cumulative probability distribution?
[d] What is a ‘singular’ matrix?
[e] Give the definition of Baye’s Rule.
Answer: see Bishop or Wikipedia for these definitions. These and other pre-requisites you should brush up on were discussed in the first week of lectures.

CS.542 Machine Learning, Fall 2019, Prof. Saenko
Q1.2 Covariance Matrix
Recall that in the derivation of normal equations in class, we used the fact that the data covariance matrix, i.e. a matrix whose element in the k, j position is the covariance between the k-th and j-th elements of the random input vector, is given b
𝑚𝑚
�𝑥𝑥(𝑖𝑖)𝑥𝑥(𝑖𝑖)𝑇𝑇 =𝑋𝑋𝑇𝑇𝑋𝑋
𝑖𝑖=1
where 𝑥𝑥(𝑖𝑖) is the 𝑖𝑖th 𝑛𝑛𝑥𝑥1 input vector, 𝑚𝑚 is the number of input vectors in the dataset, and 𝑋𝑋 is the
𝑚𝑚𝑥𝑥𝑛𝑛 design matrix. Show that the above equality is true. Show all your steps.
Answer: using linear algebra rules, we can prove this by re-writing the vector multiplication inside the sum as a matrix, then re-writing the sum of matrices as a single matrix with sums as its elements, and finally as a product of two matrices.

CS.542 Machine Learning, Fall 2019, Prof. Saenko
The trace of a square matrix 𝐴𝐴 ∈ R𝑛𝑛𝑛𝑛𝑛𝑛 is defined as the sum of diagonal entries, or Prove the following fact
where ‖𝐴𝐴‖𝐹𝐹 is the matrix Frobenius norm. Show all your steps.
Answer: this is straightforward to show using linear algebra rules. Without providing all the details here, the idea is to first write down each diagonal element of the matrix 𝐶𝐶 = 𝐴𝐴𝑇𝑇𝐴𝐴 as 𝐶𝐶𝑘𝑘𝑘𝑘 = ∑𝑛𝑛𝑖𝑖=1 𝐴𝐴𝑖𝑖𝑘𝑘 𝐴𝐴𝑖𝑖𝑘𝑘 and then compute the trace as the sum of these diagonal elements, taking all sums outside.
Q1.4 Flu Virus Test
After your yearly checkup, the doctor has bad news and good news. The bad news is that you tested positive for a flu virus, and that the test is 99% accurate (i.e., the probability of testing positive given that you have the virus is 0.99, as is the probability of testing negative given that you don’t have the disease). The good news is that this is a rare virus, striking only one in 10,000 people (10−4). What are the chances that you actually have the disease? Show your calculations as well as giving the final result. Hint: use Baye’s Rule.
Answer: This is a straightforward application of the rules of probability. Let T be the binary variable with value 1 if your test is positive and 0 if negative, and D be a variable with value 1 if you have the disease and 0 of you don’t. We are given
p(T=1|D=1)=0.99, p(T=0|D=1)=0.01 p(T=0|D=0)=0.99, p(T=1|D=0)=0.01 p(D=1)=0.0001, p(D=0)=0.9999
which is given by
Q1.3 Matrix Norm
We want the probability of having the disease given the test was positive, or p(D=1|T=1),
𝑝𝑝(𝐷𝐷 = 1|𝑇𝑇 = 1) = 𝑝𝑝(𝑇𝑇 = 1, 𝐷𝐷 = 1) = 𝑝𝑝(𝑇𝑇 = 1|𝐷𝐷 = 1)𝑝𝑝(𝐷𝐷 = 1)
𝑝𝑝(𝑇𝑇 = 1) 𝑝𝑝(𝑇𝑇 = 1)
𝑝𝑝(𝑇𝑇 = 1) = 𝑝𝑝(𝑇𝑇 = 1|𝐷𝐷 = 1)𝑝𝑝(𝐷𝐷 = 1) + 𝑝𝑝(𝑇𝑇 = 1|𝐷𝐷 = 0)𝑝𝑝(𝐷𝐷 = 0)
𝑝𝑝(𝑇𝑇 = 1) = 0.99 ∗ 0.0001 + 0.01 ∗ 0.9999 = 0.000099 + 0.009999
𝑝𝑝(𝐷𝐷 = 1|𝑇𝑇 = 1) = 0.000099 ≈ 0.0098 0.000099 + 0.009999
An answer that left the fraction un-simplified is given full points.

CS.542 Machine Learning, Fall 2019, Prof. Saenko
2. Gradient Descent
Q2.1 Gradient Descent for a Cost Function
𝐽𝐽(𝜃𝜃) (not the specific one above).
Answer: initialize 𝜃𝜃, then update 𝜃𝜃 ∶= 𝜃𝜃 − 𝛼𝛼 𝜕𝜕 𝐽𝐽(𝜃𝜃) until convergence.
Suppose we have a cost function
𝐽𝐽(𝜃𝜃) = 𝑚𝑚1 (∑𝑚𝑚𝑖𝑖=1 𝑥𝑥𝑖𝑖𝑇𝑇𝜃𝜃 + 𝑏𝑏𝑦𝑦𝑖𝑖) + 12 𝜃𝜃𝑇𝑇𝐴𝐴𝜃𝜃,
where 𝜃𝜃 ∈ R𝑛𝑛 is the parameter vector, 𝑥𝑥𝑖𝑖 ∈ R𝑛𝑛, 𝑦𝑦𝑖𝑖 ∈ R, {𝑥𝑥𝑖𝑖, 𝑦𝑦𝑖𝑖} are m training data points, 𝐴𝐴 ∈ R𝑛𝑛𝑛𝑛𝑛𝑛 is a symmetric matrix, and 𝑏𝑏 ∈ R. We want to find parameters 𝜃𝜃 using gradient descent.
a) [3 points] Give the pseudo code for the gradient decent algorithm for a generic cost function
𝜕𝜕𝜕𝜕
b) [3 points] For the specific function above, what is the vector of partial gradients of the cost function, i.e. the vector with the jth element equal to 𝜕𝜕 𝐽𝐽(𝜃𝜃)?
Answer: 𝜕𝜕 𝐽𝐽(𝜃𝜃)= 1 ∑𝑖𝑖𝑥𝑥𝑖𝑖 +12𝐴𝐴𝜃𝜃= 1 ∑𝑖𝑖𝑥𝑥𝑖𝑖 +𝐴𝐴𝜃𝜃 𝜕𝜕𝜕𝜕𝑗𝑗 𝜕𝜕𝜕𝜕 𝑚𝑚 2 𝑚𝑚
c) [3 points] What is the design matrix? Describe its entries and give its dimensions.
Answer: matrix 𝑋𝑋 ∈ R𝑚𝑚𝑛𝑛𝑛𝑛 with the ith row equal to 𝑥𝑥𝑖𝑖𝑇𝑇 .
d) [3 points] Re-write the expression for the gradient without using the summation notation ∑.
Hint: use the design matrix X.
Answer: 𝐴𝐴𝜃𝜃 + 𝑚𝑚1 ∑𝑖𝑖 𝑥𝑥𝑖𝑖 = 𝐴𝐴𝜃𝜃 + 𝑚𝑚1 𝑋𝑋𝑇𝑇𝟏𝟏, where 𝟏𝟏 is an m-dimensional vector with all entries equal to 1. As a sanity check, the result is a nx1 vector, as it should be.

CS.542 Machine Learning, Fall 2019, Prof. Saenko
e) [3 points] Suppose we run gradient descent for two iterations. Give the expression for 𝜃𝜃 after
two updates, with step size 𝛼𝛼 = 1 and initial value of 𝜃𝜃 = 𝟎𝟎 (vector of zeros). A n s w e r : 𝜃𝜃 ≔ 𝜃𝜃 − 𝛼𝛼 ( 𝐴𝐴 𝜃𝜃 + 𝑥𝑥 ̅ ) ; w h e r e 𝑥𝑥 ̅ = 𝑚𝑚1 𝑋𝑋 𝑇𝑇 𝟏𝟏
𝜃𝜃(1) ≔0−1(𝐴𝐴0+𝑥𝑥̅)=−𝑥𝑥̅,
𝜃𝜃(2) ≔−𝑥𝑥̅−1(−𝐴𝐴𝑥𝑥̅+𝑥𝑥̅)=−𝑥𝑥̅+𝐴𝐴𝑥𝑥̅−𝑥𝑥̅=(𝐴𝐴−2)𝑥𝑥̅
f) [3 points] How do we know when the algorithm has converged?
Answer: when the cost or parameter is not changing much between iterations.
g) [3 points] Give the closed-form solution for 𝜃𝜃. You do not need to prove it is the minimum of the cost.
Answer: Setting the gradient to zero, 𝐴𝐴𝜃𝜃 + 𝑥𝑥̅ = 0 and solving, we get 𝜃𝜃 = −𝐴𝐴−1𝑥𝑥̅

CS.542 Machine Learning, Fall 2019, Prof. Saenko
3. Regression and Classification
Imagine you work for an online art auctioneer. You would like to estimate the price 𝑦𝑦 that a piece of art will sell for in an auction (in dollars), based on the following features:
Q3.1 Linear Regression: Online Art Auction
𝑥𝑥 = type of art (out of 15 types such as 1:painting, 2:sculpture, etc.), 𝑥𝑥1= artist popularity (rank out of 100 artists),
𝑥𝑥2= estimated value in dollars,
𝑥𝑥3= days in the auction,
𝑥𝑥4= previously owned (binary), 𝑥𝑥5= is abstract (binary),
6 …etc.
For example, a feature vector for the 𝑖𝑖𝑡𝑡h item could be 𝑥𝑥(𝑖𝑖) = [1, 28, 1700, 5, 1, 0, … ] . You have collected data points from previous auction sales, (𝑥𝑥(𝑖𝑖), 𝑦𝑦(𝑖𝑖)), 𝑖𝑖 = 1, … , 𝑚𝑚.
a. [3 points] You decide to use a linear regression model, 𝑦𝑦 = ∑𝑛𝑛𝑗𝑗=0 𝜃𝜃𝑗𝑗𝑥𝑥𝑗𝑗. In what circumstances should you use gradient descent vs normal equations to fit the parameters?
Answer: Use gradient descent when the number of features n is large, i.e. the matrix XTX is too large to invert (O(n3)), otherwise use normal equations.
b. [3 points] Suppose you decide to use gradient descent. How can you tell if it is converging?
Answer: By plotting the cost as a function of the number of iterations: convergence is likely when the decrease in cost diminishes.
c. [3 points] Suppose you’re monitoring convergence and find it is slow. Name two things you can try to speed it up (be specific).
Answer: 1) Try a larger value for the step size; 2) In this case, some of the features have larger scale which could cause slow convergence, so we could use feature re-scaling to normalize all features, e.g. to [-1 1]
d. [3 points] You want to add new features to improve your predictor. You consider adding total minutes spent in the auction. Is this a good idea? Why or why not?
Answer: No, because it is redundant with (a constant multiple of) the time in months and would not add new information.

CS.542 Machine Learning, Fall 2019, Prof. Saenko
e. [3 points] Your boss does not know how much to trust your prediction 𝑦𝑦𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 = $15,000 for a certain watercolor painting. She asks you to estimate the probability of the painting selling for more than $20,000. Give the equation for this probability, using a linear regression model that assumes the outputs have Gaussian noise with variance 𝛽𝛽−1.
Answer: Assuming that the outputs y contain Gaussian noise gives us a Normal predictive
distribution,
𝑝𝑝(𝑦𝑦𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡|𝑥𝑥𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡,𝜃𝜃,𝛽𝛽−1)=𝑁𝑁(𝑦𝑦𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡|h (𝑥𝑥𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡),𝛽𝛽−1).
𝜕𝜕
Therefore the probability of 𝑦𝑦𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 being more than 20,000 is
𝑝𝑝(𝑦𝑦𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 > 20000) = ∫∞ 𝑁𝑁(𝑦𝑦𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡|15000, 𝛽𝛽−1)𝑑𝑑𝑦𝑦. 20000
f. [2 points] What is the probability the painting will sell for more than $15,000?
Answer: 𝑝𝑝(𝑦𝑦𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 > 15000) = ∫∞ 𝑁𝑁(𝑦𝑦𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡|15000, 𝛽𝛽−1)𝑑𝑑𝑦𝑦=0.5 15000
g. [3 points] Suppose you now want to estimate the probability that a piece of art does not get any bids, 𝑝𝑝(𝑛𝑛𝑛𝑛_𝑏𝑏𝑖𝑖𝑑𝑑𝑏𝑏|𝑥𝑥), based on historic data. What sort of features and machine learning method should you use?
Answer: The same features could be used as in the previous example; any classification algorithm could be applied, e.g. logistic regression or LDA or neural net, etc., with a binary class label (got bids or didn’t get bids).

CS.542 Machine Learning, Fall 2019, Prof. Saenko
Q3.2 Softmax Classifier
Typically, we use the softmax function to model the probability of one of several classes given the input. Instead, consider what would happen if a binary classifier with output labels 𝑗𝑗 ∈ {0,1} used the softmax function to model the probability of the binary label 𝑦𝑦 = 𝑗𝑗 given the input:
𝑃𝑃(𝑦𝑦 = 𝑗𝑗|𝑥𝑥) = 𝑒𝑒𝑤𝑤𝑗𝑗𝑇𝑇𝑛𝑛 Where 𝑤𝑤𝑗𝑗 is the parameter vector of the jth class. ∑𝑗𝑗=0,1 𝑒𝑒𝑤𝑤𝑗𝑗𝑇𝑇𝑛𝑛
a) [5 points] Show that the solution to this problem can be expressed as a logistic regression classifier with parameter w, and give the expression for w.
Answer: we can use a logistic regression classifier with parameter 𝑤𝑤 = 𝑤𝑤1 − 𝑤𝑤0 𝑒𝑒𝑤𝑤1𝑇𝑇𝑛𝑛 = 1 ∗ (𝑒𝑒𝑤𝑤1𝑇𝑇𝑛𝑛) = 1 = 1
∑𝑗𝑗=0,1 𝑒𝑒𝑤𝑤𝑗𝑗𝑇𝑇𝑛𝑛 �1 + 𝑒𝑒(𝑤𝑤0−𝑤𝑤1)𝑇𝑇𝑛𝑛� ∗ (𝑒𝑒𝑤𝑤1𝑇𝑇𝑛𝑛) �1 + 𝑒𝑒(𝑤𝑤0−𝑤𝑤1)𝑇𝑇𝑛𝑛� �1 + 𝑒𝑒−𝑤𝑤𝑇𝑇𝑛𝑛� Where we used the fact that 𝑒𝑒(𝑤𝑤0𝑇𝑇𝑛𝑛−𝑤𝑤1𝑇𝑇𝑛𝑛) = 𝑒𝑒(𝑤𝑤0−𝑤𝑤1)𝑇𝑇𝑛𝑛
b) [5 points] Show that the posterior 𝑃𝑃(𝑦𝑦 = 𝑗𝑗|𝑥𝑥) is invariant if a constant vector b is added to both weight vectors 𝑤𝑤𝑗𝑗.
Answer: thisistrivialtoshowusing(a),sinceforanyb,𝑤𝑤=𝑤𝑤0+b−𝑤𝑤1−𝑏𝑏=𝑤𝑤0−𝑤𝑤1
c) [5 points] (b) implies that the solution 𝑤𝑤𝑗𝑗 is not unique. We can guarantee a unique solution by making the objective function regularized, e.g. using the squared norm regularizer. Write down the objective function and say whether you should minimize or maximize it.
Answer: Again, several possibilities, e.g. minimize
𝐽𝐽 = − � 𝑙𝑙𝑛𝑛𝑙𝑙𝑃𝑃(𝑦𝑦 = 𝑦𝑦𝑖𝑖|𝑥𝑥𝑖𝑖) + 𝜆𝜆‖𝑤𝑤‖2 𝑖𝑖

CS.542 Machine Learning, Fall 2019, Prof. Saenko
4. Overfitting and Regularization
Q4.1 Bias-Variance and 𝝀𝝀
Alice has a binary classification dataset of m points with n-dimensional inputs.
−1 −2 −3 −4
parameters 𝜆𝜆 = 10 , 10 , 10 , 10 . She computed the cross-validation (CV) and training
a) [3 points] She has trained several regularized logistic regression models using regularization
errors for each value of 𝜆𝜆, shown in the table below, but the rows are out of order. Fill in the correctvaluesof𝜆𝜆 foreachrow.
Train error
CV error
80%
85%
40%
45%
70%
76%
Answer: 10−1, 10−3, 10−2, 10−4
b) [3 points] Based on these results, which 𝜆𝜆 should she choose, and why?
plot with the correct value, and explain why it matches.
35% 50%
Answer: 10−3, the one with lowest CV error has best generalization
c) [3 points] Which of the four models will have the highest error due to variance? Why?
Answer: 10
d) [3 points] Alice also plotted learning curves for the models with 𝜆𝜆 = 10−1, 10−4. Match each
−4
, it is has the least regularization and is the most complex
𝜆𝜆= 𝜆𝜆=
𝜆𝜆
Answer: Left: 10−1 because it has high bias, more data doesn’t help; Right: 10−4 because it has high variance, more data may help.

CS.542 Machine Learning, Fall 2019, Prof. Saenko
Q4.2 Regularization for Linear Regression
Alice is trying to fit a linear regression model to predict house price based on size using polynomial features. Since her training dataset is very small, she is applying regularization. She fit several models by minimizing the cost function
for λ = 100, 101, 102, 103. The following are sketches of the resulting models.
bias?
Answer: d
c) [3 points] Which model will have the highest error due to variance?
Answer: b
d) [3 points] Which model, if any, will always have zero test error?
Answer: none
a) [3 points] Which value of λ goes with each of the plots? (Write it next to the plot)
Answer: 𝑎𝑎: 102 𝑏𝑏: 100 𝑐𝑐: 101 𝑑𝑑: 103
b) [3 points] Alice tries her model on a test set. Which model will have the highest error due to

CS.542 Machine Learning, Fall 2019, Prof. Saenko
5. Maximum Likelihood Principle Q5.1 ML for Probabilistic Linear Regression
(𝑖𝑖)
Recall that probabilistic linear regression defines the likelihood of observing outputs 𝑡𝑡 ∈ R given
inputs 𝑥𝑥(𝑖𝑖) ∈ R𝑝𝑝, where 𝑖𝑖 = 1, … , 𝑚𝑚 and 𝑚𝑚 is the number of samples in the dataset, as 𝑚𝑚
𝑝𝑝(𝑡𝑡1, … , 𝑡𝑡𝑚𝑚|𝑥𝑥1 … , 𝑥𝑥𝑚𝑚, 𝜃𝜃, 𝛽𝛽) = � 𝑁𝑁(𝑡𝑡(𝑖𝑖)|h�𝑥𝑥(𝑖𝑖)�, 𝛽𝛽−1) 𝑖𝑖=1
where h(𝑥𝑥) is the linear regression hypothesis, 𝜃𝜃, 𝛽𝛽 are parameters and 𝑁𝑁(𝑥𝑥|𝜇𝜇, 𝜎𝜎2) is the normal (Gaussian) probability density with mean 𝜇𝜇 and variance 𝜎𝜎2. Here 𝛽𝛽 = 𝜎𝜎−2 is the inverse variance of the Gaussian noise that we assume is added to the data.
(a) [8 points] Find 𝛽𝛽𝑀𝑀𝑀𝑀, the maximum likelihood solution for 𝛽𝛽. Hint: maximize log likelihood with
respect to only 𝛽𝛽.
Answer: maximize log likelihood w.r.t. 𝛽𝛽. The likelihood function is:
𝛽𝛽𝑚𝑚2𝑚𝑚𝑚𝑚 ln𝑝𝑝(𝒕𝒕|𝒙𝒙,𝜃𝜃,𝛽𝛽)=−2��h�𝑥𝑥(𝑖𝑖)�−𝑡𝑡(𝑖𝑖)� +2ln𝛽𝛽−2ln(2𝜋𝜋)
𝑖𝑖=1
Take partial derivative w.r.t. 𝛽𝛽 and set it to 0, then solve for w.r.t. 𝛽𝛽
1𝑚𝑚 2𝑚𝑚1 −2��h�𝑥𝑥(𝑖𝑖)�−𝑡𝑡(𝑖𝑖)� +2𝛽𝛽=0
𝑖𝑖=1
11𝑚𝑚2 𝛽𝛽𝑀𝑀𝑀𝑀 =𝑚𝑚��h�𝑥𝑥(𝑖𝑖)�−𝑡𝑡(𝑖𝑖)�
𝑖𝑖=1
To get 𝛽𝛽𝑀𝑀𝑀𝑀 we just need to take the inverse of the right-hand side.
(b) [2 points] What is the interpretation of the solution 𝛽𝛽𝑀𝑀𝑀𝑀? Explain in one sentence. Answer: It is the inverse of the average squared error of the prediction.

CS.542 Machine Learning, Fall 2019, Prof. Saenko
Consider a probabilistic linear regression model for a multivariate p-dimensional target variable 𝑡𝑡 = �𝑡𝑡1, … , 𝑡𝑡𝑝𝑝�𝑇𝑇 that has a Gaussian distribution over t of the form
Q5.2 ML for Linear Regression with Multivariate Outputs
𝑝𝑝(𝑡𝑡|𝑊𝑊, Σ) = 𝑁𝑁(𝑡𝑡|𝑦𝑦(𝜙𝜙(𝑥𝑥), 𝑊𝑊), Σ) where 𝜙𝜙(𝑥𝑥) is a basis function representation of the input, and
𝑦𝑦(𝑥𝑥,𝑊𝑊) = 𝑊𝑊𝑇𝑇𝜙𝜙(𝑥𝑥)
𝑊𝑊 is a matrix of parameters and Σ is the covariance parameter matrix. We are given a training dataset of input basis vectors 𝑥𝑥𝑛𝑛 and corresponding target vectors 𝑡𝑡𝑛𝑛, 𝑛𝑛 = 1, … , 𝑁𝑁. Show that the Maximum Likelihood solution 𝑊𝑊𝑀𝑀𝑀𝑀 for the parameter matrix 𝑊𝑊has the property that each column is given by
𝑤𝑤𝑀𝑀𝑀𝑀 = (Φ𝑇𝑇Φ)−1Φ𝑇𝑇𝑡𝑡
where Φ is the design matrix. You do not need to provide a solution for Σ. Show all your steps..
Hint: the p-dimensional multivariate normal distribution is given by 𝑁𝑁(𝑥𝑥;𝜇𝜇,Σ) = 1 exp (−12(𝑥𝑥 − 𝜇𝜇)𝑇𝑇Σ−1(𝑥𝑥 − 𝜇𝜇)
Hint: you may also find some of the matrix differentiation rules in the appendix helpful.
(2𝜋𝜋)𝑝𝑝2 |Σ|12
Answer: First we write down the Likelihood function, or the probability of the output datapoints given the input datapoints and parameters. The probability of a single datapoint is
𝑇𝑇 −1 𝑇𝑇 𝑝𝑝(𝑡𝑡|𝑥𝑥;𝑊𝑊,Σ)= 1 exp(−1�𝑡𝑡−𝑊𝑊 𝜙𝜙(𝑥𝑥)�𝑇𝑇Σ �𝑡𝑡−𝑊𝑊 𝜙𝜙(𝑥𝑥)�)
the multivariate normal
(2𝜋𝜋)𝑝𝑝2 |Σ|12
2
1𝑁𝑁1𝑇𝑇 𝐿𝐿(𝑡𝑡1,…,𝑡𝑡𝑁𝑁|𝑥𝑥;𝑊𝑊,Σ)=(2𝜋𝜋)𝑝𝑝2|Σ|12 �exp(−2�𝑡𝑡𝑛𝑛 −𝑊𝑊𝑇𝑇𝜙𝜙(𝑥𝑥𝑛𝑛)� Σ−1(𝑡𝑡𝑛𝑛 −𝑊𝑊𝑇𝑇𝜙𝜙(𝑥𝑥𝑛𝑛)))
Then the likelihood is
𝑙𝑙𝑛𝑛𝐿𝐿(𝑡𝑡 ,…,𝑡𝑡 |𝑥𝑥;𝑊𝑊,Σ) 𝑛𝑛=1
and the log likelihood is
1𝑁𝑁
𝑁𝑁𝑝𝑝𝑁𝑁1𝑁𝑁 𝑇𝑇
= − 2 𝑙𝑙𝑛𝑛(2𝜋𝜋) − 2 𝑙𝑙𝑛𝑛|Σ| − 2 ��𝑡𝑡𝑛𝑛 − 𝑊𝑊𝑇𝑇𝜙𝜙(𝑥𝑥𝑛𝑛)� Σ−1(𝑡𝑡𝑛𝑛 − 𝑊𝑊𝑇𝑇𝜙𝜙(𝑥𝑥𝑛𝑛))
𝛿𝛿𝛿𝛿1𝑁𝑁 𝑛𝑛=1𝑇𝑇
𝛿𝛿𝑊𝑊 𝑙𝑙𝑛𝑛𝐿𝐿 = 𝛿𝛿𝑊𝑊 (− 2 ��𝑡𝑡𝑛𝑛 − 𝑊𝑊𝑇𝑇𝜙𝜙(𝑥𝑥𝑛𝑛)� Σ−1(𝑡𝑡𝑛𝑛 − 𝑊𝑊𝑇𝑇𝜙𝜙(𝑥𝑥𝑛𝑛))) = 0
Taking the derivative with respect to W, the first two terms vanish, and we set the rest to zero
𝑛𝑛=1
Applying chain rule with 𝑢𝑢 = 𝑡𝑡𝑛𝑛 − 𝑊𝑊𝑇𝑇𝜙𝜙(𝑥𝑥𝑛𝑛), and using the matrix derivative rules

CS.542 Machine Learning, Fall 2019, Prof. Saenko
1𝑁𝑁𝛿𝛿𝛿𝛿
−2�𝛿𝛿𝑊𝑊(𝑡𝑡𝑛𝑛 −𝑊𝑊𝑇𝑇𝜙𝜙(𝑥𝑥𝑛𝑛))𝛿𝛿𝑢𝑢(𝑢𝑢𝑇𝑇Σ−1𝑢𝑢)=0
𝑛𝑛=1 1𝑁𝑁
− 2 �(−𝜙𝜙(𝑥𝑥𝑛𝑛))(2𝑢𝑢𝑇𝑇Σ−1) = 0 1𝑁𝑁 𝑛𝑛=1
− 2 �(−𝜙𝜙(𝑥𝑥𝑛𝑛))(2(𝑡𝑡𝑛𝑛 − 𝑊𝑊𝑇𝑇𝜙𝜙(𝑥𝑥𝑛𝑛))𝑇𝑇Σ−1) = 0 −1 𝑁𝑁
Multiplying both sides by Σ 𝑛𝑛=1
� 𝜙𝜙(𝑥𝑥𝑛𝑛)(𝑡𝑡𝑛𝑛 − 𝑊𝑊𝑇𝑇𝜙𝜙(𝑥𝑥𝑛𝑛))𝑇𝑇 = 0
𝑛𝑛=1
Let Φ be the Nxm data design matrix, where m is the dimension of 𝜙𝜙(𝑥𝑥𝑛𝑛), and 𝑇𝑇 be the Nxp
output matrix of output vectors 𝑡𝑡𝑛𝑛, then we can rewrite the above as Φ𝑇𝑇ΦW = ΦT𝑇𝑇
Then the ML solution is given by the matrix below, with each column equal to 𝑤𝑤𝑀𝑀𝑀𝑀as required
Q5.3 ML for Poisson Regression
WML = (Φ𝑇𝑇Φ)−1ΦT𝑇𝑇
This problem asks you to derive the maximum likelihood solution for a Poisson hypothesis.
a) [3 points] Given a training set {𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖 }, Poisson regression models the probability of observing
an output given an input as 𝑝𝑝(𝑦𝑦𝑖𝑖|𝜆𝜆) = 𝑦𝑦1! 𝜆𝜆𝑦𝑦𝑖𝑖𝑒𝑒−𝜆𝜆 where 𝜆𝜆 = 𝑒𝑒𝜕𝜕𝑇𝑇𝑛𝑛𝑖𝑖 for some parameter vector 𝑖𝑖
𝜃𝜃. Derive the cost function 𝐽𝐽(𝜃𝜃) corresponding to maximizing the log likelihood for a single training example.
Answer: log(𝑝𝑝(𝑦𝑦𝑖𝑖|𝑥𝑥𝑖𝑖,𝜃𝜃)=log�𝑒𝑒𝑦𝑦𝑖𝑖𝜕𝜕𝑇𝑇𝑛𝑛𝑖𝑖�+log�𝑒𝑒−𝑡𝑡𝜕𝜕𝑇𝑇𝑛𝑛𝑖𝑖�−log(𝑦𝑦𝑖𝑖!) = 𝑦𝑦𝑖𝑖𝜃𝜃𝑇𝑇𝑥𝑥𝑖𝑖 − 𝑒𝑒𝜕𝜕𝑇𝑇𝑛𝑛𝑖𝑖 − log(𝑦𝑦𝑖𝑖!)
b) [3 points] The cost function in (b) has no closed form solution, so we must use an iterative method. Show the stochastic gradient descent update for this cost function.
Answer: 𝜕𝜕 log (𝑝𝑝(𝑦𝑦𝑖𝑖|𝑥𝑥𝑖𝑖,𝜃𝜃) = 𝑦𝑦𝑖𝑖𝑥𝑥𝑖𝑖 − 𝑥𝑥𝑖𝑖𝑒𝑒𝜕𝜕𝑇𝑇𝑛𝑛𝑖𝑖 = �𝑦𝑦𝑖𝑖 − 𝑒𝑒𝜕𝜕𝑇𝑇𝑛𝑛𝑖𝑖�𝑥𝑥𝑖𝑖, so the update is
𝜕𝜕𝜕𝜕 𝜃𝜃 ≔ 𝜃𝜃 + 𝛼𝛼�𝑦𝑦𝑖𝑖 − 𝑒𝑒𝜕𝜕𝑇𝑇𝑛𝑛𝑖𝑖�𝑥𝑥𝑖𝑖
Note that we are maximizing the function so we use gradient ascent, not descent.

CS.542 Machine Learning, Fall 2019, Prof. Saenko
6. Unsupervised Learning
a) PCA assumes a specific relationship between the unobserved latent coordinates 𝑧𝑧 and the observed data points 𝑥𝑥. Express this relationship as an equation. Clearly identify and name the parameters which are learned.
Q6.1 Principle Component Analysis
Answer: 𝑧𝑧 = 𝑢𝑢(𝑘𝑘)𝑇𝑇𝑥𝑥̅ where 𝑥𝑥̅ is a normalized data point and 𝑢𝑢(𝑘𝑘), 𝑘𝑘 = 1, … , 𝐾𝐾 are the 𝑘𝑘
principle component vectors, which are the parameters that need to be learned.
b) Name one objective function which could be minimized to learn the parameters of PCA.
Answer: Reconstruction error. Another objective is maximizing the total variance of projected points.
c) For a dataset of arbitrary points 𝑥𝑥(1), … , 𝑥𝑥(𝑚𝑚), specify the steps of the PCA algorithm.
Answer: First, normalize the points to have zero mean and unit standard deviation in each coordinate. Then, compute the covariance matrix of the data, and decompose it with Singular Value Decomposition to obtain its eigenvectors/values. Finally, project each point onto the top k eigenvectors to obtain the lower-dimensional points. The eigenvalues can be used to determine how many components to keep in order to preserve a certain percentage of the total variance.
d) Suppose you are given 2D feature vectors for a classification task which are distributed according to the figure below. You apply PCA to the entire dataset. On the figure, draw all the PCA components.
e) In (d) above, could you use PCA components directly to classify the data (without training a classifier)? Explain.
Answer: yes, we can project the points onto the second eigenvector, 𝑢𝑢2, and then threshold the one-dimensional points at 0 to assign the class label. Specifically, the classifier will assign labels 𝑏𝑏𝑖𝑖𝑙𝑙𝑛𝑛((𝑢𝑢2)𝑇𝑇𝑥𝑥) where +1 corresponds to class 1.

CS.542 Machine Learning, Fall 2019, Prof. Saenko
Q6.2 Gaussian Mixture Models
a) Describe in words the two main steps of the Expectation Maximization algorithm used to solve Gaussian Mixture Models.
Answer: the E step estimates the values of the latent variables, the M step maximizes the likelihood of the data given the latent variables computed in the E step.
b) True or False: In the case of fully observed data, i.e. when all latent variables are observed, EM reduces to Maximum Likelihood.
Answer: True
c) True or False: Since the EM algorithm guarantees that the value of its objective function will increase after each iteration, it is guaranteed to eventually reach the global maximum.
Answer: False. The algorithm is guaranteed to converge (as opposed to iterating forever) because the objective function is increased at each step, but it may converge to a local maximum rather than global one.
d) Sketch a dataset on which K-Means would work poorly but a Gaussian Mixture Model with the same number of clusters would do well. Describe why K-Means wouldn’t work well.
Answer: One advantage of GMM over k-means is that it accounts for data variance (or covariance in multiple dimensions.) One way to think about the k-means model is that it places a circle (or, in higher dimensions, a hyper-sphere) at the center of each cluster, with a radius defined by the most distant point in the cluster, as shown on the left. This works fine when your data is circular. However, when your data takes on different shape, you end up with something like the middle plot. On the other hand, a GMM can handle even non-circular clusters (right) by using the covariance to calculate the distance.

CS.542 Machine Learning, Fall 2019, Prof. Saenko
7. Neural Networks
Q7.1 Neural Network for XOR
Design a neural network to solve the XOR problem, i.e. the network should output 1 if only one of the two binary input variables is 1, and 0 otherwise (see left figure). Use the hyperbolic tangent, or tanh, activation function in all nodes (right figure), which ranges in [-1, +1].
Note that (A XOR B) can be expressed as (A OR B) AND NOT(A AND B)), as illustrated below:
In the diagrams below, we filled in most of the tanh units’ parameters. Fill in the remaining parameters, keeping in mind that tanh outputs +1/-1, not 0/1. Note that we need to appropriately change the second layer (the AND node) to take +1/-1 as inputs. Also, we must add an extra last layer to convert the final output from +1/-1 to 0/1. Hint: assume tanh outputs −1 for any input 𝑥𝑥 ≤ −2, +1 for any input 𝑥𝑥 ≥ +2, 0 for 𝑥𝑥 = 0.
6
-4 2
-2 ??
4
?
Answer: -2, -4, +2

CS.542 Machine Learning, Fall 2019, Prof. Saenko
Q7.2 Computation Graph and Backpropagation
In class, we learned how to take a complex function that consists of multiple nested functions and represent it with a computation graph, which allows us to write down the forward and backward pass used to compute the function gradient.
a) Practice converting different functions 𝑓𝑓𝜕𝜕 (𝑥𝑥) = 𝑓𝑓𝑘𝑘(𝑓𝑓𝑘𝑘−1�… 𝑓𝑓1(𝑥𝑥)� of input vector 𝑥𝑥 parametrized by 𝜃𝜃 to their computation graphs.
Answer: see lecture notes for examples.
b) For the computation graphs obtained in (a), write down the forward pass and the backward pass equations.
Answer: see lecture notes for examples.
a) Draw a convolutional network with input 𝑥𝑥 ∈ 𝑅𝑅4, one hidden layer with 2×1 filters and 2 channels with stride 2, and a fully-connected output layer with one neuron. How many parameters does this network have?
Answer: 9 parameters, see w’s in the plot. The c’s are the hidden convolutional units, with each channel sharing parameters. Note that the last layer should also have a “dummy” unit set to 1 and thus an extra parameter.
b) What algorithm is used for learning the parameters of a recurrent network? Name the algorithm and sketch out its main steps.
Answer: Backpropagation in time. The steps for computing the gradient for 1 training sequence are as follows. First, we unroll the network by replicating it once for each time step in our training sequence. Then we use regular backprop on the unrolled network to find the gradient. See lecture notes for more detail.
Q7.3 Neural Network Architectures

CS.542 Machine Learning, Fall 2019, Prof. Saenko
Appendix: Useful Formulas
Matrix Derivatives
For vectors x, y and matrix A, 𝑦𝑦=𝐴𝐴𝑥𝑥,then𝜕𝜕𝑦𝑦 =𝐴𝐴
𝜕𝜕𝑛𝑛
If 𝑧𝑧 = 𝑥𝑥𝑇𝑇𝐴𝐴𝑥𝑥, then 𝜕𝜕𝑧𝑧 = 𝑥𝑥𝑇𝑇(𝐴𝐴 + 𝐴𝐴𝑇𝑇). For the special case of a symmetric matrix A, 𝜕𝜕𝑧𝑧 = 2𝑥𝑥𝑇𝑇𝐴𝐴. 𝜕𝜕𝑛𝑛 𝜕𝜕𝑛𝑛
Chain Rule: if 𝑧𝑧 is a function of 𝑦𝑦, which is a function of 𝐴𝐴, then 𝜕𝜕𝑧𝑧 = 𝜕𝜕𝑦𝑦 𝜕𝜕𝑧𝑧 (note the order). 𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝑦𝑦
Single-Dimension Normal Distribution 1 (𝑥𝑥 − 𝜇𝜇)2 𝑁𝑁(𝑥𝑥;𝜇𝜇,σ2)=𝜎𝜎√2𝜋𝜋 exp(− 2𝜎𝜎2 )
The p-dimensional multivariate normal distribution with mean 𝜇𝜇 and covariance matrix Σ is given by
Multivariate Normal Distribution
𝑁𝑁 ( 𝑥𝑥 ; 𝜇𝜇 , Σ ) = 1 e x p ( − 12 ( 𝑥𝑥 − 𝜇𝜇 ) 𝑇𝑇 Σ − 1 ( 𝑥𝑥 − 𝜇𝜇 ) ) (2𝜋𝜋)𝑝𝑝2 |Σ|12