CS.542 Machine Learning, Fall 2019, Prof. Saenko
Additional Exam Practice Problems
Note: Some of these sample problems had been used in past exams and are provided for practice in addition to the midterm practice and homework problems, which you should also review. A typical exam would have around 5 questions. The exam is closed book, no electronics (no phones, calculators, laptops, tablets, smart watches, etc.) A single 8“x11” sheet of paper with typed or handwritten notes on both sides is allowed.
Table of Contents
I. Math and Probability Basics…………………………………………………………………………………………………………………. 2
Q1: Review of Definitions……………………………………………………………………………………………………………………. 2
Q2: Short questions……………………………………………………………………………………………………………………………. 2
II. General Machine Learning Concepts ……………………………………………………………………………………………………. 3
Q1. True or False questions…………………………………………………………………………………………………………………. 3 Q2. Short answer questions ………………………………………………………………………………………………………………… 4 Q3. Error metrics ……………………………………………………………………………………………………………………………….. 5
III. Bayesian Methods…………………………………………………………………………………………………………………………….. 6
Q1. Bias-variance in Bayesian models ………………………………………………………………………………………………….. 6
Q2. Bayesian Maximum A Posteriori (MAP) ………………………………………………………………………………………….. 7
IV. Unsupervised Learning ……………………………………………………………………………………………………………………… 8
Q1. Short questions……………………………………………………………………………………………………………………………. 8
Q2. Generative adversarial network (GAN) mini-max loss ………………………………………………………………………. 8
V. Loss Functions, SVM, Kernels ……………………………………………………………………………………………………………. 10
Q1. Hinge Loss…………………………………………………………………………………………………………………………………. 10 Q2. Loss functions for car sales ………………………………………………………………………………………………………….. 10 Q3. Kernels and PCA…………………………………………………………………………………………………………………………. 11 Q4. SVMs and Kernels ………………………………………………………………………………………………………………………. 12
VI. Reinforcement Learning ………………………………………………………………………………………………………………….. 14
Q1. Markov Decision Process …………………………………………………………………………………………………………….. 14
Appendix: Useful Formulas …………………………………………………………………………………………………………………… 15
CS.542 Machine Learning, Fall 2019, Prof. Saenko
I. Math and Probability Basics
(a) For a scalar random variable 𝑥𝑥, give the definition of its mean and variance.
𝑛𝑛
(b) For a vector random variable 𝑥𝑥 ∈ R , give the definition of its mean and covariance.
Answer: see Bishop Ch. 1
Q1: Review of Definitions
Answer: see Bishop Ch. 1
22
(a) Given a vector 𝑣𝑣 ∈ R and a line (through the origin) orthogonal to the vector 𝑢𝑢 ∈ R , draw
Q2: Short questions
the orthogonal distance from 𝑣𝑣 to the line, and write down the equation for the distance. Answer: 𝑣𝑣𝑇𝑇𝑢𝑢⁄‖𝑢𝑢‖, or dot product of u and v divided by the length of u
(b) A hyperplane is a subspace whose dimension is one less than that of its ambient space. Suppose the line in (c) was instead an n-1 dimensional hyperplane in the ambient space R𝑛𝑛. What is the orthogonal distance from a point 𝑣𝑣 ∈ R𝑛𝑛 to the hyperplane?
Answer: 𝑣𝑣𝑇𝑇𝑢𝑢⁄‖𝑢𝑢‖, same as above but in higher dimensions
(c) Find the maximum of the function f(x1, 𝑥𝑥2) = 1 − 𝑥𝑥12 − 𝑥𝑥2 subject to the constraint 𝑔𝑔(𝑥𝑥1, 𝑥𝑥2) = 𝑥𝑥1 + 𝑥𝑥2 − 1 = 0, using Lagrange Multipliers. The 3-D plot shows f and the 2-D plot shows its contours (blue circles) and the line corresponding to 𝑔𝑔 = 0 (red line).
Answer: solve the following system of equations:
𝛻𝛻𝛻𝛻(𝑥𝑥 ,𝑥𝑥 )=𝜆𝜆𝛻𝛻𝑔𝑔(𝑥𝑥 ,𝑥𝑥 ) 12 12
𝑔𝑔(𝑥𝑥1,𝑥𝑥2) = 0
In this case the first equation yields −2𝑥𝑥1 = 𝜆𝜆; −2𝑥𝑥2 = 𝜆𝜆. Solving for 𝑥𝑥1, 𝑥𝑥2 and substituting
into the second equation, we get 𝜆𝜆 = −1, so x = (0.5, 0.5) is the solution. For more info: http://www.math.harvard.edu/archive/21a_spring_09/PDF/11-08-Lagrange-Multipliers.pdf
CS.542 Machine Learning, Fall 2019, Prof. Saenko
(d) Show that for a matrix 𝐴𝐴 and vector 𝑥𝑥, 𝜕𝜕 (𝐴𝐴−1) = −𝐴𝐴−1 � 𝜕𝜕 𝐴𝐴� 𝐴𝐴−1. Use the fact that for any 𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕
twomatrices𝐴𝐴and𝐵𝐵, 𝜕𝜕 (𝐴𝐴𝐵𝐵)=𝜕𝜕𝜕𝜕𝐵𝐵+𝐴𝐴𝜕𝜕𝜕𝜕 .
−1 −1 −1 −1 −1
Answer:since 𝜕𝜕 (𝐴𝐴𝐴𝐴𝜕𝜕𝜕𝜕)=𝜕𝜕𝜕𝜕𝐴𝐴𝜕𝜕𝜕𝜕+𝐴𝐴𝜕𝜕𝜕𝜕𝜕𝜕𝜕𝜕=0,then𝐴𝐴𝜕𝜕𝜕𝜕 =− 𝜕𝜕𝜕𝜕𝐴𝐴 andafterleft-
𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 multiplying by 𝐴𝐴−1 we get the above result.
II. General Machine Learning Concepts
Q1. True or False questions
Circle the correct answer (T or F) and give a one-sentence explanation of each answer; answers without explanation will not be given full points.
a) Supervised classification models can be used to predict if an email attachment contains a computer virus [T/F]
Answer: true, eg binary classification with an SVM
b) Suppose we use polynomial features for linear regression, then the hypothesis is linear in the original features [T/F]
Answer: false, it is linear in the new polynomial features
c) Anadvantageofnormalequationstosolvelinearregressionisthatchoosingastepsize (learning rate) is not necessary [T/F]
Answer: true, it’s a closed form solution so no need for gradient descent
d) Maximum likelihood can be used to derive a closed-form solution to logistic regression [T/F]
Answer: false, it can be used to derive cost, but no closed form solution exists
e) The gradient descent update for logistic regression is identical to linear regression [T/F]
Answer: false, they look similar but the hypothesis functions are different
f) Changing the prior in Linear Discriminant Analysis changes the direction of the decision hyperplane [T/F]
Answer: false, changes only the position, not the direction of the decision hyperplane
CS.542 Machine Learning, Fall 2019, Prof. Saenko
Q2. Short answer questions
Answer the following questions in brief one to two sentence answers.
a) For a training dataset D = {𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖 } where 𝑥𝑥𝑖𝑖 are the inputs and 𝑦𝑦𝑖𝑖 are the output, explain the difference between discriminative and generative classification models.
Answer: A generative model learns 𝑝𝑝(𝑥𝑥, 𝑦𝑦) which can generate examples {𝑥𝑥, 𝑦𝑦} and evaluate 𝑝𝑝(𝑦𝑦), 𝑝𝑝(𝑦𝑦|𝑥𝑥) and 𝑝𝑝(𝑥𝑥|𝑦𝑦), while a discriminative model only learns 𝑝𝑝(𝑦𝑦|𝑥𝑥) or learns the decision boundary directly without modeling the output probability (eg. in the SVM).
b) Give one example of a discriminative classification model.
Answer: logistic regression (others are neural network, SVM)
c) Give one example of a generative model.
Answer: for classification, linear discriminant analysis (LDA) (others we did not cover are Naïve Bayes, Graphical Models such as HMM); for unsupervised learning, GAN, GMM
d) Suppose you want to use training data D to adjust the parameters w of a model where L(D) = p(D; w) is the likelihood of the data. You want to prevent overfitting using a squared norm regularizer. What should your objective function look like? Should you minimize or maximize it?
Answer: minimize the following objective (note, this maximizes 𝐿𝐿(𝐷𝐷)):
𝐽𝐽 = −𝐿𝐿(𝐷𝐷) + 𝜆𝜆‖𝑤𝑤‖2 or, 𝐽𝐽 = −𝜆𝜆𝐿𝐿(𝐷𝐷) + ‖𝑤𝑤‖2
where 𝜆𝜆 is a hyperparameter, or, equivalently, maximize −𝐽𝐽. e) What is cross-validation?
Answer: When we split training data into training and extra validation set; learn model parameters on the training set, test and tune hyper-parameters on the validation set. We can do this multiple times, taking a different portion of original training data for the validation set each time (known as N-fold cross-validation).
f) How can we use cross-validation to prevent overfitting? Explain the procedure using the setup of (d).
Answer: Train several models using different values of 𝜆𝜆 on the training set, test them on the validation set, and pick best model. This will reduce overfitting compared to tuning 𝜆𝜆 on training data.
CS.542 Machine Learning, Fall 2019, Prof. Saenko
Q3. Error metrics
a) Give one example each of error metrics that can be used to evaluate: classification, regression, clustering, and reinforcement learning.
Answer: accuracy, squared error (mean squared error), average distance to cluster centers, total reward.
b) Which are the correct definitions of precision and recall? Here ‘actual positives’ are examples labeled positive (by humans), and ‘predicted positives’ are examples for which the algorithm predicts a positive
label. 𝑡𝑡𝑝𝑝𝑢𝑢𝑝𝑝 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑡𝑡𝑝𝑝𝑣𝑣𝑝𝑝𝑝𝑝
1. 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 = 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑡𝑡𝑝𝑝𝑝𝑝 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑡𝑡𝑝𝑝𝑣𝑣𝑝𝑝𝑝𝑝
𝑡𝑡𝑝𝑝𝑢𝑢𝑝𝑝 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑡𝑡𝑝𝑝𝑣𝑣𝑝𝑝𝑝𝑝 2. 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 = 𝑎𝑎𝑝𝑝𝑡𝑡𝑢𝑢𝑎𝑎𝑎𝑎 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑡𝑡𝑝𝑝𝑣𝑣𝑝𝑝𝑝𝑝
3. 𝑝𝑝𝑝𝑝𝑝𝑝𝑎𝑎𝑎𝑎𝑎𝑎 = 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑡𝑡𝑝𝑝𝑝𝑝 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑡𝑡𝑝𝑝𝑣𝑣𝑝𝑝𝑝𝑝 𝑎𝑎𝑝𝑝𝑡𝑡𝑢𝑢𝑎𝑎𝑎𝑎 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑡𝑡𝑝𝑝𝑣𝑣𝑝𝑝𝑝𝑝
4. 𝑝𝑝𝑝𝑝𝑝𝑝𝑎𝑎𝑎𝑎𝑎𝑎 = 𝑡𝑡𝑝𝑝𝑢𝑢𝑝𝑝 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑡𝑡𝑝𝑝𝑣𝑣𝑝𝑝𝑝𝑝 𝑎𝑎𝑝𝑝𝑡𝑡𝑢𝑢𝑎𝑎𝑎𝑎 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑡𝑡𝑝𝑝𝑣𝑣𝑝𝑝𝑝𝑝
Answer: 1 and 4
c) Suppose your data has binary labels, where the negative class (y=0) occurs 99% of the time and the positive class (y=1) occurs 1% of the time. Which of the following results obtained by your machine learning algorithm by itself confirms that it is performing well on this data (circle one)? Hint: think about what happens if your algorithm always outputs ‘0’, or always outputs ‘1’.
1. High Recall 2. High Accuracy 3. High F1 score
Answer: 3, because it combines recall and precision. A model can have high recall but low precision, e.g. always predicting ‘1’ will have 100% recall but 0% precision (F1=0). A model can have high accuracy but low F1 score, e.g. always predicting ‘0’ will be 99% accurate but F1=0.
CS.542 Machine Learning, Fall 2019, Prof. Saenko
III. Bayesian Methods
Q1. Bias-variance in Bayesian models
Alice has a dataset of m points with n-dimensional inputs and scalar outputs. She has trained several regularized linear regression models using regularization parameters 𝜆𝜆 = 𝑝𝑝0, 𝑝𝑝−1, 𝑝𝑝−2, 𝑝𝑝−3.
Answer: since 𝜆𝜆 weights the regularizer, the smallest 𝜆𝜆 will lead to highest variance and largest 𝜆𝜆 will lead to highest bias.
b) Alice then decides to use a Bayesian approach to control the complexity of the model. What is the Bayesian equivalent to changing 𝜆𝜆?
Answer: changing the prior on the parameters of the model
c) Which Bayesian model should she use? Explain what makes the model Bayesian.
Answer: Bayesian linear regression, it puts a prior distribution over the linear parameters
d) Alice fit the parameters of her model and wants to use the predictive distribution on new inputs. Explain what the predictive distribution is.
Answer: it models the conditional probability of the output given an input and training data.
e) What is the difference between the predictive distribution of the ML-based and the Bayesian linear regression models?
Answer: they are both Normal distributions, but their means and covariances are computed differently; ML finds a point estimate of the parameters while the Bayesian solution integrates over parameter values using the posterior distribution. See Bishop for more details.
a) Which parameter will lead to highest bias? To highest variance?
CS.542 Machine Learning, Fall 2019, Prof. Saenko
Suppose we are estimating the probability of seeing ‘heads’ (x = 1) or ‘tails’ (x = 0) after tossing a coin, with 𝜇𝜇 being the probability of seeing ‘heads’. The probability distribution of a single binary variable 𝑥𝑥 ∈ {0,1} that takes value 1 with probability 𝜇𝜇 is given by the Bernoulli distribution
Q2. Bayesian Maximum A Posteriori (MAP)
𝐵𝐵𝑝𝑝𝑝𝑝𝑝𝑝(𝑥𝑥|𝜇𝜇) = 𝜇𝜇𝜕𝜕(1 − 𝜇𝜇)1−𝜕𝜕
Suppose we have a dataset of independent coin flips 𝐷𝐷 = {𝑥𝑥(1), 𝑥𝑥(2), … , 𝑥𝑥(𝑚𝑚)} and we would like to
estimate 𝜇𝜇 using the Bayesian MAP solution. Recall that we can write down the data likelihood as 𝑝𝑝(𝑥𝑥(𝑖𝑖)|𝜇𝜇) = 𝜇𝜇𝜕𝜕(𝑖𝑖) (1 − 𝜇𝜇)1−𝜕𝜕(𝑖𝑖)
Consider the following Bayesian prior on 𝜇𝜇, which believes that the coin is either fair, or slightly biased
towards ‘tails’:
0 𝑝𝑝𝑡𝑡h𝑝𝑝𝑝𝑝𝑤𝑤𝑝𝑝𝑝𝑝𝑝𝑝
Write down the MAP estimate for 𝜇𝜇 under this prior as a function of the likelihood and the prior.
(Hint: use the argmax function).
where
and
𝑝𝑝(𝐷𝐷|𝜇𝜇) ∗ 0.5 𝑝𝑝𝛻𝛻 𝜇𝜇 = 0.5 𝑝𝑝(𝐷𝐷|𝜇𝜇) 𝑝𝑝(𝜇𝜇) = � 𝑝𝑝(𝐷𝐷|𝜇𝜇) ∗ 0.5 𝑝𝑝𝛻𝛻 𝜇𝜇 = 0.4
0.5 𝑝𝑝𝛻𝛻 𝜇𝜇 = 0.5 𝑝𝑝(𝜇𝜇) = � 0.5 𝑝𝑝𝛻𝛻 𝜇𝜇 = 0.4
Answer: a MAP estimate is obtained by maximizing the posterior, using Bayes rules and dropping the p(D) denominator,
𝜇𝜇𝑀𝑀𝜕𝜕𝑀𝑀 = argmax 𝑝𝑝(𝜇𝜇|𝐷𝐷) = argmax 𝑝𝑝(𝐷𝐷|𝜇𝜇) 𝑝𝑝(𝜇𝜇) , 𝜇𝜇𝜇𝜇
𝑚𝑚𝑚𝑚
𝑝𝑝(𝐷𝐷|𝜇𝜇) = � 𝑝𝑝( 𝑥𝑥(𝑖𝑖)|𝜇𝜇) = � 𝜇𝜇𝜕𝜕(𝑖𝑖) (1 − 𝜇𝜇)1−𝜕𝜕(𝑖𝑖)
𝑖𝑖=1 𝑖𝑖=1
0 𝑝𝑝𝑡𝑡h𝑝𝑝𝑝𝑝𝑤𝑤𝑝𝑝𝑝𝑝𝑝𝑝
CS.542 Machine Learning, Fall 2019, Prof. Saenko
IV. Unsupervised Learning
Q1. Short questions
a)
For which problems below would anomaly detection be a suitable algorithm? Explain.
1. Given an image of a face, determine whether or not it is the face of a particular famous individual.
2. From a large set of hospital patient records, predict which patients have a particular disease (eg. flu)
3. Given a dataset of credit card transactions, identify transactions to flag them as possibly fraudulent.
4. From a large set of primary care patient records, identify individuals who might have unusual health
conditions.
Answer: 3 and 4, because in both cases the non-anomaly class is very large and the anomaly class cannot be easily defined but rather is “everything else”.
Suppose you have trained an anomaly detection system for intruder detection in a security camera, and your system flags anomalies when p(x) is less than ε. You find on the cross-validation set that it is missing many intruder events. What should you do?
Answer: Try increasing ε to flag more examples as anomalies; can also collect more training data to get a better model of p(x).
Which of the following approaches use density estimation? Explain.
1. Linear discriminant analysis
2. Logistic Regression
3. Anomaly detection
4. Generative Adversarial Networks
Answer: 1,3 and 4. LDA and anomaly detection use explicit density estimation (Gaussian density) but GANs use implicit density estimation i.e. they can sample from the density but cannot estimate its value.
b)
c)
Q2. Generative adversarial network (GAN) mini-max loss
A GAN consists of a generator and a discriminator: the generator network 𝐺𝐺(𝑧𝑧) has parameters 𝜃𝜃 , takes a random noise vector 𝑧𝑧 as input, and
𝐺𝐺
outputs a sample of data 𝑥𝑥 (e.g., an image); the discriminator network
𝐷𝐷(𝑥𝑥) has parameters 𝜃𝜃𝐷𝐷, takes in 𝑥𝑥, and outputs a binary label 𝑦𝑦 ∈ {0,1} where 1 indicates that the input is real and 0 that it is fake. The discriminator’s cost function is the cross-entropy loss for the binary classification task over the real and the generated inputs:
𝐽𝐽𝐷𝐷(𝜃𝜃𝐷𝐷) = 𝔼𝔼𝜕𝜕[− log 𝐷𝐷(𝑥𝑥)] + 𝔼𝔼𝑧𝑧[−log (1 − 𝐷𝐷�𝐺𝐺(𝑧𝑧)�)]
CS.542 Machine Learning, Fall 2019, Prof. Saenko
Suppose the generator’s loss is the opposite (negative) of the discriminator’s cross entropy loss. a) The generator’s loss is shown in the plot above. Write down the equation for it.
Answer: 𝐽𝐽𝐺𝐺(𝜃𝜃𝐺𝐺) = 𝔼𝔼𝑧𝑧[log (1 − 𝐷𝐷�𝐺𝐺(𝑧𝑧)�)]. Note that the term that doesn’t depend on G is constant so we omit it here.
b) Explain why the generator’s loss 𝐽𝐽𝐺𝐺 suffers from slow learning.
Answer: if the discriminator is good at predicting that the generator’s samples are fake (the
input to the loss is close to zero), gradients are flat, which slows down learning.
c) Why doesn’t this happen in the regular cross entropy loss?
Answer: in regular cross-entropy, the term involving samples with label 0 corresponds to the negative of the generator’s loss, but in that case, if the classifier is incorrect its outputs are close to 1 and the loss is not flat there.
CS.542 Machine Learning, Fall 2019, Prof. Saenko
V. Loss Functions, SVM, Kernels
Given the SVM decision function 𝛻𝛻(𝑥𝑥) = 𝑤𝑤𝑇𝑇𝑥𝑥 + 𝑏𝑏 and labels 𝑦𝑦 ∈ {−1, +1}, the unconstrained primal SVM problem is to minimize the following w.r.t. 𝑤𝑤 over the training points 𝑝𝑝:
Q1. Hinge Loss
‖𝑤𝑤‖2 + 𝐶𝐶 � max (0,1 − 𝑦𝑦𝑖𝑖𝛻𝛻(𝑥𝑥𝑖𝑖))
The loss function max (0,1 − 𝑦𝑦𝑖𝑖𝛻𝛻(𝑥𝑥𝑖𝑖)) is called Hinge Loss. Which of these four loss functions implements the SVM hinge loss? Explain in words what the hinge loss does.
Answer: (c). Input to the loss is distance to decision boundary times the label: 𝑦𝑦𝛻𝛻(𝑥𝑥) = 𝑦𝑦(𝑤𝑤𝑇𝑇𝑥𝑥 + 𝑏𝑏). The loss assigns no loss to examples with distance more than +1 from decision boundary, otherwise penalty increases linearly with decreasing distance from decision boundary. The y factor controls direction, i.e. loss is reflected around 0 for negative y. (d) is correct if we instead use −𝑦𝑦((𝑤𝑤𝑇𝑇𝑥𝑥 + 𝑏𝑏) − 1) as input. By default, (c) is correct.
Alice works for a used car dealership. Her boss wants her to estimate the price 𝑦𝑦 to charge for a car (in dollars) based on features such as: 𝑥𝑥1= car manufacturer, 𝑥𝑥2= model, 𝑥𝑥3= distance driven in miles, 𝑥𝑥4= age in years, etc. She collects data points from previous car sales, (𝑥𝑥(𝑖𝑖), 𝑦𝑦(𝑖𝑖)), 𝑝𝑝 = 1, … , 𝑚𝑚, where 𝑦𝑦(𝑖𝑖) is the price the car was sold for, and decides to use a linear regression model, 𝑦𝑦 = ∑𝑛𝑛𝑗𝑗=0 𝜃𝜃𝑗𝑗𝑥𝑥𝑗𝑗.
a) Alice decides to minimize a sum of squares loss function. Does it make sense in this case? What effect will it have?
Answer: Yes; it will encourage the sale price to be the same as previous years.
Answer: Set y(i) to be the maximum of y(i) and z(i), then apply the SSE loss function as before.
Q2. Loss functions for car sales
that the ith car cost the dealer 𝑧𝑧 encourage this.
b) Her boss tells Alice she strongly prefers that the dealership not lose money on a sale. Given
, suggest a way to pre-process the training data to
(𝑖𝑖)
CS.542 Machine Learning, Fall 2019, Prof. Saenko
Q3. Kernels and PCA
Consider the following dataset of 1-dimensional datapoints:
a)
Which placement of Gaussian basis functions corresponds to a kernel feature representation for this dataset?
a.
b.
c.
d.
Answer: (a); we place a Gaussian distribution centered at each data point.
For 𝑘𝑘(𝑥𝑥1, 𝑥𝑥2) to be a valid kernel, there must be a feature basis function 𝜑𝜑(. ) such that we can write 𝑘𝑘(𝑥𝑥1, 𝑥𝑥2) = 𝜑𝜑(𝑥𝑥1)𝑇𝑇𝜑𝜑(𝑥𝑥2).
b)
c)
Prove that 𝑘𝑘 = 𝑥𝑥1𝑇𝑇𝐶𝐶𝑥𝑥2 is a valid kernel, where 𝐶𝐶 = 𝑋𝑋𝑋𝑋𝑇𝑇 is the data covariance obtained from the design matrix 𝑋𝑋. Hint: use (𝐴𝐴𝐵𝐵)𝑇𝑇 = 𝐵𝐵𝑇𝑇𝐴𝐴𝑇𝑇.
Answer: 𝑥𝑥1𝑇𝑇𝐶𝐶𝑥𝑥2 = 𝑥𝑥1𝑇𝑇𝑋𝑋𝑋𝑋𝑇𝑇𝑥𝑥2 = (𝑋𝑋𝑇𝑇𝑥𝑥1)𝑇𝑇 𝑋𝑋𝑇𝑇𝑥𝑥2; 𝜑𝜑(𝑥𝑥 ) = 𝑋𝑋𝑇𝑇𝑥𝑥
Typically we use kernels to project data into a higher dimensional space. Principal Component Analysis projects data into a lower dimensional space. Suggest a kernel that uses PCA to project data into a lower, k-dimensional space, and prove that it is a valid kernel. You can assume the data already has a mean of zero. Hint: use the result of (b).
Answer: 𝑘𝑘 = 𝑥𝑥1𝑇𝑇𝑈𝑈𝑈𝑈𝑇𝑇𝑥𝑥2, use U equal to k top eigenvectors of the covariance matrix. Then use (b) replacing X with U to prove the kernel is valid. The feature function corresponds to
𝜑𝜑(𝑥𝑥 ) = 𝑈𝑈𝑇𝑇𝑥𝑥 or the projection only the eigenvectors. Note that since data𝑇𝑇is assumed to have zero mean, we do not need to subtract the mean before computing 𝑈𝑈 𝑥𝑥.
CS.542 Machine Learning, Fall 2019, Prof. Saenko
Q4. SVMs and Kernels
Consider the following dataset of 2-dimensional datapoints:
a) Which placement of Gaussian basis functions corresponds to a kernel feature representation for this dataset? Explain your answer in one sentence below.
(A) (B)
(C) (D) none of the above
Answer: (D); to compute a kernel representation, we place a separate Gaussian distribution centered at **each** data point.
b) How does increasing the variance of the Gaussian 𝜎𝜎2 affect the bias and variance of the resulting Gaussian Kernel SVM classifier? Explain.
Answer: it makes the boundary smoother, so the classifier has higher bias and lower variance
CS.542 Machine Learning, Fall 2019, Prof. Saenko
c)
d)
where 𝜙𝜙(𝑥𝑥) = [𝜙𝜙1(𝑥𝑥), 𝜙𝜙2(𝑥𝑥)]
e)
Explain what slack variables are used for when training SVMs.
Answer: slack variables are assigned to solve a problem that is not linearly separable by adding ‘slack’ values to the constraints
For 𝑘𝑘(𝑥𝑥1, 𝑥𝑥2) to be a valid kernel, there must be a feature basis function 𝜑𝜑(. ) such that we can write 𝑘𝑘(𝑥𝑥1, 𝑥𝑥2) = 𝜑𝜑(𝑥𝑥1)𝑇𝑇𝜑𝜑(𝑥𝑥2). Suppose 𝑘𝑘1(𝑥𝑥1, 𝑥𝑥2) and 𝑘𝑘2(𝑥𝑥1, 𝑥𝑥2) are valid kernels. Prove that the following is also a valid kernel:
Answer:
𝑘𝑘(𝑥𝑥 ,𝑥𝑥 )=𝑘𝑘 (𝑥𝑥 ,𝑥𝑥 )+𝑘𝑘 (𝑥𝑥 ,𝑥𝑥 ) 12 112 212
𝑘𝑘(𝑥𝑥 ,𝑥𝑥 )=𝑘𝑘 (𝑥𝑥 ,𝑥𝑥 )+𝑘𝑘 (𝑥𝑥 ,𝑥𝑥 )=𝜙𝜙 (𝑥𝑥 )𝑇𝑇𝜙𝜙 (𝑥𝑥 )+𝜙𝜙 (𝑥𝑥 )𝑇𝑇𝜙𝜙 (𝑥𝑥 )=𝜙𝜙(𝑥𝑥 )𝑇𝑇𝜙𝜙(𝑥𝑥 ) 121122121112212212
Both SVMs with Gaussian kernels and Neural Networks with at least one hidden layer can be used to learn non-linear decision boundaries, such as the boundary between positive and negative examples in the dataset above. Describe the main similarity and the main difference in how these two approaches achieve this.
Answer: The similarity is that both can be thought of as mapping the input features 𝑥𝑥 into a new feature space: the SVM kernel maps 𝑥𝑥 to a new feature vector 𝜙𝜙(𝑥𝑥), and the hidden layer of the neural network maps 𝑥𝑥 to the activations of the last hidden layer, 𝑎𝑎. The difference is in how the mapping is done, where SVM uses training data as landmarks, but neural network learns the feature mapping through layer parameters.
CS.542 Machine Learning, Fall 2019, Prof. Saenko
VI. Reinforcement Learning
Q1. Markov Decision Process
Consider an agent that learns to navigate a simple grid world, shown below. Suppose the Markov Decision Process is given as follows:
States: locations (x,y) on the map
Actions: move right →, left ←, up↑, down↓ or the
“do nothing” action ○
Transitions: 𝑃𝑃 (𝑝𝑝𝑡𝑡𝑎𝑎𝑡𝑡𝑝𝑝′) is given as 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠,𝑠𝑠𝑎𝑎𝑠𝑠𝑖𝑖𝑎𝑎𝑛𝑛
Reward: +1 in state (4,4), -0.01 in all other states
0.9𝑝𝑝𝛻𝛻𝑝𝑝′ =(𝑥𝑥+1,𝑦𝑦) 𝑃𝑃(𝜕𝜕,𝑦𝑦),→(𝑝𝑝′) = � 0.1 𝑝𝑝𝛻𝛻 𝑝𝑝′ = (𝑥𝑥, 𝑦𝑦) �
0 𝑝𝑝𝑡𝑡h𝑝𝑝𝑝𝑝𝑤𝑤𝑝𝑝𝑝𝑝𝑝𝑝
0.9𝑝𝑝𝛻𝛻𝑝𝑝′ =(𝑥𝑥−1,𝑦𝑦)
𝑃𝑃(𝜕𝜕,𝑦𝑦),←(𝑝𝑝′)=� 0.1𝑝𝑝𝛻𝛻𝑝𝑝′ =(𝑥𝑥,𝑦𝑦) �𝑃𝑃(𝜕𝜕,𝑦𝑦),↑(𝑝𝑝′)=� 0.1𝑝𝑝𝛻𝛻𝑝𝑝′ =(𝑥𝑥,𝑦𝑦) � 𝑃𝑃(𝜕𝜕,𝑦𝑦),↓(𝑝𝑝′)=� 0.1𝑝𝑝𝛻𝛻𝑝𝑝′ =(𝑥𝑥,𝑦𝑦) �
this policy, what is the probability that it will end up in state (4,4) after 6 steps? Explain your answer.
Answer: probability of transitioning to the intended state is .9, so the probability that the agent reaches the final state in 6 steps is (0.9)^6
(b) Forthesamepolicy𝜋𝜋,supposetheagenttransitionsintostate(4,4)forthefirsttimeonstep9.Whatisthe total reward collected by the agent? Assume the agent collects reward in initial state and there is no discounting of rewards.
Answer: 1-8*0.01=0.92 if at s1=(1,1), s2= (2,1),… s9=(4,4) and we count the first state as step 1 We don’t count the first state as time 1, but collect the reward, then we collect 1-9*.01=.91
(c) Is there a more optimal policy (i.e. with strictly higher reward) for the agent than 𝜋𝜋? If so, what is it? Answer: no
(d) Whatisthemaindifferencebetweensupervisedlearningandreinforcementlearning?
Answer: supervised algorithms learn from input and output pairs, whereas reinforcement learning algorithms learn from observations that depend on the model’s actions and from delayed rewards.
0.9𝑝𝑝𝛻𝛻𝑝𝑝′ =(𝑥𝑥,𝑦𝑦−1)
0 𝑝𝑝𝑡𝑡h𝑝𝑝𝑝𝑝𝑤𝑤𝑝𝑝𝑝𝑝𝑝𝑝 0 𝑝𝑝𝑡𝑡h𝑝𝑝𝑝𝑝𝑤𝑤𝑝𝑝𝑝𝑝𝑝𝑝 0 𝑝𝑝𝑡𝑡h𝑝𝑝𝑝𝑝𝑤𝑤𝑝𝑝𝑝𝑝𝑝𝑝
0.9𝑝𝑝𝛻𝛻𝑝𝑝′ =(𝑥𝑥,𝑦𝑦+1)
(a) Considerthepolicy𝜋𝜋shownabove(right).Iftheagentstartsinstate(1,1)andtakesactionsaccordingto
CS.542 Machine Learning, Fall 2019, Prof. Saenko
Appendix: Useful Formulas
Matrix Derivatives
For vectors x, y and matrix A, 𝑦𝑦=𝐴𝐴𝑥𝑥,then𝜕𝜕𝑦𝑦 =𝐴𝐴
𝜕𝜕𝜕𝜕
If 𝑧𝑧 = 𝑥𝑥𝑇𝑇𝐴𝐴𝑥𝑥, then 𝜕𝜕𝑧𝑧 = 𝑥𝑥𝑇𝑇(𝐴𝐴 + 𝐴𝐴𝑇𝑇). For the special case of a symmetric matrix A, 𝜕𝜕𝑧𝑧 = 2𝑥𝑥𝑇𝑇𝐴𝐴.
Single Dimension Normal Distribution 2
𝜕𝜕𝜕𝜕 1 (𝑥𝑥 − 𝜇𝜇) 𝜕𝜕𝜕𝜕
)
𝑁𝑁 ( 𝑥𝑥 ; 𝜇𝜇 , Σ ) = 1 e x p ( − 12 ( 𝑥𝑥 − 𝜇𝜇 ) 𝑇𝑇 Σ − 1 ( 𝑥𝑥 − 𝜇𝜇 ) ) (2𝜋𝜋)𝑝𝑝2 |Σ|12
The p-dimensional multivariate normal distribution is given by
𝑁𝑁(𝑥𝑥;𝜇𝜇,σ2)=𝜎𝜎√2𝜋𝜋 exp(−
2𝜎𝜎2
Multivariate Normal Distribution