程序代写代做代考 algorithm kernel Bayesian CS.542 Machine Learning, Fall 2020

CS.542 Machine Learning, Fall 2020
I. Math and Probability Basics
Q1: Review of Definitions
(a) For a scalar random variable 𝑥, give the definition of its mean and variance.
(b) For a vector random variable 𝑥 ∈ R𝑛, give the definition of its mean and covariance. Q2: Short questions
(a) Given a vector 𝑣 ∈ R2 and a line orthogonal to the vector 𝑢 ∈ R2, draw the orthogonal distance from 𝑣 to the line, and write down the equation for the distance.
(b) A hyperplane is a subspace whose dimension is one less than that of its ambient space. Suppose the line in (c) was instead an n-1 dimensional hyperplane in the ambient space R𝑛. What is the orthogonal distance from a point 𝑣 ∈ R𝑛 to the hyperplane?
(c) Show that for a matrix 𝐴 and vector 𝑥, 𝜕 (𝐴−1) = −𝐴−1 ( 𝜕 𝐴) 𝐴−1. Use the fact that for any 𝜕𝑥 𝜕𝑥
twomatrices𝐴and𝐵, 𝜕 (𝐴𝐵)=𝜕𝐴𝐵+𝐴𝜕𝐵 . 𝜕𝑥 𝜕𝑥 𝜕𝑥
II. General Machine Learning Concepts
Q1. True or False questions
Circle the correct answer (T or F) and give a one-sentence explanation of each answer; answers without explanation will not be given full points.
a) An advantage of normal equations to solve linear regression is that choosing a step size (learning rate) is not necessary [T/F]
b) Maximum likelihood can be used to derive a closed-form solution to logistic regression [T/F]
c) The gradient descent update for logistic regression is identical to linear regression [T/F]

CS.542 Machine Learning, Fall 2020
d) Changing the prior in Linear Discriminant Analysis changes the direction of the decision hyperplane [T/F]
Q2. Short answer questions
Answer the following questions in brief one to two sentence answers.
a) For a training dataset D = {𝑥𝑖 , 𝑦𝑖 } where 𝑥𝑖 are the inputs and 𝑦𝑖 are the output, explain the difference between discriminative and generative classification models.
b) Give one example of a discriminative classification model.
c) Give one example of a generative model.
d) Suppose you want to use training data D to adjust the parameters w of a model where L(D) = p(D; w) is the likelihood of the data. You want to prevent overfitting using a squared norm regularizer. What should your objective function look like? Should you minimize or maximize it?
e) What is cross-validation?
f) How can we use cross-validation to prevent overfitting? Explain the procedure using the setup of (d).

CS.542 Machine Learning, Fall 2020
Q3. Error metrics
a) Give one example each of error metrics that can be used to evaluate: classification, regression, clustering, and reinforcement learning.
b) Which are the correct definitions of precision and recall? Here ‘actual positives’ are examples labeled positive (by humans), and ‘predicted positives’ are examples for which the algorithm predicts a positive label.
1. 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
3. 𝑟𝑒𝑐𝑎𝑙𝑙 = 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 𝑎𝑐𝑡𝑢𝑎𝑙 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
2. 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 𝑎𝑐𝑡𝑢𝑎𝑙 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
4. 𝑟𝑒𝑐𝑎𝑙𝑙 =
𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 𝑎𝑐𝑡𝑢𝑎𝑙 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
c) Suppose your data has binary labels, where the negative class (y=0) occurs 99% of the time and the positive class (y=1) occurs 1% of the time. Which of the following results obtained by your machine learning algorithm by itself confirms that it is performing well on this data (circle one)? Hint: think about what happens if your algorithm always outputs ‘0’, or always outputs ‘1’.
1. High Recall 2. High Accuracy 3. High F1 score
III. Bayesian Methods
a) Which parameter will lead to highest bias? To highest variance?
b) Alice then decides to use a Bayesian approach to control the complexity of the model. What is the Bayesian equivalent to changing 𝜆?
Q1. Bias-variance in Bayesian models
Alice has a dataset of m points with n-dimensional inputs and scalar outputs. She has trained several regularized linear regression models using regularization parameters 𝜆 = 𝑒0, 𝑒−1, 𝑒−2, 𝑒−3.

CS.542 Machine Learning, Fall 2020
c) Which Bayesian model should she use? Explain what makes the model Bayesian.
d) Alice fit the parameters of her model and wants to use the predictive distribution on new inputs. Explain what the predictive distribution is.
e) What is the difference between the predictive distribution of the ML-based and the Bayesian linear regression models?

CS.542 Machine Learning, Fall 2020
Q2. Bayesian Maximum A Posteriori (MAP)
Suppose we are estimating the probability of seeing ‘heads’ (x = 1) or ‘tails’ (x = 0) after tossing a coin, with 𝜇 being the probability of seeing ‘heads’. The probability distribution of a single binary variable 𝑥 ∈ {0,1} that takes value 1 with probability 𝜇 is given by the Bernoulli distribution
𝐵𝑒𝑟𝑛(𝑥|𝜇) = 𝜇𝑥(1 − 𝜇)1−𝑥
Suppose we have a dataset of independent coin flips 𝐷 = {𝑥(1), 𝑥(2), … , 𝑥(𝑚)} and we would like to
estimate 𝜇 using the Bayesian MAP solution. Recall that we can write down the data likelihood as 𝑝(𝑥(𝑖)|𝜇) = 𝜇𝑥(𝑖)(1 − 𝜇)1−𝑥(𝑖)
Consider the following Bayesian prior on 𝜇, which believes that the coin is either fair, or slightly biased towards ‘tails’:
0.5 𝑖𝑓 𝜇 = 0.5 𝑝(𝜇) = { 0.5 𝑖𝑓 𝜇 = 0.4
0 𝑜𝑡h𝑒𝑟𝑤𝑖𝑠𝑒
Write down the MAP estimate for 𝜇 under this prior as a function of the likelihood and the prior. (Hint: use the argmax function).

CS.542 Machine Learning, Fall 2020
IV. Unsupervised Learning
Q1. Short questions
a)
b)
c)
For which problems below would anomaly detection be a suitable algorithm? Explain.
1. Given an image of a face, determine whether or not it is the face of a particular famous individual.
2. From a large set of hospital patient records, predict which patients have a particular disease (eg. flu)
3. Given a dataset of credit card transactions, identify transactions to flag them as possibly fraudulent.
4. From a large set of primary care patient records, identify individuals who might have unusual health
conditions.
Suppose you have trained an anomaly detection system for intruder detection in a security camera, and your system flags anomalies when p(x) is less than ε. You find on the cross-validation set that it is missing many intruder events. What should you do?
Which of the following approaches use density estimation? Explain.
1. Linear discriminant analysis
2. Logistic Regression
3. Anomaly detection
4. Generative Adversarial Networks
Q2. Generative adversarial network (GAN) mini-max loss
A GAN consists of a generator and a discriminator: the generator network 𝐺(𝑧) has parameters 𝜃𝐺, takes a random noise vector 𝑧 as input, and outputs a sample of data 𝑥 (e.g., an image); the discriminator network 𝐷(𝑥) has parameters 𝜃𝐷, takes in 𝑥, and outputs a binary label 𝑦 ∈ {0,1} where 1 indicates that the input is real and 0 that it is fake. The discriminator’s cost function is the cross-entropy loss for the binary classification task over the real and the generated inputs:
𝐽𝐷(𝜃𝐷) = 𝔼𝑥[− log 𝐷(𝑥)] + 𝔼𝑧[−log (1 − 𝐷(𝐺(𝑧)))]
Suppose the generator’s loss is the opposite (negative) of the discriminator’s cross entropy loss. a) The generator’s loss is shown in the plot above. Write down the equation for it.

CS.542 Machine Learning, Fall 2020
b) Explain why the generator’s loss 𝐽𝐺 suffers from slow learning.
c) Why doesn’t this happen in the regular cross entropy loss?
Q4. [20 points] Neural Network for Binary Addition
(a) [5 points] Construct a simple neural network that computes the NOT operation on a single binary input variable x1 (0 or 1). Draw a diagram and indicate the values of the weights. Use the sigmoid activation function (shown in figure on the right), which ranges in [0, 1].
(b) [15 points] Design a neural network that adds two binary digits. The inputs are two single-digit binary variables, x1 and x2, and the
outputs are the two digits y1, y2 of the sum of x1 and x2, i.e. x1+x2=s, where s has two digits, y1 and y2. The table below shows the correct outputs for each pair of inputs:
input
output
x1
x2
y1
y2
0
0
0
0
0
1
0
1
1
0
0
1
1
1
1
0

CS.542 Machine Learning, Fall 2020
i.e. y2 is the sum and y1 is the “carry.” We have given sigmoid networks corresponding to the AND and OR functions above. Note that (A XOR B) can be expressed as (A OR B) AND NOT(A AND B)). Draw the complete addition network with all weights, indicating clearly the nodes for x1, x2, y1, y2. Be careful to include the bias unit(s).
Q4. Gradient Descent
For general hypothesis, 𝜃, and cost 𝐽(𝜃), say whether each statement regarding gradient descent below is TRUE or FALSE, and give a one-sentence explanation.
a) [3 points] The cost function 𝐽(𝜃) is guaranteed to decrease with every iteration, regardless of the step size 𝛼. Circle one: TRUE / FALSE
b) [3 points] Convergence can be determined by looking at the change in the cost function across iterations. Circle one: TRUE / FALSE
c) [3 points] If gradient descent is converging very slowly, a smaller step size 𝛼 should be used. Circle one: TRUE / FALSE
d) [3 points] Convergence of gradient descent to a global minimum is always guaranteed for convex cost functions. Circle one: TRUE / FALSE
V. Loss Functions, SVM, Kernels
Q1. Hinge Loss
Given the SVM decision function 𝑓(𝑥) = 𝑤𝑇𝑥 + 𝑏 and labels 𝑦 ∈ {−1, +1}, which of the following four loss functions implements the SVM hinge loss? Explain in words what the hinge loss does.

CS.542 Machine Learning, Fall 2020
Q2. Loss functions for car sales
Alice works for a used car dealership. Her boss wants her to estimate the price 𝑦 to charge for a car
(in dollars) based on features such as: 𝑥1= car manufacturer, 𝑥2= model, 𝑥3= distance driven in miles,
𝑥4= age in years, etc. She collects data points from previous car sales, (𝑥(𝑖), 𝑦(𝑖)), 𝑖 = 1, … , 𝑚, where
𝑦(𝑖) is the price the car was sold for, and decides to use a linear regression model, 𝑦 = ∑𝑛 𝜃 𝑥 . 𝑗=0 𝑗 𝑗
a) Alice decides to minimize a sum of squares loss function. Does it make sense in this case? What effect will it have?
b) Her boss tells Alice she strongly prefers that the dealership not lose money on a sale. Given that the ith car cost the dealer 𝑧(𝑖), suggest a way to pre-process the training data to encourage this.
Q3. Kernels and PCA
Consider the following dataset of 1-dimensional datapoints:
a) Which placement of Gaussian basis functions corresponds to a kernel feature representation for this
b.
a.
c. d.

CS.542 Machine Learning, Fall 2020 dataset?
For 𝑘(𝑥1, 𝑥2) to be a valid kernel, there must be a feature basis function 𝜑(. ) such that we can write 𝑘(𝑥1, 𝑥2) = 𝜑(𝑥1)𝑇𝜑(𝑥2).
b)
c)
Prove that 𝑘 = 𝑥1𝑇𝐶𝑥2 is a valid kernel, where 𝐶 = 𝑋𝑋𝑇 is the data covariance obtained from the design matrix 𝑋. Hint: use (𝐴𝐵)𝑇 = 𝐵𝑇𝐴𝑇.
Typically we use kernels to project data into a higher dimensional space. Principal Component Analysis projects data into a lower dimensional space. Suggest a kernel that uses PCA to project data into a lower, k-dimensional space, and prove that it is a valid kernel. You can assume the data already has a mean of zero. Hint: use the result of (b).
Q4. SVMs and Kernels
Consider the following dataset of 2-dimensional datapoints:
a) Which placement of Gaussian basis functions corresponds to a kernel feature representation for this dataset? Explain your answer in one sentence below.

CS.542 Machine Learning, Fall 2020
b)
c)
(A) (B)
(C) (D) none of the above
How does increasing the variance of the Gaussian 𝜎2 affect the bias and variance of the resulting Gaussian Kernel SVM classifier? Explain.
For 𝑘(𝑥1, 𝑥2) to be a valid kernel, there must be a feature basis function 𝜑(. ) such that we can write 𝑘(𝑥1, 𝑥2) = 𝜑(𝑥1)𝑇𝜑(𝑥2). Suppose 𝑘1(𝑥1, 𝑥2) and 𝑘2(𝑥1, 𝑥2) are valid kernels. Prove that the following is also a valid kernel:
𝑘(𝑥1, 𝑥2) = 𝑘1(𝑥1, 𝑥2) + 𝑘2(𝑥1, 𝑥2)
Both SVMs with Gaussian kernels and Neural Networks with at least one hidden layer can be used to learn non-linear decision boundaries, such as the boundary between positive and negative examples in the dataset above. Describe the main similarity and the main difference in how these two approaches achieve this.
d)

CS.542 Machine Learning, Fall 2020
e) Explain what slack variables are used for when training SVMs.

CS.542 Machine Learning, Fall 2020
VI. Reinforcement Learning
Q1. Markov Decision Process
Consider an agent that learns to navigate a simple grid world, shown below. Suppose the Markov Decision Process is given as follows:
States: locations (x,y) on the map
Actions: move right →, left ←, up↑, down↓ or the
“do nothing” action ○
Reward: +1 in state (4,4), -0.01 in all other states
Transitions: 𝑃 (𝑠𝑡𝑎𝑡𝑒′) is given as 𝑠𝑡𝑎𝑡𝑒,𝑎𝑐𝑡𝑖𝑜𝑛
0.9𝑖𝑓𝑠′ =(𝑥+1,𝑦) 𝑃(𝑥,𝑦),→(𝑠′) = { 0.1 𝑖𝑓 𝑠′ = (𝑥, 𝑦) }
0 𝑜𝑡h𝑒𝑟𝑤𝑖𝑠𝑒
(a) Consider the policy 𝜋 shown above (right). If the agent starts in state (1,1) and takes actions according to this policy, what is the probability that it will end up in state (4,4) after 6 steps? Explain your answer.
(b) For the same policy 𝜋, suppose the agent transitions into state (4,4) for the first time on step 9. What is the total reward collected by the agent? Assume the agent collects reward in initial state and there is no discounting of rewards.
(c) Is there a more optimal policy (i.e. with strictly higher reward) for the agent than 𝜋? If so, what is it?
(d) What is the main difference between supervised learning and reinforcement learning?
0.9𝑖𝑓𝑠′ =(𝑥,𝑦−1)
0 𝑜𝑡h𝑒𝑟𝑤𝑖𝑠𝑒 0 𝑜𝑡h𝑒𝑟𝑤𝑖𝑠𝑒 0 𝑜𝑡h𝑒𝑟𝑤𝑖𝑠𝑒
0.9𝑖𝑓𝑠′ =(𝑥−1,𝑦)
𝑃(𝑥,𝑦),←(𝑠′)={ 0.1𝑖𝑓𝑠′ =(𝑥,𝑦) }𝑃(𝑥,𝑦),↑(𝑠′)={ 0.1𝑖𝑓𝑠′ =(𝑥,𝑦) } 𝑃(𝑥,𝑦),↓(𝑠′)={ 0.1𝑖𝑓𝑠′ =(𝑥,𝑦) }
0.9𝑖𝑓𝑠′ =(𝑥,𝑦+1)

CS.542 Machine Learning, Fall 2020
Appendix: Useful Formulas
Matrix Derivatives
For vectors x, y and matrix A, 𝑦=𝐴𝑥,then𝜕𝑦 =𝐴
𝜕𝑥
If 𝑧 = 𝑥𝑇𝐴𝑥, then 𝜕𝑧 = 𝑥𝑇(𝐴 + 𝐴𝑇). For the special case of a symmetric matrix A, 𝜕𝑧 = 2𝑥𝑇𝐴.
𝜕𝑥
Single Dimension Normal Distribution
2 1
𝑁(𝑥;𝜇,σ )=𝜎√2𝜋 exp(−
Multivariate Normal Distribution
𝜕𝑥
(𝑥 − 𝜇)2 2𝜎2
)
The p-dimensional multivariate normal distribution is given by
𝑁(𝑥;𝜇,Σ) = 1 exp (−1(𝑥 − 𝜇)𝑇Σ−1(𝑥 − 𝜇)) 𝑝12
(2𝜋)2 |Σ|2