程序代写代做代考 Bayesian chain algorithm graph deep learning Instructions:

Instructions:
CS 542 – Machine Learning Midterm Exam Spring 2020
1- Log onto the lecture zoom link
2- Share your video
3- Set an alarm for 1:35pm. You have 1 hour and 15 minutes to solve the exam.
4- Solve the exam using paper and pen
5- Print your name and BU ID clearly on the top right of the first page (the page that will have the solution to Problem 1)
6- Start a new page for each question
7- Stop solving at 1:35pm
8- Take photos of your solutions using your phone
9- Sort the photos based on question number
10- Convert the photos into a single pdf named [BUusername]_[first]_[last]_MT.pdf
11- Submit your pdf file using this submission link
12- You will receive a confirmation message from us that we received your submission. It is your responsibility to make sure the file contains solutions to all the problems you solved.
Notes:
* There are five questions. The page has some common formulas
* If you have any inquiries during the exam please message us in the zoom chat * Total points: 85
Good luck

Q1. [20 points] Short Questions
Answer the following questions in brief one or two sentence answers.
a)
b)
c)
d)
e)
[4 points] Suppose you have a training dataset D = {𝑥𝑖 , 𝑦𝑖 } where 𝑥𝑖 are the inputs and 𝑦𝑖 are the class labels, and you are designing a discriminative classifier. Which, if any, marginal, joint, or conditional probability distribution function(s) over these variables would you model, and why?
[4 points] Suppose you work for a startup that has just designed a new electric vehicle, and you are asked to build a machine learning system to detect engine failure. You have collected data from 1000 tests, only 5 of which led to the engine failing. What family of machine learning methods would you use, and why?
[4 points] What is the effect of the class prior on the decision boundary of Linear Discriminant Analysis?
[4 points] Alice decides to use Principal Component Analysis to implement image compression. Briefly explain how she should train the algorithm and how she should use it to compress a new image. What parameter controls the amount of compression?
[4 points] The following is the formulation for linearly separable SVMs. The formulation maximizes the margin and ensures data points are correctly classified.
Write the corresponding mathematical formulation for a relaxed version that allows some violation of the constraints for a problem that is not strictly linearly separable.

Q2. [10 points] Regularization and Bias/Variance
Suppose you want to fit a Logistic Regression model to predict whether an email is spam (𝑦 = 1) or not spam (𝑦 = 0) based on the frequency of the words “buy” (feature 𝑥1) and “click” (feature 𝑥2). You decide to use polynomial basis functions to represent the input features and to apply regularization. You have fit three models by minimizing the regularized Logistic Regression cost function
1𝑚 𝜆𝑛 𝐽(𝜃) = 𝑚 ∑[−𝑦(𝑖) log (h𝜃(𝑥(𝑖))) − (1 − 𝑦(𝑖)) log (1 − h𝜃(𝑥(𝑖)))] + 2𝑚 ∑ 𝜃𝑗2
𝑖=1 𝑗=2 for λ = 10−2, 100, 102. The following are sketches of the resulting decision boundaries.
a) [3 points] Which value of λ goes with each of the plots?
A: B: C:
b) [2 points] You try your models on a test set. Which of the three models will have the highest error due to bias?
c) [2 points] Which model will have the highest error due to variance?
d) [3 points] You plot model complexity M (number of polynomials) versus the cost function, computed on the test data, and a similar curve for the training data. Explain how you can use these curves to detect when the model is overfitting and draw an example to illustrate.

Q3. [15 points] Backpropagation
Suppose we want to compute the gradients for the function h(𝑥) = 𝑞(𝑤0𝑥0 + 𝑤1𝑥1), where
𝑥 = [𝑥0 𝑥1]𝑇 is the input vector, 𝑤 = [𝑤0 𝑤1]𝑇 is the parameter vector, and q is the tanh function: q(u) = tanh(𝑢) = (𝑒𝑢 − 𝑒−𝑢)/(𝑒𝑢 + 𝑒−𝑢). The tanh function is also plotted in the appendix.
a) [2 points] Complete the computational graph for this function below by adding two nodes 𝑓1(𝑢, 𝑐) = 𝑐 ∗ 𝑢 (multiplication), one node 𝑓2(𝑢, 𝑐) = 𝑢 + 𝑐 (addition), and one node 𝑓3(𝑢) = 𝑞(𝑢). Label the nodes clearly with 𝑓1, 𝑓2 or 𝑓3 and leave plenty of space between them.
b) [3 points] Write down the gradient 𝜕𝑓 for functions 𝑓1, 𝑓2, 𝑓3. 𝜕𝑢
Hint: note that 𝑡𝑎𝑛h′(𝑢) = 1 − 𝑡𝑎𝑛h(𝑢)2. 𝜕𝑓1 =
𝜕𝑢
𝜕𝑓2 = 𝜕𝑢
𝜕𝑓3 = 𝜕𝑢
c) [4 points] Perform a forward pass for 𝑥 = [5
arrows in your computational graph. What is the output of the forward pass, i.e. h(𝑥)?
d) [6 points] Perform a backward pass for the example in (c), writing values below the arrows in the graph. What are the gradients of h with respect to 𝑥0, 𝑥1, 𝑤0, 𝑤1?
𝜕h = 𝜕𝑥0
𝜕h = 𝜕𝑤0
𝜕h = 𝜕𝑥1
𝜕h = 𝜕𝑤1
5]𝑇 and 𝑤 = [1
− 1]𝑇, writing values on top of the

Q4. [20 points] Neural Networks and Deep Learning
Answer the following questions in brief one or two sentence answers.
a) [4 points] How would you design a deep learning architecture to be used for hand gesture recognition. Assume the input to your system is short video clips, and the desired output is a predicted class of hand gesture.
b) [3 points] Name the regularization technique specifically designed for Deep Neural networks and briefly describe how it achieves such regularization.
c) [3 points] Is it recommended to do feature engineering first and then apply deep learning? Contrast deep learning with other machine learning algorithms in terms of feature engineering.
d) [3 points] Suppose we have a neural network with ReLU activation function. Let’s say, we replace ReLu activations by linear activations. Would this new neural network be able to approximate a non- linear function? And why?
e) [3 points] Name an example of a data augmentation approach and explain how it helps improve model generalization.
f) [2 points] What is the main difference between a fully-connected and a convolutional network?
g) [2 points] List two ways to downsize feature maps in convolutional neural networks.

Q5. [8 points] Maximum A Posteriori (MAP) Solution
To deal with bias, we can use a Bayesian model and obtain a posterior solution for the parameter. Suppose we are estimating the probability of seeing ‘heads’ (x = 1) or ‘tails’ (x = 0) after tossing a coin, with 𝜇 being the probability of seeing ‘heads’. The probability distribution of a single binary variable
𝑥 ∈ {0,1} that takes value 1 with probability 𝜇 is given by the Bernoulli distribution
𝐵𝑒𝑟𝑛(𝑥|𝜇) = 𝜇𝑥(1 − 𝜇)1−𝑥
Suppose we have a dataset of independent coin flips 𝐷 = {𝑥(1), 𝑥(2), … , 𝑥(𝑚)} and we would like to
estimate 𝜇 by maximizing the posterior probability. Recall that we can write down the data likelihood as 𝑝(𝑥(𝑖)|𝜇) = 𝜇𝑥(𝑖) (1 − 𝜇)1−𝑥(𝑖)
Consider the following prior on 𝜇, which believes that the coin is either fair, or slightly biased towards ‘tails’:
0.5 𝑖𝑓 𝜇 = 0.5 𝑝(𝜇) = { 0.5 𝑖𝑓 𝜇 = 0.4
0 𝑜𝑡h𝑒𝑟𝑤𝑖𝑠𝑒
a) [3 points] Write down the formulation for the likelihood 𝑝(𝐷|𝜇).
b) [5 points] Write down the posterior estimate for 𝜇 under this prior as a function of the likelihood and the prior. (Hint: use the argmax function).

Q6. [12 points] Loss functions for Classification
For this question, suppose that we have a classification problem with inputs 𝑥 ∈ 𝑅𝑛 and corresponding outputs 𝑦 ∈ {−1, +1}. We would like to learn a classifier that computes a linear function of the input 𝑓(𝑥) = 𝑤𝑇𝑥, and predicts 𝑦 = +1 if 𝑓(𝑥) ≥ 0, or 𝑦 = −1 otherwise.
a)
[3 points]
(i) What can be said about the correctness of the classifier’s prediction if the expression 𝑦𝑓(𝑥) > 0 is true?
(ii) What if 𝑦𝑓(𝑥) = 0? (iii) What if 𝑦𝑓(𝑥) < 0? [4 points] Suppose we use a loss function 𝐿(𝑧) = 𝐿(𝑦𝑓(𝑥)), where 𝑦𝑓(𝑥) is defined as above, to compute the loss for a given training pair of input and output {𝑥, 𝑦}. What effect does the loss function shown below have on the resulting classifier? Will this loss function produce a reasonable solution? Explain why or why not. [5 points] Suppose you decide to use the loss 𝐿(𝑧) = 𝑒𝑥𝑝(−2𝑧 − 1). Write down the gradient descent algorithm for minimizing this loss on a set of training examples {𝑥𝑖 , 𝑦𝑖 }, 𝑖 = 1, ... , 𝑚, using a squared L-2 norm regularizer 𝑅(𝑤) = ‖𝑤‖2 on the parameter vector. (i) What is the objective function? (ii) Should it be minimized or maximized? (iii) What is the corresponding gradient descent update step for 𝑤? b) c) Matrix Derivatives Appendix: Common Formulas For vectors x, y and matrix A, 𝑦=𝐴𝑥,then𝜕𝑦 =𝐴 If 𝑧 = 𝑥𝑇𝐴𝑥, then 𝜕𝑧 = 𝑥𝑇(𝐴 + 𝐴𝑇). For the special case of a symmetric matrix A, 𝜕𝑧 = 2𝑥𝑇𝐴. 𝜕𝑥 𝜕𝑥 Chain Rule If z = f(y) and y = g(x), then 𝑑𝑧 = 𝑑𝑧 𝑑𝑦 = 𝑓′(𝑔(𝑥)) ∗ 𝑔′(𝑥) 𝑑𝑥 𝑑𝑦 𝑑𝑥 Hyperbolic Tangent Function (tanh) 𝜕𝑥