程序代写 IN2064 Machine Learning Exam

IN2064 Machine Learning Exam
Problem 1: Probabilistic Inference (Version A) (4 credits)
Imagine a disease spreading across a small village. To make accurate forecasts for the necessary hospital beds.
you want to estimate the severity of the disease with the following model. Lets be a measure for the severity of a

Copyright By PowCoder代写 加微信 powcoder

disease. We assume a priori that s follows a standard normal distribution.
S~ N(0,1) o exp (-$?)
The severity probabilistically influences the required days of hospital caret for a patient according to
1| S ~ Exp(1) = 1 exp(-1) where 1 = s?.
After some time, you were able to collect N = 5 data points. The respective patients left the hospital after 3, 7, 3, 2
and 4 days.
Derive an a-posteriori most likely value of the severity s considering these observations. Justify your answer.
Note: You can assume that s #0, that is, you can safely divide by s if necessary.

Problem 2: k-nearest neighbors (Version A) (4 credits)
Consider a k-nearest neighbor classifier using Euclidean distance on a binary classification task. The prediction for
an instance is the majority class of its k-nearest neighbors. The training set is shown on Figure 4.1.
Figure 4.1: A red + denotes instances from class 1, and a blue – denotes instances from class 2.
a) Specify one value of k that minimizes the leave-one-out cross-validation (LOOCV) error. Please consider only
odd values of k (e.g. 1,3,5,7,…) to avoid ties. What is the resulting error, i.e. the number of misclassified data
b) Imagine that the training dataset contains one additional point with coordinates (1,2) and label +. Would this
change the decision boundary for a 1-nearest neighbor classifier? Why or why not?
c) If your answer above was no what is the shortest distance that you need to move (1,2) such that it changes the
decision boundary? If your answer above was yes what is the shortest distance that you need to move (1, 2) so it
does not change the decision boundary?
Write down the new coordinates after moving and specify the shortest distance.

Problem 3: Linear Regression (Version A) (6 credits)
Consider a dataset D = ((x”,y))%, with xl € R°,y E R and centered features so that † 2N
= 0. During
preprocessing, we absorb the bias as a constant feature to produce the transformed dataset D = {(x), gin) IN
where we map each (xl’, yli)) to
| ERD+1 and gl = y’).
We want to perform ridge regression with regularization strength 1 to find the optimal weight vector W* € RD+1.
Reminder: In ridge regression we optimize the loss function
a) Derive the closed form solution for the last element of the weight vector Wb. corresponding to the absorbed bias
obtained from ridge regression on D.
b) Propose an alternative preprocessing step, i.e. define an alternative transformed dataset D
such that the optimal ridge regression vector W* E R° on D is equivalent to w* obtained on D. Justify that ridge
regression on your preprocessed data D finds an optimal w* € R° that is equal to w* in the first D elements, i.e.
W.* = W.* for i € {1,…, D}. You do not need to derive the closed-form solution for w*

Problem 4: Naive Bayes (Version A) (6 credits)
We have collected the dataset shown in the table below with the continuous feature x1, and the discrete feature x2
that can take either of the values yes or no. Each data point is labeled as one of three classes 1, 2 or 3 denoted by
Table 10.1: Naive Bayes Data (each column is one data point)
a) Set up a naive Bayes classifier (choose likelihoods and their parameterization) for the data in Table 10.1 using
Normal distributions with fixed variance of 1 and Bernoulli distributions. Compute the maximum likelihood estimate
of all parameters A required for naive Bayes.
b) You observe a new data point xl) = (1 yes). Compute the unnormalized posterior over classes p(y(b) | xlb), A)
for xl). Simplify as far as possible without evaluating exponential functions, square roots, logarithms, etc. Briefly
justify how you arrive at your solution. What is the most likely class for this data point?

Problem 5: Optimization (Version A) (2 credits)
Suppose that a € Rd is some fixed vector. We define the function f: Rd -› R as
((x) = exp(ed’* + e-a’*).
Prove or disprove that f is convex on Rd.

Problem 6: Convolutional Neural Networks (Version D) (3 credits)
Recall that a convolutional layer is defined by the following parameters:
– number of input channels
– number of output channels
• K – kernel (sliding window) size
• P- padding size
• S- stride
Suppose that x is an image with height 64 and width 32 (in pixels) that has 3 channels, which we can represent as
a tensor (three dimensional array) of shape [3, 64, 32].
We passed x through a neural network consisting of two convolutional layers conv1 and conv2 (in this order). As
output we obtained a tensor of shape [16, 16, 8] (i.e., 16 channels, height 16 and width 8).
We know that the first convolutional layer conv1 has the following parameters
We also know that the second convolutional layer conv2 has kernel size K = 3.
What are the remaining parameters Cin, Cout, P, S of the second convolutional layer conv2?

Problem 7: SVMs (Version A) (5 credits)
Consider a soft-margin SVM trained on a binary classification task with some fixed and finite penalty C, i.e.
0 < C < 0. Assume that the training set contains at least two instances from each class. After training you observe that the optimal slack variables are zero for all instances except for a single instance q. Specifically, §a > 2 for q
and Si = 0 for all i #q. Let Msoft = Twalf be the optimal margin where Wsoft are the optimal weights.
a) Now you remove the instance q from the dataset and you train a new hard-margin SVM. Let mard be the resulting
optimal margin. Which one of the following holds: Mnard
< Msoft, Mhard = Mott, Mnard ≥ Msoft- Justify your answer. b) Now instead of removing it, you relabel the instance q (if before it was class —1 you set it to class +1 and vice versa) and you train a new hard-margin SVM. Let mhard be the resulting optimal margin. Which one of the following holds: Mhard < • Msoft, Mhard = Msoft, Mhard ≥ Msoft. Justify your answer. Problem 8: Low-rank Approximation and Regression (Version A) (6 credits) Consider a dataset D = {(xl), yin) IN , with xl) E RD, y) € R. Let X € RN×D be the matrix of features and y E RN be the vector of regression targets. We construct the matrix M = [X, y] € INx(D+1) where y has been appended as an additional column to X. Let M'€ RNX(D+1) be the best rank K approximation of M for some chosen constant K. Define M' = [X', y'] where X' is the matrix containing the first D columns of M' and y' is the last column of M' We will fit a standard linear regression model using X' as the feature matrix and y' as the target, i.e. we find the optimal weight vector w* E RD and the bias b* € R. Careful: We do regression on X' and y' not on X and y! a) Consider the case where D = 1 and K = 1, i.e. our features are one dimensional and M' is the the best rank 1 approximation of M. Assuming the features X' contain at least two distinct elements, what is the training error achieved by the optimal w* and b*? Justify your answer. b) For the same case as in (a) where D = 1 and K = 1, write down the optimal w* and b* in terms of the singular value decomposition of M = UVT, i.e. write down w* and b* as a function of the singular values and singular Problem 10: Differential Privacy (Version A) (6 credits) You are given a dataset with n instances {x1, ..., Xn}, With x; € X. The instances are randomly split into disjoint groups G1, G2,... Gm, each of size # (assume that m divides n, i.e. ' is an integer). a) First you apply an arbitrary function f : x# › [a, b] (where a and b are given constants) to each of the groups, i.e. you compute g+ = f(G1), 92 = f(G2), ..., 9m = f(Gm). Then you compute the final output by aggregating the per-group outputs by computing mean(g1, ..., 9m). Derive the global A, sensitivity of the function f' := mean(f(G1),..., f(Gm))? b) How does the global sensitivity of f' change if we increase n keeping m fixed? How does the global sensitivity of f' change if we increase m keeping n fixed? c) Can you make the function ' differentially private for any function f? If yes, specify the noise distribution from which we have to sample to obtain an €-DP private mechanism. If no, why not? 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com