计算机代考 ESE 545 : Data Mining: Learning from Massive Datasets

ESE 545 : Data Mining: Learning from Massive Datasets
Instructor:
Spring 2020
Final Examination

Copyright By PowCoder代写 加微信 powcoder

Note: Each Multiple Choice (MC) question has 4 marks and each Short Answer (SA) question has 10 marks. For the multiple choice questions, you are only required to write the item that you’ve chosen (e.g. Question 1: (d) or Question 5: (e)), and for short answer questions you are required to write a short explanation for your answer.
Grade (y/n) Score Max. Score Multiple Choice 60
short Answer 30 TOTAL 90

MC-Question 1. Let S be a collection of text documents represented by the 5-shingle representation (i.e. each point in S is a set of 5-shingles). We consider the Jaccard distance on the set S. Let F be the family of all the min- hash functions. Recall that choosing a hash function uniformly at random from F is equivalent to the (random) min-hashing procedure covered in the first few lectures of the course. Which of the following is correct about F?
(a) F is (d1, d2, 1 − d1, 1 − d2)-sensitive for any choice of d1, d2 ∈ [0, 1] such that d1 < d2. (b) F is (d1, d2, 1 − d1/2, 1 − d2/2)-sensitive for any choice of d1, d2 ∈ [0, 1] such that d1 < d2. (c) F is (d1, d2, 1 − 2d1, 1 − 2d2)-sensitive for any choice of d1, d2 ∈ [0, 1] such that d1 < d2. (d) There is no choice of the parameters d1, p1, d2, p2 such that F is (d1, d2, p1, p2)- sensitive. MC-Question 2. Consider a collection of n data points. What is the com- plexity of finding all the near-duplicate pairs naively without using hashing, and what is the complexity of approximately finding the near-duplicate pairs using locally-sensitive hashing? (a) O(n2),O(n√n) (b) O(n),O(√n) (c) O(n2),O(n) (d) O(n2),O(n2) MC-Question 3. Consider the active learning procedure. The following fig- ure shows the version space after querying the labels for data points a,b,c,d,e (the version space is the area surrounded by the solid lines a,b,c,d,e). We would like to choose one of the data points f,g,h (the dashed lines) to query next. Which of the data points will be queried by the uncertainty sampling procedure and which one will be queried by the max-min margin procedure? (a) Uncertainty sampling: f , Max-min margin: h (b) Uncertainty sampling: g , Max-min margin: f (c) Uncertainty sampling: g , Max-min margin: h (d) Uncertainty sampling: f , Max-min margin: g (e) Uncertainty sampling: h , Max-min margin: f MC-Question 4. We would like to minimize the function f (x) = 12 x2 + ex over the interval (−∞,1] using the projected gradient descent procedure. Which of the following is the update rule of the projected gradient descent procedure, and do the iterates xt converge to the global minimum (assuming ηt is chosen appropriately)? (a) xt+1 = (1 − ηt)xt − ηtex; no. (b) xt+1 = (1 + ηt)xt + ηtex; yes. (c) xt+1 = min{(1 − ηt)xt − ηtex, 1}; yes. (d) xt+1 = min{(1 − ηt)xt − ηtex, 1}; no. (e) xt+1 = yt min{|yt|, 1} and yt = (1 − ηt)xt − ηtex; no. |yt | (f) xt+1 = yt min{|yt|, 1} and yt = (1 − ηt)xt − ηtex; yes. |yt | (g) xt+1 = min{(1 + ηt)xt − ηtex, 1}; no. (h) xt+1 = yt min{|yt|, 1} and yt = (1 + ηt)xt − ηtex; yes. |yt | MC-Question 5. Which of the following is a strongly convex function? (a) f(x)=|x|+x4 (b) f(x)=e−x +x4 (c) f(x) = ex (d) f(x)=x3 +x4 MC-Question 6. Which of the following is a wrong statement? (a) SVM can handle separable and inseparable (noisy) data sets (b) SVM aims at finding a classifier with maximum margin which leads to a better generalization performance (c) The PEGASOS algorithm is the fastest solver for the SVM problem (d) ADAGRAD may converge to a local minimum of the SVM problem which is different than the global minimum. MC-Question 7. Consider the stochastic version of the multi-armed bandit problem with four arms. The reward of each arm is a Bernoulli random vari- able with mean μi, i = 1, 2, 3, 4. We are using the UCB algorithm. Assume that the rewards that we have seen so far are as follows: arm1: 0,1,1 arm 2: 0 arm3: 0,0,1,1 arm4: 0,0,0,0 Which arm would be chosen next by the UCB algorithm? (a) arm 1 (b) arm 2 (c) arm 3 MC-Question 8. Consider the multi-armed bandit problem with 2 arms and adversarial losses (or equivalently adversarial rewards). We would like to use the Thompson sampling algorithm for this setting. What do you think about the normalized regret of this algorithm (RT /T )? (a) RT /T will converge to zero as Thomson sampling is a randomized al- gorithm. (b) RT /T will not converge to zero if the losses are chosen carefully because Thompson sampling is designed for stochastic rewards. (c) RT /T will converge to zero as Thompson sapling has lower regret than EXP3. (d) RT /T will not converge to zero as Thompson sampling is a deterministic algorithm and we are considering adversarial losses. MC-Question 9. Consider the multi-armed bandit problem with 2 arms. The loss of arm 1 at time t is denoted by l1,t and the loss of arm 2 is denoted by l2,t. We further know that l1,t and l2,t have the following form: 􏰀0.1 ift=odd, 􏰀0.5 ift=odd, l1,t = 0.8 ift=even. l2,t = 0.3 ift=even. Assume that we use the EXP3 algorithm for this setting and let it ∈ {1, 2} be the choice of the algorithm at time t. Define LT = 􏰁Tt=1 lit . What is limT→∞ LT ? T (a) 0.1 (b) 0.2 (c) 0.3 (d) 0.4 MC-Question 10. Consider the stochastic version of the multi-armed ban- dit problem with two arms. The reward of each arm is a Bernoulli random variable with mean μi,i = 1,2. Assume that μ1 > μ2. We use the ε-greedy
algorithm with εt = √1 . What can we say about the regret of this algorithm,
(a) E[RT ] ≥ c(μ1 − μ2) T, where c is a constant independent of T.
(b) E[RT] ≤ clogT, where c is a constant independent of T. (c) E[RT ] ≥ (μ1−μ2)T
2 (d) E[RT]≤μ1 −μ2
MC-Question 11. Consider the problem of maximizing a monotone sub- modular function F over a ground set V and under the carnality constraint. In class we explained two algorithms for this problem, namely the greedy algorithm and the lazy-greedy algorithm. Which of the following is a wrong statement?
(a) The lazy greedy algorithm has the same guarantee as the greedy algo- rithm.
(b) The lazy-greedy algorithm can be significantly faster than the greedy algorithm.
(c) The lazy greedy algorithm may output a different solution (i.e. a dif- ferent set) than the greedy algorithm.
(d) The lazy greedy algorithm performs at least |V | function computations. That is, the lazy greedy algorithm needs to compute the value of F over at least |V | different subsets of V .
MC-Question 12. Which of the following is a wrong statement?
(a) Sum of two submodular functions is submodular. (b) Sum of two monotone functions is monotone.

(c) If F : 2V → R is a submodular function and c is a fixed constant. Then the function g : 2V → R defined as g(S) = max{F(S),c} is also a submodular function.
(d) If F : 2V → R is a submodular function and a, b are positive constants, then the function g(S) = aF (S) + b is also a submodular function.
MC-Question 13. Consider the following five statements:
1. K-means++ requires K full passes over the data.
2. K-means++ will always find a better local minimum than K-means.
3. K-means++ is a variant of the gradient descent algorithm.
4. We can scale the K-means algorithm to massive data sets by using ideas similar to the stochastic gradient descent method.
5. Both K-means and K-means++ will converge to a local minimum of the loss function of the k-means problem.
Which of the above five statements are wrong?
(a) 1,5 (b) 2,5 (c) 1,4
(e) None of the above contains all the wrong statements
MC-Question 14. What is the purpose of using K-means++?
(a) It has significantly less computational complexity than K-means. (b) It provides a better solution than K-means
(c) It provides a good initialization for K-means

(d) It scales to massive data sets
MC-Question 15. What is a GAN and how do we train it using data?
(a) A GAN is a mechanism that generates samples from a distribution which is close to the true distribution of data; A GAN is trained via the approximate maximum likelihood procedure.
(b) A GAN is a probability distribution trained using data via the approx- imate maximum likelihood procedure.
(c) A GAN is a mechanism that generates samples from a distribution which is close to the true distribution of data; A GAN is a neural network that is trained using an additional neural network called the discriminator.
(d) A GAN is a mechanism that provides an explicit probability distri- bution over the data (with an explicit density function); A GAN is a neural network that is trained using an additional neural network called the discriminator.
SA-Question 1. Assume you are given a data set D = {(x1, y1), · · · , (xn, yn)} with xi ∈ Rd, and yi ∈ {−1,1}. In class, we discussed how to optimize the hinge loss using stochastic gradient descent, where
lhinge(x, y; w) = max{0, 1 − ywT x}.
We will consider here a different loss function l defined as follows:
l(x, y; w) = 1 . 1 + exp{ywT x}

1. Write the update rule of the stochastic gradient descent (with the de- tails of gradient computation, etc) to minimize the following objective:
min 􏰂 l(xi, yi; w).
w:||w||2≤ λ1
2. Is the problem convex? (hint: plot the loss function)
SA-Question 2. In class we learned how to perform sub-linear active learn- ing by hashing hyperplane queries. Specifically, we designed a two-bit hash- ing mechanism with the following property: Given a hyperplane with normal vector w and a data point x, we have
Pr(h(x) = h(w)) = 1 − (1 − θw,x )2, 42π
where θw,x is the angle between the two vectors w and x.

1. Show that the family of two-bit hash keys has the following property: Ifθw,x≤π thenPr(h(x)=h(w))≤3.
2. Assume δ is a small and fixed value (e.g. δ = 0.05). Explain how to use multiple hash functions to provide a new hashing mechanism h′ with the following property: If θw,x ≤ π4 then Pr(h′(x) = h′(w)) ≤ δ.
SA-Question 3. In class we learned about the K-means algorithm. Indeed, the K-means algorithm can be written as a variant of the gradient descent procedure. Reformulate each update of the K-means algorithm as a gradient descent update.

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com