程序代写代做 algorithm The Hong Kong University of Science and Technology Page: 1 of 3 EXAMINATION, 2020

The Hong Kong University of Science and Technology Page: 1 of 3 EXAMINATION, 2020
Course Code: Sample Section No.: 01 Time Allowed: 3 Hour(s) Course Title: Statistical Machine Learning Total Number of Pages: 3
INSTRUCTIONS:
1. Answer ALL of the following questions.
2. The full mark for this examination is 100.
3. Answers without sucient explanations/steps receive no or partial marks.
4. A calculator is allowed during the exam. Personal computer is NOT allowed. 5. Closed book but with an appendix for necessary formular.
1. (10 marks)
Suppose that we observe a quantitative response Y and a set of p predictors, X = (X1, . . . , Xp), and their relationship can be written as Y = f(X) + ✏. Here f is some fixed but unknow function of X1,…,Xp, and ✏ is a random error term, which is indepndent of X and has
ˆ
mean zero. In statistical learning, we are interested in finding f(X) which can well approx-
imate the underlying true function f(X). To do so, Bias-Variance Trade-o↵ should be kept in mind. Could you please explain what is “Bias-Variance Trade-o↵”.
2. (10 marks)
Given a data set D = {xi,yi}, xi = [xi1,…,xip] 2 Rp, i = 1,…,n, a ridge regression model needs to be fitted to this data set,
1Xn 0,1,…,p 2n i=1
y i 0
X !2 Xp
x i j j + j2 ,
m i n
j 2 j=1
with a sequence of values given as 1 > 2 > · · · > 100. Please describe how to use 5-fold cross-validation to choose a good model, such that it can be used to predict the outcome of new data xnew.
3. (10 marks)
Suppose that we take a data set, divide it into equally-sized training and test sets, and then try out two di↵erent classification procedures. First we use logistic regression and get an error rate of 20% on the training data and 30% on the test data. Next we use 1- nearest neighbors (i.e., K = 1) and get an average error rate (averaged over both test and training data sets) of 18%. Based on these results, which method should we prefer to use for classification of new observations? Why?

The Hong Kong University of Science and Technology Page: 2 of 3 EXAMINATION, 2020
Course Code: Sample Section No.: 01 Time Allowed: 3 Hour(s) Course Title: Statistical Machine Learning Total Number of Pages: 3
4. (10 marks)
Consider the following dataset with three observations, i.e., Y = (2.2,2.8,4.2) and X = (0.4,0.8,1.2), and the linear regression model Y = 0 +1X. Calculate the LOOCV (Leave- One-Out Cross-Validation) error.
5. (10 marks)
Conisder the linear regression model Y = X, where we have a simple linear regression with
one predictor and no intercept.PFor the data set with n = 2 observations, Y = (1, 3) and
X = (1, 2). Find such that 1 n (yi xi)2 + || is minimized, where = 1. 2 i=1
6. (15 marks)
Suppose xi ⇠ N(μ,2) for all i = 1,…,B, and xi and xj are correlated with a correlation
coecent⇢>0,i.e., 1 E[(x μ)(x μ)]=⇢. 2 i j
(a) Show that
(b) Show that
1 XB
E[B xi] = μ.
i=1
1XB ! 2 1⇢2 VarB xi =⇢+B.
i=1
(5 marks)
(10 marks)
Remark: this property motivates the design of random forest.
7. (10 marks)
Suppose we collect data for a group of students in a statistics class with variables X1 = hours studied, X2 = undergrad GPA, and Y = receive an A. We fit a logistic regression and produce estimated coecient, ˆ0 = 6, ˆ1 = 0.05 and ˆ2 = 1.
(a) Estimate the probability that a student who studies for 40 hours and has an undergrad GPA of 3.5 gets an A in the class. (5 marks)
(b) How many hours would the student in part (a) need to study to have a 60% chance of getting an A in the class? (5 marks)
8. (25 marks)
ConsideradatasetD={x1,…,xM},wherexj 2(0,1)andM =1,000,000. Theobservated x values independently come from a mixture of the Uniform distribution on (0,1) (denoted as component 0) and a distribution with desity function ↵x↵1 with unknown parameter ↵ (denoted as component 1). Let zj 2 {0, 1} be the latent variable indicating whether the xj

The Hong Kong University of Science and Technology Page: 3 of 3 EXAMINATION, 2020
Course Code: Sample Section No.: 01 Time Allowed: 3 Hour(s) Course Title: Statistical Machine Learning Total Number of Pages: 3
is from component 0 (zj = 0) or component 1 (zj = 1). Then the probabilistic model can
be written as:
⇡0 =Pr(zj =0): xj ⇠U[0,1], ifzj =0, ⇡1 =Pr(zj =1): xj ⇠↵x↵1, ifzj =1.
(a) Let ⇥ = {⇡0,⇡1,↵} be the set of unknown parameters to be estimated. Write down the incomplete log-likelihood function L(⇥) for this problem. (5 marks)
(b) Derive an EM algorithm for parameter estimation, where ⇥ = {⇡0, ⇡1, ↵} is the parameter set to be estimated and {z1, . . . , zM } are considered as missing data. (10 marks)
(c) Suppose we have some additional information collected in a vector A = [A1, . . . , AM ], where Aj 2 {0, 1}. We model the relationship between Aj and zj as q0 = Pr(Aj = 1|zj = 0) and q1 = Pr(Aj = 1|zj = 1), respectively. Derive an EM algorithm to estimate all the parameters {⇡0, ⇡1, ↵, q0, q1}. Again, {z1, . . . , zM } are considered as missing data. (10 marks)
— END —

Related Posts