Midterm Exam
Statistical Machine Learning (GR5241)
Instructor: As: , Note: Please submit your solutions in two files: one .RMD file for Q1, and one .pdf file
for all the questions (including the knitted file for Q1).
Copyright By PowCoder代写 加微信 powcoder
1. (4 points) Load package datasets and load the Iris data set using the ‘data(”iris”)’ command. We will try to predict the species of iris from the sepal length and width and the petal length and width using k−nearest neighbors. We will use a pseudo random number to “randomly” divide the data. This produces a deterministic split with the properties of a random split. Pseudo random numbers are often helpful for debugging. To set the seed, use the command set.seed(13), where 13 is the seed. Please read about how to write user-defined functions in R (https://www.w3schools.com/r/r_functions.asp) to answer the following questions.
a. Write a function named ‘split.data‘ that divides the iris data into training and testing sets in the following way: Use the function sample to make a new ordering for your data. Use the first 100 reordered observations as your training set and the last 50 as your testing set. Output a named list where the names are ”train” and ”test” and the values are the corresponding datasets.
b. Write a function named ‘misclassification.knn‘ using the function knn from the package class where it takes the following arguments as inputs:
• data: a named list containing training and testing data.
• type: a string which is either ”train” or ”test” which determines the output of the
function is either misclassification rate on the training data or test data. • K: a sequence of k values for k-nearest neighbor method.
This function should output a vector with values corresponding to misclassification rates for each K. As an example ‘misclassification.knn(data = data, type = ”train”, k = c(1,2,3))’ should output a vector of length three with the corresponding training misclassification errors.
c. In this part we want to plot the misclassification rates for training and test against k = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50 using the functions developed above, however this splitting is subject to randomness. In order to harness that we repeat this procedure 4 times and plot them on a SINGLE graph. Distinguish the lines by changing the color, point type; include a legend. This plot should have 8 lines in total, 4 for misclassification rates on the training datasets, and 4 for misclassification rates on the testing datasets.
2. (5 points) Consider a regression problem where we have n observations and p predictors, and where we know that only small subset of the predictors impact the response. We know that for such problems we can use lasso regression:
βˆL(y,X,λ) = arg min ∥y−Xβ∥2 +λ∥β∥1, (1) β∈Rp×1
where λ > 0 is a fixed constant, X ∈ Rn×p is a fixed matrix of predictor variables, y ∈ Rn a vector of observations, and βˆL(y,X,λ) ∈ Rp×1 is our estimated vector of coefficients. Due to the ∥β∥1 regularization term, we know that βˆ is a vector where some coefficients are set to zero. Note that this estimate depends on y, X, λ. There are many scenarios in practice where we want to perform this kind of lasso regression while controlling for some covariates such as demographics. In other words, there are a set of predictors that we want to have in the model and we don’t want to penalize the corresponding estimated coefficients. In this problem we will reduce this problem to the original lasso problem above. Specifically, imagine that we want to keep the first p1 predictors in the model and we want to select a subset of the remaining p2 = p − p1 predictors. In that vein, we want to solve the following problem
(βˆ1,βˆ2) = argmin(β1,β2)∥y−X1β1−X2β2∥2+λ∥β2∥1, (2)
where X1 ∈ Rn×p1 is composed of the first p1 predictors and X2 ∈ Rn×p2 is composed of the remaining p2 = p − p1 predictors, βˆ1 ∈ Rp1×1 is the coefficients vector of the first p1 predictors, βˆ2 ∈ Rp2×1 is the estimated coefficients vectors of the remaining p2 predictors. Our objective is to reduce the problem presented in 2 to the problem presented in 1.
a. Imagine βˆ2 is given. Using the least squares approach find βˆ1 in terms of X1, X2, y, βˆ2.
b. Recall that the hat matrix corresponding to the columns of X1 is given by H1 = X1(X1TX1)−1X1T. By proving that for any θ ∈ Rp1×1, we have H1X1θ = X1θ, show that H1 is a projection onto the subspace spanned by the columns of X1.
c. Rewrite βˆ1 in terms of H1, X2, y, βˆ2.
d. Substitute what you found for βˆ1 into the problem presented 2 and simplify it so that it resembles the lasso problem presented in 1. In other words, show that for some y ̃ and X ̃ , βˆ1 can be written as βˆL(y ̃, X ̃ , λ). Write X ̃ in terms of H1 and X2. Write y ̃ in terms of y and H1.
e. In words, interpret the way y ̃ depends on y and H1, and the way X ̃ depends on H1 and X2.
3. (7 points) Suppose we observe a vector y ∈ Rn of observations from the model, y = Xβ∗ + ε,
where X ∈ Rn×p is a fixed matrix of predictor variables, β∗ ∈ Rp is the true unknown coefficient vector that we would like to learn, and ε ∈ Rn is a random error vector with identically, independently distributed entries, with
In other words, we can say
E[ε] = 0, Cov(ε) = σ2I. y i = x ⊤i β ∗ + ε i ,
for i = 1,··· ,n where x⊤i is the ith row of X, E[εi] = 0 and Var[εi] = σ2. Ridge regression is a modified version of least squares, specially useful for p > n where no unique least squares estimate exist. Ridge regression solves a penalized least squares problem,
βˆλ = argmin∥y−Xβ∥2 +λ∥β∥2, (3) β
where λ > 0 is a fixed constant.
Show that βˆλ is simply the vector of linear regression coefficients from regressing the y n+p ̃ √X (n+p)×p
response y ̃ = 0 ∈ R onto the predictor matrix X = λI ∈ R , where 0 ∈ Rp, and I ∈ Rp×p is the identity matrix.
Show that the matrix X ̃ always has full column-rank, i.e. its columns are always linearly independent, regardless of the columns of X. Hence argue that the ridge regression estimate is always unique, for any matrix predictors X.
Write out an explicit formula for βˆ involving X, y, λ.
Show that βˆ is a biased estimate of β∗, for any λ > 0. Write the biasλ = β∗ − E[βˆ] in
terms of λ, σ, X,and β∗.
Find biasλ for λ = 0 and λ = ∞. In a sentence (or two) interpret how the bias is
related to λ.
Find varλ = pj=1 var(βˆj) in terms of X, σ and λ.
Find varλ for λ = 0 and λ = ∞. In a sentence (or two) interpret how the bias is related to λ.
points) Let y ∈ {0,1} be a Bernoulli(q), and let x ∈ Rp be a mixture Gaussian: if
y=1,thenx∼N(μ1,σ2I),andify=0,thenx∼N(μ0,σ2I)whereI∈Rp×p isthe identity matrix, and μ0 ∈ Rp and μ1 ∈ Rp are the two mean vectors.
a. Find Pr[y = 1|x] in terms of the parameters of the prior and the likelihood described above.
b. Find the set of x such that Pr[y = 0|x] = Pr[y = 1|x] = 0.5. This is the optimal Bayes decision boundary. Show that this boundary is a hyper plane.
c. Showthatfor∥μ0∥2 =∥μ1∥2 andq=0.5,thesetofxsuchthatPr[y=0|x]=Pr[y= 1|x] is a hyper plane passing through the origin.
d. Show that for properly selected vector β ∈ Rp and scalar β0 ∈ R Pr[y = 1|x] = 1 .
1 + e−(β0+β⊤x) Findβ0 andβintermsofq,σ2,μ0 andμ1.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com