编程代考 (1) (18 points) For the questions below, give 2-5 sentences explaining your

(1) (18 points) For the questions below, give 2-5 sentences explaining your answer and rationale.
(i) (3 points) Describe a situation in which we would expect principle components regression to
outperform the lasso.
(ii) (3 points) Is PCA always helpful as a dimension reduction tool, or are there situations in which it does not provide benefit? Either explain why it is always beneficial, or provide a situation in which it would not be.

Copyright By PowCoder代写 加微信 powcoder

(iii) (3 points) Suppose I use an SVM with a radial kernel for a number of γ values, and obtain the error rates on the training data given in the table below. Do you think that overfitting is occurring?
training data error rates
γ=0.01 0.40 γ=0.1 0.37 γ=1 0.31 γ=5 0.25 γ=10 0.21
(iv) (3 points) Describe one situation where partial least squares should outperform principle components regression and explain why it will work better in that scenario.
(v) (3 points) Suppose I do not care about minimizing MSE. Instead my goal is to minimize P (|Y − Y􏰈 | > 2). How would I choose tuning parameters or select the model that gives me the best performance with respect to this metric.
(vi) (3 points) Suppose I am interested in doing classification where the outcome can take one of two categories, i.e. Y ∈ {1, 2}. Further suppose we only have one covariate that has the same variance when both Y = 1 and Y = 2, and the distribution of this covariate within each of the two groups is plotted below:

−2 −1 0 1 2 −2 −1 0 1 2
Do you expect linear discriminant analysis to work well in terms of classification error for this particular data? Explain why or why not?
(2) (36 points) In this question we are going to investigate the use of the bootstrap. Suppose we have independent and identically distributed data Xi for i = 1, . . . , n, such that E(Xi) = μ and Var(Xi) = σ2. Suppose that I want to perform inference on f(μ) for some function f(·).
(i) (4 points) Give two reasons why the bootstrap would be useful for this particular problem.
(ii) (2 points) Would you choose to use the nonparametric bootstrap or the parametric bootstrap here? Explain your decision.
For the remaining parts of this question, assume that we are interested in performing inference on μ2, i.e. f(μ) = μ2. Further, assume that the data are normally distributed so that Xi ∼ N(μ,σ2)
and σ2 is known. We will utilize X2 as our estimator of the unknown μ2, where X = n1 􏰆ni=1 Xi.
(iii) (5 points) Is X2 an unbiased estimator of μ2? Prove why or why not.
(iv) (3 points) Calculate Var􏰀X2􏰁. [Note that the table in section 2.2 of the following link may be useful to you: https://en.wikipedia.org/wiki/Normal_distribution]
Now let’s utilize the parametric bootstrap to try and estimate the variability in our estimator
of μ2. As a reminder, the parametric bootstrap first estimates the parameters that govern the
distribution of the data, and then generates new data sets from that distribution. So in this
case, we assume the data come from a normal distribution with known σ2. We then estimate
the unknown μ from the data using μ􏰈 = X, and generate new (bootstrap) data sets from this
distribution. We can refer to the bth bootstrap data set as X(b) for i = 1, . . . , n. Further, we can
refer to the bth bootstrap estimate of the mean as X(b)
(v) (8 points) Let X = [X1,…,Xn] represent the original sample we are given. Utilizing the parametric bootstrap above, what is Var􏰀f􏰀X(b)􏰁 | X􏰁 = Var􏰀X(b)2 | X􏰁?
(vi) (4 points) What is the expected value of the expression in (v). In other words, calculate
􏰄􏰀(b)2 􏰁􏰅 EVarX |X
Note that if you can’t solve this expectation or did not get the answer to (v), you can receive partial credit for telling me what the expectation is with respect to, i.e. what is the random variable inside of the expectation.
0.0 0.1 0.2 0.3 0.4 0.5
0.0 0.2 0.4 0.6

For this last part, we will use simulation to confirm our results from the previous parts.
(vii) (8 points) Suppose that the normality assumption is true and Xi ∼ N (μ, σ2) with μ = −10
and σ2 = 50. For sample sizes of n ∈ {5, 10, 15, 20, 25}, calculate the following values:
1. The true variance of X2 given in part (iv)
2. The true average variance estimate from the bootstrap that is given in part (vi)
3. The empirical variance of X2 from your simulated data sets
4. The average (over many simulated data sets) estimate of the variance from the para-
metric bootstrap
5. The average (over many simulated data sets) estimate of the variance from the nonpara- metric bootstrap
(viii) (2 points) Repeat the previous question, but now generate your data from a gamma dis- tribution with parameters (α = 0.1, β = 0.001) that is scaled to have the same mean and variance. Note that for gamma distributed data, we do not have the true variance or true av- erage bootstrap variance, so you can simply estimate quantities 3-5 above. Code to generate gamma distributed variables with such a mean and variance is given below:
alpha = 0.1
beta = 0.001
X = (rgamma(n, alpha, beta) –
alpha/beta)*sigma/sqrt(alpha / (beta^2)) + mu
Your answer for the previous two parts should be a plot with the sample size on the x-axis and lines for each value (5 lines for part (vii) and three lines for (viii)). Note that you can still complete these two parts even if you did not complete any of the earlier parts of the question! If you did not get the answer to parts (iv) or (vi), you will still receive full credit if you correctly calculate the other three quantities that rely solely on simulation and not on the derivations above.
(3) (22points)Forthisproblem,readinthedatasetstitledproblem3training.csvandproblem3testing.csv. All models should be fit only on the first of these two data sets (training data) and the testing data should only be used if I ask you to evaluate the predictive performance of a fitted model
on testing data. In both data sets there is an outcome Y that is binary and a set of continuous predictors (X1, X2, …, Xp)
(i) (3 points) Fit a probit regression model of the form
P (Y = 1 | X = x) = Φ􏰀β + 􏰇 x β 􏰁
j=1 and calculate the classification error rate on the testing data
(ii) (6 points) Using the probit regression model, estimate the following odds ratio and provide a 95% confidence interval for it.
θ= P(Y =1|X1 =1,X2 =0,…Xp =0)P(Y =0|X1 =0,X2 =0,…Xp =0) P(Y =0|X1 =1,X2 =0,…Xp =0)P(Y =1|X1 =0,X2 =0,…Xp =0)
(iii) (5points)Supposeyoudon’tknowwhethertoincludethecovariateslinearlyorwithquadratic terms. Explain two ways to evaluate this using the data, and implement one of your two approaches
(iv) (4 points) Now use a support vector machine with a radial kernel. What tuning parameters do you need to choose and how do you choose them? What is the classification error on the testing data set?

(v) (4 points) Make a plot similar to the plot from problem 2 (i) on homework 2 that illustrates the decision boundary for covariates X1 and X2 for your support vector machine. To make the plot, fix the remaining covariates (X3 through Xp) to zero.
(4) (24points)Forthisproblem,readinthedatasetstitledproblem4training.csvandproblem4testing.csv. All models should be fit only on the first of these two data sets (training data) and the testing data should only be used if I ask you to evaluate the predictive performance of a model on testing data. In both data sets there is an outcome Y that is continuous and a set of continuous predictors (X1, X2, …, Xp)
(i) (3 points) Do you think that penalized approaches are useful for this data set? Explain why or why not.
(ii) (4 points) Fit both the ridge regression and lasso regression models using the training data, and make predictions on the testing data set. What are the error rates of the two models?
(iii) (4 points) Can you use cross-validation to choose between ridge regression and lasso regres- sion using the training data? If yes, explain how and implement it. If not, explain why you can not do this.
(iv) (6 points) Suppose now that you are interested in identifying predictors in (X1, X2, …, Xp) that are associated with the outcome. I want you to identify the important variables using three different variable selection techniques. These must be three distinct approaches, not three variations of the same approach. Discuss the pros and cons of each approach, and interpret your findings on the main data set.
(v) (7 points) Suppose you fit the ridge regression estimator that is given by
􏰂np􏰃 β􏰈=argmin 􏰇(Yi−Xiβ)2+λ􏰇βj2 .
Suppose that there is no intercept and that Xi contains only the p predictors for observation i. Further assume that we use λ = 2.5, and that the residual variance σ2 = 1, since we did not discuss how to estimate this. We are interested in specifically testing the null hypothesis
H0:β1=0 vs. Ha:β1̸=0
using this ridge regression model. Propose a hypothesis test for this null hypothesis and implement it on the training data. For this question, you can not use the bootstrap to help you perform inference. You may assume that the ridge regression estimator is normally distributed. I also want you to program the ridge regression solution manually here, and therefore you can not use the glmnet package or any other software.

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com