程序代写 MAST90138 Week 7 Lab

MAST90138 Week 7 Lab
Problems:
Consider iid data (X1, Z1), . . . , (Xn, Zn), where Xi ∈ Rp and Zi ∈ R, coming from the linear regression model
E(Zi|Xi =x)=α+βTx where α ∈ R and β ∈ Rp are unknown parameters.
1. As seen in last week’s tutorial, in the standard linear regression model, to predict a new value Znew from a new value Xnew, we can take
where
is the LS estimator of β in the model for centered data
Zˆnew =Z ̄+βˆT(Xnew −X ̄), (1) βˆ = ( X CT X C ) − 1 X CT Z C
ZC,i = βT XC,i + εi,
2. Also from last time, when data are centered, we have
βTX ≈βT Y , C PC(1)
where Y(1) = (Y1,…,Yq)T, the vector of the first q PCs of X (recall that Y(1) is centered). This suggests that a good prediction can also be obtained by taking
Zˆ =Z ̄+βˆTY , new,P C PC (1,)new
where βˆPC is the LS estimator of βPC and
Y(1,)new = ΓT(1)(Xnew − X ̄),
with Γ(1) = (γ1,…,γq).
The data Xtrainphoneme and Ztrainphoneme on LMS comes from a linear model as above where p = 256 and n = 300. These are actually the phoneme data used in class to illustrate PLS and PCA regression. Predict the value of Z for the new 1417 X observations in Xtestphoneme using the LS predictor for the standard linear regression with all the components of X, and using the PC predictor for q = 2,…,20 [Recall last time we only fit the model for q = 2, . . . , 10]. The true values of Z for those test data are in the file Ztestphoneme. For each estimator computed above, compare your predicted values with the true values using boxplots of the log of the squared prediction errors. Also, for the standard LS estimator and for the best of the PC predictors, plot the predicted values against the true values of Z. What can you conclude?
3. In the previous question you could see which q worked the best because I gave you the true value of the variable you set out to predict. In real life when you need to predict a variable you do not actually observe it and you have to choose q without knowing those values. One way to do this is by the cross-validation method seen in class. Find q between 2 and 20 that minimises the cross-validation criterion (this requires you to write a code for computing the cross-validation criterion). Then compute the predicted values of Z as in the previous question but this time using the value of q you just found.
4. Do the same as in question 3 but instead of choosing q that minimises CV, choose q that minimises the residual sum of squares. What happens?
1

Solution to question 2:
n=300
nnew=1417
Xcent=scale(X,scale=F)
Zcent=Z-mean(Z)
#1 Prediction by LS
betahat= solve(t(Xcent)%*%Xcent)%*%t(Xcent)%*%Zcent
repbarX=matrix(rep(colMeans(X),nnew),nrow=nnew,byrow=T)
ZpredLS=mean(Z)+ (XXnew-repbarX)%*%betahat
#2 Prediction by PC
compo=prcomp(X,retx=T)
phi=(compo$rotation)
Y=compo$x
Zpred=c()
for(q in 2:20)
{
YDATA=Y[,1:q]
betahatPC= solve(t(YDATA)%*%YDATA)%*%t(YDATA)%*%Zcent
YNew=(XXnew-repbarX)%*%phi
Zpred=cbind(Zpred,mean(Z)+YNew[,1:q]%*%betahatPC)
}
#3 Plot prediction errors
log_pred_error_list = list() # list for log of squared prediction errors
pred_error_list = list() # list for just prediction errors
for (i in 1:19){
pred_error_list[[i]] =(Zpred[,i]-Znew)
log_pred_error_list[[i]] = log((Zpred[,i]-Znew)^2)
}
log_pred_error_list[[20]] = log((ZpredLS-Znew)^2)
pred_error_list[[20]] = (ZpredLS-Znew)
boxplot(log_pred_error_list)
boxplot(pred_error_list)
## From the visual one can see that the 13th model (q = 14) has the lowest prediction
## one can also get a sense by computing the variance of the prediction errors
plot(Znew,ZpredLS)
abline(0,1)
plot(Znew,Zpred[,13])
abline(0,1)
Conclude that PC works better, especially if q is well chosen.
2

Solution to question 3:
#Choose q by cross-validation
CV=rep(0,20)
for(k in 1:20)
{
for(i in 1:n)
{
YDATACV=Y[-i,1:k]
ZcentCV=Zcent[-i]
betahatCV= solve(t(YDATACV)%*%YDATACV)%*%t(YDATACV)%*%ZcentCV
CV[k]=CV[k]+(Zcent[i]-Y[i,1:k]%*%betahatCV)^2
} }
#We only consider q=2,…,20 so we remove the first element of the vector CV
CV=CV[-1]
#Find the argmin
qCV=which(CV==min(CV))+1
#Then compute predicted values
YDATA=Y[,1:qCV]
betahatPC= solve(t(YDATA)%*%YDATA)%*%t(YDATA)%*%Zcent
YNew=(XXnew-repbarX)%*%phi
ZpredCV=mean(Z)+YNew[,1:qCV]%*%betahatPC
Solution to question 4
#Choose q by minimising RSS:
RSS=rep(0,20)
for(k in 1:20)
{
YDATA=Y[,1:k]
betahat= solve(t(YDATA)%*%YDATA)%*%t(YDATA)%*%Zcent
for(i in 1:n)
{
RSS[k]=RSS[k]+(Zcent[i]-Y[i,1:k]%*%betahat)^2
}
}
RSS=RSS[-1]
qRSS=which(RSS==min(RSS))+1
If we chose q by minimising RSS we find a too large q (overfitting. It will always give us the largest q. Why?).
3