MAST90138 Week 10 Lab Problems:
1. Recall the iris data which contain various measurements (sepal length, sepal width, petal length and petal width) of 50 flowers from each of 3 species of iris flowers (this it contains 150 ¡°individuals¡±). Load the iris data in R by typing data(iris).
(a) Create from those data two samples (one training sample of size n = 120 and one test sample of size n = 30) by drawing randomly and without replacement 120 individuals that you assign to the training sample, while assigning the remaining 30 individuals to the test sample. The functions sample and setdiff will help you do this. Sample each individual with equal probability.
TrainIndices=sample(1:150, 120)
TestIndices=setdiff(1:150,TrainIndices)
Xtrain=iris[TrainIndices,]
Xtest=iris[TestIndices,]
(b) Use the package rpart to create a classification tree from the iris data, where the goal is to classify an iris flower to its iris variety, using the four explanatory variables. Apart from the one that tells rpart to do a classification, keep all default values of the parameters. Then, plot your tree using the command rpart.plot from the package rpart.plot and examine the output by changing the values of type (try 1, 2, 3 and 4). Also, try the parameters under=T or F and clip.right.labs =T or F.
library(rpart)
library(rpart.plot)
# Create a decision tree model
tree <- rpart(Species~., data=Xtrain,method=¡¯class¡¯)
# Visualize the decision tree with rpart.plot
rpart.plot(tree, clip.right.labs = FALSE,type=1,under=T)
rpart.plot(tree, clip.right.labs = FALSE,type=2,under=T)
rpart.plot(tree, clip.right.labs = FALSE,type=3,under=T)
rpart.plot(tree, clip.right.labs = FALSE,type=4,under=T)
rpart.plot(tree, clip.right.labs = FALSE,type=4,under=F)
rpart.plot(tree,clip.right.labs =T,type=4,under=F)
table(Xtrain$Species)/length(Xtrain$Species)
(c) Usingthetrainingdata,computearandomforestclassifierusingthefunctionrandomForest from the package randomForest, where the goal is to classify an iris flower to its iris vari- ety, using the four explanatory variables. Use B = 1000 trees and keep the default values of other parameters. Compute the variable importance using the two measures seen in class and represent them graphically. Compute the OOB class predictors of the training data. Finally, apply the classifier to the test data and compute the test error.
library(randomForest)
m.rf <- randomForest(formula=Species ~. , data=Xtrain, ntree=1000,
varImpPlot(m.rf)
importance=TRUE)
1
importance(m.rf)
# note that these values
# will change if we run the RF again
m.rf$importance[, 4]/ m.rf$importanceSD[, 4]
# this should be same as the fourth column in importance(m.rf)
#OOB class predictor:
OOBpred=predict( m.rf )
#Class predictor of test data:
ClassPred=predict(m.rf,newdata=Xtest)
#Classification error
sum(ClassPred!=Xtest[,5])
2. Take the data Xtrainphoneme and Ztrainphoneme from week 7. You can compute the PLS components of X using the function plsr from the package pls. Compute the PLS components of Xtrainphoneme. The matrix ¦µ seen in class is obtained from the projection output of the plsr function. The PLS components given in the scores output are applied to the centered data. Check this by comparing these scores with the PLS components you compute yourself using the original data, centered, and the projection matrix.
Xtrain=read.table(file="Xtrainphoneme.txt",sep=",")
Xtrain=Xtrain[,1:256]
Ztrain=scan(file="Ztrainphoneme.txt",sep=",",)
XZtrain=cbind(Xtrain,Ztrain)
PLX=plsr(Ztrain~.,data=XZtrain)
#PLS variables
PLStrain=PLX$scores
#Compare with result obtained by computing that ourselves:
PLScomp=as.matrix(scale(Xtrain, center = T, scale = F))%*%PLX$projection
PLScomp-PLStrain
# may also check how the first column of the projection matrix is computed
## See the updated part in Week 6A slides regarding PLS
Xtrain_center = scale(Xtrain, center = T, scale = F) # column-center X
Ztrain_center = Ztrain - mean(Ztrain) # center Z
proj_vec_1_unnorm = t(Xtrain_center)%*%Ztrain_center
proj_vec_1_unnorm_length =
sqrt(sum(proj_vec_1_unnorm^2))
# this should be the same as the first column in PLX$projection
# after renormalization
PLX$projection[, 1] -
as.numeric(proj_vec_1_unnorm/proj_vec_1_unnorm_length)
head(PLX$projection[, 1])
head(proj_vec_1_unnorm/proj_vec_1_unnorm_length)
# they are the same
2