程序代写代做代考 kernel algorithm go Model Selection

Model Selection
Mariia Okuneva
We have talked about linear models in the second tutorial and we were estimating them with least squares. Despite its simplicity, the linear model has an advantage in terms of interpretability and it often shows good predictive performance.
So, today we try to improve on linear models by selecting or shrinking the coefficients to make the model more interpretable, and in some cases to predict better.
1. Subset Selection: Best Subset Selection
We will try to look among all possible combinations of features to find the ones that are the most predictive.
Here we apply the best subset selection approach to Hitters data. We wish to predict a baseball player’s Salary on the basis of various statistics associated with performance in the previous year.
## AtBat Hits HmRun Runs
## Min. : 16.0 Min.:1 Min. : 0.00 Min. : 0.00
## 1st Qu.:255.2 1st Qu.: 64 1st Qu.: 4.00 1st Qu.: 30.25
## Median :379.5 Median : 96 Median : 8.00 Median : 48.00
## Mean :380.9 Mean :101 Mean :10.77 Mean : 50.91
## 3rd Qu.:512.0 3rd Qu.:137 3rd Qu.:16.00 3rd Qu.: 69.00
## Max. :687.0 Max. :238 Max. :40.00 Max. :130.00
##
## RBI Walks Years CAtBat
## Min. : 0.00 Min. : 0.00 Min. : 1.000 Min. : 19.0 ## 1st Qu.: 28.00 1st Qu.: 22.00 1st Qu.: 4.000 1st Qu.: 816.8 ## Median : 44.00 Median : 35.00 Median : 6.000 Median : 1928.0 ## Mean : 48.03 Mean : 38.74 Mean : 7.444 Mean : 2648.7 ## 3rd Qu.: 64.75 3rd Qu.: 53.00 3rd Qu.:11.000 3rd Qu.: 3924.2 ## Max. :121.00 Max. :105.00 Max. :24.000 Max. :14053.0 ##
## CHits CHmRun CRuns CRBI
## Min. : 4.0 Min. : 0.00 Min. : 1.0 Min. : 0.00 ## 1st Qu.: 209.0 1st Qu.: 14.00 1st Qu.: 100.2 1st Qu.: 88.75 ## Median : 508.0 Median : 37.50 Median : 247.0 Median : 220.50 ## Mean : 717.6 Mean : 69.49 Mean : 358.8 Mean : 330.12 ## 3rd Qu.:1059.2 3rd Qu.: 90.00 3rd Qu.: 526.2 3rd Qu.: 426.25 ## Max. :4256.0 Max. :548.00 Max. :2165.0 Max. :1659.00
library(ISLR) fix(Hitters) attach(Hitters)
summary(Hitters)
1

##
## CWalks League Division PutOuts Assists
## Min. : 0.00 A:175 E:157 Min. : 0.0 Min. : 0.0
## 1st Qu.: 67.25 N:147 W:165 1st Qu.: 109.2 1st Qu.: 7.0
## Median : 170.50
## Mean : 260.24
## 3rd Qu.: 339.25
## Max. :1566.00
##
Median : 212.0
Mean : 288.9
3rd Qu.: 325.0
Max. :1378.0
Median : 39.5
Mean :106.9
3rd Qu.:166.0
Max. :492.0
## Errors
## Min. : 0.00 Min. : 67.5 A:176
## 1st Qu.: 3.00 1st Qu.: 190.0 N:146
## Median : 6.00 Median : 425.0
## Mean : 8.04 Mean : 535.9
## 3rd Qu.:11.00 3rd Qu.: 750.0
## Max. :32.00
##
## [1] 59
## [1] “AtBat”
## [7] “Years”
## [13] “CWalks”
## [19] “Salary”
Max. :2460.0
NA’s :59
## starting httpd help server … done
Salary
NewLeague
“Hits” “HmRun” “Runs”
“CAtBat” “CHits” “CHmRun”
“League” “Division” “PutOuts”
“NewLeague”
“RBI” “Walks”
“CRuns” “CRBI”
“Assists” “Errors”
For Best Subset Selection we are going to use the regsubsets() function (part of the leaps library). So, let’s load ‘leaps’ library.
The regsubsets() function performs best subset selection by identifying the best model that contains a given number of predictors, where best is quantified using RSS.
## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters)
## 19 Variables (and intercept)
## Forced in Forced out
## AtBat FALSE
## Hits FALSE
## HmRun FALSE
FALSE
FALSE
FALSE
2
same length as
e.
sum(is.na(Hitters)) # Salary variable is missing for some of the players
names(Hitters)
help(Hitters)
# is.na() function helps to identify the missing observations. It returns a vector of the # sum() function is used to count all the missing elements
Hitters = na.omit(Hitters) #remove all of the rows that have missing values in any variabl
#install.packages(‘leaps’)
library(‘leaps’)
regfit.full = regsubsets(Salary ~., data=Hitters)
summary(regfit.full) # outputs the best set of variables for each model size

## Runs
## RBI
## Walks
## Years
## CAtBat
## CHits
## CHmRun
## CRuns
## CRBI
## CWalks
## LeagueN
## DivisionW
## PutOuts
## Assists
## Errors
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
## NewLeagueN
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
##1 (1)””
##2 (1)””
##3 (1)””
##4 (1)”” ##5(1)”*””*”””””””””””””
“” “”
“” “”
“” “”
“” “”
“””” “””*” “” “” “” “*” “*” “*” “” “” “””*” “*” “”
CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN (1) “” “” “” “”””””””
(1) “” “” “” “”””””””
##4 (1)”” “” “*”
##5 (1)”” “” “*”
##6 (1)”” “” “*”
##7 (1)”” “” “*”
##8 (1)”*” “” “*”
“*” “” “*” “” “*” “” “*” “” “*” “”
“” “” “” “” “” “” “” “” “” “”
“” “” “*””” “*””” “*”””
“” “””” “” “””” “” “””” “” “”””
“” “” “” “” “” “” “” “”
“” “*” “” “*” “” “*” “” “*”
## 6
## 7
## 8 ##
## 1
## 2
##3(1)”””””” “*”””””””
(1) “*” (1) “” (1) “*”
“*””” “” “””*” “*””” “” “””*” “*””” “” “””*”
“” “” “””*” “” “”
An asterisk indicates that a given variable is included in the corresponding model. Default: up to the best eight-variable model.
Use nvmax argument to return as many variables as you want.
## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19)
## 19 Variables (and intercept)
regfit.full = regsubsets(Salary ~., data=Hitters, nvmax=19) summary(regfit.full)
##
## AtBat
## Hits
## HmRun
## Runs
## RBI
## Walks
## Years
Forced in Forced out
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
3

## CAtBat
## CHits
## CHmRun
## CRuns
## CRBI
## CWalks
## LeagueN
## DivisionW
## PutOuts
## Assists
## Errors
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
## NewLeagueN
## 1 subsets of each size up to 19
## Selection Algorithm: exhaustive
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
## 1 ( 1 ) “”
## 2 ( 1 ) “”
## 3 ( 1 ) “”
## 4 ( 1 ) “”
## 5 ( 1 ) “*”
## 6 ( 1 ) “*”
## 7 ( 1 ) “”
## 8 ( 1 ) “*”
##9(1)”*””*””””””””*””””*” “””” “*””*”
## 10
## 11
## 12
## 13
## 14
## 15
## 16
## 17
## 18
## 19 ##
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10
## 11
## 12
## 13
## 14
## 15
## 16
## 17
## 18
## 19
(1) “*” (1) “*” (1) “*” (1) “*” (1) “*” (1) “*” (1) “*” (1) “*” (1) “*” (1) “*”
“*” “” “” “*” “” “” “*” “” “*” “*” “” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*”
“””*” “””*” “””*” “””*” “””*” “””*” “*” “*” “*” “*” “*” “*” “*” “*”
“” “*” “” “*” “” “*” “” “*” “” “*” “” “*” “” “*” “” “*” “*” “*” “*” “*”
“” “” “*” “*” “” “” “*” “*” “” “” “*” “*” “” “” “*” “*” “” “” “*” “*” “*” “” “*” “*” “*” “” “*” “*” “*” “” “*” “*” “*” “” “*” “*” “*” “*” “*” “*”
“” “” “*””” “*””” “*””” “*””” “*””” “*””” “*”””
“” “””” “” “””” “” “””” “” “””” “” “””” “” “””*” “” “””*” “” “””*”
“” “” “” “” “” “” “” “” “” “” “” “” “” “*” “” “”
“” “” “” “” “” “” “” “” “” “” “” “” “*” “*” “” “*”
“” “*” “” “*” “” “*” “” “*” “” “*” “” “*” “” “” “*” “”
CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
(1) “” “” “” “” (1) “” “” “” “” (1) “” “” “” “*” (1) “” “” “*” “*” (1) “” “” “*” “*” (1) “” “” “*” “*” (1) “” “” “*” “*” (1) “*” “” “*” “*” (1) “*” “” “*” “*”
(1) “*” “” “*” “*” (1) “*” “*” “*” “*” (1) “*” “*” “*” “*” (1) “*” “*” “*” “*” (1) “*” “*” “*” “*” (1) “*” “*” “*” “*” (1) “*” “*” “*” “*” (1) “*” “*” “*” “*” (1) “*” “*” “*” “*” (1) “*” “*” “*” “*”
“” “” “”
“” “” “”
“” “” “”
“” “” “”
“” “” “”
“” “” “”
“” “” “”
“” “” “”
“” “” “”
“*” “” “”
“*” “” “”
“*” “” “”
“*” “*” “”
“*” “*” “”
“*” “*” “”
“*” “*” “”
“*” “*” “*”
“*” “*” “*”
“*” “*” “*”
4

reg.summary = summary(regfit.full)
names(reg.summary) # summary() function returns also R^2, RSS, adjusted R^2, Cp, and BIC.
## [1] “which” “rsq” “rss” “adjr2” “cp” “bic” “outmat” “obj”
Rˆ2 statistic increases monotonically as more variables are included. No adjustment for the number of variables in the model.
## [1] 0.3214501 0.4252237 0.4514294 0.4754067 0.4908036 0.5087146 0.5141227
## [8] 0.5285569 0.5346124 0.5404950 0.5426153 0.5436302 0.5444570 0.5452164
## [15] 0.5454692 0.5457656 0.5459518 0.5460945 0.5461159
Let’s plot RSS, adjusted Rˆ2, Cp, and BIC for all of the models to select the best one.
Which criteria
## [1] 11
5 10 15
Number of Variables
5
reg.summary$rsq
plot(reg.summary$adjr2,
xlab = ‘Number of Variables’, ylab = ‘Adjusted RSq’,
type = ‘l’)
ar2.max = which.max(reg.summary$adjr2)
# put the point on the plot that is already created with points() points(ar2.max, reg.summary$adjr2[ar2.max],
col=’red’, cex=2, pch=20)
Adjusted RSq
0.35 0.40 0.45 0.50
ar2.max

plot(reg.summary$rss,
xlab = ‘Number of Variables’, ylab = ‘RSS’,
type = ‘l’)
rss.min = which.min(reg.summary$rss) points(rss.min, reg.summary$rss[rss.min],
col=’red’, cex=2, pch=20)
5 10 15
Number of Variables
rss.min
## [1] 19
plot(reg.summary$cp,
xlab = ‘Number of Variables’, ylab = ‘Cp’,
type = ‘l’)
cp.min = which.min(reg.summary$cp) points(cp.min, reg.summary$cp[cp.min],
col=’red’, cex=2, pch=20)
6
RSS
2.4e+07 2.8e+07 3.2e+07 3.6e+07

5 10 15
Number of Variables
cp.min
## [1] 10
plot(reg.summary$bic,
xlab = ‘Number of Variables’, ylab = ‘bic’,
type = ‘l’)
bic.min = which.min(reg.summary$bic) points(bic.min, reg.summary$bic[bic.min],
col=’red’, cex=2, pch=20)
7
Cp
20 40 60 80 100

5 10 15
Number of Variables
bic.min
## [1] 6
The model with the lowest BIC is the six-variable model. Which regressors?
coef(regfit.full, 6)
## ## ## ##
(Intercept) AtBat Hits Walks CRBI
91.5117981 -1.8685892 7.6043976 3.6976468
PutOuts
0.2643076
DivisionW
0.6430169 -122.9515338
Forward and Backward Stepwise Selection
For computational reasons, best subset selection cannot be applied with very large p. Additionally, enormous search space can lead to overfitting. Possible solution: stepwise methods.
We can also use the regsubsets() function to perform forward stepwise or backward stepwise selection, using the argument method = “forward” or method = “backward”.
## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19, method = “forward”)
## 19 Variables (and intercept)
## Forced in Forced out
## AtBat FALSE FALSE
regfit.fwd = regsubsets(Salary~., data=Hitters, nvmax=19, method=”forward”) summary(regfit.fwd)
8
−150 −130
−110 −90
bic

## Hits
## HmRun
## Runs
## RBI
## Walks
## Years
## CAtBat
## CHits
## CHmRun
## CRuns
## CRBI
## CWalks
## LeagueN
## DivisionW
## PutOuts
## Assists
## Errors
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
## NewLeagueN
## 1 subsets of each size up to 19
## Selection Algorithm: forward
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
## 1 ( 1 ) “”
## 2 ( 1 ) “”
## 3 ( 1 ) “”
## 4 ( 1 ) “”
## 5 ( 1 ) “*”
## 6 ( 1 ) “*”
## 7 ( 1 ) “*”
## 8 ( 1 ) “*”
##9(1)”*””*””””””””*””””*” “””” “*””*”
## 10
## 11
## 12
## 13
## 14
## 15
## 16
## 17
## 18
## 19 ##
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10
## 11
## 12
## 13
(1) “*” (1) “*” (1) “*” (1) “*” (1) “*” (1) “*” (1) “*” (1) “*” (1) “*” (1) “*”
“*” “” “” “*” “” “” “*” “” “*” “*” “” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*”
“””*” “””*” “””*” “””*” “””*” “””*” “*” “*” “*” “*” “*” “*” “*” “*”
“” “*” “” “*” “” “*” “” “*” “” “*” “” “*” “” “*” “” “*” “*” “*” “*” “*”
“” “” “*” “*” “” “” “*” “*” “” “” “*” “*” “” “” “*” “*” “” “” “*” “*” “*” “” “*” “*” “*” “” “*” “*” “*” “” “*” “*” “*” “” “*” “*” “*” “*” “*” “*”
“” “” “*””” “*””” “*””” “*””” “*””” “*””” “*”””
“” “””” “” “””” “” “””” “” “””” “” “””” “” “””*” “” “””*” “” “””*”
“” “” “” “” “” “” “” “” “” “” “” “” “” “” “” “”
“” “” “” “” “” “” “” “” “” “” “” “” “” “” “” “”
“” “*” “” “*” “” “*” “” “*” “” “*” “” “*” “” “*” “*” “*”
CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
(1) “” “” “” “” (1) “” “” “” “” (1) “” “” “” “*” (1) “” “” “*” “*” (1) “” “” “*” “*” (1) “” “” “*” “*” (1) “*” “” “*” “*” (1) “*” “” “*” “*” (1) “*” “” “*” “*”
(1) “*” “” “*” “*” (1) “*” “*” “*” “*” (1) “*” “*” “*” “*” (1) “*” “*” “*” “*”
“” “” “”
“” “” “”
“” “” “”
“” “” “”
“” “” “”
“” “” “”
“” “” “”
“” “” “”
“” “” “”
“*” “” “”
“*” “” “”
“*” “” “”
“*” “*” “”
9

## 14
## 15
## 16
## 17
## 18
## 19
(1)”*” “*” “*” (1)”*” “*” “*” (1)”*” “*” “*” (1)”*” “*” “*” (1)”*” “*” “*” (1)”*” “*” “*”
“*” “*”
“*” “*”
“*” “*”
“*” “*”
“*” “*”
“*” “*”
“*” “” “*” “” “*” “” “*” “*” “*” “*” “*” “*”
summary.fwd = summary(regfit.fwd) ar2.max = which.max(summary.fwd$adjr2) ar2.max
## [1] 11
max(summary.fwd$adjr2) ## [1] 0.5225706
regfit.bwd = regsubsets(Salary~., data=Hitters, nvmax=19, method=”backward”) summary(regfit.bwd)
## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19, method = “backward”)
## 19 Variables (and intercept)
Forced in Forced out
##
## AtBat
## Hits
## HmRun
## Runs
## RBI
## Walks
## Years
## CAtBat
## CHits
## CHmRun
## CRuns
## CRBI
## CWalks
## LeagueN
## DivisionW
## PutOuts
## Assists
## Errors
## NewLeagueN
## 1 subsets of each size up to 19
## Selection Algorithm: backward
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
## 1 ( 1 ) “”
## 2 ( 1 ) “”
## 3 ( 1 ) “”
## 4 ( 1 ) “*”
## 5 ( 1 ) “*”
## 6 ( 1 ) “*”
## 7 ( 1 ) “*”
## 8 ( 1 ) “*” ##9(1)”*””*””””””””*””””*” ##10 (1)”*” “*” “” “” “””*” “” “*”
“” “” “” “” “” “” “” “” “” “” “” “” “” “” “” “” “””” “” “”
“*” “” “*” “” “*” “” “*” “” “*” “” “*” “” “*” “” “*” “*” “*””*” “*” “*”
“” “” “*””” “*””” “*””” “*””” “*””” “*””” “*”””
“” “””” “” “””” “” “””” “” “””” “” “””*” “” “””*” “” “””*” “” “””*”
“” “” “” “” “” “” “” “” “” “” “” “” “” “” “” “”
10

## 11
## 12
## 13
## 14
## 15
## 16
## 17
## 18
## 19 ##
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10
## 11
## 12
## 13
## 14
## 15
## 16
## 17
## 18
## 19
(1) “*” (1) “*” (1) “*” (1) “*” (1) “*” (1) “*” (1) “*” (1) “*” (1) “*”
“*” “” “” “*” “” “*” “*” “” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*”
“””*” “””*” “””*” “””*” “””*” “*” “*” “*” “*” “*” “*” “*” “*”
“” “*” “” “*” “” “*” “” “*” “” “*” “” “*” “” “*” “*” “*” “*” “*”
“” “” “*” “*” “” “” “*” “*” “” “” “*” “*” “” “” “*” “*” “*” “” “*” “*” “*” “” “*” “*” “*” “” “*” “*” “*” “” “*” “*” “*” “*” “*” “*”
summary.bwd = summary(regfit.bwd) ar2.max = which.max(summary.bwd$adjr2) ar2.max
## [1] 11
max(summary.bwd$adjr2) ## [1] 0.5225706
CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
(1) “” “” “” (1) “” “” “” (1) “” “” “” (1) “” “” “” (1) “” “” “” (1) “” “” “*” (1) “*” “” “*” (1) “*” “” “*” (1) “*” “” “*”
(1) “*” “” “*” (1) “*” “*” “*” (1) “*” “*” “*” (1) “*” “*” “*” (1) “*” “*” “*” (1) “*” “*” “*” (1) “*” “*” “*” (1) “*” “*” “*” (1) “*” “*” “*” (1) “*” “*” “*”
“” “” “” “” “*” “” “*” “” “*” “” “*” “” “*” “” “*” “” “*” “” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*”
“” “” “” “” “” “” “” “” “” “” “” “” “” “” “” “” “” “” “” “” “” “” “” “” “*” “” “*” “” “*” “” “*” “” “*” “*” “*” “*” “*” “*”
These procedures may lead to different results. E.g., the best seven-variable models identified by forward stepwise selection, backward stepwise selection, and best subset selection are different:
coef(regfit.full,7) ## (Intercept)
## 79.4509472
## DivisionW
## -129.9866432
coef(regfit.bwd, 7) ## (Intercept)
## 105.6487488
## DivisionW
## -116.1692169
Hits
1.2833513
PutOuts
0.2366813
Walks
3.2274264
Hits
6.7574914
CAtBat
-0.3752350
Walks
6.0558691
CHits
1.4957073
CRuns
1.1293095
CHmRun
1.4420538
CWalks
-0.7163346
AtBat
-1.9762838
PutOuts
0.3028847
11

coef(regfit.fwd, 7) ## (Intercept)
## 109.7873062
## DivisionW
## -127.1223928
## ## ##
135.7512195
CRBI
0.7852528
AtBat
-2.1277482
CWalks
-0.8228559
Walks
5.6202755
DivisionW
43.1116152 -111.1460252
AtBat
-1.9588851
PutOuts
0.2533404
Hits
7.4498772
Hits
6.9236994
LeagueN
Walks
4.9131401
CRBI
0.8537622
CAtBat
-0.1389914
PutOuts
0.2894087
CAtBat
-0.1389914
PutOuts
0.2894087
CAtBat
-0.1389914
PutOuts
0.2894087
CWalks
-0.3053070
CRuns
1.4553310
Assists
0.2688277
CRuns
1.4553310
Assists
0.2688277
CRuns
1.4553310
Assists
0.2688277
coef(regfit.full,11) ## (Intercept)
coef(regfit.bwd, 11) ## (Intercept)
## ## ##
135.7512195
CRBI
0.7852528
Walks
5.6202755
DivisionW
43.1116152 -111.1460252
AtBat
-2.1277482
CWalks
-0.8228559
Hits
6.9236994
LeagueN
coef(regfit.fwd, 11) ## (Intercept)
## ## ##
135.7512195
CRBI
0.7852528
Walks
5.6202755
DivisionW
43.1116152 -111.1460252
##
## [1,] 11 14
row col
AtBat
-2.1277482
CWalks
-0.8228559
Hits
6.9236994
LeagueN
opt.x <- c() x <- Hitters[,!(names(Hitters) %in% c("Salary"))] y <- Hitters$Salary concat <- cbind(x,y) adj.r2 <- matrix(nrow = ncol(x), ncol = ncol(x)) for(j in 1:(ncol(x))){ for(i in 1:(ncol(x))){ if(i %in% opt.x){ adj.r2[j,i] <- -1 }else{ adj.r2[j,i] <- summary(lm(y ~ ., data = concat[, c(opt.x, i, (ncol(x)+1))]))$adj.r.s } } opt.x <- c(opt.x, which.max(adj.r2[j,])) } which(adj.r2 == max(adj.r2), arr.ind = TRUE) quared opt.n <- which(adj.r2 == max(adj.r2), arr.ind = TRUE)[1,][1] opt.x <- opt.x[1:opt.n] 12 summary(lm(y ~ ., data = x[, c(opt.x)])) ## ## Call: ## lm(formula = y ~ ., data = x[, c(opt.x)]) ## ## Residuals: ## Min 1Q Median 3Q Max ## -932.2 -175.4 -29.2 130.4 1897.2 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 135.75122 71.34623 1.903 0.058223 .
## CRBI 0.78525
## Hits 6.92370
## PutOuts 0.28941
## DivisionW -111.14603
## AtBat -2.12775
## Walks 5.62028
## CWalks -0.82286
## CRuns 1.45533
## CAtBat -0.13899
## Assists 0.26883
## LeagueN 43.11162
## —
0.20978 3.743 0.000225 ***
1.64612 4.206 3.62e-05 ***
0.07478 3.870 0.000139 ***
39.21835 -2.834 0.004970 **
0.53746 -3.959 9.81e-05 ***
1.59064 3.533 0.000488 ***
0.26361 -3.121 0.002010 **
0.39270 3.706 0.000259 ***
0.05609 -2.478 0.013870 *
0.15816 1.700 0.090430 .
39.96612 1.079 0.281755
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
##
## Residual standard error: 311.7 on 251 degrees of freedom
## Multiple R-squared: 0.5426, Adjusted R-squared: 0.5226
## F-statistic: 27.07 on 11 and 251 DF, p-value: < 2.2e-16 forward <- function(x,y){ concat <- cbind(x,y) opt.x <- c() adj.r2 <- matrix(nrow = ncol(x), ncol = ncol(x)) for(j in 1:ncol(x)){ for(i in 1:ncol(x)){ if(i %in% opt.x){ adj.r2[j,i] <- -1 }else{ adj.r2[j,i] <- summary(lm(y ~ ., data = concat[, c(opt.x, i, (ncol(x)+1))]))$adj.r } } opt.x <- c(opt.x, which.max(adj.r2[j,])) } opt.n <- which(adj.r2 == max(adj.r2), arr.ind = TRUE)[1,][1] opt.x <- opt.x[1:opt.n] 13 .squared return(summary(lm(y ~ ., data = x[, c(opt.x)]))) } x <- Hitters[,!(names(Hitters) %in% c("Salary"))] y <- Hitters$Salary opt.forward <- forward(x,y) opt.forward ## ## Call: ## lm(formula = y ~ ., data = x[, c(opt.x)]) ## ## Residuals: ## Min 1Q Median 3Q Max ## -932.2 -175.4 -29.2 130.4 1897.2 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 135.75122 71.34623 1.903 0.058223 .
## CRBI 0.78525
## Hits 6.92370
## PutOuts 0.28941
## DivisionW -111.14603 39.21835 -2.834 0.004970 **
## AtBat
## Walks
## CWalks
## CRuns
## CAtBat
## Assists 0.26883
## LeagueN 43.11162
## —
0.53746 -3.959 9.81e-05 ***
1.59064 3.533 0.000488 ***
0.26361 -3.121 0.002010 **
0.39270 3.706 0.000259 ***
0.05609 -2.478 0.013870 *
0.15816 1.700 0.090430 .
39.96612 1.079 0.281755
-2.12775
5.62028
-0.82286
1.45533
-0.13899
0.20978 3.743 0.000225 ***
1.64612 4.206 3.62e-05 ***
0.07478 3.870 0.000139 ***
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
##
## Residual standard error: 311.7 on 251 degrees of freedom
## Multiple R-squared: 0.5426, Adjusted R-squared: 0.5226
## F-statistic: 27.07 on 11 and 251 DF, p-value: < 2.2e-16 opt.forward$adj.r.squared ## [1] 0.5225706 ## Subset selection object ## Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19, method = "forward") ## 19 Variables (and intercept) regfit.fwd = regsubsets(Salary~., data=Hitters, nvmax=19, method="forward") summary(regfit.fwd) ## ## AtBat ## Hits ## HmRun ## Runs ## RBI ## Walks Forced in Forced out FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 14 ## Years ## CAtBat ## CHits ## CHmRun ## CRuns ## CRBI ## CWalks ## LeagueN ## DivisionW ## PutOuts ## Assists ## Errors FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## NewLeagueN ## 1 subsets of each size up to 19 ## Selection Algorithm: forward ## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI ## 1 ( 1 ) "" ## 2 ( 1 ) "" ## 3 ( 1 ) "" ## 4 ( 1 ) "" ## 5 ( 1 ) "*" ## 6 ( 1 ) "*" ## 7 ( 1 ) "*" ## 8 ( 1 ) "*" ##9(1)"*""*""""""""*""""*" """" "*""*" ## 10 ## 11 ## 12 ## 13 ## 14 ## 15 ## 16 ## 17 ## 18 ## 19 ## ## 1 ## 2 ## 3 ## 4 ## 5 ## 6 ## 7 ## 8 ## 9 ## 10 ## 11 ## 12 ## 13 ## 14 ## 15 ## 16 ## 17 ## 18 (1) "*" (1) "*" (1) "*" (1) "*" (1) "*" (1) "*" (1) "*" (1) "*" (1) "*" (1) "*" "*" "" "" "*" "" "" "*" "" "*" "*" "" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" """*" """*" """*" """*" """*" """*" "*" "*" "*" "*" "*" "*" "*" "*" "" "*" "" "*" "" "*" "" "*" "" "*" "" "*" "" "*" "" "*" "*" "*" "*" "*" "" "" "*" "*" "" "" "*" "*" "" "" "*" "*" "" "" "*" "*" "" "" "*" "*" "*" "" "*" "*" "*" "" "*" "*" "*" "" "*" "*" "*" "" "*" "*" "*" "*" "*" "*" "" "" "*""" "*""" "*""" "*""" "*""" "*""" "*""" "" """" "" """" "" """" "" """" "" """" "" """*" "" """*" "" """*" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "*" "" "*" "" "*" "" "*" "" "*" "" "*" "" "*" "*" "*" CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN (1) "" "" "" "" (1) "" "" "" "" (1) "" "" "" "*" (1) "" "" "*" "*" (1) "" "" "*" "*" (1) "" "" "*" "*" (1) "*" "" "*" "*" (1) "*" "" "*" "*" (1) "*" "" "*" "*" (1) "*" "" "*" "*" (1) "*" "*" "*" "*" (1) "*" "*" "*" "*" (1) "*" "*" "*" "*" (1) "*" "*" "*" "*" (1) "*" "*" "*" "*" (1) "*" "*" "*" "*" (1) "*" "*" "*" "*" (1) "*" "*" "*" "*" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "*" "" "" "*" "" "" "*" "" "" "*" "*" "" "*" "*" "" "*" "*" "" "*" "*" "" "*" "*" "*" "*" "*" "*" 15 ## 19 ( 1 ) "*" ## (Intercept) "*" "*" "*" "*" "*" "*" CAtBat -0.1389914 PutOuts 0.2894087 ## ## ## 135.7512195 CRBI 0.7852528 Walks 5.6202755 DivisionW 43.1116152 -111.1460252 AtBat -2.1277482 CWalks -0.8228559 Hits 6.9236994 LeagueN CRuns 1.4553310 Assists 0.2688277 2. Shrinkage: Ridge and Lasso Here we don’t select features explicitly, but we put on a penalty to penalize the model for the size of coefficients in various ways. glmnet package helps to fit ridge and lasso models. ## Loading required package: Matrix ## Loaded glmnet 4.0 We will perform ridge and the lasso in order to predict Salary on the Hitters data. glmnet function requires the data to be in matrix form. The model.matrix() function is particularly useful for creating x. It produces a matrix corresponding to 19 predictors and transforms qualitative variables into dummy variables. It is important because glmnet() requires numerical inputs. ## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun ## 1 1 1 1 1 1 1 1 1 1 ## CRuns CRBI CWalks PutOuts Assists Errors ## 1 1 1 1 1 1 The glmnet() function has an alpha argument that determines what type of model is fit. alpha = 0: ridge regression alpha = 1: lasso regression We first fit a ridge regression model: Important: glmnet() function standardizes the variables so that they are on the same scale. For each value of lambda you get a vector of ridge regression coefficients, stored in a matrix that can be accessed by coef(). lambda = 10^1 16 coef(regfit.fwd, 11) #install.packages('glmnet') library('glmnet') x = model.matrix(Salary ~., Hitters)[,-1] # regressor matrix with dummy variables x1 = x[,c(14,15,19)] x2 = x[,-c(14,15,19)] x2 = scale(x2, scale = TRUE) # divide columns by theird standard deviation apply(x2, 2, sd) x = cbind(x1,x2) x = model.matrix(Salary ~., Hitters)[,-1] # regressor matrix with dummy variables y = Hitters$Salary # response variable grid = 10^seq(10,-2, length=100) # implement a function over a grid of values ranging from ridge.mod = glmnet(x,y, alpha=0, lambda = grid) dim(coef(ridge.mod)) # 100 models with 20 coefficients ## [1] 20 100 Find the 50th value of lambda in the ridge.mod output: ridge.mod$lambda[50] ## [1] 11497.57 Display estimated coefficients of the model with lambda = 11497.57! coef(ridge.mod)[,50] ## (Intercept) Runs 0.230701523 CHits 0.011653637 LeagueN 0.085028114 NewLeagueN ## -6.215440973 0.016482577 0.002612988 -0.020502690 0.301433531 Display the corresponding l2 norm: sqrt(sum(coef(ridge.mod)[-1,50]^2)) ## [1] 6.360612 Let’s plot estimated coefficients as a function of lambda! plot(ridge.mod, xvar = 'lambda', label=TRUE) ## 407.356050200 AtBat 0.036957182 Walks 0.289618741 CRuns 0.023379882 PutOuts Hits 0.138180344 Years 1.107702929 CRBI 0.024138320 Assists HmRun 0.524629976 CAtBat 0.003131815 CWalks 0.025015421 Errors ## ## ## ## ## RBI 0.239841459 CHmRun 0.087545670 DivisionW 17 19 19 19 19 19 19 14 26 3 112769 10385 14 187 19 15 −5 0 5 10 15 20 Log Lambda Remember: ridge regression is penalized by the sum of squares of the coefficients. That’s controlled by lambda! If lambda is big, you want the sum of squares of the coefficients to be small. That will shrink the sum of squares of the coefficients towards zero. If lambda is very big all the coefficients are going to be zero. As we relax lambda, the coefficients grow away from 0 in a nice, smooth way, and the sum of squares of the coefficients gets bigger and bigger until we reach a point where lambda is effectively zero, and the coefficients are unregularized/unpenalized. If you want to go the hard way to draw a similar picture (and to better understand loops and matrices in R): index = dim(coef(ridge.mod)) #remember: you have 100 models with 20 parameters coef.ridge = matrix(nrow=20, ncol=100) #create a matrix where you will store all the coeff row.names(coef.ridge) = #rows'names in your new matrix row.names(coef(ridge.mod)) #names of 20 regressors # Filling in a matrix of estimated coefficients depending on lambda for(i in 1:index[1]){ #for each row (each parameter estimate) for(lambda in 1:index[2]){ #for each column (each lambda) coef.ridge[i, lambda] = coef(ridge.mod)[,lambda][i] } } # Plotting the coefficients (exclude the intercept) in different colors # AtBat parameter estimate for different models (with different lambda): 18 icients Coefficients −100 −50 0 50 # X axis: models from 0 to 100 plot(1:100, coef.ridge[2,], col=1, lwd = 1, ylim = c(min(coef.ridge[-1,]), max(coef.ridge[-1,])), 'l', ylab = 'Coefficients', xlab = 'Lambda:decreasing' ) for(i in 3:20){ lines(coef.ridge[i,], col=i+10, lwd=1) } 0 20 40 60 80 100 Lambda:decreasing You can use the command predict to get parameter estimates for the model with a particular value of lambda. Parameter s: value of the penalty parameter lambda. grid[50] ## [1] 11497.57 coef.ridge[,50] ## (Intercept) Runs 0.230701523 CHits 0.011653637 LeagueN 0.085028114 NewLeagueN ## -6.215440973 0.016482577 0.002612988 -0.020502690 0.301433531 ## 407.356050200 AtBat 0.036957182 Walks 0.289618741 CRuns 0.023379882 PutOuts Hits 0.138180344 Years 1.107702929 CRBI 0.024138320 Assists HmRun 0.524629976 CAtBat 0.003131815 CWalks 0.025015421 Errors ## ## ## ## ## RBI 0.239841459 CHmRun 0.087545670 DivisionW 19 Coefficients −100 −50 0 50 predict(ridge.mod, s=11497.57, type = 'coefficients') ## 20 x 1 sparse Matrix of class "dgCMatrix" ## 1 ## (Intercept) 407.356050519 ## AtBat ## Hits ## HmRun ## Runs ## RBI ## Walks ## Years ## CAtBat ## CHits ## CHmRun ## CRuns ## CRBI ## CWalks ## LeagueN ## DivisionW ## PutOuts ## Assists ## Errors ## NewLeagueN 0.036957182 0.138180343 0.524629975 0.230701522 0.239841458 0.289618740 1.107702926 0.003131815 0.011653637 0.087545670 0.023379882 0.024138320 0.025015421 0.085028112 -6.215440955 0.016482577 0.002612988 -0.020502690 0.301433530 We require a method selecting a value for the tuning parameter lambda. 1) Validation set approach We now split the samples into a training set and a test set in order to estimate the test error of ridge regression. set.seed(100) # set a random seed, so that the results are reproducible train = sample(1:nrow(x), nrow(x)/2) #randomly choose a subset of numbers between 1 and n # These numbers are used as indices for the training observations test = (-train) y.test = y[test] train_fit = train[1:66] train_validate = train[67:131] # we should not use the same test set for model selection a # train 100 models with different values of lambda, use 50% of the training set: ridge.mod = glmnet(x[train_fit,], y[train_fit], alpha=0, lambda=grid, thresh=1e-12) # thresh: convergence threshold for coordinate descent # make predictions for observations from the validation set (for 100 trained models): ridge.pred = predict(ridge.mod, s=grid, newx = x[train_validate,]) # compute MSE for all 100 models: mse.valid = rep(0, 100) for(i in 1:100){ mse.valid[i] = mean((ridge.pred[,i]-y[train_validate])^2)} Now let’s plot test MSE for 100 models: 20 row(x) nd test error plot(1:100, mse.valid, col=1, lwd = 1, 'l', ylab = 'MSE', xlab = 'Models, lambda decreasing' ) 0 20 40 60 80 100 Which model would you choose? which.min(mse.valid) ## [1] 60 grid[60] ## [1] 705.4802 mse.valid[60] ## [1] 100540.2 predict(ridge.mod, s=grid[60], type = 'coefficients') ## 20 x 1 sparse Matrix of class "dgCMatrix" ## 1 ## (Intercept) 45.168855395 ## AtBat ## Hits ## HmRun ## Runs ## RBI 0.156627202 0.649163323 -1.950359682 1.222283332 -0.074048460 Models, lambda decreasing 21 MSE 100000 140000 180000 ## Walks ## Years ## CAtBat ## CHits ## CHmRun ## CRuns ## CRBI ## CWalks ## LeagueN ## DivisionW ## PutOuts ## Assists ## Errors ## NewLeagueN 1.946881746 3.846078283 0.011924431 0.037666963 0.159345518 0.092262057 0.087936080 0.136987710 27.332619700 -51.245863615 0.006266396 0.163357489 -1.678171179 23.861624897 How well performs the model with no shrinkage? ## [1] 178445 Model only with an intercept? mean((mean(y[train_fit])-y[train_validate])^2) ## [1] 169136.9 Now compute the test error of the best model: ## [1] 147908.8 2) Cross-validation Use your training data more efficiently to choose the model. ridge.pred = predict(ridge.mod, s=0, newx=x[train_validate,]) mean((ridge.pred-y[train_validate])^2) ridge.mod = glmnet(x[train,], y[train], alpha=0, lambda=grid[60], thresh=1e-12) ridge.pred = predict(ridge.mod, s=grid[60], newx=x[test,]) mean((ridge.pred-y.test)^2) # test MSE set.seed(1) # the choice of the cross-validation folds is random cv.out = cv.glmnet(x[train,], y[train], plot(cv.out) alpha=0) 22 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 4 6 8 10 12 Log(λ) # Best value of lambda for the training set bestlam = cv.out$lambda.min bestlam ## [1] 506.3855 ## [1] 145854.5 ridge.mod = glmnet(x[train,], y[train], alpha=0, lambda=bestlam, thresh=1e-12) ridge.pred = predict(ridge.mod, s = bestlam, newx = x[test, ]) mean((ridge.pred-y.test)^2) # test MSE out = glmnet(x,y, alpha=0) predict(out, type='coefficients', s=bestlam) ## 20 x 1 sparse Matrix of class "dgCMatrix" ## 1 ## (Intercept) 32.68292956 ## AtBat ## Hits ## HmRun ## Runs ## RBI ## Walks ## Years 0.10318577 0.73633996 0.97925280 0.99817171 0.86986640 1.45072544 2.19917786 23 Mean−Squared Error 80000 120000 160000 200000 ## CAtBat ## CHits ## CHmRun ## CRuns ## CRBI ## CWalks ## LeagueN ## DivisionW ## PutOuts ## Assists ## Errors ## NewLeagueN Lasso 0.01118213 0.05111039 0.36643692 0.10220469 0.10728513 0.06701095 17.17766693 -65.11474283 0.13887405 0.02047837 -0.95658604 9.36341159 Same algorithm as for ridge, alpha=1 ## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm): ## collapsing to unique 'x' values 0 7 8 15 16 17 16 lasso.mod = glmnet(x[train,], y[train], alpha = 1, lambda=grid) plot(lasso.mod) 0 50 100 150 200 250 300 L1 Norm # Cross-validation set.seed(1) cv.out = cv.glmnet(x[train, ], y[train], 24 Coefficients −50 0 50 100 plot(cv.out) alpha = 1, lambda=grid) 19 19 17 17 8 4 0 0 0 0 0 0 0 0 0 0 0 0 0 bestlam=cv.out$lambda.min bestlam ## [1] 14.17474 ## [1] 141581.8 −5 0 5 10 15 20 Log(λ) lasso.pred = predict(lasso.mod, s=bestlam, newx = x[test,]) mean((lasso.pred-y.test)^2) Test MSE of lasso regression is very similar to the test MSE of ridge regression with lambda chosen by cross-validation. However, the lasso has a substantial advantage over ridge regression in that the resulting coefficient estimates are sparse. 12 out of 19 coefficient estimates are exactly zero. out = glmnet(x,y, alpha=1, lambda=grid) lasso.coef = predict(out, type='coefficients', s=bestlam)[1:20,] lasso.coef[lasso.coef!=0] ## (Intercept) ## 10.4228870 ## DivisionW ## -108.1562844 Hits 1.9049368 PutOuts 0.2264672 Walks 2.2467777 CRuns 0.2088215 CRBI LeagueN 0.4175651 7.6134308 #3. Dimension reduction 25 Mean−Squared Error 100000 140000 180000 Extracting important combinations of variables, and then using those combinations as the features in regression. 3.1. Principal Components Regression Principal components regression can be performed using the pcr() function, which is part of the pls library. ## ## Attaching package: 'pls' ## The following object is masked from 'package:stats': ## ## loadings The resulting fit can be examined using summary(): summary(pcr.fit) ## Data: X dimension: 131 19 ## Y dimension: 131 1 ## Fit method: svdpc ## Number of components considered: 19 ## ## VALIDATION: RMSEP ## Cross-validated using 10 random segments. ## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps ## CV 409.2 320.8 322.4 317.3 319.1 324.8 327.9 ## adjCV 409.2 319.7 321.2 315.9 317.6 323.0 326.0 ## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps ossible value #install.packages("pls") library(pls) set.seed(2) pcr.fit = pcr(Salary~., data=Hitters, subset = train, scale = TRUE, validation = "CV") # scale = TRUE: standardizing each predictor prior to generating principal components # Why? Scale on which each variable is measured will not have an effect. # validation = "CV" causes pcr() to compute the ten-fold cross-validation error for each p ## CV 327.6 329.4 332.5 331.5 339.2 337.9 ## adjCV 325.3 327.0 329.9 328.2 335.9 334.0 342.1 338.4 ## ## CV ## adjCV ## ## TRAINING: % variance explained 14 comps 15 comps 16 comps 17 comps 18 comps 19 comps 345.0 346.7 339.2 360.8 359.9 356 340.6 342.0 334.6 354.9 353.9 350 ## ## X ## Salary ## ## X ## Salary ## ## X 99.91 99.97 99.99 100.00 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps 37.06 58.99 69.98 78.63 83.96 88.49 92.03 94.67 44.97 45.27 48.70 48.70 48.93 49.18 50.51 51.02 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps 96.03 97.01 97.87 98.59 99.20 99.5 99.76 51.15 53.22 53.34 54.62 55.16 56.0 58.93 16 comps 17 comps 18 comps 19 comps 26 ## Salary 60.54 60.67 61.01 62.27 The CV score is provided for each possible number of components, ranging from M=0 onwards. pcr() reports RMSE! MSE = RMSEˆ2 To plot the cross-validation scores, use validationplot() function. Using val.type = “MSEP” will cause the cross-validation MSE to be plotted. validationplot(pcr.fit, val.type="MSEP") Salary 0 5 10 15 number of components We see that the smallest cross-validation error occurs when M=3 components are used. We compute the test MSE as follows: ## [1] 151374.3 This test MSE is competitive with the results obtained using the ridge and lasso regressions. But the final model is more difficult to interpret! #3.2. Partial Least Squares We implement partial least squares (PLS) ising the plsr() function, also in the pls library. ## Data: X dimension: 131 19 ## Y dimension: 131 1 pcr.pred = predict(pcr.fit, x[test,], ncomp=3) mean((pcr.pred-y.test)^2) set.seed(1) pls.fit = plsr(Salary~., data=Hitters, subset = train, scale = TRUE, validation = "CV") summary(pls.fit) 27 MSEP 100000 120000 140000 160000 ## Fit method: kernelpls ## Number of components considered: 19 ## ## VALIDATION: RMSEP ## Cross-validated using 10 random segments. ## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps ## CV 409.2 306.9 311.4 321.9 332.9 337.9 343 ## adjCV 409.2 306.4 310.3 319.9 329.6 333.7 338 ## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps ## CV 339.4 336.3 337.2 335.5 336.9 337.9 ## adjCV 333.8 331.4 332.5 331.2 332.5 333.5 342.9 338.0 ## ## CV ## adjCV ## ## TRAINING: % variance explained 14 comps 15 comps 16 comps 17 comps 18 comps 19 comps 343.3 349.5 347.2 342.9 341.8 339.7 338.3 343.9 341.8 337.9 336.9 334.8 ## ## X ## Salary ## ## X ## Salary ## ## X 99.54 99.90 99.98 100.00 ## Salary 61.74 61.84 62.09 62.27 The lowest cross-validation error occurs when only M=2 partial least squares directions are used. ## [1] 145505.4 The test MSE is comparable, but slightly higher than the test MSE obtained using ridge regression, the lasso, and PCR. Notice that the percentage of variance in Salary that the two-component PLS fit explains, 51.41%, is as much as that explained using the nine-component model PCR fit, 51.15%. This is because PLS searches for directions that explain variance in both predictors and the response! 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps 36.83 48.21 62.91 72.76 75.32 79.03 81.30 83.88 47.67 51.41 53.26 55.37 58.05 58.92 59.86 60.45 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps 87.44 93.99 96.67 97.54 98.10 98.59 98.90 60.61 60.68 60.81 60.97 61.14 61.32 61.65 16 comps 17 comps 18 comps 19 comps pls.pred = predict(pls.fit, x[test,], ncomp=2) mean((pls.pred-y.test)^2) 28