Model Selection
Mariia Okuneva
We have talked about linear models in the second tutorial and we were estimating them with least squares. Despite its simplicity, the linear model has an advantage in terms of interpretability and it often shows good predictive performance.
So, today we try to improve on linear models by selecting or shrinking the coefficients to make the model more interpretable, and in some cases to predict better.
1. Subset Selection: Best Subset Selection
We will try to look among all possible combinations of features to find the ones that are the most predictive.
Here we apply the best subset selection approach to Hitters data. We wish to predict a baseball player’s Salary on the basis of various statistics associated with performance in the previous year.
## AtBat Hits HmRun Runs
## Min. : 16.0 Min.:1 Min. : 0.00 Min. : 0.00
## 1st Qu.:255.2 1st Qu.: 64 1st Qu.: 4.00 1st Qu.: 30.25
## Median :379.5 Median : 96 Median : 8.00 Median : 48.00
## Mean :380.9 Mean :101 Mean :10.77 Mean : 50.91
## 3rd Qu.:512.0 3rd Qu.:137 3rd Qu.:16.00 3rd Qu.: 69.00
## Max. :687.0 Max. :238 Max. :40.00 Max. :130.00
##
## RBI Walks Years CAtBat
## Min. : 0.00 Min. : 0.00 Min. : 1.000 Min. : 19.0 ## 1st Qu.: 28.00 1st Qu.: 22.00 1st Qu.: 4.000 1st Qu.: 816.8 ## Median : 44.00 Median : 35.00 Median : 6.000 Median : 1928.0 ## Mean : 48.03 Mean : 38.74 Mean : 7.444 Mean : 2648.7 ## 3rd Qu.: 64.75 3rd Qu.: 53.00 3rd Qu.:11.000 3rd Qu.: 3924.2 ## Max. :121.00 Max. :105.00 Max. :24.000 Max. :14053.0 ##
## CHits CHmRun CRuns CRBI
## Min. : 4.0 Min. : 0.00 Min. : 1.0 Min. : 0.00 ## 1st Qu.: 209.0 1st Qu.: 14.00 1st Qu.: 100.2 1st Qu.: 88.75 ## Median : 508.0 Median : 37.50 Median : 247.0 Median : 220.50 ## Mean : 717.6 Mean : 69.49 Mean : 358.8 Mean : 330.12 ## 3rd Qu.:1059.2 3rd Qu.: 90.00 3rd Qu.: 526.2 3rd Qu.: 426.25 ## Max. :4256.0 Max. :548.00 Max. :2165.0 Max. :1659.00
library(ISLR) fix(Hitters) attach(Hitters)
summary(Hitters)
1
##
## CWalks League Division PutOuts Assists
## Min. : 0.00 A:175 E:157 Min. : 0.0 Min. : 0.0
## 1st Qu.: 67.25 N:147 W:165 1st Qu.: 109.2 1st Qu.: 7.0
## Median : 170.50
## Mean : 260.24
## 3rd Qu.: 339.25
## Max. :1566.00
##
Median : 212.0
Mean : 288.9
3rd Qu.: 325.0
Max. :1378.0
Median : 39.5
Mean :106.9
3rd Qu.:166.0
Max. :492.0
## Errors
## Min. : 0.00 Min. : 67.5 A:176
## 1st Qu.: 3.00 1st Qu.: 190.0 N:146
## Median : 6.00 Median : 425.0
## Mean : 8.04 Mean : 535.9
## 3rd Qu.:11.00 3rd Qu.: 750.0
## Max. :32.00
##
## [1] 59
## [1] “AtBat”
## [7] “Years”
## [13] “CWalks”
## [19] “Salary”
Max. :2460.0
NA’s :59
## starting httpd help server … done
Salary
NewLeague
“Hits” “HmRun” “Runs”
“CAtBat” “CHits” “CHmRun”
“League” “Division” “PutOuts”
“NewLeague”
“RBI” “Walks”
“CRuns” “CRBI”
“Assists” “Errors”
For Best Subset Selection we are going to use the regsubsets() function (part of the leaps library). So, let’s load ‘leaps’ library.
The regsubsets() function performs best subset selection by identifying the best model that contains a given number of predictors, where best is quantified using RSS.
## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters)
## 19 Variables (and intercept)
## Forced in Forced out
## AtBat FALSE
## Hits FALSE
## HmRun FALSE
FALSE
FALSE
FALSE
2
same length as
e.
sum(is.na(Hitters)) # Salary variable is missing for some of the players
names(Hitters)
help(Hitters)
# is.na() function helps to identify the missing observations. It returns a vector of the # sum() function is used to count all the missing elements
Hitters = na.omit(Hitters) #remove all of the rows that have missing values in any variabl
#install.packages(‘leaps’)
library(‘leaps’)
regfit.full = regsubsets(Salary ~., data=Hitters)
summary(regfit.full) # outputs the best set of variables for each model size
## Runs
## RBI
## Walks
## Years
## CAtBat
## CHits
## CHmRun
## CRuns
## CRBI
## CWalks
## LeagueN
## DivisionW
## PutOuts
## Assists
## Errors
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
## NewLeagueN
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
##1 (1)””
##2 (1)””
##3 (1)””
##4 (1)”” ##5(1)”*””*”””””””””””””
“” “”
“” “”
“” “”
“” “”
“””” “””*” “” “” “” “*” “*” “*” “” “” “””*” “*” “”
CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN (1) “” “” “” “”””””””
(1) “” “” “” “”””””””
##4 (1)”” “” “*”
##5 (1)”” “” “*”
##6 (1)”” “” “*”
##7 (1)”” “” “*”
##8 (1)”*” “” “*”
“*” “” “*” “” “*” “” “*” “” “*” “”
“” “” “” “” “” “” “” “” “” “”
“” “” “*””” “*””” “*”””
“” “””” “” “””” “” “””” “” “”””
“” “” “” “” “” “” “” “”
“” “*” “” “*” “” “*” “” “*”
## 6
## 7
## 8 ##
## 1
## 2
##3(1)”””””” “*”””””””
(1) “*” (1) “” (1) “*”
“*””” “” “””*” “*””” “” “””*” “*””” “” “””*”
“” “” “””*” “” “”
An asterisk indicates that a given variable is included in the corresponding model. Default: up to the best eight-variable model.
Use nvmax argument to return as many variables as you want.
## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19)
## 19 Variables (and intercept)
regfit.full = regsubsets(Salary ~., data=Hitters, nvmax=19) summary(regfit.full)
##
## AtBat
## Hits
## HmRun
## Runs
## RBI
## Walks
## Years
Forced in Forced out
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
3
## CAtBat
## CHits
## CHmRun
## CRuns
## CRBI
## CWalks
## LeagueN
## DivisionW
## PutOuts
## Assists
## Errors
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
## NewLeagueN
## 1 subsets of each size up to 19
## Selection Algorithm: exhaustive
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
## 1 ( 1 ) “”
## 2 ( 1 ) “”
## 3 ( 1 ) “”
## 4 ( 1 ) “”
## 5 ( 1 ) “*”
## 6 ( 1 ) “*”
## 7 ( 1 ) “”
## 8 ( 1 ) “*”
##9(1)”*””*””””””””*””””*” “””” “*””*”
## 10
## 11
## 12
## 13
## 14
## 15
## 16
## 17
## 18
## 19 ##
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10
## 11
## 12
## 13
## 14
## 15
## 16
## 17
## 18
## 19
(1) “*” (1) “*” (1) “*” (1) “*” (1) “*” (1) “*” (1) “*” (1) “*” (1) “*” (1) “*”
“*” “” “” “*” “” “” “*” “” “*” “*” “” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*”
“””*” “””*” “””*” “””*” “””*” “””*” “*” “*” “*” “*” “*” “*” “*” “*”
“” “*” “” “*” “” “*” “” “*” “” “*” “” “*” “” “*” “” “*” “*” “*” “*” “*”
“” “” “*” “*” “” “” “*” “*” “” “” “*” “*” “” “” “*” “*” “” “” “*” “*” “*” “” “*” “*” “*” “” “*” “*” “*” “” “*” “*” “*” “” “*” “*” “*” “*” “*” “*”
“” “” “*””” “*””” “*””” “*””” “*””” “*””” “*”””
“” “””” “” “””” “” “””” “” “””” “” “””” “” “””*” “” “””*” “” “””*”
“” “” “” “” “” “” “” “” “” “” “” “” “” “*” “” “”
“” “” “” “” “” “” “” “” “” “” “” “” “*” “*” “” “*”
“” “*” “” “*” “” “*” “” “*” “” “*” “” “*” “” “” “*” “”
CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
(1) “” “” “” “” (1) “” “” “” “” (1) “” “” “” “*” (1) “” “” “*” “*” (1) “” “” “*” “*” (1) “” “” “*” “*” (1) “” “” “*” “*” (1) “*” “” “*” “*” (1) “*” “” “*” “*”
(1) “*” “” “*” “*” (1) “*” “*” “*” “*” (1) “*” “*” “*” “*” (1) “*” “*” “*” “*” (1) “*” “*” “*” “*” (1) “*” “*” “*” “*” (1) “*” “*” “*” “*” (1) “*” “*” “*” “*” (1) “*” “*” “*” “*” (1) “*” “*” “*” “*”
“” “” “”
“” “” “”
“” “” “”
“” “” “”
“” “” “”
“” “” “”
“” “” “”
“” “” “”
“” “” “”
“*” “” “”
“*” “” “”
“*” “” “”
“*” “*” “”
“*” “*” “”
“*” “*” “”
“*” “*” “”
“*” “*” “*”
“*” “*” “*”
“*” “*” “*”
4
reg.summary = summary(regfit.full)
names(reg.summary) # summary() function returns also R^2, RSS, adjusted R^2, Cp, and BIC.
## [1] “which” “rsq” “rss” “adjr2” “cp” “bic” “outmat” “obj”
Rˆ2 statistic increases monotonically as more variables are included. No adjustment for the number of variables in the model.
## [1] 0.3214501 0.4252237 0.4514294 0.4754067 0.4908036 0.5087146 0.5141227
## [8] 0.5285569 0.5346124 0.5404950 0.5426153 0.5436302 0.5444570 0.5452164
## [15] 0.5454692 0.5457656 0.5459518 0.5460945 0.5461159
Let’s plot RSS, adjusted Rˆ2, Cp, and BIC for all of the models to select the best one.
Which criteria
## [1] 11
5 10 15
Number of Variables
5
reg.summary$rsq
plot(reg.summary$adjr2,
xlab = ‘Number of Variables’, ylab = ‘Adjusted RSq’,
type = ‘l’)
ar2.max = which.max(reg.summary$adjr2)
# put the point on the plot that is already created with points() points(ar2.max, reg.summary$adjr2[ar2.max],
col=’red’, cex=2, pch=20)
Adjusted RSq
0.35 0.40 0.45 0.50
ar2.max
plot(reg.summary$rss,
xlab = ‘Number of Variables’, ylab = ‘RSS’,
type = ‘l’)
rss.min = which.min(reg.summary$rss) points(rss.min, reg.summary$rss[rss.min],
col=’red’, cex=2, pch=20)
5 10 15
Number of Variables
rss.min
## [1] 19
plot(reg.summary$cp,
xlab = ‘Number of Variables’, ylab = ‘Cp’,
type = ‘l’)
cp.min = which.min(reg.summary$cp) points(cp.min, reg.summary$cp[cp.min],
col=’red’, cex=2, pch=20)
6
RSS
2.4e+07 2.8e+07 3.2e+07 3.6e+07
5 10 15
Number of Variables
cp.min
## [1] 10
plot(reg.summary$bic,
xlab = ‘Number of Variables’, ylab = ‘bic’,
type = ‘l’)
bic.min = which.min(reg.summary$bic) points(bic.min, reg.summary$bic[bic.min],
col=’red’, cex=2, pch=20)
7
Cp
20 40 60 80 100
5 10 15
Number of Variables
bic.min
## [1] 6
The model with the lowest BIC is the six-variable model. Which regressors?
coef(regfit.full, 6)
## ## ## ##
(Intercept) AtBat Hits Walks CRBI
91.5117981 -1.8685892 7.6043976 3.6976468
PutOuts
0.2643076
DivisionW
0.6430169 -122.9515338
Forward and Backward Stepwise Selection
For computational reasons, best subset selection cannot be applied with very large p. Additionally, enormous search space can lead to overfitting. Possible solution: stepwise methods.
We can also use the regsubsets() function to perform forward stepwise or backward stepwise selection, using the argument method = “forward” or method = “backward”.
## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19, method = “forward”)
## 19 Variables (and intercept)
## Forced in Forced out
## AtBat FALSE FALSE
regfit.fwd = regsubsets(Salary~., data=Hitters, nvmax=19, method=”forward”) summary(regfit.fwd)
8
−150 −130
−110 −90
bic
## Hits
## HmRun
## Runs
## RBI
## Walks
## Years
## CAtBat
## CHits
## CHmRun
## CRuns
## CRBI
## CWalks
## LeagueN
## DivisionW
## PutOuts
## Assists
## Errors
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
## NewLeagueN
## 1 subsets of each size up to 19
## Selection Algorithm: forward
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
## 1 ( 1 ) “”
## 2 ( 1 ) “”
## 3 ( 1 ) “”
## 4 ( 1 ) “”
## 5 ( 1 ) “*”
## 6 ( 1 ) “*”
## 7 ( 1 ) “*”
## 8 ( 1 ) “*”
##9(1)”*””*””””””””*””””*” “””” “*””*”
## 10
## 11
## 12
## 13
## 14
## 15
## 16
## 17
## 18
## 19 ##
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10
## 11
## 12
## 13
(1) “*” (1) “*” (1) “*” (1) “*” (1) “*” (1) “*” (1) “*” (1) “*” (1) “*” (1) “*”
“*” “” “” “*” “” “” “*” “” “*” “*” “” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*”
“””*” “””*” “””*” “””*” “””*” “””*” “*” “*” “*” “*” “*” “*” “*” “*”
“” “*” “” “*” “” “*” “” “*” “” “*” “” “*” “” “*” “” “*” “*” “*” “*” “*”
“” “” “*” “*” “” “” “*” “*” “” “” “*” “*” “” “” “*” “*” “” “” “*” “*” “*” “” “*” “*” “*” “” “*” “*” “*” “” “*” “*” “*” “” “*” “*” “*” “*” “*” “*”
“” “” “*””” “*””” “*””” “*””” “*””” “*””” “*”””
“” “””” “” “””” “” “””” “” “””” “” “””” “” “””*” “” “””*” “” “””*”
“” “” “” “” “” “” “” “” “” “” “” “” “” “” “” “”
“” “” “” “” “” “” “” “” “” “” “” “” “” “” “” “”
“” “*” “” “*” “” “*” “” “*” “” “*” “” “*” “” “*” “*” “*”
CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
(1) “” “” “” “” (1) “” “” “” “” (1) “” “” “” “*” (1) “” “” “*” “*” (1) “” “” “*” “*” (1) “” “” “*” “*” (1) “*” “” “*” “*” (1) “*” “” “*” “*” (1) “*” “” “*” “*”
(1) “*” “” “*” “*” (1) “*” “*” “*” “*” (1) “*” “*” “*” “*” (1) “*” “*” “*” “*”
“” “” “”
“” “” “”
“” “” “”
“” “” “”
“” “” “”
“” “” “”
“” “” “”
“” “” “”
“” “” “”
“*” “” “”
“*” “” “”
“*” “” “”
“*” “*” “”
9
## 14
## 15
## 16
## 17
## 18
## 19
(1)”*” “*” “*” (1)”*” “*” “*” (1)”*” “*” “*” (1)”*” “*” “*” (1)”*” “*” “*” (1)”*” “*” “*”
“*” “*”
“*” “*”
“*” “*”
“*” “*”
“*” “*”
“*” “*”
“*” “” “*” “” “*” “” “*” “*” “*” “*” “*” “*”
summary.fwd = summary(regfit.fwd) ar2.max = which.max(summary.fwd$adjr2) ar2.max
## [1] 11
max(summary.fwd$adjr2) ## [1] 0.5225706
regfit.bwd = regsubsets(Salary~., data=Hitters, nvmax=19, method=”backward”) summary(regfit.bwd)
## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19, method = “backward”)
## 19 Variables (and intercept)
Forced in Forced out
##
## AtBat
## Hits
## HmRun
## Runs
## RBI
## Walks
## Years
## CAtBat
## CHits
## CHmRun
## CRuns
## CRBI
## CWalks
## LeagueN
## DivisionW
## PutOuts
## Assists
## Errors
## NewLeagueN
## 1 subsets of each size up to 19
## Selection Algorithm: backward
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
## 1 ( 1 ) “”
## 2 ( 1 ) “”
## 3 ( 1 ) “”
## 4 ( 1 ) “*”
## 5 ( 1 ) “*”
## 6 ( 1 ) “*”
## 7 ( 1 ) “*”
## 8 ( 1 ) “*” ##9(1)”*””*””””””””*””””*” ##10 (1)”*” “*” “” “” “””*” “” “*”
“” “” “” “” “” “” “” “” “” “” “” “” “” “” “” “” “””” “” “”
“*” “” “*” “” “*” “” “*” “” “*” “” “*” “” “*” “” “*” “*” “*””*” “*” “*”
“” “” “*””” “*””” “*””” “*””” “*””” “*””” “*”””
“” “””” “” “””” “” “””” “” “””” “” “””*” “” “””*” “” “””*” “” “””*”
“” “” “” “” “” “” “” “” “” “” “” “” “” “” “” “”
10
## 11
## 12
## 13
## 14
## 15
## 16
## 17
## 18
## 19 ##
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10
## 11
## 12
## 13
## 14
## 15
## 16
## 17
## 18
## 19
(1) “*” (1) “*” (1) “*” (1) “*” (1) “*” (1) “*” (1) “*” (1) “*” (1) “*”
“*” “” “” “*” “” “*” “*” “” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*”
“””*” “””*” “””*” “””*” “””*” “*” “*” “*” “*” “*” “*” “*” “*”
“” “*” “” “*” “” “*” “” “*” “” “*” “” “*” “” “*” “*” “*” “*” “*”
“” “” “*” “*” “” “” “*” “*” “” “” “*” “*” “” “” “*” “*” “*” “” “*” “*” “*” “” “*” “*” “*” “” “*” “*” “*” “” “*” “*” “*” “*” “*” “*”
summary.bwd = summary(regfit.bwd) ar2.max = which.max(summary.bwd$adjr2) ar2.max
## [1] 11
max(summary.bwd$adjr2) ## [1] 0.5225706
CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
(1) “” “” “” (1) “” “” “” (1) “” “” “” (1) “” “” “” (1) “” “” “” (1) “” “” “*” (1) “*” “” “*” (1) “*” “” “*” (1) “*” “” “*”
(1) “*” “” “*” (1) “*” “*” “*” (1) “*” “*” “*” (1) “*” “*” “*” (1) “*” “*” “*” (1) “*” “*” “*” (1) “*” “*” “*” (1) “*” “*” “*” (1) “*” “*” “*” (1) “*” “*” “*”
“” “” “” “” “*” “” “*” “” “*” “” “*” “” “*” “” “*” “” “*” “” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*” “*”
“” “” “” “” “” “” “” “” “” “” “” “” “” “” “” “” “” “” “” “” “” “” “” “” “*” “” “*” “” “*” “” “*” “” “*” “*” “*” “*” “*” “*”
These procedures may lead to different results. E.g., the best seven-variable models identified by forward stepwise selection, backward stepwise selection, and best subset selection are different:
coef(regfit.full,7) ## (Intercept)
## 79.4509472
## DivisionW
## -129.9866432
coef(regfit.bwd, 7) ## (Intercept)
## 105.6487488
## DivisionW
## -116.1692169
Hits
1.2833513
PutOuts
0.2366813
Walks
3.2274264
Hits
6.7574914
CAtBat
-0.3752350
Walks
6.0558691
CHits
1.4957073
CRuns
1.1293095
CHmRun
1.4420538
CWalks
-0.7163346
AtBat
-1.9762838
PutOuts
0.3028847
11
coef(regfit.fwd, 7) ## (Intercept)
## 109.7873062
## DivisionW
## -127.1223928
## ## ##
135.7512195
CRBI
0.7852528
AtBat
-2.1277482
CWalks
-0.8228559
Walks
5.6202755
DivisionW
43.1116152 -111.1460252
AtBat
-1.9588851
PutOuts
0.2533404
Hits
7.4498772
Hits
6.9236994
LeagueN
Walks
4.9131401
CRBI
0.8537622
CAtBat
-0.1389914
PutOuts
0.2894087
CAtBat
-0.1389914
PutOuts
0.2894087
CAtBat
-0.1389914
PutOuts
0.2894087
CWalks
-0.3053070
CRuns
1.4553310
Assists
0.2688277
CRuns
1.4553310
Assists
0.2688277
CRuns
1.4553310
Assists
0.2688277
coef(regfit.full,11) ## (Intercept)
coef(regfit.bwd, 11) ## (Intercept)
## ## ##
135.7512195
CRBI
0.7852528
Walks
5.6202755
DivisionW
43.1116152 -111.1460252
AtBat
-2.1277482
CWalks
-0.8228559
Hits
6.9236994
LeagueN
coef(regfit.fwd, 11) ## (Intercept)
## ## ##
135.7512195
CRBI
0.7852528
Walks
5.6202755
DivisionW
43.1116152 -111.1460252
##
## [1,] 11 14
row col
AtBat
-2.1277482
CWalks
-0.8228559
Hits
6.9236994
LeagueN
opt.x <- c()
x <- Hitters[,!(names(Hitters) %in% c("Salary"))]
y <- Hitters$Salary
concat <- cbind(x,y)
adj.r2 <- matrix(nrow = ncol(x), ncol = ncol(x))
for(j in 1:(ncol(x))){ for(i in 1:(ncol(x))){
if(i %in% opt.x){ adj.r2[j,i] <- -1
}else{
adj.r2[j,i] <- summary(lm(y ~ ., data = concat[, c(opt.x, i, (ncol(x)+1))]))$adj.r.s
} }
opt.x <- c(opt.x, which.max(adj.r2[j,])) }
which(adj.r2 == max(adj.r2), arr.ind = TRUE)
quared
opt.n <- which(adj.r2 == max(adj.r2), arr.ind = TRUE)[1,][1] opt.x <- opt.x[1:opt.n]
12
summary(lm(y ~ ., data = x[, c(opt.x)]))
##
## Call:
## lm(formula = y ~ ., data = x[, c(opt.x)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -932.2 -175.4 -29.2 130.4 1897.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 135.75122 71.34623 1.903 0.058223 .
## CRBI 0.78525
## Hits 6.92370
## PutOuts 0.28941
## DivisionW -111.14603
## AtBat -2.12775
## Walks 5.62028
## CWalks -0.82286
## CRuns 1.45533
## CAtBat -0.13899
## Assists 0.26883
## LeagueN 43.11162
## —
0.20978 3.743 0.000225 ***
1.64612 4.206 3.62e-05 ***
0.07478 3.870 0.000139 ***
39.21835 -2.834 0.004970 **
0.53746 -3.959 9.81e-05 ***
1.59064 3.533 0.000488 ***
0.26361 -3.121 0.002010 **
0.39270 3.706 0.000259 ***
0.05609 -2.478 0.013870 *
0.15816 1.700 0.090430 .
39.96612 1.079 0.281755
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
##
## Residual standard error: 311.7 on 251 degrees of freedom
## Multiple R-squared: 0.5426, Adjusted R-squared: 0.5226
## F-statistic: 27.07 on 11 and 251 DF, p-value: < 2.2e-16
forward <- function(x,y){ concat <- cbind(x,y)
opt.x <- c()
adj.r2 <- matrix(nrow = ncol(x), ncol = ncol(x))
for(j in 1:ncol(x)){ for(i in 1:ncol(x)){
if(i %in% opt.x){ adj.r2[j,i] <- -1
}else{
adj.r2[j,i] <- summary(lm(y ~ ., data = concat[, c(opt.x, i, (ncol(x)+1))]))$adj.r
} }
opt.x <- c(opt.x, which.max(adj.r2[j,])) }
opt.n <- which(adj.r2 == max(adj.r2), arr.ind = TRUE)[1,][1] opt.x <- opt.x[1:opt.n]
13
.squared
return(summary(lm(y ~ ., data = x[, c(opt.x)]))) }
x <- Hitters[,!(names(Hitters) %in% c("Salary"))] y <- Hitters$Salary
opt.forward <- forward(x,y) opt.forward
##
## Call:
## lm(formula = y ~ ., data = x[, c(opt.x)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -932.2 -175.4 -29.2 130.4 1897.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 135.75122 71.34623 1.903 0.058223 .
## CRBI 0.78525
## Hits 6.92370
## PutOuts 0.28941
## DivisionW -111.14603 39.21835 -2.834 0.004970 **
## AtBat
## Walks
## CWalks
## CRuns
## CAtBat
## Assists 0.26883
## LeagueN 43.11162
## —
0.53746 -3.959 9.81e-05 ***
1.59064 3.533 0.000488 ***
0.26361 -3.121 0.002010 **
0.39270 3.706 0.000259 ***
0.05609 -2.478 0.013870 *
0.15816 1.700 0.090430 .
39.96612 1.079 0.281755
-2.12775
5.62028
-0.82286
1.45533
-0.13899
0.20978 3.743 0.000225 ***
1.64612 4.206 3.62e-05 ***
0.07478 3.870 0.000139 ***
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
##
## Residual standard error: 311.7 on 251 degrees of freedom
## Multiple R-squared: 0.5426, Adjusted R-squared: 0.5226
## F-statistic: 27.07 on 11 and 251 DF, p-value: < 2.2e-16
opt.forward$adj.r.squared ## [1] 0.5225706
## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19, method = "forward")
## 19 Variables (and intercept)
regfit.fwd = regsubsets(Salary~., data=Hitters, nvmax=19, method="forward") summary(regfit.fwd)
##
## AtBat
## Hits
## HmRun
## Runs
## RBI
## Walks
Forced in Forced out
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
14
## Years
## CAtBat
## CHits
## CHmRun
## CRuns
## CRBI
## CWalks
## LeagueN
## DivisionW
## PutOuts
## Assists
## Errors
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
## NewLeagueN
## 1 subsets of each size up to 19
## Selection Algorithm: forward
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
## 1 ( 1 ) ""
## 2 ( 1 ) ""
## 3 ( 1 ) ""
## 4 ( 1 ) ""
## 5 ( 1 ) "*"
## 6 ( 1 ) "*"
## 7 ( 1 ) "*"
## 8 ( 1 ) "*"
##9(1)"*""*""""""""*""""*" """" "*""*"
## 10
## 11
## 12
## 13
## 14
## 15
## 16
## 17
## 18
## 19 ##
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10
## 11
## 12
## 13
## 14
## 15
## 16
## 17
## 18
(1) "*" (1) "*" (1) "*" (1) "*" (1) "*" (1) "*" (1) "*" (1) "*" (1) "*" (1) "*"
"*" "" "" "*" "" "" "*" "" "*" "*" "" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
"""*" """*" """*" """*" """*" """*" "*" "*" "*" "*" "*" "*" "*" "*"
"" "*" "" "*" "" "*" "" "*" "" "*" "" "*" "" "*" "" "*" "*" "*" "*" "*"
"" "" "*" "*" "" "" "*" "*" "" "" "*" "*" "" "" "*" "*" "" "" "*" "*" "*" "" "*" "*" "*" "" "*" "*" "*" "" "*" "*" "*" "" "*" "*" "*" "*" "*" "*"
"" "" "*""" "*""" "*""" "*""" "*""" "*""" "*"""
"" """" "" """" "" """" "" """" "" """" "" """*" "" """*" "" """*"
"" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
"" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
"" "*" "" "*" "" "*" "" "*" "" "*" "" "*" "" "*" "*" "*"
CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
(1) "" "" "" "" (1) "" "" "" "" (1) "" "" "" "*" (1) "" "" "*" "*" (1) "" "" "*" "*" (1) "" "" "*" "*" (1) "*" "" "*" "*" (1) "*" "" "*" "*" (1) "*" "" "*" "*"
(1) "*" "" "*" "*" (1) "*" "*" "*" "*" (1) "*" "*" "*" "*" (1) "*" "*" "*" "*" (1) "*" "*" "*" "*" (1) "*" "*" "*" "*" (1) "*" "*" "*" "*" (1) "*" "*" "*" "*" (1) "*" "*" "*" "*"
"" "" ""
"" "" ""
"" "" ""
"" "" ""
"" "" ""
"" "" ""
"" "" ""
"" "" ""
"" "" ""
"*" "" ""
"*" "" ""
"*" "" ""
"*" "*" ""
"*" "*" ""
"*" "*" ""
"*" "*" ""
"*" "*" "*"
"*" "*" "*"
15
## 19 ( 1 ) "*"
## (Intercept)
"*" "*" "*"
"*" "*"
"*"
CAtBat
-0.1389914
PutOuts
0.2894087
## ## ##
135.7512195
CRBI
0.7852528
Walks
5.6202755
DivisionW
43.1116152 -111.1460252
AtBat
-2.1277482
CWalks
-0.8228559
Hits
6.9236994
LeagueN
CRuns
1.4553310
Assists
0.2688277
2. Shrinkage: Ridge and Lasso
Here we don’t select features explicitly, but we put on a penalty to penalize the model for the size of coefficients in various ways.
glmnet package helps to fit ridge and lasso models.
## Loading required package: Matrix
## Loaded glmnet 4.0
We will perform ridge and the lasso in order to predict Salary on the Hitters data. glmnet function requires the data to be in matrix form. The model.matrix() function is particularly useful for creating x. It produces a matrix corresponding to 19 predictors and transforms qualitative variables into dummy variables. It is important because glmnet() requires numerical inputs.
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun
## 1 1 1 1 1 1 1 1 1 1
## CRuns CRBI CWalks PutOuts Assists Errors
## 1 1 1 1 1 1
The glmnet() function has an alpha argument that determines what type of model is fit. alpha = 0: ridge regression alpha = 1: lasso regression
We first fit a ridge regression model:
Important: glmnet() function standardizes the variables so that they are on the same scale.
For each value of lambda you get a vector of ridge regression coefficients, stored in a matrix that can be accessed by coef().
lambda = 10^1
16
coef(regfit.fwd, 11)
#install.packages('glmnet')
library('glmnet')
x = model.matrix(Salary ~., Hitters)[,-1] # regressor matrix with dummy variables x1 = x[,c(14,15,19)]
x2 = x[,-c(14,15,19)]
x2 = scale(x2, scale = TRUE) # divide columns by theird standard deviation apply(x2, 2, sd)
x = cbind(x1,x2)
x = model.matrix(Salary ~., Hitters)[,-1] # regressor matrix with dummy variables y = Hitters$Salary # response variable
grid = 10^seq(10,-2, length=100) # implement a function over a grid of values ranging from ridge.mod = glmnet(x,y, alpha=0, lambda = grid)
dim(coef(ridge.mod)) # 100 models with 20 coefficients ## [1] 20 100
Find the 50th value of lambda in the ridge.mod output: ridge.mod$lambda[50]
## [1] 11497.57
Display estimated coefficients of the model with lambda = 11497.57!
coef(ridge.mod)[,50] ## (Intercept)
Runs
0.230701523
CHits
0.011653637
LeagueN
0.085028114
NewLeagueN
## -6.215440973 0.016482577 0.002612988 -0.020502690 0.301433531
Display the corresponding l2 norm:
sqrt(sum(coef(ridge.mod)[-1,50]^2)) ## [1] 6.360612
Let’s plot estimated coefficients as a function of lambda!
plot(ridge.mod, xvar = 'lambda', label=TRUE)
## 407.356050200
AtBat
0.036957182
Walks
0.289618741
CRuns
0.023379882
PutOuts
Hits
0.138180344
Years
1.107702929
CRBI
0.024138320
Assists
HmRun
0.524629976
CAtBat
0.003131815
CWalks
0.025015421
Errors
## ## ## ## ##
RBI
0.239841459
CHmRun
0.087545670
DivisionW
17
19 19 19 19 19 19
14
26
3
112769 10385 14 187
19
15
−5 0 5 10 15 20
Log Lambda
Remember: ridge regression is penalized by the sum of squares of the coefficients. That’s controlled by lambda!
If lambda is big, you want the sum of squares of the coefficients to be small. That will shrink the sum of squares of the coefficients towards zero. If lambda is very big all the coefficients are going to be zero.
As we relax lambda, the coefficients grow away from 0 in a nice, smooth way, and the sum of squares of the coefficients gets bigger and bigger until we reach a point where lambda is effectively zero, and the coefficients are unregularized/unpenalized.
If you want to go the hard way to draw a similar picture (and to better understand loops and matrices in R):
index = dim(coef(ridge.mod)) #remember: you have 100 models with 20 parameters
coef.ridge = matrix(nrow=20, ncol=100) #create a matrix where you will store all the coeff row.names(coef.ridge) = #rows'names in your new matrix
row.names(coef(ridge.mod)) #names of 20 regressors
# Filling in a matrix of estimated coefficients depending on lambda for(i in 1:index[1]){ #for each row (each parameter estimate)
for(lambda in 1:index[2]){ #for each column (each lambda) coef.ridge[i, lambda] =
coef(ridge.mod)[,lambda][i] }
}
# Plotting the coefficients (exclude the intercept) in different colors
# AtBat parameter estimate for different models (with different lambda):
18
icients
Coefficients
−100 −50 0 50
# X axis: models from 0 to 100
plot(1:100, coef.ridge[2,], col=1, lwd = 1, ylim = c(min(coef.ridge[-1,]),
max(coef.ridge[-1,])), 'l', ylab = 'Coefficients',
xlab = 'Lambda:decreasing'
)
for(i in 3:20){
lines(coef.ridge[i,], col=i+10, lwd=1) }
0 20 40 60 80 100
Lambda:decreasing
You can use the command predict to get parameter estimates for the model with a particular value of lambda. Parameter s: value of the penalty parameter lambda.
grid[50]
## [1] 11497.57
coef.ridge[,50]
## (Intercept)
Runs
0.230701523
CHits
0.011653637
LeagueN
0.085028114
NewLeagueN
## -6.215440973 0.016482577 0.002612988 -0.020502690 0.301433531
## 407.356050200
AtBat
0.036957182
Walks
0.289618741
CRuns
0.023379882
PutOuts
Hits
0.138180344
Years
1.107702929
CRBI
0.024138320
Assists
HmRun
0.524629976
CAtBat
0.003131815
CWalks
0.025015421
Errors
## ## ## ## ##
RBI
0.239841459
CHmRun
0.087545670
DivisionW
19
Coefficients
−100 −50 0 50
predict(ridge.mod, s=11497.57, type = 'coefficients')
## 20 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 407.356050519
## AtBat
## Hits
## HmRun
## Runs
## RBI
## Walks
## Years
## CAtBat
## CHits
## CHmRun
## CRuns
## CRBI
## CWalks
## LeagueN
## DivisionW
## PutOuts
## Assists
## Errors
## NewLeagueN
0.036957182
0.138180343
0.524629975
0.230701522
0.239841458
0.289618740
1.107702926
0.003131815
0.011653637
0.087545670
0.023379882
0.024138320
0.025015421
0.085028112
-6.215440955
0.016482577
0.002612988
-0.020502690
0.301433530
We require a method selecting a value for the tuning parameter lambda. 1) Validation set approach We now split the samples into a training set and a test set in order to estimate the test error of ridge regression.
set.seed(100) # set a random seed, so that the results are reproducible
train = sample(1:nrow(x), nrow(x)/2) #randomly choose a subset of numbers between 1 and n # These numbers are used as indices for the training observations
test = (-train)
y.test = y[test]
train_fit = train[1:66]
train_validate = train[67:131] # we should not use the same test set for model selection a
# train 100 models with different values of lambda, use 50% of the training set:
ridge.mod = glmnet(x[train_fit,], y[train_fit], alpha=0, lambda=grid,
thresh=1e-12)
# thresh: convergence threshold for coordinate descent
# make predictions for observations from the validation set (for 100 trained models):
ridge.pred = predict(ridge.mod, s=grid,
newx = x[train_validate,])
# compute MSE for all 100 models:
mse.valid = rep(0, 100) for(i in 1:100){
mse.valid[i] = mean((ridge.pred[,i]-y[train_validate])^2)}
Now let’s plot test MSE for 100 models:
20
row(x)
nd test error
plot(1:100, mse.valid, col=1, lwd = 1, 'l', ylab = 'MSE',
xlab = 'Models, lambda decreasing'
)
0 20 40 60 80 100
Which model would you choose?
which.min(mse.valid) ## [1] 60
grid[60]
## [1] 705.4802
mse.valid[60]
## [1] 100540.2
predict(ridge.mod, s=grid[60], type = 'coefficients')
## 20 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 45.168855395
## AtBat
## Hits
## HmRun
## Runs
## RBI
0.156627202
0.649163323
-1.950359682
1.222283332
-0.074048460
Models, lambda decreasing
21
MSE
100000 140000 180000
## Walks
## Years
## CAtBat
## CHits
## CHmRun
## CRuns
## CRBI
## CWalks
## LeagueN
## DivisionW
## PutOuts
## Assists
## Errors
## NewLeagueN
1.946881746
3.846078283
0.011924431
0.037666963
0.159345518
0.092262057
0.087936080
0.136987710
27.332619700
-51.245863615
0.006266396
0.163357489
-1.678171179
23.861624897
How well performs the model with no shrinkage?
## [1] 178445
Model only with an intercept?
mean((mean(y[train_fit])-y[train_validate])^2) ## [1] 169136.9
Now compute the test error of the best model:
## [1] 147908.8
2) Cross-validation Use your training data more efficiently to choose the model.
ridge.pred = predict(ridge.mod, s=0, newx=x[train_validate,]) mean((ridge.pred-y[train_validate])^2)
ridge.mod = glmnet(x[train,], y[train], alpha=0, lambda=grid[60],
thresh=1e-12)
ridge.pred = predict(ridge.mod, s=grid[60], newx=x[test,]) mean((ridge.pred-y.test)^2) # test MSE
set.seed(1) # the choice of the cross-validation folds is random cv.out = cv.glmnet(x[train,], y[train],
plot(cv.out)
alpha=0)
22
19 19 19 19 19 19 19 19 19 19 19 19 19 19 19
4 6 8 10 12
Log(λ)
# Best value of lambda for the training set
bestlam = cv.out$lambda.min bestlam
## [1] 506.3855
## [1] 145854.5
ridge.mod = glmnet(x[train,], y[train], alpha=0, lambda=bestlam,
thresh=1e-12)
ridge.pred = predict(ridge.mod, s = bestlam, newx = x[test, ]) mean((ridge.pred-y.test)^2) # test MSE
out = glmnet(x,y, alpha=0)
predict(out, type='coefficients', s=bestlam)
## 20 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 32.68292956
## AtBat
## Hits
## HmRun
## Runs
## RBI
## Walks
## Years
0.10318577
0.73633996
0.97925280
0.99817171
0.86986640
1.45072544
2.19917786
23
Mean−Squared Error
80000 120000 160000 200000
## CAtBat
## CHits
## CHmRun
## CRuns
## CRBI
## CWalks
## LeagueN
## DivisionW
## PutOuts
## Assists
## Errors
## NewLeagueN
Lasso
0.01118213
0.05111039
0.36643692
0.10220469
0.10728513
0.06701095
17.17766693
-65.11474283
0.13887405
0.02047837
-0.95658604
9.36341159
Same algorithm as for ridge, alpha=1
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
0 7 8 15 16 17 16
lasso.mod = glmnet(x[train,], y[train], alpha = 1, lambda=grid)
plot(lasso.mod)
0 50 100 150 200 250 300
L1 Norm
# Cross-validation
set.seed(1)
cv.out = cv.glmnet(x[train, ], y[train],
24
Coefficients
−50 0 50 100
plot(cv.out)
alpha = 1, lambda=grid)
19 19 17 17 8 4 0 0 0 0 0 0 0 0 0 0 0 0 0
bestlam=cv.out$lambda.min bestlam
## [1] 14.17474
## [1] 141581.8
−5 0 5 10 15 20
Log(λ)
lasso.pred = predict(lasso.mod, s=bestlam, newx = x[test,])
mean((lasso.pred-y.test)^2)
Test MSE of lasso regression is very similar to the test MSE of ridge regression with lambda chosen by cross-validation. However, the lasso has a substantial advantage over ridge regression in that the resulting coefficient estimates are sparse. 12 out of 19 coefficient estimates are exactly zero.
out = glmnet(x,y, alpha=1, lambda=grid) lasso.coef = predict(out, type='coefficients',
s=bestlam)[1:20,] lasso.coef[lasso.coef!=0]
## (Intercept)
## 10.4228870
## DivisionW
## -108.1562844
Hits
1.9049368
PutOuts
0.2264672
Walks
2.2467777
CRuns
0.2088215
CRBI LeagueN
0.4175651 7.6134308
#3. Dimension reduction
25
Mean−Squared Error
100000 140000 180000
Extracting important combinations of variables, and then using those combinations as the features in regression.
3.1. Principal Components Regression
Principal components regression can be performed using the pcr() function, which is part of the pls library.
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
The resulting fit can be examined using summary():
summary(pcr.fit)
## Data: X dimension: 131 19
## Y dimension: 131 1
## Fit method: svdpc
## Number of components considered: 19
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 409.2 320.8 322.4 317.3 319.1 324.8 327.9
## adjCV 409.2 319.7 321.2 315.9 317.6 323.0 326.0
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
ossible value
#install.packages("pls")
library(pls)
set.seed(2)
pcr.fit = pcr(Salary~., data=Hitters, subset = train, scale = TRUE, validation = "CV")
# scale = TRUE: standardizing each predictor prior to generating principal components
# Why? Scale on which each variable is measured will not have an effect.
# validation = "CV" causes pcr() to compute the ten-fold cross-validation error for each p
## CV 327.6 329.4 332.5 331.5 339.2 337.9
## adjCV 325.3 327.0 329.9 328.2 335.9 334.0
342.1 338.4
##
## CV
## adjCV
##
## TRAINING: % variance explained
14 comps 15 comps 16 comps 17 comps 18 comps 19 comps
345.0 346.7 339.2 360.8 359.9 356
340.6 342.0 334.6 354.9 353.9 350
##
## X
## Salary
##
## X
## Salary
##
## X 99.91 99.97 99.99 100.00
1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
37.06 58.99 69.98 78.63 83.96 88.49 92.03 94.67
44.97 45.27 48.70 48.70 48.93 49.18 50.51 51.02
9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
96.03 97.01 97.87 98.59 99.20 99.5 99.76
51.15 53.22 53.34 54.62 55.16 56.0 58.93
16 comps 17 comps 18 comps 19 comps
26
## Salary 60.54 60.67 61.01 62.27
The CV score is provided for each possible number of components, ranging from M=0 onwards. pcr() reports RMSE! MSE = RMSEˆ2
To plot the cross-validation scores, use validationplot() function. Using val.type = “MSEP” will cause the cross-validation MSE to be plotted.
validationplot(pcr.fit, val.type="MSEP")
Salary
0 5 10 15
number of components
We see that the smallest cross-validation error occurs when M=3 components are used. We compute the test MSE as follows:
## [1] 151374.3
This test MSE is competitive with the results obtained using the ridge and lasso regressions. But the final model is more difficult to interpret!
#3.2. Partial Least Squares
We implement partial least squares (PLS) ising the plsr() function, also in the pls library.
## Data: X dimension: 131 19
## Y dimension: 131 1
pcr.pred = predict(pcr.fit, x[test,], ncomp=3) mean((pcr.pred-y.test)^2)
set.seed(1)
pls.fit = plsr(Salary~., data=Hitters, subset = train, scale = TRUE, validation = "CV") summary(pls.fit)
27
MSEP
100000 120000 140000 160000
## Fit method: kernelpls
## Number of components considered: 19
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 409.2 306.9 311.4 321.9 332.9 337.9 343
## adjCV 409.2 306.4 310.3 319.9 329.6 333.7 338
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 339.4 336.3 337.2 335.5 336.9 337.9
## adjCV 333.8 331.4 332.5 331.2 332.5 333.5
342.9 338.0
##
## CV
## adjCV
##
## TRAINING: % variance explained
14 comps 15 comps 16 comps 17 comps 18 comps 19 comps
343.3 349.5 347.2 342.9 341.8 339.7
338.3 343.9 341.8 337.9 336.9 334.8
##
## X
## Salary
##
## X
## Salary
##
## X 99.54 99.90 99.98 100.00
## Salary 61.74 61.84 62.09 62.27
The lowest cross-validation error occurs when only M=2 partial least squares directions are used.
## [1] 145505.4
The test MSE is comparable, but slightly higher than the test MSE obtained using ridge regression, the lasso, and PCR.
Notice that the percentage of variance in Salary that the two-component PLS fit explains, 51.41%, is as much as that explained using the nine-component model PCR fit, 51.15%. This is because PLS searches for directions that explain variance in both predictors and the response!
1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
36.83 48.21 62.91 72.76 75.32 79.03 81.30 83.88
47.67 51.41 53.26 55.37 58.05 58.92 59.86 60.45
9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
87.44 93.99 96.67 97.54 98.10 98.59 98.90
60.61 60.68 60.81 60.97 61.14 61.32 61.65
16 comps 17 comps 18 comps 19 comps
pls.pred = predict(pls.fit, x[test,], ncomp=2) mean((pls.pred-y.test)^2)
28