Model Selection Techniques
Model Selection
We are primarily concerned with prediction accuracy versus model interpretability. In some cases, we may have to trade-o↵ one for the other; in other cases, we can achieve both objectives.
Copyright By PowCoder代写 加微信 powcoder
1. Prediction Accuracy
If the true relationship is linear, then OLS will have low bias; if n becomes very large relative to p, the estimates are precise and model fit is good. Otherwise, model fit has high variance; in this case, we may want to trade o↵ some bias.
Model Selection
We are primarily concerned with prediction accuracy versus model interpretability. In some cases, we may have to trade-o↵ one for the other; in other cases, we can achieve both objectives.
2. Model Interpretability
Often time we include variables in our regressions, that are not associated with the response variable. This holds true, in particular, if we control for various interaction terms or polynomials of variables. The more variables, the more di cult it is to interpret a model… but how do we identify
“irrelevant variables”?
Model Selection
We are going to discuss two di↵erent methods. 1. Subset selection
Di↵erent (algorithmic) approaches to identify a subset of our predictors p, which we believe to be driving the variation in y
I Best subset selection I Forward / backward
Model Selection
We are going to discuss two di↵erent methods. 2. Shrinkage
Fitting a model involving all predictors p, but penalizing coe cients on predictors that are“not important”. Depending on the shrinkage method, we can remove some predictors completely, as their coe cient i is estimated to be exactly zero.
I Ridge regression I Lasso
Subset selection
Suppose you have p predictors, that is X1, …, Xp and you want to identify the best subset of predictors M, possibly with M < p, that achieve“best performance”.
This is a di cult task...why?
There are p models, containing exactly 1 predictor There are ✓p2◆ = p(p 1)/2 possible ways to choose 2
predictors [Remember: ✓p◆ = p! ] I ✓p◆ k k!(p k)!
There are = p! = p(p 1)(p 2)/6 I 3 3!(p 3)!
In total there are: Pp p = 2p possible models. How would
we proceed to find the“best”possible model?
Algorithm (Best subset selection)
1. Let M0 denote the null model containing no predictors except for a constant (i.e. predicting the mean).
2. For k = 1,...,p
a) Fit all ✓kp◆ models that contain k predictors.
b) Pick the best among these k dimensional models, calling it Mk. The best model is the one that has smallest RSS or largest R2.
3. Select the single best model from the set M0, ..., Mp. Here best is determined using best performance in terms of MSE on a training set, or some measure of goodness of fit that adjusts for the fact that R2 monotonically increases as k gets larger.
Some comments on the di↵erent steps...
2. b) You know that R2 monotonically increases as you add more control variables. Since you hold k fixed, i.e. you compare only the performance of models with the same set of predictors, you can choose the one with the largest value of R2, since you compare“like with like”.
Some comments on the di↵erent steps...
2. b) You know that R2 monotonically increases as you add more control variables. Since you hold k fixed, i.e. you compare only the performance of models with the same set of predictors, you can choose the one with the largest value of R2, since you compare“like with like”.
3. Here, we are not comparing“like with like”, but rather we want to penalize models that mechanically bound to perform better in the training sample; we can do this by computing their performance in terms of MSE on a training set, or, in absence of a training set, we can look at various statistics that adjust the measure of goodness of fit.
We will consider AIC or adjusted R2 as alternative estimates of test error.
Selection: Optimization Problem
I We can express the best subset selection problem as a nonconvex and combinatorial optimization problem.
I The objective is to find the optimal s
min (yi 0 xij j)2 subject to I( j 6= 0) s i=1 j=1 j=1
Residual Sum of Squares
I This requires that the optimial solution involves finding a vector such that RSS is minimal and no more than s coe cients are non-zero.
I The algorithm presented above solves this optimization problem for every value of s and then picks among the optimal models for the di↵erent values of s.
Selection: Optimization Problem
I Focus on the constraint, suppose p = 2, then it becomes I( 1 6= 0)+I( 2 6= 0) s
I Think of the constraint as a function
C( 1, 2)=I( 1 6=0)+I( 2 6=0)
I Fors=0,thiscanonlybetrueifboth 1 =0and 2 =0
I For s = 1, the constraint is satisfied exactly if either 1 6= 0 or 2 6= 0 (but not both)
I For s = 2, the constraint is satisfied exactly if both 1 6= 0 and 2 6= 0
Lets do“best subset selection”in our hedonic pricing example...
I We have a range of variables to consider, they are:
I sqft, bathrooms, bedroom, built year, County (Fixed E↵ects),
Year Fixed E↵ect, latitude (y), longitude (x), distance to
nearest Natural gas well in 2009, 2010 and 2011
I I worked on this to study whether natural gas extraction in the
neighbourhood adversely a↵ects house prices.
I So in total, we have 11 variables to consider - in total there
are 211 = 2,048 models to estimate.
vars<-c("sqft","baths","bedrooms","builtyryr","distgas2009", "distgas2010","distgas2011","lat","lon","factor(NAME)", "factor(year)")
varcombs<-combn(vars,k)
R2<-apply(varcombs, 2, function(x) summary(lm(paste("soldprice ~ ",paste(x,collapse="+")),
curlyM<-list() R2.df<-NULL
for(k in 1:length(vars)) {
R2.df<-rbind(R2.df, cbind("k"=k,R2)) curlyM[[k]]<-varcombs[,which(R2==max(R2))]
data=HOUSES[SAMPLE]))$r.squared)
Step 2a,b: How does R2 evolve for di↵erent values of k?
I Plotting the values for R2 for the 2,048 models we have fit. At k=1, there are exactly 11 points, at k=11, there is exactly 1 point.
.0 0.1 0.2 0.3 0.4 0.5 0.6
Step 2a,b: How does R2 evolve for di↵erent values of k? I R2 increases monotonically thoughout.
2 4 6 8 10
0.0 0.1 0.2 0.3 0.4 0.5 0.6
Step 2a,b: How does R2 evolve for di↵erent values of k?
I “red”points indicate those models with maximal R2 chosen in each step 2b)
0.0 0.1 0.2 0.3 0.4 0.5 0.6
2 4 6 8 10
Lets explore Mk. > curlyM
[1] “baths”
[1] “baths”
[1] “baths”
[1] “sqft”
[1] “sqft”
[1] “sqft”
[6] “factor(year)”
[1] “sqft”
[6] “factor(NAME)”
[1] “sqft” [6] “lon”
[1] “sqft” [6] “lat”
[1] “sqft”
[6] “distgas2011” “lat”
[1] “sqft” “baths”
[6] “distgas2010” “distgas2011” “lat”
[11] “factor(year)”
“builtyryr”
“builtyryr”
“baths” “factor(year)”
“baths” “factor(NAME)”
“baths” “lon”
“factor(NAME)”
“builtyryr”
“builtyryr”
“builtyryr”
“builtyryr”
“builtyryr” “factor(year)”
“builtyryr” “factor(NAME)”
“factor(NAME)”
“factor(NAME)” “factor(year)”
“distgas2009” “factor(NAME)”
“distgas2009” “lon”
“distgas2009” “lat”
“distgas2009” “distgas2010” “factor(year)”
“builtyryr” “distgas2009” “distgas2010”
“factor(NAME)” “factor(year)”
“builtyryr” “distgas2009” “lon” “factor(NAME)”
“bedrooms”
Lets pick the“best”Mk.
As discussed, R2 is adequate to choose the best model, when we compare“like with like”, i.e. models of the same number of parameters; however, we can not compare them across di↵erent values of k. We want to select the model with the lowest test error, i.e. the model that provides most robust predictions and does not su↵er from“overfitting”. There are two approaches…
1. We can add a penality term to a measure of goodness of fit for the training set, to account for the bias that is likely to arise due to overfitting. The statistics that do this are AIC and adjusted R2.
2. Directly estimate the test error, by computing test MSE for the di↵erent models based on a validation set.
Approach (1) is usually followed, in case no test data set is available.
AIC and Adjusted R2
I From first set of lectures, we know that the training set MSE is an underestimate of the of the test MSE [Remember
MSE = RSS/n]
I That is, the MSE estimated from the training set is too low relative to the true test MSE.
I We can correct for this“optimism”directly.
I The first one we discuss is the Akaike information criterion for
a model with k predictors.
A I C = 1 ( R S S + 2 k ˆ 2 )
I Its beyond this course to theoretically derive where this is
coming from… but, it is derived in the context of maximum likelihood estimation.
I ˆ2 is an estimate of the variance of the residuals ✏.
I As we increase k, the second term becomes larger. The model fit is better, the smaller AIC.
AIC and Adjusted R2
AdjustedR2 = 1 RSS/(n k 1)
TSS/(n 1)
I Note that the denominator is constant. Maximizing Adjusted
R2 is equivalent to minimizing RSS I (n k 1)
However, the numerator changes non-linearily in k.
I Ask”,RSS#and(n k 1)#.
I So R2 keeps increasing, if the decrease in RSS is larger than the decrease in (n k 1).
Intuition: Once all the“correct”variables have been included in the model, additional noise variables will lead to only a small decrease in RSS, while the denominator (n k 1) decreases a lot, so the ratio increases and adjusted R2 goes down.
Plotting AIC for our models in Mk Zooming in on AIC in the panels…
2 4 6 8 10 4 5 6 7 8 9 10 11 7 8 9 10 11
it looks like the best model is the one with k = 9, which includes
> curlyM[[9]]
[1] “sqft” “baths” “builtyryr” “distgas2009” “distga
[6] “lat” “lon” “factor(NAME)” “factor(year)”
447550 447650
unlist(lapply(RES, AIC))
448000 450000
452000 454000
447350 447355
Plotting adjusted R2 for our models in Mk Zooming in on adjusted R2 in the panels…
2 4 6 8 10 4 5 6 7 8 9 10 11 7 8 9 10 11
it looks like the best model is the one with k = 9, which includes
> curlyM[[9]]
[1] “sqft” “baths” “builtyryr” “distgas2009” “distga
[6] “lat” “lon” “factor(NAME)” “factor(year)”
unlist(lapply(RES, function(x) summary(x)$adj.r.squared))
0.40 0.45 0.50 0.55
0.574 0.576 0.578 0.580
0.5804 0.5805 0.5806 0.5807 0.5808
Lets pick the“best”Mk.
As discussed, R2 is adequate to choose the best model, when we compare“like with like”, i.e. models of the same number of parameters; however, we can not compare them across di↵erent values of k. We want to select the model with the lowest test error, i.e. the model that provides most robust predictions and does not su↵er from“overfitting”. There are two approaches…
1. We can add a penality term to a measure of goodness of fit for the training set, to account for the bias that is likely to arise due to overfitting. The statistics that do this are AIC and adjusted R2.
2. Directly estimate the test error, by computing MSE for di↵erent models based on the training set.
Approach (1) is usually followed, in case no test data set is available.
Forward Stepwise Selection
Algorithm (Forward stepwise selection)
1. Let M0 denote the null model containing no predictors except for a constant (i.e. predicting the mean).
2. For k = 0, …, p 1
a) Consider all p k models that augment the predictors of
Mk with one additional predictor.
b) Choose the best among these p k models and call it Mk+1. The best model is the one that has smallest RSS or largest R2.
3. Select the single best model from the set M0, …, Mp. Here best is determined using best performance in terms of MSE on a training set, or some measure of goodness of fit that adjusts for the fact that R2 monotonically increases as k gets larger.
Forward Stepwise Selection
Where or how do we save time here?
I Each fowrard step involves fitting only p k models in each
iteration k, rathar than ✓kp◆.
I So in first iteration, we fit p models, in second, we fit p 1,
in third … P
I Thismeans,intotalwefit p 1(p k)=1+p(p+1)/2
I in case p = 40, this amounts to 821 models.
Backward Stepwise Selection
Algorithm (Backward stepwise selection)
1. Let Mp denote the full model containing all p predictors.
2. Fork=p,p 1,…,1
a) Consider all k models that contain all but one of the predictors in Mk , i.e. that contain a total of k 1 predictors.
b) Choose the best among these k models and call it Mk 1. The best model is the one that has smallest RSS or largest R2.
3. Select the single best model from the set M0, …, Mp. Here best is determined using best performance in terms of MSE on a training set, or some measure of goodness of fit that adjusts for the fact that R2 monotonically increases in more complex models.
Another application example: which places voted Leave?
Becker, S. O., Fetzer, T., Novy, D. (2017). Who voted for Brexit? A comprehensive district-level analysis. Economic Policy, 32(92), 601–650.
Another application example: which places voted Leave?
Becker, S. O., Fetzer, T., Novy, D. (2017). Who voted for Brexit? A comprehensive district-level analysis. Economic Policy, 32(92), 601–650.
A simple empirical model fits the data extremely well
Example presentation of BSS
Predictive power of individual groups variables
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com