Introduction to information system
Model Selection
Bowei Chen, Deema Hafeth and Jingmin Huang
School of Computer Science
University of Lincoln
CMP3036M/CMP9063M
Data Science 2016 – 2017 Workshop
Today’s Objective
• Do the following Exercises 1-4.
• There are several hints about using the relevant R packages and/or built-in
R functions, please google them or read the materials in the reference list.
Exercise 1/4
a) Download and import the “Housing.csv” dataset into R
b) Explore the variables by:
1) Providing descriptive statistics
2) Showing the distributions of variables and their relationship
(see fig the next slide) [Hint: use function ‘ggpairs’]
c) Use Mallow’s 𝑐𝑝 criterion to select the best regression model where the first
variable in dataset is the response variable, and the second and third
variables are predictors. [Hint: use function ‘leaps()’ from ‘leaps’ library]
d) Use 𝑅2 adjusted criterion to select the best regression model where the first
variable in dataset is the response variable, and the second and third
variables are predictors. [Hint: use function ‘leaps()’ from ‘leaps’ library]
e) Implement 𝑅2 adjusted criterion by using the formula given in the lecture
slides and compare your results with question d)
Exercise 1/4 b)-2)
Exercise 2/4
a) Download and import the “Housing.csv” dataset into R
b) Use forward selection method to select the best model
[Hint: use function ‘step()’]
c) Use backward elimination method to select the best model
[Hint: use function ‘step()’]
d) Use bidirectional elimination method to select the best model
[Hint: use function ‘step()’]
Exercise 3/4
a) Download and import the “Housing.csv” dataset into R
b) Randomly select 80% of the dataset observations to be the training set
c) Build a linear model based on the training set where price is the response
variable and any other variables are predictors
d) Use the 3-fold cross validation method for the best fitted model
[Hint: use ‘DAAG’ library]
Exercise 3/4 d)
Exercise 4/4
a) Download and import the “Worcester Heart Attack Trial.csv ” dataset into R
b) Use the 10-fold cross validation method to estimate a multiple linear model
where lenfol is the response variable and other variables, such as los, fstat,
age and gender, can be possible predictors [Hint: use the ‘caret’ library]
c) Set random seed to be 1 and do the repeated 10-fold cross validation for
question b)
d) Use the repeated 10-fold cross validation method to estimate a logistic
model where fstat is the response variable and other variables, such as los,
lenfol, age and gender, can be possible predictors
References
• G. James, D. Witten, T. Hastie, and R. Tibshirani. (2014). An introduction to
statistical learning. Springer. (Chapter 6)
• The Caret Package: http://topepo.github.io/caret/index.html
• GGally Github: http://ggobi.github.io/ggally/#quick_coefficients_plot
• H. Kang. Lecture on Model Selection. Stanford University Notes.
http://web.stanford.edu/~hskang/stat431/ModelSelection.pdf
http://topepo.github.io/caret/index.html
http://ggobi.github.io/ggally/#quick_coefficients_plot
http://web.stanford.edu/~hskang/stat431/ModelSelection.pdf
Thank You
bchen@lincoln.ac.uk
dabdalhafeth@lincoln.ac.uk
jhua8590@gmail.com
mailto:bchen@lincoln.ac.uk
mailto:dabdalhafeth@lincoln.ac.uk
https://github.com/boweichen/CMP3036MDataScience/blob/master/jhua8590@gmail.com