程序代写代做代考 data science Introduction to information system

Introduction to information system

Model Selection

Bowei Chen, Deema Hafeth and Jingmin Huang

School of Computer Science

University of Lincoln

CMP3036M/CMP9063M

Data Science 2016 – 2017 Workshop

Today’s Objective

• Do the following Exercises 1-4.

• There are several hints about using the relevant R packages and/or built-in

R functions, please google them or read the materials in the reference list.

Exercise 1/4

a) Download and import the “Housing.csv” dataset into R

b) Explore the variables by:

1) Providing descriptive statistics

2) Showing the distributions of variables and their relationship

(see fig the next slide) [Hint: use function ‘ggpairs’]

c) Use Mallow’s 𝑐𝑝 criterion to select the best regression model where the first

variable in dataset is the response variable, and the second and third

variables are predictors. [Hint: use function ‘leaps()’ from ‘leaps’ library]

d) Use 𝑅2 adjusted criterion to select the best regression model where the first
variable in dataset is the response variable, and the second and third

variables are predictors. [Hint: use function ‘leaps()’ from ‘leaps’ library]

e) Implement 𝑅2 adjusted criterion by using the formula given in the lecture
slides and compare your results with question d)

Exercise 1/4 b)-2)

Exercise 2/4

a) Download and import the “Housing.csv” dataset into R

b) Use forward selection method to select the best model

[Hint: use function ‘step()’]

c) Use backward elimination method to select the best model

[Hint: use function ‘step()’]

d) Use bidirectional elimination method to select the best model

[Hint: use function ‘step()’]

Exercise 3/4

a) Download and import the “Housing.csv” dataset into R

b) Randomly select 80% of the dataset observations to be the training set

c) Build a linear model based on the training set where price is the response

variable and any other variables are predictors

d) Use the 3-fold cross validation method for the best fitted model

[Hint: use ‘DAAG’ library]

Exercise 3/4 d)

Exercise 4/4

a) Download and import the “Worcester Heart Attack Trial.csv ” dataset into R

b) Use the 10-fold cross validation method to estimate a multiple linear model

where lenfol is the response variable and other variables, such as los, fstat,

age and gender, can be possible predictors [Hint: use the ‘caret’ library]

c) Set random seed to be 1 and do the repeated 10-fold cross validation for

question b)

d) Use the repeated 10-fold cross validation method to estimate a logistic

model where fstat is the response variable and other variables, such as los,

lenfol, age and gender, can be possible predictors

References

• G. James, D. Witten, T. Hastie, and R. Tibshirani. (2014). An introduction to

statistical learning. Springer. (Chapter 6)

• The Caret Package: http://topepo.github.io/caret/index.html

• GGally Github: http://ggobi.github.io/ggally/#quick_coefficients_plot

• H. Kang. Lecture on Model Selection. Stanford University Notes.

http://web.stanford.edu/~hskang/stat431/ModelSelection.pdf

http://topepo.github.io/caret/index.html
http://ggobi.github.io/ggally/#quick_coefficients_plot
http://web.stanford.edu/~hskang/stat431/ModelSelection.pdf

Thank You

bchen@lincoln.ac.uk

dabdalhafeth@lincoln.ac.uk

jhua8590@gmail.com

mailto:bchen@lincoln.ac.uk
mailto:dabdalhafeth@lincoln.ac.uk
https://github.com/boweichen/CMP3036MDataScience/blob/master/jhua8590@gmail.com