程序代写代做代考 algorithm data science Introduction to information system

Introduction to information system

Model Selection

Bowei Chen

School of Computer Science

University of Lincoln

CMP3036M/CMP9063M Data Science

• Basic Setup of the Learning from Data

• Cross-Validation Methods

• Appendix A: Testing-Based/Stepwise Procedures

• Appendix B: Criterion-Based Procedures

Today’s Objectives

Limitation of Linear Regression

Price Fullbase

1 420 1

2 385 0

3 495 0

4 605 0

5 610 0

6 660 1

7 660 1

8 690 0

9 838 1

10 885 0

… … …

Housing dataset

Response

variable
Predictor

Logit Function and Odds Ratio

The logit function of 𝑝, where 𝑝 is between 0 and 1, can be expressed as

logit 𝑝 = log
𝑝

1 − 𝑝
= log 𝑝 − log 1 − p

𝑝

1−𝑝
is called odds ratio

If 𝑝 = 0, logit 𝑝 → −∞

If 𝑝 = 1, logit 𝑝 → ∞

Logistic Function

The logit function is the inverse of logistic function. If we let 𝛼 = logit 𝑝 , then

logistic 𝛼 = logit−1 𝑝 =
1

e−𝛼 + 1
=

𝑒𝛼

1 + 𝑒𝛼

𝑒𝛼

1 + 𝑒𝛼

log
𝑝

1 − 𝑝

Simple Logistic Regression

The logit of the underlying probability 𝑝𝑖 is a linear function of the predictors

logit 𝑝𝑖 = 𝛽0 + 𝛽1𝑥𝑖 ,

then

𝑝𝑖 =
1

1 + 𝑒−(𝛽0+𝛽1𝑥𝑖)
=

𝑒𝛽0+𝛽1𝑥𝑖

1 + 𝑒𝛽0+𝛽1𝑥𝑖
.

We should predict yi = 1 when 𝑝𝑖 ≥ 0.5 and y𝑖 = 0 when 𝑝𝑖 < 0.5. This means guessing 1 whenever 𝛽0 + 𝛽1𝑥𝑖 is non-negative, and 0 otherwise. So logistic regression gives us a linear classifier. The decision boundary separating the two predicted classes is the solution of 𝛽0 + 𝛽1𝑥𝑖 = 0. Simple Logistic Regression Parameters Estimation We use maximum likelihood estimation (MLE) method to estimate parameters’ values. Simple, the likelihood function can be written as ℒ 𝛽0 , 𝛽1 = ℙ(𝒚 ∣ 𝒙, 𝛽0, 𝛽1) = 𝑖=1 𝑛 𝑝𝑖 𝑦𝑖 (1 − 𝑝𝑖) 1−𝑦𝑖 Then 𝛽0 , 𝛽1 = argmax log{ℒ 𝛽0 , 𝛽1 } We can quickly obtain 𝛽0, 𝛽1 in R for logistic regression. The algorithmic solution of parameters estimation is not required but I hope you could search it after class. 𝑛 observations 𝑦𝑖 = 0 or 1 𝑒𝛽0+𝛽1𝑥𝑖 1 + 𝑒𝛽0+𝛽1𝑥𝑖 Multiple Logistic Regression The logit of the underlying probability 𝑝𝑖 is a linear function of the predictors logit 𝑝𝑖 = 𝒙𝒊 𝑻𝜷, where 𝜷 = 𝛽0 𝛽1 ⋮ 𝛽𝑝 , 𝒙 = 1 𝑥1,1 ⋯ 𝑥1,𝑝 ⋮ ⋮ ⋱ ⋮ 1 𝑥𝑛,1 ⋯ 𝑥𝑛,𝑝 . Then 𝑝𝑖 = 𝑒𝒙𝒊 𝑻𝜷 1 + 𝑒𝒙𝒊 𝑻𝜷 . 𝒙𝒏 = 1 𝑥𝑛,1 ⋮ 𝑥𝑛,𝑝−1 𝑥𝑛,𝑝 Model selection is a process of seeking the model in a set of candidate models that gives the best balance between model fit and complexity. Burnham and Anderson in Model Selection and Multimodel Inference Basic Setup of the Learning from Data Source: Y. Abu-Mostafa, M. Magdon-Ismail and H. Lin. Learning from Data. AMLbook.com, 2012, Chapter 1 A Simple Regression Problem 𝑥 𝑦 Price House size 1 420 5850 2 385 4000 3 495 3060 … … … 𝑥𝑦 Which Regression Model Fits Data Best? 𝑥 𝑦 𝑥 𝑦 𝑥 𝑦 Simple Linear Regression 𝑦𝑖 = 𝛽0 + 𝛽1𝑥𝑖 + 𝜀𝑖 Quadratic Regression 𝑦𝑖 = 𝛽0 + 𝛽1𝑥𝑖 + 𝛽2𝑥𝑖 2 + 𝜀𝑖 Piecewise Linear Nonparametric Regression (or Join-the-Dots) How Well the “Best Fit” Model Can Be Used for Predicting Future Data Drawn from the Same Distribution? 𝑥 𝑦 𝑥 𝑦 𝑥 𝑦 Simple Linear Regression 𝑦𝑖 = 𝛽0 + 𝛽1𝑥𝑖 + 𝜀𝑖 Quadratic Regression 𝑦𝑖 = 𝛽0 + 𝛽1𝑥𝑖 + 𝛽2𝑥𝑖 2 + 𝜀𝑖 Piecewise Linear Nonparametric Regression (or Join-the-Dots) Cross Validation (CV) Methods Test Set Method 1) Randomly choose 30% of the data to be in a test set 2) The remainder is a training set x y Test set data Training set data Test Set Method 1) Randomly choose 30% of the data to be in a test set 2) The remainder is a training set 3) Perform your regression on the training set x y (Linear regression example) Test Set Method 1) Randomly choose 30% of the data to be in a test set 2) The remainder is a training set 3) Perform your regression on the training set 4) Estimate your future performance with the test set x y (Linear regression example) Mean Squared Error = 120 Test Set Method 1) Randomly choose 30% of the data to be in a test set 2) The remainder is a training set 3) Perform your regression on the training set 4) Estimate your future performance with the test set (Quadratic regression example) Mean Squared Error = 98 x y Test Set Method x y (Join the dots example) Mean Squared Error = 130 1) Randomly choose 30% of the data to be in a test set 2) The remainder is a training set 3) Perform your regression on the training set 4) Estimate your future performance with the test set Test Set Method Results 𝑥 𝑦 𝑥 𝑦 𝑥 𝑦 Simple Linear Regression 𝑦𝑖 = 𝛽0 + 𝛽1𝑥𝑖 + 𝜀𝑖 Quadratic Regression 𝑦𝑖 = 𝛽0 + 𝛽1𝑥𝑖 + 𝛽2𝑥𝑖 2 + 𝜀𝑖 Piecewise Linear Nonparametric Regression (or Join-the-Dots) Mean Squared Error = 120 Mean Squared Error = 98 Mean Squared Error = 130 LOOCV (Leave-One-Out Cross Validation) For 𝑘 = 1 to 𝑛 1) Let (𝑥𝑘 , 𝑦𝑘) be the 𝑘 th record x y LOOCV (Leave-One-Out Cross Validation) For 𝑘 = 1 to 𝑛 1) Let (𝑥𝑘 , 𝑦𝑘) be the 𝑘 th record 2) Temporarily remove (𝑥𝑘 , 𝑦𝑘) from the dataset x y LOOCV (Leave-One-Out Cross Validation) For 𝑘 = 1 to 𝑛 1) Let (𝑥𝑘 , 𝑦𝑘) be the 𝑘 th record 2) Temporarily remove (𝑥𝑘 , 𝑦𝑘) from the dataset 3) Train on the remaining 𝑛 − 1 data points x y LOOCV (Leave-One-Out Cross Validation) For 𝑘 = 1 to 𝑛 1) Let (𝑥𝑘 , 𝑦𝑘) be the 𝑘 th record 2) Temporarily remove (𝑥𝑘 , 𝑦𝑘) from the dataset 3) Train on the remaining 𝑁 − 1 data points 4) Note your error (𝑥𝑘 , 𝑦𝑘) x y LOOCV (Leave-One-Out Cross Validation) x y x y x y x y x y x y x y x y x y For 𝑘 = 1 to 𝑛 1) Let (𝑥𝑘 , 𝑦𝑘) be the 𝑘 th record 2) Temporarily remove (𝑥𝑘 , 𝑦𝑘) from the dataset 3) Train on the remaining 𝑁 − 1 data points 4) Note your error (𝑥𝑘 , 𝑦𝑘) When you’ve done all points, report the mean error. 𝑘-Fold Cross Validation x y Break the dataset into 𝑘 partitions randomly. In this example, we’ll have 𝑘 = 3 partitions colored blue, green and violet. 𝑘-Fold Cross Validation Break the dataset into 𝑘 partitions randomly. In this example, we’ll have 𝑘 = 3 partitions colored blue, green and violet. For the blue partition, train on all the points not in the blue partition. Find the test-set sum of errors on the blue points. x y 𝑘-Fold Cross Validation x y Break the dataset into 𝑘 partitions randomly. In this example, we’ll have 𝑘 = 3 partitions colored blue, green and violet. For the blue partition, train on all the points not in the blue partition. Find the test-set sum of errors on the blue points. For the green partition, train on all the points not in the green partition. Find the test-set sum of errors on the green points. 𝑘-Fold Cross Validation x y Break the dataset into 𝑘 partitions randomly. In this example, we’ll have 𝑘 = 3 partitions colored blue, green and violet. For the blue partition, train on all the points not in the blue partition. Find the test-set sum of errors on the blue points. For the green partition, train on all the points not in the green partition. Find the test-set sum of errors on the green points. For the violet partition, train on all the points not in the violet partition. Find the test-set sum of errors on the violet points. 𝑘-Fold Cross Validation x y Break the dataset into 𝑘 partitions randomly. In this example, we’ll have 𝑘 = 3 partitions colored blue, green and violet. For the blue partition, train on all the points not in the blue partition. Find the test-set sum of errors on the blue points. For the green partition, train on all the points not in the green partition. Find the test-set sum of errors on the green points. For the violet partition, train on all the points not in the violet partition. Find the test-set sum of errors on the violet points. Then report the mean error Comparison of Different Kinds of Cross Validation Disadvantage Advantage Test-set Unreliable estimate of future performance Cheap Leave-one-out • Computationally expensive • Has some weird behavior Doesn’t waste data k-fold • Wastes a certain percentage of data. • Computationally expensive if 𝑘 is large Only wastes a certain percentage of data but have a reliable estimate of future performance n-fold Identical to leave-one-out if 𝑘 = 𝑛 References • Y. Abu-Mostafa, M. Magdon-Ismail and H. Lin. Learning from Data. AMLbook.com, 2012, Chapter 1 • A. Moore. Cross-Validation for Detecting and Preventing Overfitting. Carnegie Mellon University Lecture Slides • M. Ugarte, A. Militino, and A. Arnholt. Probability and Statistics with R. CPC Press, 2002, Chapter 12. Thank You! bchen@Lincoln.ac.uk mailto:bchen@Lincoln.ac.uk Appendix A: Testing-Based/Stepwise Procedures • Backward Elimination • Forward Selection • Stepwise Selection Backward Elimination 1) Start with all the predictors in the model 2) Remove the predictor if it doesn’t contribute to the model 3) Refit the model and goto step 2) 4) Stop when all the predictors make contributions Forward Selection This just reverses the backward method. 1) Start with no variables in the model. 2) For all predictors not in the model, check their contributions/relationship with the response variable. 3) Continue until no new predictors can be added. Stepwise Selection/Bidirectional Elimination It is a combination of backward elimination and forward selection. It addresses the situation where variables are added or removed early in the process and we want to change our mind about them later. At each stage a variable may be added or removed and there are several variations on exactly how this is done. Appendix B: Criterion-Based Procedures • Mallows’ 𝐶𝑝 • AIC & BIC • 𝑅2 Adjusted Mallows’ 𝐶𝑝 It is used to assess the fit of a regression model that has been estimated using (ordinary) least squares. It is defined by 𝐶𝑝 = SSE 𝜎2 + 2𝑝 − 𝑛, where 𝜎2 is from the model with all predictors and SSE is for the model with 𝑝 parameters. When all 𝑝 parameters are used in the model, 𝐶𝑝 = 𝑝. A model with a bad fit will produce a 𝐶𝑝 much bigger than 𝑝. Desirable models have small 𝒑 and 𝑪𝒑 less than or equal to 𝒑. It is common practice to plot 𝐶𝑝 against 𝑝. AIC & BIC The Akaike Information Criterion (AIC) and the Bayes Information Criterion (BIC) are some other commonly used criteria. In general, they are defined by AIC = −2 log ℒ + 2𝑝, BIC = −2 log ℒ + 𝑝 log(𝑛). where ℒ is the maximised value of likelihood function. We want to minimise AIC or BIC. Larger models will fit better and so have smaller RSS but use more parameters. Thus the best choice of model will balance fit with model size. The lower the AIC or BIC values, the better the model. 𝑅𝑎 2 (or 𝑅2 Adjusted) 𝑅𝑎 2 is defined by 𝑅𝑎 2 = 1 − 𝑛 − 1 𝑛 − 𝑝 SSE SST = 1 − 𝑛 − 1 𝑛 − 𝑝 𝑖=1 𝑛 𝑦𝑖 − 𝑦𝑖 2 𝑖=1 𝑛 𝑦𝑖 − 𝑦 2 • 𝑅𝑎 2 is used instead of 𝑅2 since 𝑅2 will always increase with the addition of more predictors to the model.