Regression Analysis with Linear Models
Dr. Lan Du and Dr Ming Liu
Faculty of Information Technology, Monash University, Australia FIT5149 week 3
(Monash) FIT5149 1 / 35
Outline
1 Simple Linear regression
2 Multiple Linear Regression
3 Linear Regression with Qualitative Predicators
4 Extension of Linear models
5 Summary
(Monash) FIT5149 2 / 35
Simple Linear regression
Outline
1 Simple Linear regression
2 Multiple Linear Regression
3 Linear Regression with Qualitative Predicators
4 Extension of Linear models
5 Summary
(Monash) FIT5149 3 / 35
Simple Linear regression
Linear Regression for the Advertising Data in ISL
Consider the advertising data used in “Introduction to Statistical Learning”.
0 10 20 30 40 50
5 10 15 20 25
TV
●●● ●●● ●●●●●● ● ● ● ●●
● ●● ●
●● ● ● ●● ● ● ●
●● ●●● ●●● ●●● ●●● ●●●● ●●●● ●● ●●●● ●●●●●●●
●● ●●● ● ● ●● ● ● ● ●
●● ●● ●
● ●●● ● ● ●
●●●●●●● ●●●●
● ● ●● ● ● ●●
● ●● ● ● ● ● ● ●●●●● ● ●● ●
●● ●●●●
●●● ●●●●●● ●●●●● ●●
●● ●● ● ● ● ● ●● ● ●
●● ●● ● ●● ●●●● ●
●● ●●● ● ● ●● ● ● ●● ●●●●●●●
● ●●●● ●
●●●● ●● ●●●●
●●●●●●● ● ●● ●●●
● ● ●● ●●●● ●● ●●●● ● ●
●●●●●●●● ● ●●● ●●● ●
●●●● ●
●●●●● ●●● ● ●
●●●●●●● ● ●
●●● ●● ● ● ●●● ●● ●
● ● ●●
● ●● ●● ●●● ●●●●●● ●● ●● ●● ●●● ●
●● ●● ●●●●● ●
● ●●● ● ● ●●● ●●●●● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●
● ● ●● ●●●● ● ● ●● ●●●
●●● ●● ●●●● ●● ● ●●●● ●
●●●●●● ● ● ●
● ●
● ●● ●●●●● ●●●● ●●●
● ●●●●●●
●●●● ● ●●● ● ● ● ● ●● ●●● ●●● ●
●●●●● ● ● ●● ●●●
●●● ●●● ●●● ●●●● ●
●
●●● ● ●●●●●
● ● ●● ●● ●●●●● ●●●●
● ● ●●●● ●● ●● ●●●●●●●● ● ● ●●●● ●●●● ●
● ●
●● ● ● ●● ●● ● ●●●
● ●● ● ● ●●●●● ●●● ●● ●
● ●●
Radio
● ● ●● ● ● ● ● ● ● ● ●●
●● ●●● ●● ● ● ●● ●●●●●●
●● ●●●● ● ●●● ●●●●●●●●●●● ●●●
● ●
●● ●●
●●●●● ●●●● ●●● ●●● ● ●●● ● ●●
●●●●●●● ● ●●● ●● ●
●
●
●●●
● ●●●●●● ●
●●●●●●● ●● ●● ● ●
●●●● ● ● ● ●
●
● ●● ●●●●● ●●●● ●●●
● ● ● ●●●●●●
●
● ●●●●●●● ●●●● ● ●● ●●●●●●●
●●●●●
●● ●● ● ●● ●● ●● ● ●● ●●●●●●
●● ● ● ● ● ● ● ●●●● ● ● ●
●● ●●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●●●●● ● ●● ● ●
●●●●● ● ● ● ●● ●●
●
● ●
●
●
● ●
● ●●●●●●
●●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●●●●●●●●●●
●● ●●● ● ●● ● ●●● ● ●●● ●●●
●●●●●●●●● ● ● ●●●●●● ● ● ●●●● ● ●● ● ●●●●● ● ●● ● ● ● ● ●●● ● ● ●● ● ● ●● ● ● ●● ● ●●●●●● ●
● ●●● ● ●●●●●●●●●●●● ● ● ●●●● ●●●●●●●
●
● ●
●
●●
●
● ●●●●●●● ● ● ● ● ● ●● ●● ●● ● ● ● ●●● ●●●●● ● ●
●
● ●●● ● ●
●●●● ● ●●●● ●●●●● ●●●● ●●●●●● ● ●● ●●●●●●● ● ●●●●●●● ●● ● ● ● ●● ●●● ●
●●
● ●
●●●● ● ●
● ●●● ● ●● ● ●● ●
●● ●●●● ●
Newspaper
● ●
●● ● ●● ●
● ●●●●● ●● ●●
●
●
●
●
● ● ●●●
●
●● ●● ● ● ● ● ● ●● ● ● ● ●● ●● ●●●●●● ● ●●
●●●●●●●●●● ●● ● ●● ●● ● ●●●●●●● ●●●●
●●●●●● ●●● ● ●● ●●● ●● ● ● ●●●●● ●● ● ● ●● ●●●●●● ●
●● ●● ●●●●● ●●
●
● ● ●● ●●● ●
●●●●●●
● ●● ●●
●●● ●●●●●●●● ●●● ●●● ●● ● ● ●● ●● ●
● ●● ● ● ●●● ● ●● ●
●●●● ●●●● ●●●●●●●●●● ●●●● ●●●
●●●●● ●●
●
●
●
●●●●●●● ●●●
●●●
● ●●
●
● ● ● ●●●● ●
●● ●●●● ● ● ● ●● ●● ● ● ● ●●● ● ●
● ● ●● ● ● ● ●●● ● ●●●● ●● ●
●●● ●● ● ● ● ● ●●●●● ● ● ●● ● ● ●●● ●
●●● ●
●● ● ● ● ● ● ●●●● ● ●● ●● ●●●
●
●
●
● ●●● ●●
●●● ● ●●● ● ●● ●●
●●●● ● ●
●● ● ●● ● ●
●●●● ● ● ●●● ●●●●●●●● ●● ● ● ● ● ● ● ● ● ●● ● ●●
●●● ● ●
● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●●●● ●● ●
●●● ●●●●● ●
● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●●
● ●
●● ● ●●●●●●● ●
●● ● ●●●●●
●
● ●
●
Sales
Questions we might ask:
Is there a relationship between advertising budget and sales?
0 50
150 250
0 20 40 60 80
(Monash)
FIT5149
4 / 35
5 10 15 20 25 0 10 20 30 40 50
0 20 40 60 80
0 50 150 250
Simple Linear regression
Linear Regression for the Advertising Data in ISL
Consider the advertising data used in “Introduction to Statistical Learning”.
0 10 20 30 40 50
5 10 15 20 25
TV
●●● ●●● ●●●●●● ● ● ● ●●
● ●● ●
●● ● ● ●● ● ● ●
●● ●●● ●●● ●●● ●●● ●●●● ●●●● ●● ●●●● ●●●●●●●
●● ●●● ● ● ●● ● ● ● ●
●● ●● ●
● ●●● ● ● ●
●●●●●●● ●●●●
● ● ●● ● ● ●●
● ●● ● ● ● ● ● ●●●●● ● ●● ●
●● ●●●●
●●● ●●●●●● ●●●●● ●●
●● ●● ● ● ● ● ●● ● ●
●● ●● ● ●● ●●●● ●
●● ●●● ● ● ●● ● ● ●● ●●●●●●●
● ●●●● ●
●●●● ●● ●●●●
●●●●●●● ● ●● ●●●
● ● ●● ●●●● ●● ●●●● ● ●
●●●●●●●● ● ●●● ●●● ●
●●●● ●
●●●●● ●●● ● ●
●●●●●●● ● ●
●●● ●● ● ● ●●● ●● ●
● ● ●●
● ●● ●● ●●● ●●●●●● ●● ●● ●● ●●● ●
●● ●● ●●●●● ●
● ●●● ● ● ●●● ●●●●● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●
● ● ●● ●●●● ● ● ●● ●●●
●●● ●● ●●●● ●● ● ●●●● ●
●●●●●● ● ● ●
● ●
● ●● ●●●●● ●●●● ●●●
● ●●●●●●
●●●● ● ●●● ● ● ● ● ●● ●●● ●●● ●
●●●●● ● ● ●● ●●●
●●● ●●● ●●● ●●●● ●
●
●●● ● ●●●●●
● ● ●● ●● ●●●●● ●●●●
● ● ●●●● ●● ●● ●●●●●●●● ● ● ●●●● ●●●● ●
● ●
●● ● ● ●● ●● ● ●●●
● ●● ● ● ●●●●● ●●● ●● ●
● ●●
Radio
● ● ●● ● ● ● ● ● ● ● ●●
●● ●●● ●● ● ● ●● ●●●●●●
●● ●●●● ● ●●● ●●●●●●●●●●● ●●●
● ●
●● ●●
●●●●● ●●●● ●●● ●●● ● ●●● ● ●●
●●●●●●● ● ●●● ●● ●
●
●
●●●
● ●●●●●● ●
●●●●●●● ●● ●● ● ●
●●●● ● ● ● ●
●
● ●● ●●●●● ●●●● ●●●
● ● ● ●●●●●●
●
● ●●●●●●● ●●●● ● ●● ●●●●●●●
●●●●●
●● ●● ● ●● ●● ●● ● ●● ●●●●●●
●● ● ● ● ● ● ● ●●●● ● ● ●
●● ●●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●●●●● ● ●● ● ●
●●●●● ● ● ● ●● ●●
●
● ●
●
●
● ●
● ●●●●●●
●●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●●●●●●●●●●
●● ●●● ● ●● ● ●●● ● ●●● ●●●
●●●●●●●●● ● ● ●●●●●● ● ● ●●●● ● ●● ● ●●●●● ● ●● ● ● ● ● ●●● ● ● ●● ● ● ●● ● ● ●● ● ●●●●●● ●
● ●●● ● ●●●●●●●●●●●● ● ● ●●●● ●●●●●●●
●
● ●
●
●●
●
● ●●●●●●● ● ● ● ● ● ●● ●● ●● ● ● ● ●●● ●●●●● ● ●
●
● ●●● ● ●
●●●● ● ●●●● ●●●●● ●●●● ●●●●●● ● ●● ●●●●●●● ● ●●●●●●● ●● ● ● ● ●● ●●● ●
●●
● ●
●●●● ● ●
● ●●● ● ●● ● ●● ●
●● ●●●● ●
Newspaper
● ●
●● ● ●● ●
● ●●●●● ●● ●●
●
●
●
●
● ● ●●●
●
●● ●● ● ● ● ● ● ●● ● ● ● ●● ●● ●●●●●● ● ●●
●●●●●●●●●● ●● ● ●● ●● ● ●●●●●●● ●●●●
●●●●●● ●●● ● ●● ●●● ●● ● ● ●●●●● ●● ● ● ●● ●●●●●● ●
●● ●● ●●●●● ●●
●
● ● ●● ●●● ●
●●●●●●
● ●● ●●
●●● ●●●●●●●● ●●● ●●● ●● ● ● ●● ●● ●
● ●● ● ● ●●● ● ●● ●
●●●● ●●●● ●●●●●●●●●● ●●●● ●●●
●●●●● ●●
●
●
●
●●●●●●● ●●●
●●●
● ●●
●
● ● ● ●●●● ●
●● ●●●● ● ● ● ●● ●● ● ● ● ●●● ● ●
● ● ●● ● ● ● ●●● ● ●●●● ●● ●
●●● ●● ● ● ● ● ●●●●● ● ● ●● ● ● ●●● ●
●●● ●
●● ● ● ● ● ● ●●●● ● ●● ●● ●●●
●
●
●
● ●●● ●●
●●● ● ●●● ● ●● ●●
●●●● ● ●
●● ● ●● ● ●
●●●● ● ● ●●● ●●●●●●●● ●● ● ● ● ● ● ● ● ● ●● ● ●●
●●● ● ●
● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●●●● ●● ●
●●● ●●●●● ●
● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●●
● ●
●● ● ●●●●●●● ●
●● ● ●●●●●
●
● ●
●
Sales
Questions we might ask:
How strong is the relationship between advertising budget and sales?
0 50
150 250
0 20 40 60 80
(Monash)
FIT5149
4 / 35
5 10 15 20 25 0 10 20 30 40 50
0 20 40 60 80
0 50 150 250
Simple Linear regression
Linear Regression for the Advertising Data in ISL
Consider the advertising data used in “Introduction to Statistical Learning”.
0 10 20 30 40 50
5 10 15 20 25
TV
●●● ●●● ●●●●●● ● ● ● ●●
● ●● ●
●● ● ● ●● ● ● ●
●● ●●● ●●● ●●● ●●● ●●●● ●●●● ●● ●●●● ●●●●●●●
●● ●●● ● ● ●● ● ● ● ●
●● ●● ●
● ●●● ● ● ●
●●●●●●● ●●●●
● ● ●● ● ● ●●
● ●● ● ● ● ● ● ●●●●● ● ●● ●
●● ●●●●
●●● ●●●●●● ●●●●● ●●
●● ●● ● ● ● ● ●● ● ●
●● ●● ● ●● ●●●● ●
●● ●●● ● ● ●● ● ● ●● ●●●●●●●
● ●●●● ●
●●●● ●● ●●●●
●●●●●●● ● ●● ●●●
● ● ●● ●●●● ●● ●●●● ● ●
●●●●●●●● ● ●●● ●●● ●
●●●● ●
●●●●● ●●● ● ●
●●●●●●● ● ●
●●● ●● ● ● ●●● ●● ●
● ● ●●
● ●● ●● ●●● ●●●●●● ●● ●● ●● ●●● ●
●● ●● ●●●●● ●
● ●●● ● ● ●●● ●●●●● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●
● ● ●● ●●●● ● ● ●● ●●●
●●● ●● ●●●● ●● ● ●●●● ●
●●●●●● ● ● ●
● ●
● ●● ●●●●● ●●●● ●●●
● ●●●●●●
●●●● ● ●●● ● ● ● ● ●● ●●● ●●● ●
●●●●● ● ● ●● ●●●
●●● ●●● ●●● ●●●● ●
●
●●● ● ●●●●●
● ● ●● ●● ●●●●● ●●●●
● ● ●●●● ●● ●● ●●●●●●●● ● ● ●●●● ●●●● ●
● ●
●● ● ● ●● ●● ● ●●●
● ●● ● ● ●●●●● ●●● ●● ●
● ●●
Radio
● ● ●● ● ● ● ● ● ● ● ●●
●● ●●● ●● ● ● ●● ●●●●●●
●● ●●●● ● ●●● ●●●●●●●●●●● ●●●
● ●
●● ●●
●●●●● ●●●● ●●● ●●● ● ●●● ● ●●
●●●●●●● ● ●●● ●● ●
●
●
●●●
● ●●●●●● ●
●●●●●●● ●● ●● ● ●
●●●● ● ● ● ●
●
● ●● ●●●●● ●●●● ●●●
● ● ● ●●●●●●
●
● ●●●●●●● ●●●● ● ●● ●●●●●●●
●●●●●
●● ●● ● ●● ●● ●● ● ●● ●●●●●●
●● ● ● ● ● ● ● ●●●● ● ● ●
●● ●●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●●●●● ● ●● ● ●
●●●●● ● ● ● ●● ●●
●
● ●
●
●
● ●
● ●●●●●●
●●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●●●●●●●●●●
●● ●●● ● ●● ● ●●● ● ●●● ●●●
●●●●●●●●● ● ● ●●●●●● ● ● ●●●● ● ●● ● ●●●●● ● ●● ● ● ● ● ●●● ● ● ●● ● ● ●● ● ● ●● ● ●●●●●● ●
● ●●● ● ●●●●●●●●●●●● ● ● ●●●● ●●●●●●●
●
● ●
●
●●
●
● ●●●●●●● ● ● ● ● ● ●● ●● ●● ● ● ● ●●● ●●●●● ● ●
●
● ●●● ● ●
●●●● ● ●●●● ●●●●● ●●●● ●●●●●● ● ●● ●●●●●●● ● ●●●●●●● ●● ● ● ● ●● ●●● ●
●●
● ●
●●●● ● ●
● ●●● ● ●● ● ●● ●
●● ●●●● ●
Newspaper
● ●
●● ● ●● ●
● ●●●●● ●● ●●
●
●
●
●
● ● ●●●
●
●● ●● ● ● ● ● ● ●● ● ● ● ●● ●● ●●●●●● ● ●●
●●●●●●●●●● ●● ● ●● ●● ● ●●●●●●● ●●●●
●●●●●● ●●● ● ●● ●●● ●● ● ● ●●●●● ●● ● ● ●● ●●●●●● ●
●● ●● ●●●●● ●●
●
● ● ●● ●●● ●
●●●●●●
● ●● ●●
●●● ●●●●●●●● ●●● ●●● ●● ● ● ●● ●● ●
● ●● ● ● ●●● ● ●● ●
●●●● ●●●● ●●●●●●●●●● ●●●● ●●●
●●●●● ●●
●
●
●
●●●●●●● ●●●
●●●
● ●●
●
● ● ● ●●●● ●
●● ●●●● ● ● ● ●● ●● ● ● ● ●●● ● ●
● ● ●● ● ● ● ●●● ● ●●●● ●● ●
●●● ●● ● ● ● ● ●●●●● ● ● ●● ● ● ●●● ●
●●● ●
●● ● ● ● ● ● ●●●● ● ●● ●● ●●●
●
●
●
● ●●● ●●
●●● ● ●●● ● ●● ●●
●●●● ● ●
●● ● ●● ● ●
●●●● ● ● ●●● ●●●●●●●● ●● ● ● ● ● ● ● ● ● ●● ● ●●
●●● ● ●
● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●●●● ●● ●
●●● ●●●●● ●
● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●●
● ●
●● ● ●●●●●●● ●
●● ● ●●●●●
●
● ●
●
Sales
Questions we might ask:
Which media contribute to sales?
0 50
150 250
0 20 40 60 80
(Monash)
FIT5149
4 / 35
5 10 15 20 25 0 10 20 30 40 50
0 20 40 60 80
0 50 150 250
Simple Linear regression
Linear Regression for the Advertising Data in ISL
Consider the advertising data used in “Introduction to Statistical Learning”.
0 10 20 30 40 50
5 10 15 20 25
TV
●●● ●●● ●●●●●● ● ● ● ●●
● ●● ●
●● ● ● ●● ● ● ●
●● ●●● ●●● ●●● ●●● ●●●● ●●●● ●● ●●●● ●●●●●●●
●● ●●● ● ● ●● ● ● ● ●
●● ●● ●
● ●●● ● ● ●
●●●●●●● ●●●●
● ● ●● ● ● ●●
● ●● ● ● ● ● ● ●●●●● ● ●● ●
●● ●●●●
●●● ●●●●●● ●●●●● ●●
●● ●● ● ● ● ● ●● ● ●
●● ●● ● ●● ●●●● ●
●● ●●● ● ● ●● ● ● ●● ●●●●●●●
● ●●●● ●
●●●● ●● ●●●●
●●●●●●● ● ●● ●●●
● ● ●● ●●●● ●● ●●●● ● ●
●●●●●●●● ● ●●● ●●● ●
●●●● ●
●●●●● ●●● ● ●
●●●●●●● ● ●
●●● ●● ● ● ●●● ●● ●
● ● ●●
● ●● ●● ●●● ●●●●●● ●● ●● ●● ●●● ●
●● ●● ●●●●● ●
● ●●● ● ● ●●● ●●●●● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●
● ● ●● ●●●● ● ● ●● ●●●
●●● ●● ●●●● ●● ● ●●●● ●
●●●●●● ● ● ●
● ●
● ●● ●●●●● ●●●● ●●●
● ●●●●●●
●●●● ● ●●● ● ● ● ● ●● ●●● ●●● ●
●●●●● ● ● ●● ●●●
●●● ●●● ●●● ●●●● ●
●
●●● ● ●●●●●
● ● ●● ●● ●●●●● ●●●●
● ● ●●●● ●● ●● ●●●●●●●● ● ● ●●●● ●●●● ●
● ●
●● ● ● ●● ●● ● ●●●
● ●● ● ● ●●●●● ●●● ●● ●
● ●●
Radio
● ● ●● ● ● ● ● ● ● ● ●●
●● ●●● ●● ● ● ●● ●●●●●●
●● ●●●● ● ●●● ●●●●●●●●●●● ●●●
● ●
●● ●●
●●●●● ●●●● ●●● ●●● ● ●●● ● ●●
●●●●●●● ● ●●● ●● ●
●
●
●●●
● ●●●●●● ●
●●●●●●● ●● ●● ● ●
●●●● ● ● ● ●
●
● ●● ●●●●● ●●●● ●●●
● ● ● ●●●●●●
●
● ●●●●●●● ●●●● ● ●● ●●●●●●●
●●●●●
●● ●● ● ●● ●● ●● ● ●● ●●●●●●
●● ● ● ● ● ● ● ●●●● ● ● ●
●● ●●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●●●●● ● ●● ● ●
●●●●● ● ● ● ●● ●●
●
● ●
●
●
● ●
● ●●●●●●
●●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●●●●●●●●●●
●● ●●● ● ●● ● ●●● ● ●●● ●●●
●●●●●●●●● ● ● ●●●●●● ● ● ●●●● ● ●● ● ●●●●● ● ●● ● ● ● ● ●●● ● ● ●● ● ● ●● ● ● ●● ● ●●●●●● ●
● ●●● ● ●●●●●●●●●●●● ● ● ●●●● ●●●●●●●
●
● ●
●
●●
●
● ●●●●●●● ● ● ● ● ● ●● ●● ●● ● ● ● ●●● ●●●●● ● ●
●
● ●●● ● ●
●●●● ● ●●●● ●●●●● ●●●● ●●●●●● ● ●● ●●●●●●● ● ●●●●●●● ●● ● ● ● ●● ●●● ●
●●
● ●
●●●● ● ●
● ●●● ● ●● ● ●● ●
●● ●●●● ●
Newspaper
● ●
●● ● ●● ●
● ●●●●● ●● ●●
●
●
●
●
● ● ●●●
●
●● ●● ● ● ● ● ● ●● ● ● ● ●● ●● ●●●●●● ● ●●
●●●●●●●●●● ●● ● ●● ●● ● ●●●●●●● ●●●●
●●●●●● ●●● ● ●● ●●● ●● ● ● ●●●●● ●● ● ● ●● ●●●●●● ●
●● ●● ●●●●● ●●
●
● ● ●● ●●● ●
●●●●●●
● ●● ●●
●●● ●●●●●●●● ●●● ●●● ●● ● ● ●● ●● ●
● ●● ● ● ●●● ● ●● ●
●●●● ●●●● ●●●●●●●●●● ●●●● ●●●
●●●●● ●●
●
●
●
●●●●●●● ●●●
●●●
● ●●
●
● ● ● ●●●● ●
●● ●●●● ● ● ● ●● ●● ● ● ● ●●● ● ●
● ● ●● ● ● ● ●●● ● ●●●● ●● ●
●●● ●● ● ● ● ● ●●●●● ● ● ●● ● ● ●●● ●
●●● ●
●● ● ● ● ● ● ●●●● ● ●● ●● ●●●
●
●
●
● ●●● ●●
●●● ● ●●● ● ●● ●●
●●●● ● ●
●● ● ●● ● ●
●●●● ● ● ●●● ●●●●●●●● ●● ● ● ● ● ● ● ● ● ●● ● ●●
●●● ● ●
● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●●●● ●● ●
●●● ●●●●● ●
● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●●
● ●
●● ● ●●●●●●● ●
●● ● ●●●●●
●
● ●
●
Sales
Questions we might ask:
How accurately can we predict future sales?
0 50
150 250
0 20 40 60 80
(Monash)
FIT5149
4 / 35
5 10 15 20 25 0 10 20 30 40 50
0 20 40 60 80
0 50 150 250
Simple Linear regression
Linear Regression for the Advertising Data in ISL
Consider the advertising data used in “Introduction to Statistical Learning”.
0 10 20 30 40 50
5 10 15 20 25
TV
●●● ●●● ●●●●●● ● ● ● ●●
● ●● ●
●● ● ● ●● ● ● ●
●● ●●● ●●● ●●● ●●● ●●●● ●●●● ●● ●●●● ●●●●●●●
●● ●●● ● ● ●● ● ● ● ●
●● ●● ●
● ●●● ● ● ●
●●●●●●● ●●●●
● ● ●● ● ● ●●
● ●● ● ● ● ● ● ●●●●● ● ●● ●
●● ●●●●
●●● ●●●●●● ●●●●● ●●
●● ●● ● ● ● ● ●● ● ●
●● ●● ● ●● ●●●● ●
●● ●●● ● ● ●● ● ● ●● ●●●●●●●
● ●●●● ●
●●●● ●● ●●●●
●●●●●●● ● ●● ●●●
● ● ●● ●●●● ●● ●●●● ● ●
●●●●●●●● ● ●●● ●●● ●
●●●● ●
●●●●● ●●● ● ●
●●●●●●● ● ●
●●● ●● ● ● ●●● ●● ●
● ● ●●
● ●● ●● ●●● ●●●●●● ●● ●● ●● ●●● ●
●● ●● ●●●●● ●
● ●●● ● ● ●●● ●●●●● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●
● ● ●● ●●●● ● ● ●● ●●●
●●● ●● ●●●● ●● ● ●●●● ●
●●●●●● ● ● ●
● ●
● ●● ●●●●● ●●●● ●●●
● ●●●●●●
●●●● ● ●●● ● ● ● ● ●● ●●● ●●● ●
●●●●● ● ● ●● ●●●
●●● ●●● ●●● ●●●● ●
●
●●● ● ●●●●●
● ● ●● ●● ●●●●● ●●●●
● ● ●●●● ●● ●● ●●●●●●●● ● ● ●●●● ●●●● ●
● ●
●● ● ● ●● ●● ● ●●●
● ●● ● ● ●●●●● ●●● ●● ●
● ●●
Radio
● ● ●● ● ● ● ● ● ● ● ●●
●● ●●● ●● ● ● ●● ●●●●●●
●● ●●●● ● ●●● ●●●●●●●●●●● ●●●
● ●
●● ●●
●●●●● ●●●● ●●● ●●● ● ●●● ● ●●
●●●●●●● ● ●●● ●● ●
●
●
●●●
● ●●●●●● ●
●●●●●●● ●● ●● ● ●
●●●● ● ● ● ●
●
● ●● ●●●●● ●●●● ●●●
● ● ● ●●●●●●
●
● ●●●●●●● ●●●● ● ●● ●●●●●●●
●●●●●
●● ●● ● ●● ●● ●● ● ●● ●●●●●●
●● ● ● ● ● ● ● ●●●● ● ● ●
●● ●●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●●●●● ● ●● ● ●
●●●●● ● ● ● ●● ●●
●
● ●
●
●
● ●
● ●●●●●●
●●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●●●●●●●●●●
●● ●●● ● ●● ● ●●● ● ●●● ●●●
●●●●●●●●● ● ● ●●●●●● ● ● ●●●● ● ●● ● ●●●●● ● ●● ● ● ● ● ●●● ● ● ●● ● ● ●● ● ● ●● ● ●●●●●● ●
● ●●● ● ●●●●●●●●●●●● ● ● ●●●● ●●●●●●●
●
● ●
●
●●
●
● ●●●●●●● ● ● ● ● ● ●● ●● ●● ● ● ● ●●● ●●●●● ● ●
●
● ●●● ● ●
●●●● ● ●●●● ●●●●● ●●●● ●●●●●● ● ●● ●●●●●●● ● ●●●●●●● ●● ● ● ● ●● ●●● ●
●●
● ●
●●●● ● ●
● ●●● ● ●● ● ●● ●
●● ●●●● ●
Newspaper
● ●
●● ● ●● ●
● ●●●●● ●● ●●
●
●
●
●
● ● ●●●
●
●● ●● ● ● ● ● ● ●● ● ● ● ●● ●● ●●●●●● ● ●●
●●●●●●●●●● ●● ● ●● ●● ● ●●●●●●● ●●●●
●●●●●● ●●● ● ●● ●●● ●● ● ● ●●●●● ●● ● ● ●● ●●●●●● ●
●● ●● ●●●●● ●●
●
● ● ●● ●●● ●
●●●●●●
● ●● ●●
●●● ●●●●●●●● ●●● ●●● ●● ● ● ●● ●● ●
● ●● ● ● ●●● ● ●● ●
●●●● ●●●● ●●●●●●●●●● ●●●● ●●●
●●●●● ●●
●
●
●
●●●●●●● ●●●
●●●
● ●●
●
● ● ● ●●●● ●
●● ●●●● ● ● ● ●● ●● ● ● ● ●●● ● ●
● ● ●● ● ● ● ●●● ● ●●●● ●● ●
●●● ●● ● ● ● ● ●●●●● ● ● ●● ● ● ●●● ●
●●● ●
●● ● ● ● ● ● ●●●● ● ●● ●● ●●●
●
●
●
● ●●● ●●
●●● ● ●●● ● ●● ●●
●●●● ● ●
●● ● ●● ● ●
●●●● ● ● ●●● ●●●●●●●● ●● ● ● ● ● ● ● ● ● ●● ● ●●
●●● ● ●
● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●●●● ●● ●
●●● ●●●●● ●
● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●●
● ●
●● ● ●●●●●●● ●
●● ● ●●●●●
●
● ●
●
Sales
Questions we might ask:
Is the relationship linear?
0 50 150 250
0 20 40 60 80
(Monash)
FIT5149
4 / 35
5 10 15 20 25 0 10 20 30 40 50
0 20 40 60 80
0 50 150 250
Simple Linear regression
Linear Regression for the Advertising Data in ISL
Consider the advertising data used in “Introduction to Statistical Learning”.
0 10 20 30 40 50
5 10 15 20 25
TV
●●● ●●● ●●●●●● ● ● ● ●●
● ●● ●
●● ● ● ●● ● ● ●
●● ●●● ●●● ●●● ●●● ●●●● ●●●● ●● ●●●● ●●●●●●●
●● ●●● ● ● ●● ● ● ● ●
●● ●● ●
● ●●● ● ● ●
●●●●●●● ●●●●
● ● ●● ● ● ●●
● ●● ● ● ● ● ● ●●●●● ● ●● ●
●● ●●●●
●●● ●●●●●● ●●●●● ●●
●● ●● ● ● ● ● ●● ● ●
●● ●● ● ●● ●●●● ●
●● ●●● ● ● ●● ● ● ●● ●●●●●●●
● ●●●● ●
●●●● ●● ●●●●
●●●●●●● ● ●● ●●●
● ● ●● ●●●● ●● ●●●● ● ●
●●●●●●●● ● ●●● ●●● ●
●●●● ●
●●●●● ●●● ● ●
●●●●●●● ● ●
●●● ●● ● ● ●●● ●● ●
● ● ●●
● ●● ●● ●●● ●●●●●● ●● ●● ●● ●●● ●
●● ●● ●●●●● ●
● ●●● ● ● ●●● ●●●●● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●
● ● ●● ●●●● ● ● ●● ●●●
●●● ●● ●●●● ●● ● ●●●● ●
●●●●●● ● ● ●
● ●
● ●● ●●●●● ●●●● ●●●
● ●●●●●●
●●●● ● ●●● ● ● ● ● ●● ●●● ●●● ●
●●●●● ● ● ●● ●●●
●●● ●●● ●●● ●●●● ●
●
●●● ● ●●●●●
● ● ●● ●● ●●●●● ●●●●
● ● ●●●● ●● ●● ●●●●●●●● ● ● ●●●● ●●●● ●
● ●
●● ● ● ●● ●● ● ●●●
● ●● ● ● ●●●●● ●●● ●● ●
● ●●
Radio
● ● ●● ● ● ● ● ● ● ● ●●
●● ●●● ●● ● ● ●● ●●●●●●
●● ●●●● ● ●●● ●●●●●●●●●●● ●●●
● ●
●● ●●
●●●●● ●●●● ●●● ●●● ● ●●● ● ●●
●●●●●●● ● ●●● ●● ●
●
●
●●●
● ●●●●●● ●
●●●●●●● ●● ●● ● ●
●●●● ● ● ● ●
●
● ●● ●●●●● ●●●● ●●●
● ● ● ●●●●●●
●
● ●●●●●●● ●●●● ● ●● ●●●●●●●
●●●●●
●● ●● ● ●● ●● ●● ● ●● ●●●●●●
●● ● ● ● ● ● ● ●●●● ● ● ●
●● ●●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●●●●● ● ●● ● ●
●●●●● ● ● ● ●● ●●
●
● ●
●
●
● ●
● ●●●●●●
●●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●●●●●●●●●●
●● ●●● ● ●● ● ●●● ● ●●● ●●●
●●●●●●●●● ● ● ●●●●●● ● ● ●●●● ● ●● ● ●●●●● ● ●● ● ● ● ● ●●● ● ● ●● ● ● ●● ● ● ●● ● ●●●●●● ●
● ●●● ● ●●●●●●●●●●●● ● ● ●●●● ●●●●●●●
●
● ●
●
●●
●
● ●●●●●●● ● ● ● ● ● ●● ●● ●● ● ● ● ●●● ●●●●● ● ●
●
● ●●● ● ●
●●●● ● ●●●● ●●●●● ●●●● ●●●●●● ● ●● ●●●●●●● ● ●●●●●●● ●● ● ● ● ●● ●●● ●
●●
● ●
●●●● ● ●
● ●●● ● ●● ● ●● ●
●● ●●●● ●
Newspaper
● ●
●● ● ●● ●
● ●●●●● ●● ●●
●
●
●
●
● ● ●●●
●
●● ●● ● ● ● ● ● ●● ● ● ● ●● ●● ●●●●●● ● ●●
●●●●●●●●●● ●● ● ●● ●● ● ●●●●●●● ●●●●
●●●●●● ●●● ● ●● ●●● ●● ● ● ●●●●● ●● ● ● ●● ●●●●●● ●
●● ●● ●●●●● ●●
●
● ● ●● ●●● ●
●●●●●●
● ●● ●●
●●● ●●●●●●●● ●●● ●●● ●● ● ● ●● ●● ●
● ●● ● ● ●●● ● ●● ●
●●●● ●●●● ●●●●●●●●●● ●●●● ●●●
●●●●● ●●
●
●
●
●●●●●●● ●●●
●●●
● ●●
●
● ● ● ●●●● ●
●● ●●●● ● ● ● ●● ●● ● ● ● ●●● ● ●
● ● ●● ● ● ● ●●● ● ●●●● ●● ●
●●● ●● ● ● ● ● ●●●●● ● ● ●● ● ● ●●● ●
●●● ●
●● ● ● ● ● ● ●●●● ● ●● ●● ●●●
●
●
●
● ●●● ●●
●●● ● ●●● ● ●● ●●
●●●● ● ●
●● ● ●● ● ●
●●●● ● ● ●●● ●●●●●●●● ●● ● ● ● ● ● ● ● ● ●● ● ●●
●●● ● ●
● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●●●● ●● ●
●●● ●●●●● ●
● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●●
● ●
●● ● ●●●●●●● ●
●● ● ●●●●●
●
● ●
●
Sales
Questions we might ask:
Is there synergy among the advertising media?
0 50
150 250
0 20 40 60 80
(Monash)
FIT5149
4 / 35
5 10 15 20 25 0 10 20 30 40 50
0 20 40 60 80
0 50 150 250
Simple Linear regression
Simple Linear Regression
Simple linear regression is a statistical method that allows us to predict a quantitative response Y on the basis of a single predicator Variable X. It assumes the relationship between Y and X can be model by a straight line:
Y = β0 + β1X + ε
where
β0: the expected value of Y when X = 0.
β1: the average change in Y for a 1-unit change in X.
ε: error term describes the random component of the linear relationship.
Assumptions:
Linearity: The response variable Y has a linear relationship to the predictor variable X.
Nearly normal residuals: The errors must be independent and normally distributed.
ε∼N(0,σ2 ·In×n)
Constant variability: The Variance of the residuals is constant.
(Monash) FIT5149 5 / 35
Simple Linear regression
Example: Advertising data
●● ●●●● ●
●●
●● ●● ● ●●
● ● ● ●●●
● ● ●
●●●● ●●●
●●
●●●● ●●●●
● ● ●● ● ● ● ●● ● ● ●●
●● ● ●●●●● ● ● ●●●
●●●●● ●● ●
● ●● ● ● ●
● ● ● ● ●
●
●● ●
●●●● ●●●● ●
●●● ● ●
●●● ●●●●● ●●
● ● ●●
●● ●●●● ● ●● ●●● ●●
● ●● ● ●● ●● ●●●
●● ●●
●
●
●●●●
● ●●●
●● ●
●●● ●●
●●●● ●●●
● ●●●
Sales ≈ βˆ0 + βˆ1 × TV
0 50 100 150 200 250 300 TV
Given some estimates βˆ0 and βˆ1 for the model coefficients,
Inference: describe the linear dependency between sales and budgets for TV
advertisement.
Prediction: predict future sales given a budget plan for TV advertisement,
yˆ = βˆ 0 + βˆ 1 x
where yˆ indicates the prediction of Y on the basis of X = x.
(Monash) FIT5149 6 / 35
Sales
5 10 15 20 25
Simple Linear regression
The “Ordinary Least Squares” Regression.
Let yˆ = βˆ0 + βˆ1x be the prediction for Y based on the ith value of X. The ith residual (i.e., error) is defined as
e i = y i − yˆ i
We define the residual sum of squares (RSS) as
or equivalent as
n RSS=e12 +e2 +···+en2 =ei2
i=1
RSS = (y1 −βˆ0 −βˆ1×1)2 +(y2 −βˆ0 −βˆ1×2)2 +···+(yn −βˆ0 −βˆ1xn)2 n
= (yi−βˆ0−βˆ1xi)2 i=1
The least square approaches chooses βˆ0 and βˆ1 to minimise the RSS.
(Monash) FIT5149 7 / 35
Simple Linear regression
How to Fit a Regression model with RSS in R
The lm() function performs a least squares regression and creates a linear model object:
where:
Models for lm() are specified symbolically: response ∼ predictor The intercept βˆ0 = 7.026 and the slope βˆ1 = 0.0475
The linear model object contains much more information that just the coefficients!
(Monash) FIT5149 8 / 35
Simple Linear regression
Interpret Simple Linear Regression Model
(Monash) FIT5149 9 / 35
Simple Linear regression
Assessing the Accuracy of the Coefficient Estimates
Coefficient – Std. Error: measures how precisely the model estimates the coefficient’s unknown value.
SE(βˆ0) = 0.457843: in the absence of any advertising, the average sales can vary by 457.843 units.
SE(βˆ1) = 0.002691: for each $1,000 increase in television advertising, the average increase in sales can vary by 2.691 units.
(Monash) FIT5149 10 / 35
Simple Linear regression
Assessing the Accuracy of the Coefficient Estimates
Coefficient – Std. Error: measures how precisely the model estimates the coefficient’s unknown value.
These standard errors can be used to compute confidence intervals. A 95% confidence interval is defined as a range of values such that with 95% probability, the range will contain the true unknown value of the parameter. It has the form
βˆ1 ± 2 · SE(βˆ1)
That is, there is approximately a 95% chance that the interval
βˆ1 − 2 · SE(βˆ1), βˆ1 + 2 · SE(βˆ1) will contain the true value of β1.
(Monash) FIT5149 10 / 35
Simple Linear regression
Assessing the Accuracy of the Coefficient Estimates
Coefficient – Std. Error: measures how precisely the model estimates the coefficient’s unknown value.
In the case of the advertising data
− The 95% confidence interval for β0 is [6.130, 7.935]
− The 95% confidence interval for β1 is [0.042, 0.053]
Use the confidence interval to assess the reliability of the estimate of the coefficient.
Standard errors can also be used to perform hypothesis tests on the coefficient.
− H0: There is no relationship between X and Y, i.e., β1 = 0
− Ha: There is some relationship between X and Y, i.e., β1 ̸= 0
(Monash) FIT5149 10 / 35
Simple Linear regression
Assessing the Accuracy of the Coefficient Estimates
Coefficient – t statistics
t = βˆ 1 − 0 SE(βˆ1 )
which measures the number of standard deviations that βˆ1 is away from 0. Large t value indicates the null hypothesis could be rejected.
Small t value indicates rejecting the null hypothesis could cause a type-I error.
Question: How large is large?
(Monash) FIT5149 10 / 35
Simple Linear regression
Assessing the Accuracy of the Coefficient Estimates
Coefficient – Pr(>|t|) (i.e., p-value): test for the predicative power of predictor variable, i.e., TV
Small p-value (Pr(> |t|) < α = 0.001): reject the null hypothesis
− Changes in the predictor’s value are related to changes in the response variable.
Use the coefficient p-values to determine which terms to keep in the regression model.
(Monash) FIT5149 10 / 35
Simple Linear regression
Assessing the Accuracy of the Model
Residual standard error (RSE): an estimate of the standard deviation of residuals, i.e., ε.
1 1 n
(yi − yˆi )2
A measure of the quality of a linear regression fit, or a measure of the lack of fit of the model
The advertising data: RSE = 3.259.
− Actual sales in each market deviate from the true regression line by
approximately 3,259 units, on average.
− The percentage error:
3, 259/14, 000 = 23% where 14,000 is the mean value of sales.
RSE = n − 2 RSS = n − 2
i=1
(Monash) FIT5149 11 / 35
Simple Linear regression
Assessing the Accuracy of the Model
The Coefficient of Determination (i.e., the R2 statistic): measures the proportion of variability in Y that can be explained using X.
R2 = 1 − RSS TSS
where the total sum of squares (TSS) is n (yi − y ̄)2, and RSS is ni = 1 ( y i − yˆ i ) 2 i = 1
0 ≤ R2 ≤ 1: the larger R2 is the better the model is fitting the actual data. The advertising data: R2 = 0.6119.
(Monash) FIT5149 11 / 35
Simple Linear regression
Normality: are residuals normally distributed?
Residuals are essentially the difference between the observed response values and the response values predicted by the model.
●●● ● ●
●
●
●●●●
Ideally, residuals should be normally distributed.
E(ε) = 0
When assessing how well the model fit the data, you should look for a symmetrical distribution across these points on the mean value 0.
● ●●●
●● ●
●●● ●●
●●●● ●●●
0 50 100 150 200 250 300 TV
(Monash)
FIT5149
12 / 35
● ●●●
●●●● ●●●
●●
●●● ●●●●● ●●
● ● ●●
●● ●●●● ● ●● ●●● ●●
● ●● ● ●● ●● ●●●
●● ●●
●●●● ●●●●
● ● ●● ● ● ● ●● ● ● ●●
●● ● ●●●●● ● ● ●●●
●●●●● ●● ●
● ● ●
● ●● ● ● ●
● ● ● ● ●
●
●● ●
●●●● ●●●● ●
●● ●●●● ●
●●
●● ●● ● ●●
● ● ● ●●●
Sales
5 10 15 20 25
Simple Linear regression
Normality: are residuals normally distributed?
Distribution of Studentized Residuals
−3 −2 −1 0 1 2
sresid
(Monash) FIT5149 12 / 35
0.0 0.1 0.2
0.3 0.4 0.5
Density
Simple Linear regression
Normality: are residuals normally distributed?
QQ Plot
● ●
●●●●● ●
●●
●●
● ● ●
● ●
●
● ●
●
●
●
● ● ●
●
●
●
● ●
● ●
● ●
●●●●●●
−3 −2 −1 0 1 2 3
norm quantiles
(Monash) FIT5149 12 / 35
lmfit$residuals
−5 0 5
Simple Linear regression
Normality: are residuals normally distributed?
(Monash) FIT5149 12 / 35
Simple Linear regression
Linearity: is the relationship between predictor and response variables linear?
Residuals vs Fitted
If residuals are randomly spread around a horizontal line without distinct patterns, that is a good indication you don’t have non-linear relationships.
what is the difference in the plots between linear models trained on datasets
●● ●●●
● ●●●●●
● ●●●●●●●●
●● ●●●●
●●●●● ●●●●●
●●
●● ●●●●●●●●
● ● ● ● ● ●●●●●
●●●●● ●●●●●
●●●● ● ● ●●●●● ●●●●●
●●●● ●●●
●
●●● ●● ● ●●● ● ●● ● ●●●●●
●
●
●●●●●● ●●●●● ●● ●●●
●●●●●●●●● ●● ●●● ● ●
●
● ●●●●● ● ● ●● ●
● ● ●●●●
●
●● ●●●●
●●●
● ●●
● ●● ●●
● ●26
36● ● 179
8 10
12 14 16 Fitted values
lm(Sales ~ TV)
18 20
the relationship between predictors and the response variable is linear.
the relationship between predictors and the response variable is not linear.
Figure: Plots of residuals versus predicted
(or fitted) values for the Advertising
dataset. (Monash) FIT5149
13 / 35
Residuals
−10 −5 0 5
Simple Linear regression
Linearity: is the relationship between predictor and response variables linear?
(Monash) FIT5149 13 / 35
Simple Linear regression
Constant variance (homoscedasticity)
Are residuals spread equally along the ranges of predictors
Scale−Location
●● ●●●
● ●●●●●● ● ● ●
●26 ●
● 179 36●
●●●●● ●●●●●●●
●●●●●●
● ● ● ●● ●●
●●●
● ● ●●●●● ●
● ● ● ● ●●● ● ● ●● ●
●●●● ●●●●●
●●●●●●●● ● ●●● ●●● ●
● ● ● ● ●● ● ● ●●●●
● ● ● ● ●● ● ●
●●●● ● ● ●● ● ● ●●
●● ●
●●●●●● ●●●● ●●●● ● ● ●●●
●●●
●● ● ●●
●●
●● ●●●●
●●●● ●●●●●
●●●
●●●●●● ●●
●
●● ●●
●
●
8 10 12 14 16 18 20 Fitted values
lm(Sales ~ TV)
(Monash) FIT5149 14 / 35
Standardized residuals
0.0 0.5 1.0 1.5
Simple Linear regression
Constant variance (homoscedasticity)
Are residuals spread equally along the ranges of predictors
(Monash) FIT5149 14 / 35
Simple Linear regression
Residuals v.s. Leverage: what are the influential data sample in the fitting?
Residuals vs Leverage
Cook's distance
●● ●●●
● ●●●●●
●
●● ● ● ● ●●
●●
●●●
●●● ● ● ● ● ●● ●●●
●●
●● ●● ● ●● ● ● ●● ● ●●
●●●●●● ●●●●●
●●● ●●●● ●●●●
●● ●●●●●●● ●
●● ●●●● ●
●
Cook's distance
●●● ●● ●●
●● ● ●
●●●●● ●
●●●●●● ●●●●●
● ● ●●●●●●●●
● ● ● ● ●●● ● ● ● ●
●●
●
●●●● ●
●
●● ● ●●
●●●●
●●●
● ●● ● ●●
●●●
36●
● 179
●
● ●26
● ●● ●
36
26
179
0.000
0.005
0.010
Leverage
lm(Sales ~ TV)
0.015
0.020 0
50 100 150
Obs. number lm(Sales ~ TV)
200
(Monash)
FIT5149
15 / 35
Standardized residuals
−3 −2 −1 0 1 2
Cook's distance
0.00 0.01 0.02 0.03 0.04 0.05 0.06
Simple Linear regression
Residuals v.s. Leverage: what are the influential data sample in the fitting?
Watch out for outlying values at the upper right corner or at the lower right corner.
(Monash) FIT5149 15 / 35
Multiple Linear Regression
Outline
1 Simple Linear regression
2 Multiple Linear Regression
3 Linear Regression with Qualitative Predicators
4 Extension of Linear models
5 Summary
(Monash) FIT5149 16 / 35
Multiple Linear Regression
Example: the Advertising Data
● ● ● ●● ●●● ● ●●● ●
●● ●●●●●
●● ●●● ●
● ● ●●● ●
● ●
● ●●●●●●
●●● ●●●
●●●●●● ●● ● ●●
●● ●●●● ●●●●●● ● ●● ●●●● ● ●
●●●
●● ● ●●
●●● ●●●● ●●● ●● ● ● ●
●●●● ●●●●● ● ● ●● ●● ●●●●●●●● ●
● ● ●● ●●
● ●●●
● ●●●● ●
●●● ● ●●●●
●
● ●●●
● ●
●
●
●
●●●●●●● ● ●
●●● ●● ●
●●●● ●●● ● ● ●● ●
●●
●● ●●● ●●● ●●●●●●● ●●●●
●● ●● ● ● ● ●
● ● ●●
●● ● ●●● ● ●●
● ●●●●●●● ●● ●● ●
●●● ●●●
●
● ●
●● ● ●●● ● ●
●
● ●● ● ● ●●●
●●●● ●●●
●●
●●●● ●●●
● ● ● ●● ●
●●● ●●● ● ●● ●
●●●
●●
● ● ● ● ●
●● ● ●
● ●●●●●● ● ● ●● ●●
●●●
●
●
● ●●●●
●● ●
● ●●●●
●●● ●●● ● ● ● ● ●
● ●●
●●●
●●● ●● ●●
●
●●●●●● ●●●
● ●
●● ●●●
●●●
●●
●●●●●● ●●
● ● ●●● ● ●● ●
●●●●
●●●
●●●● ● ●●
●●●●●●●●● ● ●● ●●●●●●● ●●●● ●
●
●●●● ●●●● ●●●●●●●●● ●
●
● ●●● ●
●● ●● ● ● ●
●● ●● ●
● ●●● ●
●●
●●●●
●●● ● ●
● ●●● ●●
●
●
●
●
●● ●
0 50 100 150 200 250 300 0 10 20 30 40 50 0 20 40 60 80 100
TV Radio Newspaper
How we can extend our analysis of the advertising data in order to accommodate these two additional predictors?
(Monash) FIT5149 17 / 35
Sales
5 10 15 20 25
Sales
5 10 15 20 25
Sales
5 10 15 20 25
Multiple Linear Regression
Example: the Advertising Data
●
●
● ● ●● ●●
● ●●●
● ●●●● ●
●●● ● ●●●●
● ●●●
● ●●●●●●
●●● ●●●
●●●●●● ●● ● ●●
●● ●●●● ●●●●●● ● ●● ●●●● ● ●
●●●
●● ● ●●
●●● ●●●● ●●● ●● ● ● ●
●●●● ●●●●● ● ● ●● ●● ●●●●●●●● ●
● ● ● ●● ●●● ● ●●● ●
●● ●●●●●
●● ●●● ●
● ● ●●● ●
●
●
●
● ●
●●● ●● ●
●●●● ●●● ● ● ●● ●
●●
●● ●● ● ● ● ●
●
●● ● ●●● ● ●●
● ●● ●
●●●●●●● ●● ●● ● ●● ●●●
●●● ●●●●●●● ●●●●
●●● ●●●
●
●
●
●
●● ●
●●● ● ●
●
● ●● ● ● ●●●
●●●● ●●●
●●● ●●● ● ●● ●
●●● ●●
● ● ● ● ● ●● ● ●
● ●●●●●● ● ● ●● ●●
●●●
●●●●●●● ● ●
●●
●●●● ●●●
● ● ● ●● ●
●
●
● ●●●●
●● ●
●●●
● ●●●●
●●● ●● ●●
●
● ●
●● ●●●
●●●●●● ●●●
●●● ●●● ● ● ● ● ●
● ●●
●●
●●●●●● ●●
● ● ●●● ● ●● ●
●●●●
●●●
●●●● ● ●●
●●●●●●●●● ● ●● ●●●●●●● ●●●● ●
●●●● ●●●● ●●●●●●●●● ●
●●
●●●
● ●●● ●
●● ●● ● ● ●
●● ●● ●
● ●●● ●
●
●● ●
●●●●
●●● ● ●
● ●●● ●●
●
●
●
●
●
0 50 100 150 200 250 300 0 10 20 30 40 50 0 20 40 60 80 100
Problems
TV Radio
Newspaper
Predict sales given the three advertising media budgets.
Ignore the correlation between the predictors, TV, Radio and Newspaper.
(Monash) FIT5149 17 / 35
Sales
5 10 15 20 25
Sales
5 10 15 20 25
Sales
5 10 15 20 25
Multiple Linear Regression
Multiple Linear Regression
The multiple linear regression model:
Y =β0 +β1X1 +β2X2 +···+βpXp +ε where ε ∼ N(0,σ2In×n)
βj : the average effect on Y of a one unit increase in Xj , holding all other predictors fixed.
In the advertising example, the model becomes
Sales = β0 + β1 × TV + β2 × Radio + β3 × Newspaper + ε
(Monash) FIT5149 18 / 35
Multiple Linear Regression
Estimating the Regression Coefficients
Given estimates βˆ0,βˆ1,...,βˆp, we can make prediction using the formula yˆ = βˆ 0 + βˆ 1 x 1 + βˆ 2 x 2 + · · · + βˆ p x p
We estimate β0, β1, . . . , βp as the values that minimise the sum of squared residuals
n
RSS = (yi−yˆi)2
i=1 n
= yi −βˆ0 −βˆ1x1 −βˆ2x2 −···−βˆpxp2 i=1
This can be done using standard statistical software.
(Monash) FIT5149 19 / 35
Multiple Linear Regression
Results for Advertising Data
Results:
(Monash) FIT5149 20 / 35
Multiple Linear Regression
Results for Advertising Data
Compare simple linear regression with multiple linear regression:
Coefficients Std. error t value p-value Intercept <2e-16 TV <2e-16
Intercept <2e-16 Radio <2e-16
7.032594
0.457843
15.36
0.047537
0.002691
17.67
9.31164
0.56290
16.542
0.20250
0.02041
9.921
Intercept
Intercept TV Radio
< 2e-16
<2e-16 <2e-16 <2e-16
12.35141
0.62142
19.88
Newspaper
0.05469
0.01658
3.30
0.00115
2.938889
0.311908
9.422
0.045765
0.001395
32.809
0.188530
0.008611
21.893
Newspaper
-0.001037
0.005871
-0.177
0.86
The multiple linear regression suggests that there is no relationship between sales and newspaper while the simple linear regression implies the opposite.
(Monash) FIT5149 21 / 35
Multiple Linear Regression
Results for Advertising Data
Correlation matrix for TV, Radio, Newspaper, and sales.
The correlation between radio and newspaper is 0.354.
A tendency to spend more on newspaper advertising in markets where more is spent on radio advertising.
(Monash) FIT5149 22 / 35
Multiple Linear Regression
Some Important Questions
Is at least one of the predictors X1, X2, . . . , Xp useful in predicting the response?
Do all the predictors help to explain Y , or is only a subset of the predictors useful?
How well does the model fit the data?
Given a set of predictor values, what response value should we predict, and how accurate is our prediction?
(Monash) FIT5149 23 / 35
Multiple Linear Regression
F-Statistics
Is there a relationship between the response and predictors?
Hypothesis testing
− Null hypothesis: There is no relationship between Y and X1,X2,...,Xp.
H0 :β1 =β2 =···=βp =0
− The alternative: There is at least one Xj related to Y .
Ha : βj ̸= 0, ∃j ∈ [1, p]
F-statistics: a good indicator of whether there is a relationship between our
predictor and the response variables.
(TSS − RSS)/p F = RSS/(n−p−1)
− F-value close to 1: no relationship between Y and X1, X2, . . . , Xp.
− F-value greater than 1: a relationship between our predictor and the response
variables.
(Monash) FIT5149 24 / 35
Multiple Linear Regression
F-Statistics
Is there a relationship between the response and predictors?
For the F-value, how large is large?
− n is large: small F-value provides strong evidence against H0.
− n is small: large F-value is needed.
Example: multiple linear regression on the advertising dataset
(Monash) FIT5149 24 / 35
Multiple Linear Regression
The Analysis of Variance (ANOVA)
It tests whether reduction in the residual sum of squares are statistically significant or not).
Note that this makes sense only if lm.1 and lm.2 are nested models.
(Monash) FIT5149 25 / 35
Multiple Linear Regression
The Analysis of Variance (ANOVA)
It tests whether reduction in the residual sum of squares are statistically significant or not).
Note that this makes sense only if lm.1 and lm.2 are nested models.
(Monash) FIT5149 25 / 35
Multiple Linear Regression
Confidence interval v.s. Prediction interval
Given a set of predictor values, what response value should be predict, and how accurate is our prediction?
Prediction: given the estimated coefficients, βˆ0,βˆ1,...,βˆp, p
Yˆ = βˆ 0 + βˆ i X i i=1
(Monash) FIT5149
26 / 35
Multiple Linear Regression
Confidence interval v.s. Prediction interval
Given a set of predictor values, what response value should be predict, and how accurate is our prediction?
To determine how close Yˆ will be close to f (X ).
− Confidence interval: use to quantify the uncertainty around the expected value
of predictions (average of a group of predictions) — the uncertainty of predicting the average sales over a number of markets.
(Monash) FIT5149 26 / 35
Multiple Linear Regression
Confidence interval v.s. Prediction interval
Given a set of predictor values, what response value should be predict, and how accurate is our prediction?
To determine how close Yˆ will be close to f (X ).
− Prediction interval: use to quantify the uncertainty around a single prediction
— e.g. the uncertainty of predicting sales given the budgets of TV and Radio adverting for a particular market.
(Monash) FIT5149 26 / 35
Linear Regression with Qualitative Predicators
Outline
1 Simple Linear regression
2 Multiple Linear Regression
3 Linear Regression with Qualitative Predicators
4 Extension of Linear models
5 Summary
(Monash) FIT5149 27 / 35
Linear Regression with Qualitative Predicators
Linear Regression with Qualitative Predicators
Some predictors are not quantitative but are qualitative, taking a discrete set of values.
These are also called categorical predictors or factor variables.
Figure: The credit card dataset that contain both quantitative variables (e.g., income, limit, rating, and age), and qualitative variables (e.g., gender, student, married, and ethnicity).
(Monash) FIT5149 28 / 35
Linear Regression with Qualitative Predicators
Linear Regression with Qualitative Predicators — continued
Dummy coding — making many variables out of one
A categorical variable with k levels will be transformed into k − 1 variables each with two levels.
For example, for the ethnicity variable we create two dummy variables. The first could be
to obtain the model
yi = β0+β1xi,1+β2xi,2+εi = (Monash)
β +β +ε ifithpersonisAsian
xi,1 =
1 0
if ith person is Asian
if ith persion is not Asian
and the second could be 1
xi,2 = 0
Then both of these variables can be used in the regression equation, in order
if ith person is Caucasian
if ith persion is not Caucasian
01i β0 + β2 + εi
β0 + εi FIT5149
if ith person is Caucasian
if i th person is African American
29 / 35
Linear Regression with Qualitative Predicators
Linear Regression with Qualitative Predicators — continued
Dummy coding — making many variables out of one
Example: the Credit dataset.
(Monash) FIT5149 29 / 35
Extension of Linear models
Outline
1 Simple Linear regression
2 Multiple Linear Regression
3 Linear Regression with Qualitative Predicators
4 Extension of Linear models
5 Summary
(Monash) FIT5149 30 / 35
Extension of Linear models
Addictive and Linear assumptions
Two of the most important assumptions on the relationship between predictors and response:
Sales =β0 +β1 ×TV +β2 ×Radio
Additive — the effect of changes in Xj on Y is independent of Xi for i ̸= j. Linear — the change in Y due to one-unit change in Xj is constant,
regardless of the value of Xj .
Can we remove the additive assumption?
(Monash) FIT5149 31 / 35
Extension of Linear models
Interaction between variables
●● ●●
Synergy effect (or interaction affect):
●●● ●●
●●
●●● ●●●●
For example, given a fixed budget of $100, 000, spending half on radio and half on TV may increase sales more than allocating the entire amount to either TV or to radio.
Spending money on radio advertising actually increases the effectiveness of TV advertising, so that the slope term for TV should increase as radio increases.
● ●●●● ●●● ●● ● ●
●●●●● ●●
●● ●●●●●● ●●●●●
● ●●● ● ● ● ●● ● ● ● ● ●
●●●● ●● ●● ● ●●
●●● ●●●●● ● ●●●● ●● ●●●●●●●
●●● ●● ●●●●●●●●●● ●●
●●●● ●● ●●
●●●●● ● ● ● ● ●● ●
● ●●● ●● ●● ●●
●
●●●●
●
300 250
●●●
● ●●
●●●●●
●●●● ● ● ●● ●●
●●● ●●●
200 150
100 50
● ●●● ●
●
0 0 10 20 30 40 50
Radio
●
Figure: Over-estimate v.s. under-estimate without considering interaction between predictors
(Monash) FIT5149
32 / 35
Sales
0 5 10 15 20 25 30
TV
Extension of Linear models
Interaction between variables — continued
Model with interaction terms takes the form
Sales = β0 +β1 ×TV +β2 ×Radio +β3 ×(TV ×Radio)+ε = β0 +(β1 +β3 ×Radio)×TV +β3 ×Radio +ε
β3: the increase in the effectiveness of TV advertising for a one unit increase in radio advertising (or vice-versa)
(Monash) FIT5149 33 / 35
Extension of Linear models
Interaction between variables — continued
results
Strong evidence that Ha : β3 ̸== 0: the true relationship is not additive
(Monash) FIT5149 33 / 35
Extension of Linear models
Interaction between variables — continued
results
the R2 and F-statistics:
R 2
F − statistic
Sales ∼ TV +Radio Sales ∼ TV +Radio +TV ∗Radio 0.9678
(0.9678 − 0.8972)/(1 − 0.8972) ≈ 69%
0.8972
859.6
1963
(Monash)
FIT5149
33 / 35
Extension of Linear models
Interaction between variables — continued
results
Interpret coefficients:
− An increase in TV advertising of $1, 000 is associated with increased sales of
(βˆ1 +βˆ3 ×Radio)×1000 = 19+1.1×Radio
− An increase in radio advertising of $1, 000 will be associated with an increase
in sales of
(βˆ2 +βˆ3 ×TV)×1000 = 29+1.1×TV
(Monash) FIT5149 33 / 35
Extension of Linear models
Interaction between variables — continued
results
The hierarchy principle: if we include an interaction in a model, we should also include the main effects, even if the p-values associated with their coefficient are not significant.
(Monash) FIT5149 33 / 35
Summary
Summary
What we have covered:
Simple linear regression with ordinary least squares Various regression diagnostics
− Assess the accuracy of the estimated coefficients
− Assess the accuracy of the model
− Residual analysis
Multiple linear regression
Categorial variables in regression
Extension of linear regression: interaction between variables
What we haven’t covered:
Outliers
High leverage points
Collinearity
Linear regression with K-Nearest Neighbors
See sections 3.3.3, 3.4 and 3.5 of "Introduction to Statistical learning"
(Monash) FIT5149 34 / 35
Summary
Reference
Reading materials:
"Linear Regression", Chapter 3 of "Introduction to Statistical Learning", 6th edition
"Linear Regression and ANOVA", Chapter 11 of "R Cookbook" by Paul Teetor, available online from Monash Library.
Some figures in this presentation were taken from
"An Introduction to Statistical Learning, with applications in R" (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani
https://data.library.virginia.edu/diagnostic-plots/
Some of the slides are reproduced based on the slides from T. Hastie and R.
Tibshirani
(Monash) FIT5149 35 / 35