程序代写代做代考 Regression Analysis with Linear Models

Regression Analysis with Linear Models
Dr. Lan Du and Dr Ming Liu
Faculty of Information Technology, Monash University, Australia FIT5149 week 3
(Monash) FIT5149 1 / 35

Outline
1 Simple Linear regression
2 Multiple Linear Regression
3 Linear Regression with Qualitative Predicators
4 Extension of Linear models
5 Summary
(Monash) FIT5149 2 / 35

Simple Linear regression
Outline
1 Simple Linear regression
2 Multiple Linear Regression
3 Linear Regression with Qualitative Predicators
4 Extension of Linear models
5 Summary
(Monash) FIT5149 3 / 35

Simple Linear regression
Linear Regression for the Advertising Data in ISL
Consider the advertising data used in “Introduction to Statistical Learning”.
0 10 20 30 40 50
5 10 15 20 25
TV
●●● ●●● ●●●●●● ● ● ● ●●
● ●● ●
●● ● ● ●● ● ● ●
●● ●●● ●●● ●●● ●●● ●●●● ●●●● ●● ●●●● ●●●●●●●
●● ●●● ● ● ●● ● ● ● ●
●● ●● ●
● ●●● ● ● ●
●●●●●●● ●●●●
● ● ●● ● ● ●●
● ●● ● ● ● ● ● ●●●●● ● ●● ●
●● ●●●●
●●● ●●●●●● ●●●●● ●●
●● ●● ● ● ● ● ●● ● ●
●● ●● ● ●● ●●●● ●
●● ●●● ● ● ●● ● ● ●● ●●●●●●●
● ●●●● ●
●●●● ●● ●●●●
●●●●●●● ● ●● ●●●
● ● ●● ●●●● ●● ●●●● ● ●
●●●●●●●● ● ●●● ●●● ●
●●●● ●
●●●●● ●●● ● ●
●●●●●●● ● ●
●●● ●● ● ● ●●● ●● ●
● ● ●●
● ●● ●● ●●● ●●●●●● ●● ●● ●● ●●● ●
●● ●● ●●●●● ●
● ●●● ● ● ●●● ●●●●● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●
● ● ●● ●●●● ● ● ●● ●●●
●●● ●● ●●●● ●● ● ●●●● ●
●●●●●● ● ● ●
● ●
● ●● ●●●●● ●●●● ●●●
● ●●●●●●
●●●● ● ●●● ● ● ● ● ●● ●●● ●●● ●
●●●●● ● ● ●● ●●●
●●● ●●● ●●● ●●●● ●

●●● ● ●●●●●
● ● ●● ●● ●●●●● ●●●●
● ● ●●●● ●● ●● ●●●●●●●● ● ● ●●●● ●●●● ●
● ●
●● ● ● ●● ●● ● ●●●
● ●● ● ● ●●●●● ●●● ●● ●
● ●●
Radio
● ● ●● ● ● ● ● ● ● ● ●●
●● ●●● ●● ● ● ●● ●●●●●●
●● ●●●● ● ●●● ●●●●●●●●●●● ●●●
● ●
●● ●●
●●●●● ●●●● ●●● ●●● ● ●●● ● ●●
●●●●●●● ● ●●● ●● ●


●●●
● ●●●●●● ●
●●●●●●● ●● ●● ● ●
●●●● ● ● ● ●

● ●● ●●●●● ●●●● ●●●
● ● ● ●●●●●●

● ●●●●●●● ●●●● ● ●● ●●●●●●●
●●●●●
●● ●● ● ●● ●● ●● ● ●● ●●●●●●
●● ● ● ● ● ● ● ●●●● ● ● ●
●● ●●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●●●●● ● ●● ● ●
●●●●● ● ● ● ●● ●●

● ●


● ●
● ●●●●●●
●●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●●●●●●●●●●
●● ●●● ● ●● ● ●●● ● ●●● ●●●
●●●●●●●●● ● ● ●●●●●● ● ● ●●●● ● ●● ● ●●●●● ● ●● ● ● ● ● ●●● ● ● ●● ● ● ●● ● ● ●● ● ●●●●●● ●
● ●●● ● ●●●●●●●●●●●● ● ● ●●●● ●●●●●●●

● ●

●●

● ●●●●●●● ● ● ● ● ● ●● ●● ●● ● ● ● ●●● ●●●●● ● ●

● ●●● ● ●
●●●● ● ●●●● ●●●●● ●●●● ●●●●●● ● ●● ●●●●●●● ● ●●●●●●● ●● ● ● ● ●● ●●● ●
●●
● ●
●●●● ● ●
● ●●● ● ●● ● ●● ●
●● ●●●● ●
Newspaper
● ●
●● ● ●● ●
● ●●●●● ●● ●●




● ● ●●●

●● ●● ● ● ● ● ● ●● ● ● ● ●● ●● ●●●●●● ● ●●
●●●●●●●●●● ●● ● ●● ●● ● ●●●●●●● ●●●●
●●●●●● ●●● ● ●● ●●● ●● ● ● ●●●●● ●● ● ● ●● ●●●●●● ●
●● ●● ●●●●● ●●

● ● ●● ●●● ●
●●●●●●
● ●● ●●
●●● ●●●●●●●● ●●● ●●● ●● ● ● ●● ●● ●
● ●● ● ● ●●● ● ●● ●
●●●● ●●●● ●●●●●●●●●● ●●●● ●●●
●●●●● ●●



●●●●●●● ●●●
●●●
● ●●

● ● ● ●●●● ●
●● ●●●● ● ● ● ●● ●● ● ● ● ●●● ● ●
● ● ●● ● ● ● ●●● ● ●●●● ●● ●
●●● ●● ● ● ● ● ●●●●● ● ● ●● ● ● ●●● ●
●●● ●
●● ● ● ● ● ● ●●●● ● ●● ●● ●●●



● ●●● ●●
●●● ● ●●● ● ●● ●●
●●●● ● ●
●● ● ●● ● ●
●●●● ● ● ●●● ●●●●●●●● ●● ● ● ● ● ● ● ● ● ●● ● ●●
●●● ● ●
● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●●●● ●● ●
●●● ●●●●● ●
● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●●
● ●
●● ● ●●●●●●● ●
●● ● ●●●●●

● ●

Sales
Questions we might ask:
Is there a relationship between advertising budget and sales?
0 50
150 250
0 20 40 60 80
(Monash)
FIT5149
4 / 35
5 10 15 20 25 0 10 20 30 40 50
0 20 40 60 80
0 50 150 250

Simple Linear regression
Linear Regression for the Advertising Data in ISL
Consider the advertising data used in “Introduction to Statistical Learning”.
0 10 20 30 40 50
5 10 15 20 25
TV
●●● ●●● ●●●●●● ● ● ● ●●
● ●● ●
●● ● ● ●● ● ● ●
●● ●●● ●●● ●●● ●●● ●●●● ●●●● ●● ●●●● ●●●●●●●
●● ●●● ● ● ●● ● ● ● ●
●● ●● ●
● ●●● ● ● ●
●●●●●●● ●●●●
● ● ●● ● ● ●●
● ●● ● ● ● ● ● ●●●●● ● ●● ●
●● ●●●●
●●● ●●●●●● ●●●●● ●●
●● ●● ● ● ● ● ●● ● ●
●● ●● ● ●● ●●●● ●
●● ●●● ● ● ●● ● ● ●● ●●●●●●●
● ●●●● ●
●●●● ●● ●●●●
●●●●●●● ● ●● ●●●
● ● ●● ●●●● ●● ●●●● ● ●
●●●●●●●● ● ●●● ●●● ●
●●●● ●
●●●●● ●●● ● ●
●●●●●●● ● ●
●●● ●● ● ● ●●● ●● ●
● ● ●●
● ●● ●● ●●● ●●●●●● ●● ●● ●● ●●● ●
●● ●● ●●●●● ●
● ●●● ● ● ●●● ●●●●● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●
● ● ●● ●●●● ● ● ●● ●●●
●●● ●● ●●●● ●● ● ●●●● ●
●●●●●● ● ● ●
● ●
● ●● ●●●●● ●●●● ●●●
● ●●●●●●
●●●● ● ●●● ● ● ● ● ●● ●●● ●●● ●
●●●●● ● ● ●● ●●●
●●● ●●● ●●● ●●●● ●

●●● ● ●●●●●
● ● ●● ●● ●●●●● ●●●●
● ● ●●●● ●● ●● ●●●●●●●● ● ● ●●●● ●●●● ●
● ●
●● ● ● ●● ●● ● ●●●
● ●● ● ● ●●●●● ●●● ●● ●
● ●●
Radio
● ● ●● ● ● ● ● ● ● ● ●●
●● ●●● ●● ● ● ●● ●●●●●●
●● ●●●● ● ●●● ●●●●●●●●●●● ●●●
● ●
●● ●●
●●●●● ●●●● ●●● ●●● ● ●●● ● ●●
●●●●●●● ● ●●● ●● ●


●●●
● ●●●●●● ●
●●●●●●● ●● ●● ● ●
●●●● ● ● ● ●

● ●● ●●●●● ●●●● ●●●
● ● ● ●●●●●●

● ●●●●●●● ●●●● ● ●● ●●●●●●●
●●●●●
●● ●● ● ●● ●● ●● ● ●● ●●●●●●
●● ● ● ● ● ● ● ●●●● ● ● ●
●● ●●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●●●●● ● ●● ● ●
●●●●● ● ● ● ●● ●●

● ●


● ●
● ●●●●●●
●●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●●●●●●●●●●
●● ●●● ● ●● ● ●●● ● ●●● ●●●
●●●●●●●●● ● ● ●●●●●● ● ● ●●●● ● ●● ● ●●●●● ● ●● ● ● ● ● ●●● ● ● ●● ● ● ●● ● ● ●● ● ●●●●●● ●
● ●●● ● ●●●●●●●●●●●● ● ● ●●●● ●●●●●●●

● ●

●●

● ●●●●●●● ● ● ● ● ● ●● ●● ●● ● ● ● ●●● ●●●●● ● ●

● ●●● ● ●
●●●● ● ●●●● ●●●●● ●●●● ●●●●●● ● ●● ●●●●●●● ● ●●●●●●● ●● ● ● ● ●● ●●● ●
●●
● ●
●●●● ● ●
● ●●● ● ●● ● ●● ●
●● ●●●● ●
Newspaper
● ●
●● ● ●● ●
● ●●●●● ●● ●●




● ● ●●●

●● ●● ● ● ● ● ● ●● ● ● ● ●● ●● ●●●●●● ● ●●
●●●●●●●●●● ●● ● ●● ●● ● ●●●●●●● ●●●●
●●●●●● ●●● ● ●● ●●● ●● ● ● ●●●●● ●● ● ● ●● ●●●●●● ●
●● ●● ●●●●● ●●

● ● ●● ●●● ●
●●●●●●
● ●● ●●
●●● ●●●●●●●● ●●● ●●● ●● ● ● ●● ●● ●
● ●● ● ● ●●● ● ●● ●
●●●● ●●●● ●●●●●●●●●● ●●●● ●●●
●●●●● ●●



●●●●●●● ●●●
●●●
● ●●

● ● ● ●●●● ●
●● ●●●● ● ● ● ●● ●● ● ● ● ●●● ● ●
● ● ●● ● ● ● ●●● ● ●●●● ●● ●
●●● ●● ● ● ● ● ●●●●● ● ● ●● ● ● ●●● ●
●●● ●
●● ● ● ● ● ● ●●●● ● ●● ●● ●●●



● ●●● ●●
●●● ● ●●● ● ●● ●●
●●●● ● ●
●● ● ●● ● ●
●●●● ● ● ●●● ●●●●●●●● ●● ● ● ● ● ● ● ● ● ●● ● ●●
●●● ● ●
● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●●●● ●● ●
●●● ●●●●● ●
● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●●
● ●
●● ● ●●●●●●● ●
●● ● ●●●●●

● ●

Sales
Questions we might ask:
How strong is the relationship between advertising budget and sales?
0 50
150 250
0 20 40 60 80
(Monash)
FIT5149
4 / 35
5 10 15 20 25 0 10 20 30 40 50
0 20 40 60 80
0 50 150 250

Simple Linear regression
Linear Regression for the Advertising Data in ISL
Consider the advertising data used in “Introduction to Statistical Learning”.
0 10 20 30 40 50
5 10 15 20 25
TV
●●● ●●● ●●●●●● ● ● ● ●●
● ●● ●
●● ● ● ●● ● ● ●
●● ●●● ●●● ●●● ●●● ●●●● ●●●● ●● ●●●● ●●●●●●●
●● ●●● ● ● ●● ● ● ● ●
●● ●● ●
● ●●● ● ● ●
●●●●●●● ●●●●
● ● ●● ● ● ●●
● ●● ● ● ● ● ● ●●●●● ● ●● ●
●● ●●●●
●●● ●●●●●● ●●●●● ●●
●● ●● ● ● ● ● ●● ● ●
●● ●● ● ●● ●●●● ●
●● ●●● ● ● ●● ● ● ●● ●●●●●●●
● ●●●● ●
●●●● ●● ●●●●
●●●●●●● ● ●● ●●●
● ● ●● ●●●● ●● ●●●● ● ●
●●●●●●●● ● ●●● ●●● ●
●●●● ●
●●●●● ●●● ● ●
●●●●●●● ● ●
●●● ●● ● ● ●●● ●● ●
● ● ●●
● ●● ●● ●●● ●●●●●● ●● ●● ●● ●●● ●
●● ●● ●●●●● ●
● ●●● ● ● ●●● ●●●●● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●
● ● ●● ●●●● ● ● ●● ●●●
●●● ●● ●●●● ●● ● ●●●● ●
●●●●●● ● ● ●
● ●
● ●● ●●●●● ●●●● ●●●
● ●●●●●●
●●●● ● ●●● ● ● ● ● ●● ●●● ●●● ●
●●●●● ● ● ●● ●●●
●●● ●●● ●●● ●●●● ●

●●● ● ●●●●●
● ● ●● ●● ●●●●● ●●●●
● ● ●●●● ●● ●● ●●●●●●●● ● ● ●●●● ●●●● ●
● ●
●● ● ● ●● ●● ● ●●●
● ●● ● ● ●●●●● ●●● ●● ●
● ●●
Radio
● ● ●● ● ● ● ● ● ● ● ●●
●● ●●● ●● ● ● ●● ●●●●●●
●● ●●●● ● ●●● ●●●●●●●●●●● ●●●
● ●
●● ●●
●●●●● ●●●● ●●● ●●● ● ●●● ● ●●
●●●●●●● ● ●●● ●● ●


●●●
● ●●●●●● ●
●●●●●●● ●● ●● ● ●
●●●● ● ● ● ●

● ●● ●●●●● ●●●● ●●●
● ● ● ●●●●●●

● ●●●●●●● ●●●● ● ●● ●●●●●●●
●●●●●
●● ●● ● ●● ●● ●● ● ●● ●●●●●●
●● ● ● ● ● ● ● ●●●● ● ● ●
●● ●●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●●●●● ● ●● ● ●
●●●●● ● ● ● ●● ●●

● ●


● ●
● ●●●●●●
●●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●●●●●●●●●●
●● ●●● ● ●● ● ●●● ● ●●● ●●●
●●●●●●●●● ● ● ●●●●●● ● ● ●●●● ● ●● ● ●●●●● ● ●● ● ● ● ● ●●● ● ● ●● ● ● ●● ● ● ●● ● ●●●●●● ●
● ●●● ● ●●●●●●●●●●●● ● ● ●●●● ●●●●●●●

● ●

●●

● ●●●●●●● ● ● ● ● ● ●● ●● ●● ● ● ● ●●● ●●●●● ● ●

● ●●● ● ●
●●●● ● ●●●● ●●●●● ●●●● ●●●●●● ● ●● ●●●●●●● ● ●●●●●●● ●● ● ● ● ●● ●●● ●
●●
● ●
●●●● ● ●
● ●●● ● ●● ● ●● ●
●● ●●●● ●
Newspaper
● ●
●● ● ●● ●
● ●●●●● ●● ●●




● ● ●●●

●● ●● ● ● ● ● ● ●● ● ● ● ●● ●● ●●●●●● ● ●●
●●●●●●●●●● ●● ● ●● ●● ● ●●●●●●● ●●●●
●●●●●● ●●● ● ●● ●●● ●● ● ● ●●●●● ●● ● ● ●● ●●●●●● ●
●● ●● ●●●●● ●●

● ● ●● ●●● ●
●●●●●●
● ●● ●●
●●● ●●●●●●●● ●●● ●●● ●● ● ● ●● ●● ●
● ●● ● ● ●●● ● ●● ●
●●●● ●●●● ●●●●●●●●●● ●●●● ●●●
●●●●● ●●



●●●●●●● ●●●
●●●
● ●●

● ● ● ●●●● ●
●● ●●●● ● ● ● ●● ●● ● ● ● ●●● ● ●
● ● ●● ● ● ● ●●● ● ●●●● ●● ●
●●● ●● ● ● ● ● ●●●●● ● ● ●● ● ● ●●● ●
●●● ●
●● ● ● ● ● ● ●●●● ● ●● ●● ●●●



● ●●● ●●
●●● ● ●●● ● ●● ●●
●●●● ● ●
●● ● ●● ● ●
●●●● ● ● ●●● ●●●●●●●● ●● ● ● ● ● ● ● ● ● ●● ● ●●
●●● ● ●
● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●●●● ●● ●
●●● ●●●●● ●
● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●●
● ●
●● ● ●●●●●●● ●
●● ● ●●●●●

● ●

Sales
Questions we might ask:
Which media contribute to sales?
0 50
150 250
0 20 40 60 80
(Monash)
FIT5149
4 / 35
5 10 15 20 25 0 10 20 30 40 50
0 20 40 60 80
0 50 150 250

Simple Linear regression
Linear Regression for the Advertising Data in ISL
Consider the advertising data used in “Introduction to Statistical Learning”.
0 10 20 30 40 50
5 10 15 20 25
TV
●●● ●●● ●●●●●● ● ● ● ●●
● ●● ●
●● ● ● ●● ● ● ●
●● ●●● ●●● ●●● ●●● ●●●● ●●●● ●● ●●●● ●●●●●●●
●● ●●● ● ● ●● ● ● ● ●
●● ●● ●
● ●●● ● ● ●
●●●●●●● ●●●●
● ● ●● ● ● ●●
● ●● ● ● ● ● ● ●●●●● ● ●● ●
●● ●●●●
●●● ●●●●●● ●●●●● ●●
●● ●● ● ● ● ● ●● ● ●
●● ●● ● ●● ●●●● ●
●● ●●● ● ● ●● ● ● ●● ●●●●●●●
● ●●●● ●
●●●● ●● ●●●●
●●●●●●● ● ●● ●●●
● ● ●● ●●●● ●● ●●●● ● ●
●●●●●●●● ● ●●● ●●● ●
●●●● ●
●●●●● ●●● ● ●
●●●●●●● ● ●
●●● ●● ● ● ●●● ●● ●
● ● ●●
● ●● ●● ●●● ●●●●●● ●● ●● ●● ●●● ●
●● ●● ●●●●● ●
● ●●● ● ● ●●● ●●●●● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●
● ● ●● ●●●● ● ● ●● ●●●
●●● ●● ●●●● ●● ● ●●●● ●
●●●●●● ● ● ●
● ●
● ●● ●●●●● ●●●● ●●●
● ●●●●●●
●●●● ● ●●● ● ● ● ● ●● ●●● ●●● ●
●●●●● ● ● ●● ●●●
●●● ●●● ●●● ●●●● ●

●●● ● ●●●●●
● ● ●● ●● ●●●●● ●●●●
● ● ●●●● ●● ●● ●●●●●●●● ● ● ●●●● ●●●● ●
● ●
●● ● ● ●● ●● ● ●●●
● ●● ● ● ●●●●● ●●● ●● ●
● ●●
Radio
● ● ●● ● ● ● ● ● ● ● ●●
●● ●●● ●● ● ● ●● ●●●●●●
●● ●●●● ● ●●● ●●●●●●●●●●● ●●●
● ●
●● ●●
●●●●● ●●●● ●●● ●●● ● ●●● ● ●●
●●●●●●● ● ●●● ●● ●


●●●
● ●●●●●● ●
●●●●●●● ●● ●● ● ●
●●●● ● ● ● ●

● ●● ●●●●● ●●●● ●●●
● ● ● ●●●●●●

● ●●●●●●● ●●●● ● ●● ●●●●●●●
●●●●●
●● ●● ● ●● ●● ●● ● ●● ●●●●●●
●● ● ● ● ● ● ● ●●●● ● ● ●
●● ●●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●●●●● ● ●● ● ●
●●●●● ● ● ● ●● ●●

● ●


● ●
● ●●●●●●
●●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●●●●●●●●●●
●● ●●● ● ●● ● ●●● ● ●●● ●●●
●●●●●●●●● ● ● ●●●●●● ● ● ●●●● ● ●● ● ●●●●● ● ●● ● ● ● ● ●●● ● ● ●● ● ● ●● ● ● ●● ● ●●●●●● ●
● ●●● ● ●●●●●●●●●●●● ● ● ●●●● ●●●●●●●

● ●

●●

● ●●●●●●● ● ● ● ● ● ●● ●● ●● ● ● ● ●●● ●●●●● ● ●

● ●●● ● ●
●●●● ● ●●●● ●●●●● ●●●● ●●●●●● ● ●● ●●●●●●● ● ●●●●●●● ●● ● ● ● ●● ●●● ●
●●
● ●
●●●● ● ●
● ●●● ● ●● ● ●● ●
●● ●●●● ●
Newspaper
● ●
●● ● ●● ●
● ●●●●● ●● ●●




● ● ●●●

●● ●● ● ● ● ● ● ●● ● ● ● ●● ●● ●●●●●● ● ●●
●●●●●●●●●● ●● ● ●● ●● ● ●●●●●●● ●●●●
●●●●●● ●●● ● ●● ●●● ●● ● ● ●●●●● ●● ● ● ●● ●●●●●● ●
●● ●● ●●●●● ●●

● ● ●● ●●● ●
●●●●●●
● ●● ●●
●●● ●●●●●●●● ●●● ●●● ●● ● ● ●● ●● ●
● ●● ● ● ●●● ● ●● ●
●●●● ●●●● ●●●●●●●●●● ●●●● ●●●
●●●●● ●●



●●●●●●● ●●●
●●●
● ●●

● ● ● ●●●● ●
●● ●●●● ● ● ● ●● ●● ● ● ● ●●● ● ●
● ● ●● ● ● ● ●●● ● ●●●● ●● ●
●●● ●● ● ● ● ● ●●●●● ● ● ●● ● ● ●●● ●
●●● ●
●● ● ● ● ● ● ●●●● ● ●● ●● ●●●



● ●●● ●●
●●● ● ●●● ● ●● ●●
●●●● ● ●
●● ● ●● ● ●
●●●● ● ● ●●● ●●●●●●●● ●● ● ● ● ● ● ● ● ● ●● ● ●●
●●● ● ●
● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●●●● ●● ●
●●● ●●●●● ●
● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●●
● ●
●● ● ●●●●●●● ●
●● ● ●●●●●

● ●

Sales
Questions we might ask:
How accurately can we predict future sales?
0 50
150 250
0 20 40 60 80
(Monash)
FIT5149
4 / 35
5 10 15 20 25 0 10 20 30 40 50
0 20 40 60 80
0 50 150 250

Simple Linear regression
Linear Regression for the Advertising Data in ISL
Consider the advertising data used in “Introduction to Statistical Learning”.
0 10 20 30 40 50
5 10 15 20 25
TV
●●● ●●● ●●●●●● ● ● ● ●●
● ●● ●
●● ● ● ●● ● ● ●
●● ●●● ●●● ●●● ●●● ●●●● ●●●● ●● ●●●● ●●●●●●●
●● ●●● ● ● ●● ● ● ● ●
●● ●● ●
● ●●● ● ● ●
●●●●●●● ●●●●
● ● ●● ● ● ●●
● ●● ● ● ● ● ● ●●●●● ● ●● ●
●● ●●●●
●●● ●●●●●● ●●●●● ●●
●● ●● ● ● ● ● ●● ● ●
●● ●● ● ●● ●●●● ●
●● ●●● ● ● ●● ● ● ●● ●●●●●●●
● ●●●● ●
●●●● ●● ●●●●
●●●●●●● ● ●● ●●●
● ● ●● ●●●● ●● ●●●● ● ●
●●●●●●●● ● ●●● ●●● ●
●●●● ●
●●●●● ●●● ● ●
●●●●●●● ● ●
●●● ●● ● ● ●●● ●● ●
● ● ●●
● ●● ●● ●●● ●●●●●● ●● ●● ●● ●●● ●
●● ●● ●●●●● ●
● ●●● ● ● ●●● ●●●●● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●
● ● ●● ●●●● ● ● ●● ●●●
●●● ●● ●●●● ●● ● ●●●● ●
●●●●●● ● ● ●
● ●
● ●● ●●●●● ●●●● ●●●
● ●●●●●●
●●●● ● ●●● ● ● ● ● ●● ●●● ●●● ●
●●●●● ● ● ●● ●●●
●●● ●●● ●●● ●●●● ●

●●● ● ●●●●●
● ● ●● ●● ●●●●● ●●●●
● ● ●●●● ●● ●● ●●●●●●●● ● ● ●●●● ●●●● ●
● ●
●● ● ● ●● ●● ● ●●●
● ●● ● ● ●●●●● ●●● ●● ●
● ●●
Radio
● ● ●● ● ● ● ● ● ● ● ●●
●● ●●● ●● ● ● ●● ●●●●●●
●● ●●●● ● ●●● ●●●●●●●●●●● ●●●
● ●
●● ●●
●●●●● ●●●● ●●● ●●● ● ●●● ● ●●
●●●●●●● ● ●●● ●● ●


●●●
● ●●●●●● ●
●●●●●●● ●● ●● ● ●
●●●● ● ● ● ●

● ●● ●●●●● ●●●● ●●●
● ● ● ●●●●●●

● ●●●●●●● ●●●● ● ●● ●●●●●●●
●●●●●
●● ●● ● ●● ●● ●● ● ●● ●●●●●●
●● ● ● ● ● ● ● ●●●● ● ● ●
●● ●●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●●●●● ● ●● ● ●
●●●●● ● ● ● ●● ●●

● ●


● ●
● ●●●●●●
●●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●●●●●●●●●●
●● ●●● ● ●● ● ●●● ● ●●● ●●●
●●●●●●●●● ● ● ●●●●●● ● ● ●●●● ● ●● ● ●●●●● ● ●● ● ● ● ● ●●● ● ● ●● ● ● ●● ● ● ●● ● ●●●●●● ●
● ●●● ● ●●●●●●●●●●●● ● ● ●●●● ●●●●●●●

● ●

●●

● ●●●●●●● ● ● ● ● ● ●● ●● ●● ● ● ● ●●● ●●●●● ● ●

● ●●● ● ●
●●●● ● ●●●● ●●●●● ●●●● ●●●●●● ● ●● ●●●●●●● ● ●●●●●●● ●● ● ● ● ●● ●●● ●
●●
● ●
●●●● ● ●
● ●●● ● ●● ● ●● ●
●● ●●●● ●
Newspaper
● ●
●● ● ●● ●
● ●●●●● ●● ●●




● ● ●●●

●● ●● ● ● ● ● ● ●● ● ● ● ●● ●● ●●●●●● ● ●●
●●●●●●●●●● ●● ● ●● ●● ● ●●●●●●● ●●●●
●●●●●● ●●● ● ●● ●●● ●● ● ● ●●●●● ●● ● ● ●● ●●●●●● ●
●● ●● ●●●●● ●●

● ● ●● ●●● ●
●●●●●●
● ●● ●●
●●● ●●●●●●●● ●●● ●●● ●● ● ● ●● ●● ●
● ●● ● ● ●●● ● ●● ●
●●●● ●●●● ●●●●●●●●●● ●●●● ●●●
●●●●● ●●



●●●●●●● ●●●
●●●
● ●●

● ● ● ●●●● ●
●● ●●●● ● ● ● ●● ●● ● ● ● ●●● ● ●
● ● ●● ● ● ● ●●● ● ●●●● ●● ●
●●● ●● ● ● ● ● ●●●●● ● ● ●● ● ● ●●● ●
●●● ●
●● ● ● ● ● ● ●●●● ● ●● ●● ●●●



● ●●● ●●
●●● ● ●●● ● ●● ●●
●●●● ● ●
●● ● ●● ● ●
●●●● ● ● ●●● ●●●●●●●● ●● ● ● ● ● ● ● ● ● ●● ● ●●
●●● ● ●
● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●●●● ●● ●
●●● ●●●●● ●
● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●●
● ●
●● ● ●●●●●●● ●
●● ● ●●●●●

● ●

Sales
Questions we might ask:
Is the relationship linear?
0 50 150 250
0 20 40 60 80
(Monash)
FIT5149
4 / 35
5 10 15 20 25 0 10 20 30 40 50
0 20 40 60 80
0 50 150 250

Simple Linear regression
Linear Regression for the Advertising Data in ISL
Consider the advertising data used in “Introduction to Statistical Learning”.
0 10 20 30 40 50
5 10 15 20 25
TV
●●● ●●● ●●●●●● ● ● ● ●●
● ●● ●
●● ● ● ●● ● ● ●
●● ●●● ●●● ●●● ●●● ●●●● ●●●● ●● ●●●● ●●●●●●●
●● ●●● ● ● ●● ● ● ● ●
●● ●● ●
● ●●● ● ● ●
●●●●●●● ●●●●
● ● ●● ● ● ●●
● ●● ● ● ● ● ● ●●●●● ● ●● ●
●● ●●●●
●●● ●●●●●● ●●●●● ●●
●● ●● ● ● ● ● ●● ● ●
●● ●● ● ●● ●●●● ●
●● ●●● ● ● ●● ● ● ●● ●●●●●●●
● ●●●● ●
●●●● ●● ●●●●
●●●●●●● ● ●● ●●●
● ● ●● ●●●● ●● ●●●● ● ●
●●●●●●●● ● ●●● ●●● ●
●●●● ●
●●●●● ●●● ● ●
●●●●●●● ● ●
●●● ●● ● ● ●●● ●● ●
● ● ●●
● ●● ●● ●●● ●●●●●● ●● ●● ●● ●●● ●
●● ●● ●●●●● ●
● ●●● ● ● ●●● ●●●●● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●
● ● ●● ●●●● ● ● ●● ●●●
●●● ●● ●●●● ●● ● ●●●● ●
●●●●●● ● ● ●
● ●
● ●● ●●●●● ●●●● ●●●
● ●●●●●●
●●●● ● ●●● ● ● ● ● ●● ●●● ●●● ●
●●●●● ● ● ●● ●●●
●●● ●●● ●●● ●●●● ●

●●● ● ●●●●●
● ● ●● ●● ●●●●● ●●●●
● ● ●●●● ●● ●● ●●●●●●●● ● ● ●●●● ●●●● ●
● ●
●● ● ● ●● ●● ● ●●●
● ●● ● ● ●●●●● ●●● ●● ●
● ●●
Radio
● ● ●● ● ● ● ● ● ● ● ●●
●● ●●● ●● ● ● ●● ●●●●●●
●● ●●●● ● ●●● ●●●●●●●●●●● ●●●
● ●
●● ●●
●●●●● ●●●● ●●● ●●● ● ●●● ● ●●
●●●●●●● ● ●●● ●● ●


●●●
● ●●●●●● ●
●●●●●●● ●● ●● ● ●
●●●● ● ● ● ●

● ●● ●●●●● ●●●● ●●●
● ● ● ●●●●●●

● ●●●●●●● ●●●● ● ●● ●●●●●●●
●●●●●
●● ●● ● ●● ●● ●● ● ●● ●●●●●●
●● ● ● ● ● ● ● ●●●● ● ● ●
●● ●●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●●●●● ● ●● ● ●
●●●●● ● ● ● ●● ●●

● ●


● ●
● ●●●●●●
●●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●●●●●●●●●●
●● ●●● ● ●● ● ●●● ● ●●● ●●●
●●●●●●●●● ● ● ●●●●●● ● ● ●●●● ● ●● ● ●●●●● ● ●● ● ● ● ● ●●● ● ● ●● ● ● ●● ● ● ●● ● ●●●●●● ●
● ●●● ● ●●●●●●●●●●●● ● ● ●●●● ●●●●●●●

● ●

●●

● ●●●●●●● ● ● ● ● ● ●● ●● ●● ● ● ● ●●● ●●●●● ● ●

● ●●● ● ●
●●●● ● ●●●● ●●●●● ●●●● ●●●●●● ● ●● ●●●●●●● ● ●●●●●●● ●● ● ● ● ●● ●●● ●
●●
● ●
●●●● ● ●
● ●●● ● ●● ● ●● ●
●● ●●●● ●
Newspaper
● ●
●● ● ●● ●
● ●●●●● ●● ●●




● ● ●●●

●● ●● ● ● ● ● ● ●● ● ● ● ●● ●● ●●●●●● ● ●●
●●●●●●●●●● ●● ● ●● ●● ● ●●●●●●● ●●●●
●●●●●● ●●● ● ●● ●●● ●● ● ● ●●●●● ●● ● ● ●● ●●●●●● ●
●● ●● ●●●●● ●●

● ● ●● ●●● ●
●●●●●●
● ●● ●●
●●● ●●●●●●●● ●●● ●●● ●● ● ● ●● ●● ●
● ●● ● ● ●●● ● ●● ●
●●●● ●●●● ●●●●●●●●●● ●●●● ●●●
●●●●● ●●



●●●●●●● ●●●
●●●
● ●●

● ● ● ●●●● ●
●● ●●●● ● ● ● ●● ●● ● ● ● ●●● ● ●
● ● ●● ● ● ● ●●● ● ●●●● ●● ●
●●● ●● ● ● ● ● ●●●●● ● ● ●● ● ● ●●● ●
●●● ●
●● ● ● ● ● ● ●●●● ● ●● ●● ●●●



● ●●● ●●
●●● ● ●●● ● ●● ●●
●●●● ● ●
●● ● ●● ● ●
●●●● ● ● ●●● ●●●●●●●● ●● ● ● ● ● ● ● ● ● ●● ● ●●
●●● ● ●
● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●●●● ●● ●
●●● ●●●●● ●
● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●●
● ●
●● ● ●●●●●●● ●
●● ● ●●●●●

● ●

Sales
Questions we might ask:
Is there synergy among the advertising media?
0 50
150 250
0 20 40 60 80
(Monash)
FIT5149
4 / 35
5 10 15 20 25 0 10 20 30 40 50
0 20 40 60 80
0 50 150 250

Simple Linear regression
Simple Linear Regression
Simple linear regression is a statistical method that allows us to predict a quantitative response Y on the basis of a single predicator Variable X. It assumes the relationship between Y and X can be model by a straight line:
Y = β0 + β1X + ε
where
􏰀 β0: the expected value of Y when X = 0.
􏰀 β1: the average change in Y for a 1-unit change in X.
􏰀 ε: error term describes the random component of the linear relationship.
Assumptions:
􏰀 Linearity: The response variable Y has a linear relationship to the predictor variable X.
􏰀 Nearly normal residuals: The errors must be independent and normally distributed.
ε∼N(0,σ2 ·In×n)
􏰀 Constant variability: The Variance of the residuals is constant.
(Monash) FIT5149 5 / 35

Simple Linear regression
Example: Advertising data
●● ●●●● ●
●●
●● ●● ● ●●
● ● ● ●●●
● ● ●
●●●● ●●●
●●
●●●● ●●●●
● ● ●● ● ● ● ●● ● ● ●●
●● ● ●●●●● ● ● ●●●
●●●●● ●● ●
● ●● ● ● ●
● ● ● ● ●

●● ●
●●●● ●●●● ●
●●● ● ●
●●● ●●●●● ●●
● ● ●●
●● ●●●● ● ●● ●●● ●●
● ●● ● ●● ●● ●●●
●● ●●


●●●●
● ●●●
●● ●
●●● ●●
●●●● ●●●
● ●●●
Sales ≈ βˆ0 + βˆ1 × TV
0 50 100 150 200 250 300 TV
Given some estimates βˆ0 and βˆ1 for the model coefficients,
􏰀 Inference: describe the linear dependency between sales and budgets for TV
advertisement.
􏰀 Prediction: predict future sales given a budget plan for TV advertisement,
yˆ = βˆ 0 + βˆ 1 x
where yˆ indicates the prediction of Y on the basis of X = x.
(Monash) FIT5149 6 / 35
Sales
5 10 15 20 25

Simple Linear regression
The “Ordinary Least Squares” Regression.
Let yˆ = βˆ0 + βˆ1x be the prediction for Y based on the ith value of X. The ith residual (i.e., error) is defined as
e i = y i − yˆ i
We define the residual sum of squares (RSS) as
or equivalent as
n RSS=e12 +e2 +···+en2 =􏰊ei2
i=1
RSS = (y1 −βˆ0 −βˆ1×1)2 +(y2 −βˆ0 −βˆ1×2)2 +···+(yn −βˆ0 −βˆ1xn)2 n
= 􏰊(yi−βˆ0−βˆ1xi)2 i=1
The least square approaches chooses βˆ0 and βˆ1 to minimise the RSS.
(Monash) FIT5149 7 / 35

Simple Linear regression
How to Fit a Regression model with RSS in R
The lm() function performs a least squares regression and creates a linear model object:
where:
􏰀 Models for lm() are specified symbolically: response ∼ predictor 􏰀 The intercept βˆ0 = 7.026 and the slope βˆ1 = 0.0475
The linear model object contains much more information that just the coefficients!
(Monash) FIT5149 8 / 35

Simple Linear regression
Interpret Simple Linear Regression Model
(Monash) FIT5149 9 / 35

Simple Linear regression
Assessing the Accuracy of the Coefficient Estimates
Coefficient – Std. Error: measures how precisely the model estimates the coefficient’s unknown value.
􏰀 SE(βˆ0) = 0.457843: in the absence of any advertising, the average sales can vary by 457.843 units.
􏰀 SE(βˆ1) = 0.002691: for each $1,000 increase in television advertising, the average increase in sales can vary by 2.691 units.
(Monash) FIT5149 10 / 35

Simple Linear regression
Assessing the Accuracy of the Coefficient Estimates
Coefficient – Std. Error: measures how precisely the model estimates the coefficient’s unknown value.
􏰀 These standard errors can be used to compute confidence intervals. A 95% confidence interval is defined as a range of values such that with 95% probability, the range will contain the true unknown value of the parameter. It has the form
βˆ1 ± 2 · SE(βˆ1)
That is, there is approximately a 95% chance that the interval
􏰒βˆ1 − 2 · SE(βˆ1), βˆ1 + 2 · SE(βˆ1)􏰓 will contain the true value of β1.
(Monash) FIT5149 10 / 35

Simple Linear regression
Assessing the Accuracy of the Coefficient Estimates
Coefficient – Std. Error: measures how precisely the model estimates the coefficient’s unknown value.
􏰀 In the case of the advertising data
− The 95% confidence interval for β0 is [6.130, 7.935]
− The 95% confidence interval for β1 is [0.042, 0.053]
􏰀 Use the confidence interval to assess the reliability of the estimate of the coefficient.
􏰀 Standard errors can also be used to perform hypothesis tests on the coefficient.
− H0: There is no relationship between X and Y, i.e., β1 = 0
− Ha: There is some relationship between X and Y, i.e., β1 ̸= 0
(Monash) FIT5149 10 / 35

Simple Linear regression
Assessing the Accuracy of the Coefficient Estimates
Coefficient – t statistics
t = βˆ 1 − 0 SE(βˆ1 )
which measures the number of standard deviations that βˆ1 is away from 0. 􏰀 Large t value indicates the null hypothesis could be rejected.
􏰀 Small t value indicates rejecting the null hypothesis could cause a type-I error.
Question: How large is large?
(Monash) FIT5149 10 / 35

Simple Linear regression
Assessing the Accuracy of the Coefficient Estimates
Coefficient – Pr(>|t|) (i.e., p-value): test for the predicative power of predictor variable, i.e., TV
􏰀 Small p-value (Pr(> |t|) < α = 0.001): reject the null hypothesis − Changes in the predictor’s value are related to changes in the response variable. 􏰀 Use the coefficient p-values to determine which terms to keep in the regression model. (Monash) FIT5149 10 / 35 Simple Linear regression Assessing the Accuracy of the Model Residual standard error (RSE): an estimate of the standard deviation of residuals, i.e., ε. 􏰖􏰔 1 􏰖􏰕􏰔 1 􏰊n (yi − yˆi )2 􏰀 A measure of the quality of a linear regression fit, or a measure of the lack of fit of the model 􏰀 The advertising data: RSE = 3.259. − Actual sales in each market deviate from the true regression line by approximately 3,259 units, on average. − The percentage error: 3, 259/14, 000 = 23% where 14,000 is the mean value of sales. RSE = n − 2 RSS = n − 2 i=1 (Monash) FIT5149 11 / 35 Simple Linear regression Assessing the Accuracy of the Model The Coefficient of Determination (i.e., the R2 statistic): measures the proportion of variability in Y that can be explained using X. R2 = 1 − RSS TSS where the total sum of squares (TSS) is 􏰃n (yi − y ̄)2, and RSS is 􏰃 ni = 1 ( y i − yˆ i ) 2 i = 1 0 ≤ R2 ≤ 1: the larger R2 is the better the model is fitting the actual data. The advertising data: R2 = 0.6119. (Monash) FIT5149 11 / 35 Simple Linear regression Normality: are residuals normally distributed? Residuals are essentially the difference between the observed response values and the response values predicted by the model. ●●● ● ● ● ● ●●●● Ideally, residuals should be normally distributed. E(ε) = 0 When assessing how well the model fit the data, you should look for a symmetrical distribution across these points on the mean value 0. ● ●●● ●● ● ●●● ●● ●●●● ●●● 0 50 100 150 200 250 300 TV (Monash) FIT5149 12 / 35 ● ●●● ●●●● ●●● ●● ●●● ●●●●● ●● ● ● ●● ●● ●●●● ● ●● ●●● ●● ● ●● ● ●● ●● ●●● ●● ●● ●●●● ●●●● ● ● ●● ● ● ● ●● ● ● ●● ●● ● ●●●●● ● ● ●●● ●●●●● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●●●● ●●●● ● ●● ●●●● ● ●● ●● ●● ● ●● ● ● ● ●●● Sales 5 10 15 20 25 Simple Linear regression Normality: are residuals normally distributed? Distribution of Studentized Residuals −3 −2 −1 0 1 2 sresid (Monash) FIT5149 12 / 35 0.0 0.1 0.2 0.3 0.4 0.5 Density Simple Linear regression Normality: are residuals normally distributed? QQ Plot ● ● ●●●●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●●● −3 −2 −1 0 1 2 3 norm quantiles (Monash) FIT5149 12 / 35 lmfit$residuals −5 0 5 Simple Linear regression Normality: are residuals normally distributed? (Monash) FIT5149 12 / 35 Simple Linear regression Linearity: is the relationship between predictor and response variables linear? Residuals vs Fitted If residuals are randomly spread around a horizontal line without distinct patterns, that is a good indication you don’t have non-linear relationships. what is the difference in the plots between linear models trained on datasets ●● ●●● ● ●●●●● ● ●●●●●●●● ●● ●●●● ●●●●● ●●●●● ●● ●● ●●●●●●●● ● ● ● ● ● ●●●●● ●●●●● ●●●●● ●●●● ● ● ●●●●● ●●●●● ●●●● ●●● ● ●●● ●● ● ●●● ● ●● ● ●●●●● ● ● ●●●●●● ●●●●● ●● ●●● ●●●●●●●●● ●● ●●● ● ● ● ● ●●●●● ● ● ●● ● ● ● ●●●● ● ●● ●●●● ●●● ● ●● ● ●● ●● ● ●26 36● ● 179 8 10 12 14 16 Fitted values lm(Sales ~ TV) 18 20 􏰀 the relationship between predictors and the response variable is linear. 􏰀 the relationship between predictors and the response variable is not linear. Figure: Plots of residuals versus predicted (or fitted) values for the Advertising dataset. (Monash) FIT5149 13 / 35 Residuals −10 −5 0 5 Simple Linear regression Linearity: is the relationship between predictor and response variables linear? (Monash) FIT5149 13 / 35 Simple Linear regression Constant variance (homoscedasticity) Are residuals spread equally along the ranges of predictors Scale−Location ●● ●●● ● ●●●●●● ● ● ● ●26 ● ● 179 36● ●●●●● ●●●●●●● ●●●●●● ● ● ● ●● ●● ●●● ● ● ●●●●● ● ● ● ● ● ●●● ● ● ●● ● ●●●● ●●●●● ●●●●●●●● ● ●●● ●●● ● ● ● ● ● ●● ● ● ●●●● ● ● ● ● ●● ● ● ●●●● ● ● ●● ● ● ●● ●● ● ●●●●●● ●●●● ●●●● ● ● ●●● ●●● ●● ● ●● ●● ●● ●●●● ●●●● ●●●●● ●●● ●●●●●● ●● ● ●● ●● ● ● 8 10 12 14 16 18 20 Fitted values lm(Sales ~ TV) (Monash) FIT5149 14 / 35 Standardized residuals 0.0 0.5 1.0 1.5 Simple Linear regression Constant variance (homoscedasticity) Are residuals spread equally along the ranges of predictors (Monash) FIT5149 14 / 35 Simple Linear regression Residuals v.s. Leverage: what are the influential data sample in the fitting? Residuals vs Leverage Cook's distance ●● ●●● ● ●●●●● ● ●● ● ● ● ●● ●● ●●● ●●● ● ● ● ● ●● ●●● ●● ●● ●● ● ●● ● ● ●● ● ●● ●●●●●● ●●●●● ●●● ●●●● ●●●● ●● ●●●●●●● ● ●● ●●●● ● ● Cook's distance ●●● ●● ●● ●● ● ● ●●●●● ● ●●●●●● ●●●●● ● ● ●●●●●●●● ● ● ● ● ●●● ● ● ● ● ●● ● ●●●● ● ● ●● ● ●● ●●●● ●●● ● ●● ● ●● ●●● 36● ● 179 ● ● ●26 ● ●● ● 36 26 179 0.000 0.005 0.010 Leverage lm(Sales ~ TV) 0.015 0.020 0 50 100 150 Obs. number lm(Sales ~ TV) 200 (Monash) FIT5149 15 / 35 Standardized residuals −3 −2 −1 0 1 2 Cook's distance 0.00 0.01 0.02 0.03 0.04 0.05 0.06 Simple Linear regression Residuals v.s. Leverage: what are the influential data sample in the fitting? Watch out for outlying values at the upper right corner or at the lower right corner. (Monash) FIT5149 15 / 35 Multiple Linear Regression Outline 1 Simple Linear regression 2 Multiple Linear Regression 3 Linear Regression with Qualitative Predicators 4 Extension of Linear models 5 Summary (Monash) FIT5149 16 / 35 Multiple Linear Regression Example: the Advertising Data ● ● ● ●● ●●● ● ●●● ● ●● ●●●●● ●● ●●● ● ● ● ●●● ● ● ● ● ●●●●●● ●●● ●●● ●●●●●● ●● ● ●● ●● ●●●● ●●●●●● ● ●● ●●●● ● ● ●●● ●● ● ●● ●●● ●●●● ●●● ●● ● ● ● ●●●● ●●●●● ● ● ●● ●● ●●●●●●●● ● ● ● ●● ●● ● ●●● ● ●●●● ● ●●● ● ●●●● ● ● ●●● ● ● ● ● ● ●●●●●●● ● ● ●●● ●● ● ●●●● ●●● ● ● ●● ● ●● ●● ●●● ●●● ●●●●●●● ●●●● ●● ●● ● ● ● ● ● ● ●● ●● ● ●●● ● ●● ● ●●●●●●● ●● ●● ● ●●● ●●● ● ● ● ●● ● ●●● ● ● ● ● ●● ● ● ●●● ●●●● ●●● ●● ●●●● ●●● ● ● ● ●● ● ●●● ●●● ● ●● ● ●●● ●● ● ● ● ● ● ●● ● ● ● ●●●●●● ● ● ●● ●● ●●● ● ● ● ●●●● ●● ● ● ●●●● ●●● ●●● ● ● ● ● ● ● ●● ●●● ●●● ●● ●● ● ●●●●●● ●●● ● ● ●● ●●● ●●● ●● ●●●●●● ●● ● ● ●●● ● ●● ● ●●●● ●●● ●●●● ● ●● ●●●●●●●●● ● ●● ●●●●●●● ●●●● ● ● ●●●● ●●●● ●●●●●●●●● ● ● ● ●●● ● ●● ●● ● ● ● ●● ●● ● ● ●●● ● ●● ●●●● ●●● ● ● ● ●●● ●● ● ● ● ● ●● ● 0 50 100 150 200 250 300 0 10 20 30 40 50 0 20 40 60 80 100 TV Radio Newspaper How we can extend our analysis of the advertising data in order to accommodate these two additional predictors? (Monash) FIT5149 17 / 35 Sales 5 10 15 20 25 Sales 5 10 15 20 25 Sales 5 10 15 20 25 Multiple Linear Regression Example: the Advertising Data ● ● ● ● ●● ●● ● ●●● ● ●●●● ● ●●● ● ●●●● ● ●●● ● ●●●●●● ●●● ●●● ●●●●●● ●● ● ●● ●● ●●●● ●●●●●● ● ●● ●●●● ● ● ●●● ●● ● ●● ●●● ●●●● ●●● ●● ● ● ● ●●●● ●●●●● ● ● ●● ●● ●●●●●●●● ● ● ● ● ●● ●●● ● ●●● ● ●● ●●●●● ●● ●●● ● ● ● ●●● ● ● ● ● ● ● ●●● ●● ● ●●●● ●●● ● ● ●● ● ●● ●● ●● ● ● ● ● ● ●● ● ●●● ● ●● ● ●● ● ●●●●●●● ●● ●● ● ●● ●●● ●●● ●●●●●●● ●●●● ●●● ●●● ● ● ● ● ●● ● ●●● ● ● ● ● ●● ● ● ●●● ●●●● ●●● ●●● ●●● ● ●● ● ●●● ●● ● ● ● ● ● ●● ● ● ● ●●●●●● ● ● ●● ●● ●●● ●●●●●●● ● ● ●● ●●●● ●●● ● ● ● ●● ● ● ● ● ●●●● ●● ● ●●● ● ●●●● ●●● ●● ●● ● ● ● ●● ●●● ●●●●●● ●●● ●●● ●●● ● ● ● ● ● ● ●● ●● ●●●●●● ●● ● ● ●●● ● ●● ● ●●●● ●●● ●●●● ● ●● ●●●●●●●●● ● ●● ●●●●●●● ●●●● ● ●●●● ●●●● ●●●●●●●●● ● ●● ●●● ● ●●● ● ●● ●● ● ● ● ●● ●● ● ● ●●● ● ● ●● ● ●●●● ●●● ● ● ● ●●● ●● ● ● ● ● ● 0 50 100 150 200 250 300 0 10 20 30 40 50 0 20 40 60 80 100 Problems TV Radio Newspaper 􏰀 Predict sales given the three advertising media budgets. 􏰀 Ignore the correlation between the predictors, TV, Radio and Newspaper. (Monash) FIT5149 17 / 35 Sales 5 10 15 20 25 Sales 5 10 15 20 25 Sales 5 10 15 20 25 Multiple Linear Regression Multiple Linear Regression The multiple linear regression model: Y =β0 +β1X1 +β2X2 +···+βpXp +ε where ε ∼ N(0,σ2In×n) βj : the average effect on Y of a one unit increase in Xj , holding all other predictors fixed. In the advertising example, the model becomes Sales = β0 + β1 × TV + β2 × Radio + β3 × Newspaper + ε (Monash) FIT5149 18 / 35 Multiple Linear Regression Estimating the Regression Coefficients Given estimates βˆ0,βˆ1,...,βˆp, we can make prediction using the formula yˆ = βˆ 0 + βˆ 1 x 1 + βˆ 2 x 2 + · · · + βˆ p x p We estimate β0, β1, . . . , βp as the values that minimise the sum of squared residuals n RSS = 􏰊(yi−yˆi)2 i=1 n = 􏰊􏰐yi −βˆ0 −βˆ1x1 −βˆ2x2 −···−βˆpxp􏰑2 i=1 This can be done using standard statistical software. (Monash) FIT5149 19 / 35 Multiple Linear Regression Results for Advertising Data Results: (Monash) FIT5149 20 / 35 Multiple Linear Regression Results for Advertising Data Compare simple linear regression with multiple linear regression: Coefficients Std. error t value p-value Intercept <2e-16 TV <2e-16 Intercept <2e-16 Radio <2e-16 7.032594 0.457843 15.36 0.047537 0.002691 17.67 9.31164 0.56290 16.542 0.20250 0.02041 9.921 Intercept Intercept TV Radio < 2e-16 <2e-16 <2e-16 <2e-16 12.35141 0.62142 19.88 Newspaper 0.05469 0.01658 3.30 0.00115 2.938889 0.311908 9.422 0.045765 0.001395 32.809 0.188530 0.008611 21.893 Newspaper -0.001037 0.005871 -0.177 0.86 The multiple linear regression suggests that there is no relationship between sales and newspaper while the simple linear regression implies the opposite. (Monash) FIT5149 21 / 35 Multiple Linear Regression Results for Advertising Data Correlation matrix for TV, Radio, Newspaper, and sales. The correlation between radio and newspaper is 0.354. 􏰀 A tendency to spend more on newspaper advertising in markets where more is spent on radio advertising. (Monash) FIT5149 22 / 35 Multiple Linear Regression Some Important Questions Is at least one of the predictors X1, X2, . . . , Xp useful in predicting the response? Do all the predictors help to explain Y , or is only a subset of the predictors useful? How well does the model fit the data? Given a set of predictor values, what response value should we predict, and how accurate is our prediction? (Monash) FIT5149 23 / 35 Multiple Linear Regression F-Statistics Is there a relationship between the response and predictors? 􏰀 Hypothesis testing − Null hypothesis: There is no relationship between Y and X1,X2,...,Xp. H0 :β1 =β2 =···=βp =0 − The alternative: There is at least one Xj related to Y . Ha : βj ̸= 0, ∃j ∈ [1, p] 􏰀 F-statistics: a good indicator of whether there is a relationship between our predictor and the response variables. (TSS − RSS)/p F = RSS/(n−p−1) − F-value close to 1: no relationship between Y and X1, X2, . . . , Xp. − F-value greater than 1: a relationship between our predictor and the response variables. (Monash) FIT5149 24 / 35 Multiple Linear Regression F-Statistics Is there a relationship between the response and predictors? 􏰀 For the F-value, how large is large? − n is large: small F-value provides strong evidence against H0. − n is small: large F-value is needed. 􏰀 Example: multiple linear regression on the advertising dataset (Monash) FIT5149 24 / 35 Multiple Linear Regression The Analysis of Variance (ANOVA) It tests whether reduction in the residual sum of squares are statistically significant or not). Note that this makes sense only if lm.1 and lm.2 are nested models. (Monash) FIT5149 25 / 35 Multiple Linear Regression The Analysis of Variance (ANOVA) It tests whether reduction in the residual sum of squares are statistically significant or not). Note that this makes sense only if lm.1 and lm.2 are nested models. (Monash) FIT5149 25 / 35 Multiple Linear Regression Confidence interval v.s. Prediction interval Given a set of predictor values, what response value should be predict, and how accurate is our prediction? 􏰀 Prediction: given the estimated coefficients, βˆ0,βˆ1,...,βˆp, p Yˆ = βˆ 0 + 􏰊 βˆ i X i i=1 (Monash) FIT5149 26 / 35 Multiple Linear Regression Confidence interval v.s. Prediction interval Given a set of predictor values, what response value should be predict, and how accurate is our prediction? 􏰀 To determine how close Yˆ will be close to f (X ). − Confidence interval: use to quantify the uncertainty around the expected value of predictions (average of a group of predictions) — the uncertainty of predicting the average sales over a number of markets. (Monash) FIT5149 26 / 35 Multiple Linear Regression Confidence interval v.s. Prediction interval Given a set of predictor values, what response value should be predict, and how accurate is our prediction? 􏰀 To determine how close Yˆ will be close to f (X ). − Prediction interval: use to quantify the uncertainty around a single prediction — e.g. the uncertainty of predicting sales given the budgets of TV and Radio adverting for a particular market. (Monash) FIT5149 26 / 35 Linear Regression with Qualitative Predicators Outline 1 Simple Linear regression 2 Multiple Linear Regression 3 Linear Regression with Qualitative Predicators 4 Extension of Linear models 5 Summary (Monash) FIT5149 27 / 35 Linear Regression with Qualitative Predicators Linear Regression with Qualitative Predicators Some predictors are not quantitative but are qualitative, taking a discrete set of values. These are also called categorical predictors or factor variables. Figure: The credit card dataset that contain both quantitative variables (e.g., income, limit, rating, and age), and qualitative variables (e.g., gender, student, married, and ethnicity). (Monash) FIT5149 28 / 35 Linear Regression with Qualitative Predicators Linear Regression with Qualitative Predicators — continued Dummy coding — making many variables out of one 􏰀 A categorical variable with k levels will be transformed into k − 1 variables each with two levels. 􏰀 For example, for the ethnicity variable we create two dummy variables. The first could be to obtain the model yi = β0+β1xi,1+β2xi,2+εi = (Monash) β +β +ε ifithpersonisAsian xi,1 = 􏰉1 0 if ith person is Asian if ith persion is not Asian and the second could be 􏰉1 xi,2 = 0 Then both of these variables can be used in the regression equation, in order if ith person is Caucasian if ith persion is not Caucasian 01i β0 + β2 + εi β0 + εi FIT5149 if ith person is Caucasian if i th person is African American 29 / 35 Linear Regression with Qualitative Predicators Linear Regression with Qualitative Predicators — continued Dummy coding — making many variables out of one 􏰀 Example: the Credit dataset. (Monash) FIT5149 29 / 35 Extension of Linear models Outline 1 Simple Linear regression 2 Multiple Linear Regression 3 Linear Regression with Qualitative Predicators 4 Extension of Linear models 5 Summary (Monash) FIT5149 30 / 35 Extension of Linear models Addictive and Linear assumptions Two of the most important assumptions on the relationship between predictors and response: Sales =β0 +β1 ×TV +β2 ×Radio 􏰗 􏰀 Additive — the effect of changes in Xj on Y is independent of Xi for i ̸= j. 􏰀 Linear — the change in Y due to one-unit change in Xj is constant, regardless of the value of Xj . Can we remove the additive assumption? (Monash) FIT5149 31 / 35 Extension of Linear models Interaction between variables ●● ●● Synergy effect (or interaction affect): ●●● ●● ●● ●●● ●●●● 􏰀 For example, given a fixed budget of $100, 000, spending half on radio and half on TV may increase sales more than allocating the entire amount to either TV or to radio. Spending money on radio advertising actually increases the effectiveness of TV advertising, so that the slope term for TV should increase as radio increases. ● ●●●● ●●● ●● ● ● ●●●●● ●● ●● ●●●●●● ●●●●● ● ●●● ● ● ● ●● ● ● ● ● ● ●●●● ●● ●● ● ●● ●●● ●●●●● ● ●●●● ●● ●●●●●●● ●●● ●● ●●●●●●●●●● ●● ●●●● ●● ●● ●●●●● ● ● ● ● ●● ● ● ●●● ●● ●● ●● ● ●●●● ● 300 250 ●●● ● ●● ●●●●● ●●●● ● ● ●● ●● ●●● ●●● 200 150 100 50 􏰀 ● ●●● ● ● 0 0 10 20 30 40 50 Radio ● Figure: Over-estimate v.s. under-estimate without considering interaction between predictors (Monash) FIT5149 32 / 35 Sales 0 5 10 15 20 25 30 TV Extension of Linear models Interaction between variables — continued Model with interaction terms takes the form Sales = β0 +β1 ×TV +β2 ×Radio +β3 ×(TV ×Radio)+ε = β0 +(β1 +β3 ×Radio)×TV +β3 ×Radio +ε 􏰀 β3: the increase in the effectiveness of TV advertising for a one unit increase in radio advertising (or vice-versa) (Monash) FIT5149 33 / 35 Extension of Linear models Interaction between variables — continued results 􏰀 Strong evidence that Ha : β3 ̸== 0: the true relationship is not additive (Monash) FIT5149 33 / 35 Extension of Linear models Interaction between variables — continued results 􏰀 the R2 and F-statistics: R 2 F − statistic Sales ∼ TV +Radio Sales ∼ TV +Radio +TV ∗Radio 0.9678 (0.9678 − 0.8972)/(1 − 0.8972) ≈ 69% 0.8972 859.6 1963 (Monash) FIT5149 33 / 35 Extension of Linear models Interaction between variables — continued results 􏰀 Interpret coefficients: − An increase in TV advertising of $1, 000 is associated with increased sales of (βˆ1 +βˆ3 ×Radio)×1000 = 19+1.1×Radio − An increase in radio advertising of $1, 000 will be associated with an increase in sales of (βˆ2 +βˆ3 ×TV)×1000 = 29+1.1×TV (Monash) FIT5149 33 / 35 Extension of Linear models Interaction between variables — continued results 􏰀 The hierarchy principle: if we include an interaction in a model, we should also include the main effects, even if the p-values associated with their coefficient are not significant. (Monash) FIT5149 33 / 35 Summary Summary What we have covered: 􏰀 Simple linear regression with ordinary least squares 􏰀 Various regression diagnostics − Assess the accuracy of the estimated coefficients − Assess the accuracy of the model − Residual analysis 􏰀 Multiple linear regression 􏰀 Categorial variables in regression 􏰀 Extension of linear regression: interaction between variables What we haven’t covered: 􏰀 Outliers 􏰀 High leverage points 􏰀 Collinearity 􏰀 Linear regression with K-Nearest Neighbors See sections 3.3.3, 3.4 and 3.5 of "Introduction to Statistical learning" (Monash) FIT5149 34 / 35 Summary Reference Reading materials: 􏰀 "Linear Regression", Chapter 3 of "Introduction to Statistical Learning", 6th edition 􏰀 "Linear Regression and ANOVA", Chapter 11 of "R Cookbook" by Paul Teetor, available online from Monash Library. Some figures in this presentation were taken from 􏰀 "An Introduction to Statistical Learning, with applications in R" (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani 􏰀 https://data.library.virginia.edu/diagnostic-plots/ Some of the slides are reproduced based on the slides from T. Hastie and R. Tibshirani (Monash) FIT5149 35 / 35