COMP20008
Elements of Data Processing
Semester 1, 2021
Lecture 7, Part 1: Regression analysis
Contact: pauline.lin@unimelb.edu.au
© University of Melbourne 2021
Learning Objectives
• How to use regression analysis to predict the value of a dependent variable based on independent variables
• Make inferences about the slope and correlation coefficient
• Evaluate the assumptions of regression analysis and know what to do if the assumptions are violated
© University of Melbourne 2021
Correlation vs. Regression
• A scatter diagram can be used to show the relationship between two variables
• Correlation analysis is used to measure strength / direction of the relationship between two variables
• No causal effect is implied with correlation
© University of Melbourne 2021
Introduction to Regression Analysis
Regression analysis is used to:
• Predict the value of a dependent variable based on the value of at least one independent variable
• Explain the impact of changes in an independent variable on the dependent variable
• Dependent variable: the variable we wish to predict or explain
• Independent variable: the variable used to explain the dependent variable
© University of Melbourne 2021
Linear Regression
© University of Melbourne 2021
Simple Linear Regression Model
• Only one independent variable, X
• Relationship between X and Y is described by a linear function
• Changes in Y are assumed to be caused by changes in X
© University of Melbourne 2021
Types of Relationships (1)
Linear relationships
YY
Non-linear relationships
XX
YY
XX
© University of Melbourne 2021
Types of Relationships (2)
Strong relationships
Weak relationships No relationship
YY
XXX YY
X
Y
Y
X
© University of Melbourne 2021
X
Simple Linear Regression Model
The simple linear regression equation provides an estimate of the population regression line
Dependent Variable
Yi =β0 +β1Xi +εi
intercept
Slope Coefficient
Independent Variable
Error term
Linear component
Error component
© University of Melbourne 2021
Simple Linear Regression Model (2)
Yi=β0 +β1Xi+εi εi
Error
for this X value i
Xi
Y
Observed Value of Y for Xi
Predicted Value of Y for Xi
Intercept = β0
Slope = β1
© University of Melbourne 2021
X
Least Squares Method
b0 and b1 are obtained by finding the values of b0 and b1that minimize the sum of the squared differences between 𝑌 and 𝑌”
𝑚𝑖𝑛$𝑌−𝑌'”=𝑚𝑖𝑛$𝑌−𝑏+𝑏𝑋 ” !! !#$!
© University of Melbourne 2021
Interpretation of Slope and Intercept
• b0 is the estimated average value of Y when the value of X is zero (intercept)
• b1 is the estimated change in the average value of Y as a result of a one-unit change in X (slope)
© University of Melbourne 2021
Simple Linear Regression Example
A real estate agent wishes to examine the relationship between the selling price of a home and its size (measured in square feet)
A random sample of 10 houses is selected
• Dependent variable (Y) = house price in $1000s
• Independent variable (X) = square feet
© University of Melbourne 2021
Sample Data for House Price Model
House Price in $1000s
(Y)
Square Feet
(X)
245
1400
312
1600
279
1700
308
1875
199
1100
219
1550
405
2350
324
2450
319
1425
255
1700
© University of Melbourne 2021
Graphical Presentation
House price model: scatter plot
450
400
350
300
250
200
150
100
50 0
0 500
1000 1500 2000 2500 3000
Square Feet
© University of Melbourne 2021
House Price ($1000s)
Calculation Output
Regression Statistics
The regression equation is:
Multiple R R Square
Adjusted R Square Standard Error Observations
ANOVA
0.76211 0.58082 0.52842
41.33032 10
df
house price = 98.24833 + 0.10977 (square feet)
Regression 1 Residual 8 Total 9
Coefficients
Intercept 98.24833
Square Feet 0.10977
SS
18934.9348 13665.5652 32600.5000
Standard Error
58.03348
0.03297
MS
18934.9348 1708.1957
t Stat
1.69296
3.32938
F
11.0848
P-value
0.12892
0.01039
Significance F
0.01039
Lower 95%
-35.57720 0.03374
Upper 95%
232.07386 0.18580
© University of Melbourne 2021
Graphical Presentation
House price model: scatter plot and regression line
450
400
350
300
250
200
150
100
50 0
Slope
= 0.10977
Intercept = 98.248
0 500
1000 1500 2000 2500 3000
Square Feet
house price = 98.24833 + 0.10977 (square feet)
© University of Melbourne 2021
House Price ($1000s)
Interpretation of the Intercept b0
house price = + 0.10977 (squarefeet)
98.24833
• b0 is the estimated average value of Y when the value of X is zero
• Here, no houses had 0 square feet, so b0 = 98.24833 ($1000) just indicates that, for houses within the range of sizes observed, $98,248.33 is the portion of the house price not explained by square feet
© University of Melbourne 2021
Interpretation of the Slope Coefficient b1
house price =98.24833 + (square feet)
0.10977
• b1 measures the estimated change in the average value of Y as a result of a one-unit change in X
• Here, b1 = .10977 tells us that the average value of a house increases by .10977 ($1000) = $109.77, on average, for each additional one square foot of size
© University of Melbourne 2021
Predictions using Regression
Predict the price for a house with 2000 square feet:
house price = 98.25 + 0.1098 (sq.ft.) = 98.25 + 0.1098 (2000)
= 317.85
The predicted price for a house with 2000 square feet is 317.85 ($1,000s) = $317,850
© University of Melbourne 2021
Interpolation vs. Extrapolation
When using a regression model for prediction, only predict within the relevant range of data
Relevant range for interpolation
450
400
350
300
250
200
150
100
50 0
0 500 Department of Statistics, ITS Surabaya
1000 1500 2000 2500 3000
Square Feet
Do not try to extrapolate beyond the range of observed X’s
© University of Melbourne 2021
House Price ($1000s)
Break
From the time before, 2019
© University of Melbourne 2021
COMP20008
Elements of Data Processing
Semester 1, 2021
Lecture 7, Part 2: Regression — Residual analysis
Contact: pauline.lin@unimelb.edu.au
© University of Melbourne 2021
Regression – Residual Analysis
© University of Melbourne 2021
Measures of Variation (1)
Total variation is made up of two parts:
SST = SSR + SSE
Total Sum of Squares
Regression Sum of Squares
S S T = $ Y − Y/ ” S S R = $ Y’ − Y/ ” S S E = $ Y − Y’ ” ((((
where:
Y% = Average value of the dependent variable
Y! = Observed values of the dependent variable ‘Y! = Predicted value of Y for the given Xi value
© University of Melbourne 2021
Error Sum of Squares
Measures of Variation (2)
§ SST = total sum of squares
§ Measures the variation of the Y! values around their
mean Y
§ SSR = regression sum of squares
§ Explained variation attributable to the relationship
between X and Y
§ SSE = error sum of squares
§ Variation attributable to factors other than the
relationship between X and Y
© University of Melbourne 2021
Measures of Variation (3)
𝒀 𝒊
𝒀
S S T = * 𝒀 𝒊 − 𝒀% 𝟐
= * 𝒀 − 𝒀’ 𝟐 SSE 𝒊 𝒊
𝒀’
𝒀’
𝒀2
SSR
= * 𝒀’ − 𝒀% 𝟐 𝒊
𝒀2
𝑿
𝑿𝒊
© University of Melbourne 2021
Coefficient of Determination, r2
§ The coefficient of determination is the portion of the total variation in the dependent variable that is explained by variation in the independent variable
§ The coefficient of determination is also called r-squared and is denoted as r2
r2 =SSR/SST
= regression sum of squares / total sum ofsquares
0 ≤ 𝑟” ≤ 1
© University of Melbourne 2021
Examples of approximate r2 values
Y
r2 = 1
Perfect linear relationship between X and Y:
100% of the variation in Y is explained by variation in X
X
Y
r2 = 1
X
r2 = 1
© University of Melbourne 2021 33
Examples of approximate r2 values (2)
Y
0 < r2 < 1
Weaker linear relationships between X and Y:
Some but not all of the variation in Y is explained by variation in X
X
X
Y
© University of Melbourne 2021
Examples of approximate r2 values (3)
Y
r2 = 0
No linear relationship between X and Y:
The value of Y does not depend on X.
r2 = 0
X
© University of Melbourne 2021
Calculation Output
Regression Statistics
r2 = SSR = 18934.9348 = 0.58082 SST 32600.5000
Multiple R
R Square Adjusted R Square
Standard Error Observations
0.76211
0.58082 0.52842
41.33032 10
58.08% of the variation in house prices is explained by variation in square feet
ANOVA
df
SS
Regression
1
Residual 8 13665.5652 9
Total
18934.9348
32600.5000
Coefficients
Standard Error
58.03348
0.03297
MS
18934.9348 1708.1957
t Stat
1.69296
3.32938
F
11.0848
P-value
0.12892
0.01039
Significance F
0.01039
Lower 95%
-35.57720 0.03374
Upper 95%
232.07386 0.18580
Intercept 98.24833
Square Feet 0.10977
© University of Melbourne 2021
Assumptions of Regression
• Linearity: The underlying relationship between X and Y is linear
• Independence of residuals: Residual values (also known as fitting errors) are statistically sound and sum to zero
© University of Melbourne 2021
Assumptions of Regression
•
𝑒 = 𝑌 − 𝑌% (((
The residual for observation i, ei, is the difference between the observed and predicted value
Check the assumptions of regression by examining the residuals
• Examine for linearity assumption
• Evaluate independence assumption
• Examine for constant variance for all levels of X
Graphical Analysis of Residuals: plot residuals vs. X
•
© University of Melbourne 2021
Residual Analysis for Linearity
YY
xx xx
ü
Not Linear
Linear
© University of Melbourne 2021
residuals
residuals
Residual Analysis for Independence
Not Independent
X
X
ü
Independent
X
© University of Melbourne 2021
residuals residuals
residuals
Residual Analysis for Equal Variance
Y
Y
x
x
xx
ü
Non-constant variance
Constant variance
© University of Melbourne 2021
residuals
residuals
House Price Residual Output
House Price Model Residual Plot
RESIDUAL OUTPUT
Predicted House Price
Residuals
1
251.92316
-6.923162
2
273.87671
38.12329
3
284.85348
-5.853484
4
304.06284
3.937162
5
218.99284
-19.99284
6
268.38832
-49.38832
7
356.20251
48.79749
8
367.17929
-43.17929
9
254.6674
64.33264
10
284.85348
-29.85348
80 60 40 20
0 -20 0
-40 -60
1000 2000 3000
Square Feet
Does not appear to violate any regression assumptions
© University of Melbourne 2021
Residuals
Break
From the time before, 2019
© University of Melbourne 2021
COMP20008
Elements of Data Processing
Semester 1, 2021
Lecture 7, Part 3: Multiple regression
Contact: pauline.lin@unimelb.edu.au
© University of Melbourne 2021
Multiple Regression
© University of Melbourne 2021
Multiple Regression
• Multiple regression is an extension of simple linear regression
• It is used when we want to predict the value of a variable based on
the value of two or more other variables
• The variable we want to predict is called the dependent variable
• The variables we are using to predict the value of the dependent variable are called the independent variables
© University of Melbourne 2021
Multiple Regression Example
A researcher may be interested in the relationship between the weight of a car, engine size of a car and petrol consumption.
• Independent Variable 1: weight
• Independent Variable 2: horsepower • Dependent Variable: miles per gallon
© University of Melbourne 2021
Multiple Regression Fitting
• Linear regression is based on fitting a line as close as possible to the plotted coordinates of the data on a two-dimensional graph
• Multiple regression with two independent variables is based on fitting a plane as close as possible to the plotted coordinates of your data on a three-dimensional graph: Y = b0 + b1X1 + b2X2
• More independent variables extend this into higher dimensions
• The plane (or higher dimensional shape) will be placed so that it
minimises the distance (sum of squared errors) to every data point
© University of Melbourne 2021
Multiple Regression Graphic
© University of Melbourne 2021
Multiple Regression Assumptions
• Multiple regression assumes that the independent variables are not highly correlated with each other
• Use scatter plots to check
© University of Melbourne 2021
Multiple Regression Assumption
BAD
X2
•
Multiple regression assumes that the independent variables are not highly correlated with each other
GOOD
X2
X2
X1
X1
X2
•
Use scatter plots to check
X1
© University of Melbourne 2021
X1
Pitfalls of Regression Analysis
• Lacking an awareness of the assumptions underlying least-squares regression
• Using a regression model without knowledge of the subject matter
• Extrapolating outside the relevant range
• For multiple regression remember the importance
of independence assumptions
© University of Melbourne 2021
Avoiding the Pitfalls of Regression
• Start with a scatter plot of X vs. Y to observe possible relationships
• Performresidualanalysistochecktheassumptions: Plot residuals vs. X to check for violations
• If there is no evidence of assumption violation, test for significance of the regression coefficients
• Avoid making predictions or forecasts outside the relevant range
© University of Melbourne 2021
Acknowledgements
• Materials are partially adopted from ...
• Previous COMP2008 slides including material produced by James Bailey,
Pauline Lin, Chris Ewin, Uwe Aickelin and others
• Sutikno, Department of Statistics, Faculty of Mathematics and Natural Sciences Sepuluh Nopember Institute of Technology (ITS), Surabaya
© University of Melbourne 2021