COMP20008 Elements of Data Processing Regression
Semester 1, 2020
Contact: uwe.aickelin@unimelb.edu.au
Where are we now?
2
Learning Objectives
• How to use regression analysis to predict the value of a dependent variable based on independent variables
• Make inferences about the slope and correlation coefficient
• Evaluate the assumptions of regression analysis and know what to do if the assumptions are violated
3
Correlation vs. Regression
• A scatter diagram can be used to show the relationship between two variables
• Correlation analysis is used to measure strength / direction of the relationship between two variables
• No causal effect is implied with correlation
4
Introduction to Regression Analysis
Regression analysis is used to:
• Predict the value of a dependent variable based on the value of at least one independent variable
• Explain the impact of changes in an independent variable on the dependent variable
• Dependent variable: the variable we wish to predict or explain
• Independent variable: the variable used to explain the dependent variable
5
Linear Regression
6
Simple Linear Regression Model
• Only one independent variable, X
• Relationship between X and Y is described by a linear function
• Changes in Y are assumed to be caused by changes in X
7
Types of Relationships
Linear relationships
YY
Non-linear relationships
XX YY
XX
8
Strong relationships
Types of Relationships (2)
Weak relationships No relationship YY
XXX YY
X
Y
Y
X
X9
Simple Linear Regression Model
Independent Variable
Error term
The simple linear regression equation provides an estimate of the population regression line
Dependent Variable
Yi =β0 +β1Xi +εi
intercept
Slope Coefficient
Linear component
Error component
10
Simple Linear Regression Model (2)
Y Yi=β0 +β1Xi+εi Observed Value
of Y for Xi
Predicted Value of Y for Xi
Intercept = β0
)
εi
Error
for this X value i
Slope = β1
Xi
X
11
Least Squares Method
b0 and b1 are obtained by finding the values of b0 and b1that minimize the sum of the squared differences between Y and Y:ˆ
∑ ˆ ∑(Y−(b +bX))2 min (Y−Y)2=min
ii i01i
12
Interpretation of Slope and Intercept
• b0 is the estimated average value of Y when the value of X is zero (intercept)
• b1 is the estimated change in the average value of Y as a result of a one-unit change in X (slope)
13
Simple Linear Regression Example
A real estate agent wishes to examine the relationship between the selling price of a home and its size (measured in square feet)
A random sample of 10 houses is selected
• Dependent variable (Y) = house price in $1000s
• Independent variable (X) = square feet
14
Sample Data for House Price Model
House Price in $1000s (Y)
Square Feet (X)
245
1400
312
1600
279
1700
308
1875
199
1100
219
1550
405
2350
324
2450
319
1425
255
1700
15
Graphical Presentation
House price model: scatter plot
450 400
350
300
250
200
150
100
50
00 500 1000 1500 2000 2500 3000
Square Feet
16
House Price ($1000s)
Calculation Output
Regression Statistics
The regression equation is:
Multiple R R Square
Adjusted R Square Standard Error
Observations
ANOVA
0.76211 0.58082 0.52842
41.33032 10
df
SS
18934.9348 13665.5652 32600.5000
Standard Error
58.03348 0.03297
MS
18934.9348 1708.1957
t Stat
1.69296 3.32938
F
11.0848
P-value
0.12892 0.01039
Significance F
0.01039
Lower 95%
-35.57720 0.03374
house price = 98.24833 + 0.10977 (square feet)
Regression 1 Residual 8 Total 9
Coefficients
Intercept 98.24833 Square Feet 0.10977
Upper 95%
232.07386 0.18580
17
Graphical Presentation
House price model: scatter plot and regression line
450
400
350
300
250
200
150
100
50 0
Slope
= 0.10977
Intercept = 98.248
0 500
1000 1500 2000 2500 3000
Square Feet
house price = 98.24833 + 0.10977 (square feet)
18
House Price ($1000s)
Interpretation of the Intercept b0
house price = + 0.10977 (squarefeet)
98.24833
• b0 is the estimated average value of Y when the value of X is zero
• Here, no houses had 0 square feet, so b0 = 98.24833 ($1000) just indicates that, for houses within the range of sizes observed, $98,248.33 is the portion of the house price not explained by square feet
19
Interpretation of the Slope Coefficient b1
house price =98.24833 + (square feet)
0.10977
• b1 measures the estimated change in the average value of Y as a result of a one-unit change in X
• Here, b1 = .10977 tells us that the average value of a house increases by .10977 ($1000) = $109.77, on average, for each additional one square foot of size
20
Predictions using Regression
Predict the price for a house with 2000 square feet:
house price = 98.25 + 0.1098 (sq.ft.) = 98.25 + 0.1098 (2000)
= 317.85
The predicted price for a house with 2000 square feet is 317.85 ($1,000s) = $317,850
21
Interpolation vs. Extrapolation
When using a regression model for prediction, only predict within the relevant range of data
Relevant range for interpolation
Do not try to extrapolate beyond the range of observed X’s
22
450
400
350
300
250
200
150
100
50 0
0 500
1000 1500 2000 2500 3000
Square Feet
Department of Statistics, ITS Surabaya
House Price ($1000s)
Regression – Residual Analysis
23
Measures of Variation
Total variation is made up of two parts:
SST = SSR + SSE
SST= (Y−Y)2 SSR= (Y −Y)2 SSE= (Y−Y)2
Total Sum of Squares
Regression Sum of Squares
∑∑ˆ∑ˆ iiii
Error Sum of Squares
where:
ˆ
Y = Average value of the dependent variable
Yi = Observed values of the dependent variable
Yˆi= Predicted value of Y for the given Xi value
24
Measures of Variation (2)
SST = total sum of squares
Measures the variation of the Yi values around their mean Y
SSR = regression sum of squares
Explained variation attributable to the relationship between X and Y
SSE = error sum of squares
Variation attributable to factors other than the relationship between X and Y
25
Measures of Variation (3)
Y
Yi ∧ Y∧ _ SSE = ∑ (Yi -Yi )2
∧ SST=∑(Yi-Y)2 Y∧
_ SSR = ∑ (Yi -Y)2
_ YY
_
Xi
X
26
Coefficient of Determination, r2
The coefficient of determination is the portion of the total variation in the dependent variable that is explained by variation in the independent variable
The coefficient of determination is also called r-squared and is denoted as r2
r2 =SSR/SST
= regression sum of squares / total sum of squares
0≤r2 ≤1
note:
27
Examples of approximate r2 values
Y
r2 = 1
Perfect linear relationship between X and Y:
100% of the variation in Y is explained by variation in X
X
Y
r2 = 1
X
r2 = 1
28
Examples of approximate r2 values (2)
Y
0 < r2 < 1
Weaker linear relationships between X and Y:
Some but not all of the variation in Y is explained by variation in X
X
X
Y
29
Examples of approximate r2 values (3)
Y
r2 = 0
No linear relationship between X and Y:
The value of Y does not depend on X.
r2 = 0
X
30
Multiple R
R Square
Adjusted R Square
Standard Error Observations
0.76211
0.58082 0.52842
41.33032 10
Calculation Output
Regression Statistics
r2 = SSR = 18934.9348 = 0.58082 SST 32600.5000
58.08% of the variation in house prices is explained by variation in square feet
ANOVA
df
Regression
1
Residual 8 13665.5652 9
SS
18934.9348
Total
32600.5000
Coefficients
Standard Error
58.03348 0.03297
MS
18934.9348 1708.1957
t Stat
1.69296 3.32938
F
11.0848
P-value
0.12892 0.01039
Significance F
0.01039
Lower 95%
-35.57720 0.03374
Upper 95%
232.07386 0.18580
Intercept 98.24833 Square Feet 0.10977
31
Assumptions of Regression
• Linearity: The underlying relationship between X and Y is linear
• Independence of residuals: Residual values (also known as fitting errors) are statistically sound and sum to zero
32
Residual Analysis
ei =Yi-Yˆi
• The residual for observation i, ei, is the difference between the observed and predicted value
• Check the assumptions of regression by examining the residuals
• Examine for linearity assumption
• Evaluate independence assumption
• Examine for constant variance for all levels of X
• Graphical Analysis of Residuals: plot residuals vs. X
33
Residual Analysis for Linearity
YY
xx
xx
Not Linear
Linear
34
residuals
residuals
Residual Analysis for Independence
Not Independent
X
X
Independent
X
35
residuals residuals
residuals
Residual Analysis for Equal Variance
Y
x
Y
x
xx
Non-constant variance
Constant variance
36
residuals
residuals
House Price Residual Output
House Price Model Residual Plot
RESIDUAL OUTPUT
Predicted House Price
Residuals
1
251.92316
-6.923162
2
273.87671
38.12329
3
284.85348
-5.853484
4
304.06284
3.937162
5
218.99284
-19.99284
6
268.38832
-49.38832
7
356.20251
48.79749
8
367.17929
-43.17929
9
254.6674
64.33264
10
284.85348
-29.85348
80 60 40 20
0 -20 0
-40 -60
1000 2000 3000
Square Feet
Does not appear to violate any regression assumptions
37
Residuals
Multiple Regression
38
Multiple Regression
• Multiple regression is an extension of simple linear regression
• It is used when we want to predict the value of a variable based on
the value of two or more other variables
• The variable we want to predict is called the dependent variable
• The variables we are using to predict the value of the dependent variable are called the independent variables
39
Multiple Regression Example
A researcher may be interested in the relationship between the weight of a car, engine size of a car and petrol consumption.
• Independent Variable 1: weight
• Independent Variable 2: horsepower • Dependent Variable: miles per gallon
40
Multiple Regression Fitting
• Linear regression is based on fitting a line as close as possible to the plotted coordinates of the data on a two-dimensional graph
• Multiple regression with two independent variables is based on fitting a plane as close as possible to the plotted coordinates of your data on a three-dimensional graph: Y = a + b1X1 + b2X2
• More independent variables extend this into higher dimensions
• The plane (or higher dimensional shape) will be placed so that it
minimises the distance (sum of squared errors) to every data point
41
Multiple Regression Graphic
42
Multiple Regression Assumptions
• Multiple regression assumes that the independent variables are not highly correlated with each other
• Use scatter plots to check
43
Multiple Regression Assumption
BAD
X2
GOOD
•
Multiple regression assumes that the independent variables are not highly correlated with each other
X2
X1
X1
X2
•
Use scatter plots to check
X2
X1
X1
44
Pitfalls of Regression Analysis
• Lacking an awareness of the assumptions underlying least-squares regression
• Using a regression model without knowledge of the subject matter
• Extrapolating outside the relevant range
• For multiple regression remember the importance of independence assumptions
45
Avoiding the Pitfalls of Regression
• Start with a scatter plot of X vs. Y to observe possible relationships
• Performresidualanalysistochecktheassumptions: Plot residuals vs. X to check for violations
• If there is no evidence of assumption violation, test for significance of the regression coefficients
• Avoid making predictions or forecasts outside the relevant range
46
Acknowledgements
• Materials are partially adopted from ...
• Previous COMP2008 slides including material produced by James Bailey,
Pauline Lin, Chris Ewin, Uwe Aickelin and others
• Sutikno, Department of Statistics, Faculty of Mathematics and Natural Sciences Sepuluh Nopember Institute of Technology (ITS), Surabaya
47