CS计算机代考程序代写 COMP20008

COMP20008
Elements of Data Processing
Semester 1, 2021
Lecture 7, Part 1: Regression analysis
Contact: pauline.lin@unimelb.edu.au
© University of Melbourne 2021

Learning Objectives
• How to use regression analysis to predict the value of a dependent variable based on independent variables
• Make inferences about the slope and correlation coefficient
• Evaluate the assumptions of regression analysis and know what to do if the assumptions are violated
© University of Melbourne 2021

Correlation vs. Regression
• A scatter diagram can be used to show the relationship between two variables
• Correlation analysis is used to measure strength / direction of the relationship between two variables
• No causal effect is implied with correlation
© University of Melbourne 2021

Introduction to Regression Analysis
Regression analysis is used to:
• Predict the value of a dependent variable based on the value of at least one independent variable
• Explain the impact of changes in an independent variable on the dependent variable
• Dependent variable: the variable we wish to predict or explain
• Independent variable: the variable used to explain the dependent variable
© University of Melbourne 2021

Linear Regression
© University of Melbourne 2021

Simple Linear Regression Model
• Only one independent variable, X
• Relationship between X and Y is described by a linear function
• Changes in Y are assumed to be caused by changes in X
© University of Melbourne 2021

Types of Relationships (1)
Linear relationships
YY
Non-linear relationships
XX
YY
XX
© University of Melbourne 2021

Types of Relationships (2)
Strong relationships
Weak relationships No relationship
YY
XXX YY
X
Y
Y
X
© University of Melbourne 2021
X

Simple Linear Regression Model
The simple linear regression equation provides an estimate of the population regression line
Dependent Variable
Yi =β0 +β1Xi +εi
intercept
Slope Coefficient
Independent Variable
Error term
Linear component
Error component
© University of Melbourne 2021

Simple Linear Regression Model (2)
Yi=β0 +β1Xi+εi εi
Error
for this X value i
Xi
Y
Observed Value of Y for Xi
Predicted Value of Y for Xi
Intercept = β0
Slope = β1
© University of Melbourne 2021
X

Least Squares Method
b0 and b1 are obtained by finding the values of b0 and b1that minimize the sum of the squared differences between 𝑌 and 𝑌”
𝑚𝑖𝑛$𝑌−𝑌'”=𝑚𝑖𝑛$𝑌−𝑏+𝑏𝑋 ” !! !#$!
© University of Melbourne 2021

Interpretation of Slope and Intercept
• b0 is the estimated average value of Y when the value of X is zero (intercept)
• b1 is the estimated change in the average value of Y as a result of a one-unit change in X (slope)
© University of Melbourne 2021

Simple Linear Regression Example
A real estate agent wishes to examine the relationship between the selling price of a home and its size (measured in square feet)
A random sample of 10 houses is selected
• Dependent variable (Y) = house price in $1000s
• Independent variable (X) = square feet
© University of Melbourne 2021

Sample Data for House Price Model
House Price in $1000s
(Y)
Square Feet
(X)
245
1400
312
1600
279
1700
308
1875
199
1100
219
1550
405
2350
324
2450
319
1425
255
1700
© University of Melbourne 2021

Graphical Presentation
House price model: scatter plot
450
400
350
300
250
200
150
100
50 0
0 500
1000 1500 2000 2500 3000
Square Feet
© University of Melbourne 2021
House Price ($1000s)

Calculation Output
Regression Statistics
The regression equation is:
Multiple R R Square
Adjusted R Square Standard Error Observations
ANOVA
0.76211 0.58082 0.52842
41.33032 10
df
house price = 98.24833 + 0.10977 (square feet)
Regression 1 Residual 8 Total 9
Coefficients
Intercept 98.24833
Square Feet 0.10977
SS
18934.9348 13665.5652 32600.5000
Standard Error
58.03348
0.03297
MS
18934.9348 1708.1957
t Stat
1.69296
3.32938
F
11.0848
P-value
0.12892
0.01039
Significance F
0.01039
Lower 95%
-35.57720 0.03374
Upper 95%
232.07386 0.18580
© University of Melbourne 2021

Graphical Presentation
House price model: scatter plot and regression line
450
400
350
300
250
200
150
100
50 0
Slope
= 0.10977
Intercept = 98.248
0 500
1000 1500 2000 2500 3000
Square Feet
house price = 98.24833 + 0.10977 (square feet)
© University of Melbourne 2021
House Price ($1000s)

Interpretation of the Intercept b0
house price = + 0.10977 (squarefeet)
98.24833
• b0 is the estimated average value of Y when the value of X is zero
• Here, no houses had 0 square feet, so b0 = 98.24833 ($1000) just indicates that, for houses within the range of sizes observed, $98,248.33 is the portion of the house price not explained by square feet
© University of Melbourne 2021

Interpretation of the Slope Coefficient b1
house price =98.24833 + (square feet)
0.10977
• b1 measures the estimated change in the average value of Y as a result of a one-unit change in X
• Here, b1 = .10977 tells us that the average value of a house increases by .10977 ($1000) = $109.77, on average, for each additional one square foot of size
© University of Melbourne 2021

Predictions using Regression
Predict the price for a house with 2000 square feet:
house price = 98.25 + 0.1098 (sq.ft.) = 98.25 + 0.1098 (2000)
= 317.85
The predicted price for a house with 2000 square feet is 317.85 ($1,000s) = $317,850
© University of Melbourne 2021

Interpolation vs. Extrapolation
When using a regression model for prediction, only predict within the relevant range of data
Relevant range for interpolation
450
400
350
300
250
200
150
100
50 0
0 500 Department of Statistics, ITS Surabaya
1000 1500 2000 2500 3000
Square Feet
Do not try to extrapolate beyond the range of observed X’s
© University of Melbourne 2021
House Price ($1000s)

Break
From the time before, 2019
© University of Melbourne 2021

COMP20008
Elements of Data Processing
Semester 1, 2021
Lecture 7, Part 2: Regression — Residual analysis
Contact: pauline.lin@unimelb.edu.au
© University of Melbourne 2021

Regression – Residual Analysis
© University of Melbourne 2021

Measures of Variation (1)
Total variation is made up of two parts:
SST = SSR + SSE
Total Sum of Squares
Regression Sum of Squares
S S T = $ Y − Y/ ” S S R = $ Y’ − Y/ ” S S E = $ Y − Y’ ” ((((
where:
Y% = Average value of the dependent variable
Y! = Observed values of the dependent variable ‘Y! = Predicted value of Y for the given Xi value
© University of Melbourne 2021
Error Sum of Squares

Measures of Variation (2)
§ SST = total sum of squares
§ Measures the variation of the Y! values around their
mean Y
§ SSR = regression sum of squares
§ Explained variation attributable to the relationship
between X and Y
§ SSE = error sum of squares
§ Variation attributable to factors other than the
relationship between X and Y
© University of Melbourne 2021

Measures of Variation (3)
𝒀 𝒊
𝒀
S S T = * 𝒀 𝒊 − 𝒀% 𝟐
= * 𝒀 − 𝒀’ 𝟐 SSE 𝒊 𝒊
𝒀’
𝒀’
𝒀2
SSR
= * 𝒀’ − 𝒀% 𝟐 𝒊
𝒀2
𝑿
𝑿𝒊
© University of Melbourne 2021

Coefficient of Determination, r2
§ The coefficient of determination is the portion of the total variation in the dependent variable that is explained by variation in the independent variable
§ The coefficient of determination is also called r-squared and is denoted as r2
r2 =SSR/SST
= regression sum of squares / total sum ofsquares
0 ≤ 𝑟” ≤ 1
© University of Melbourne 2021

Examples of approximate r2 values
Y
r2 = 1
Perfect linear relationship between X and Y:
100% of the variation in Y is explained by variation in X
X
Y
r2 = 1
X
r2 = 1
© University of Melbourne 2021 33

Examples of approximate r2 values (2)
Y
0 < r2 < 1 Weaker linear relationships between X and Y: Some but not all of the variation in Y is explained by variation in X X X Y © University of Melbourne 2021 Examples of approximate r2 values (3) Y r2 = 0 No linear relationship between X and Y: The value of Y does not depend on X. r2 = 0 X © University of Melbourne 2021 Calculation Output Regression Statistics r2 = SSR = 18934.9348 = 0.58082 SST 32600.5000 Multiple R R Square Adjusted R Square Standard Error Observations 0.76211 0.58082 0.52842 41.33032 10 58.08% of the variation in house prices is explained by variation in square feet ANOVA df SS Regression 1 Residual 8 13665.5652 9 Total 18934.9348 32600.5000 Coefficients Standard Error 58.03348 0.03297 MS 18934.9348 1708.1957 t Stat 1.69296 3.32938 F 11.0848 P-value 0.12892 0.01039 Significance F 0.01039 Lower 95% -35.57720 0.03374 Upper 95% 232.07386 0.18580 Intercept 98.24833 Square Feet 0.10977 © University of Melbourne 2021 Assumptions of Regression • Linearity: The underlying relationship between X and Y is linear • Independence of residuals: Residual values (also known as fitting errors) are statistically sound and sum to zero © University of Melbourne 2021 Assumptions of Regression • 𝑒 = 𝑌 − 𝑌% ((( The residual for observation i, ei, is the difference between the observed and predicted value Check the assumptions of regression by examining the residuals • Examine for linearity assumption • Evaluate independence assumption • Examine for constant variance for all levels of X Graphical Analysis of Residuals: plot residuals vs. X • © University of Melbourne 2021 Residual Analysis for Linearity YY xx xx ü Not Linear Linear © University of Melbourne 2021 residuals residuals Residual Analysis for Independence Not Independent X X ü Independent X © University of Melbourne 2021 residuals residuals residuals Residual Analysis for Equal Variance Y Y x x xx ü Non-constant variance Constant variance © University of Melbourne 2021 residuals residuals House Price Residual Output House Price Model Residual Plot RESIDUAL OUTPUT Predicted House Price Residuals 1 251.92316 -6.923162 2 273.87671 38.12329 3 284.85348 -5.853484 4 304.06284 3.937162 5 218.99284 -19.99284 6 268.38832 -49.38832 7 356.20251 48.79749 8 367.17929 -43.17929 9 254.6674 64.33264 10 284.85348 -29.85348 80 60 40 20 0 -20 0 -40 -60 1000 2000 3000 Square Feet Does not appear to violate any regression assumptions © University of Melbourne 2021 Residuals Break From the time before, 2019 © University of Melbourne 2021 COMP20008 Elements of Data Processing Semester 1, 2021 Lecture 7, Part 3: Multiple regression Contact: pauline.lin@unimelb.edu.au © University of Melbourne 2021 Multiple Regression © University of Melbourne 2021 Multiple Regression • Multiple regression is an extension of simple linear regression • It is used when we want to predict the value of a variable based on the value of two or more other variables • The variable we want to predict is called the dependent variable • The variables we are using to predict the value of the dependent variable are called the independent variables © University of Melbourne 2021 Multiple Regression Example A researcher may be interested in the relationship between the weight of a car, engine size of a car and petrol consumption. • Independent Variable 1: weight • Independent Variable 2: horsepower • Dependent Variable: miles per gallon © University of Melbourne 2021 Multiple Regression Fitting • Linear regression is based on fitting a line as close as possible to the plotted coordinates of the data on a two-dimensional graph • Multiple regression with two independent variables is based on fitting a plane as close as possible to the plotted coordinates of your data on a three-dimensional graph: Y = b0 + b1X1 + b2X2 • More independent variables extend this into higher dimensions • The plane (or higher dimensional shape) will be placed so that it minimises the distance (sum of squared errors) to every data point © University of Melbourne 2021 Multiple Regression Graphic © University of Melbourne 2021 Multiple Regression Assumptions • Multiple regression assumes that the independent variables are not highly correlated with each other • Use scatter plots to check © University of Melbourne 2021 Multiple Regression Assumption BAD X2 • Multiple regression assumes that the independent variables are not highly correlated with each other GOOD X2 X2 X1 X1 X2 • Use scatter plots to check X1 © University of Melbourne 2021 X1 Pitfalls of Regression Analysis • Lacking an awareness of the assumptions underlying least-squares regression • Using a regression model without knowledge of the subject matter • Extrapolating outside the relevant range • For multiple regression remember the importance of independence assumptions © University of Melbourne 2021 Avoiding the Pitfalls of Regression • Start with a scatter plot of X vs. Y to observe possible relationships • Performresidualanalysistochecktheassumptions: Plot residuals vs. X to check for violations • If there is no evidence of assumption violation, test for significance of the regression coefficients • Avoid making predictions or forecasts outside the relevant range © University of Melbourne 2021 Acknowledgements • Materials are partially adopted from ... • Previous COMP2008 slides including material produced by James Bailey, Pauline Lin, Chris Ewin, Uwe Aickelin and others • Sutikno, Department of Statistics, Faculty of Mathematics and Natural Sciences Sepuluh Nopember Institute of Technology (ITS), Surabaya © University of Melbourne 2021