程序代写代做 graph COMP20008 Elements of Data Processing Regression

COMP20008 Elements of Data Processing Regression
Semester 1, 2020
Contact: uwe.aickelin@unimelb.edu.au

Where are we now?
2

Learning Objectives
• How to use regression analysis to predict the value of a dependent variable based on independent variables
• Make inferences about the slope and correlation coefficient
• Evaluate the assumptions of regression analysis and know what to do if the assumptions are violated
3

Correlation vs. Regression
• A scatter diagram can be used to show the relationship between two variables
• Correlation analysis is used to measure strength / direction of the relationship between two variables
• No causal effect is implied with correlation
4

Introduction to Regression Analysis
Regression analysis is used to:
• Predict the value of a dependent variable based on the value of at least one independent variable
• Explain the impact of changes in an independent variable on the dependent variable
• Dependent variable: the variable we wish to predict or explain
• Independent variable: the variable used to explain the dependent variable
5

Linear Regression
6

Simple Linear Regression Model
• Only one independent variable, X
• Relationship between X and Y is described by a linear function
• Changes in Y are assumed to be caused by changes in X
7

Types of Relationships
Linear relationships
YY
Non-linear relationships
XX YY
XX
8

Strong relationships
Types of Relationships (2)
Weak relationships No relationship YY
XXX YY
X
Y
Y
X
X9

Simple Linear Regression Model
Independent Variable
Error term
The simple linear regression equation provides an estimate of the population regression line
Dependent Variable
Yi =β0 +β1Xi +εi
intercept
Slope Coefficient
Linear component
Error component
10

Simple Linear Regression Model (2)
Y Yi=β0 +β1Xi+εi Observed Value
of Y for Xi
Predicted Value of Y for Xi
Intercept = β0
)
εi
Error
for this X value i
Slope = β1
Xi
X
11

Least Squares Method
b0 and b1 are obtained by finding the values of b0 and b1that minimize the sum of the squared differences between Y and Y:ˆ
∑ ˆ ∑(Y−(b +bX))2 min (Y−Y)2=min
ii i01i
12

Interpretation of Slope and Intercept
• b0 is the estimated average value of Y when the value of X is zero (intercept)
• b1 is the estimated change in the average value of Y as a result of a one-unit change in X (slope)
13

Simple Linear Regression Example
A real estate agent wishes to examine the relationship between the selling price of a home and its size (measured in square feet)
A random sample of 10 houses is selected
• Dependent variable (Y) = house price in $1000s
• Independent variable (X) = square feet
14

Sample Data for House Price Model
House Price in $1000s (Y)
Square Feet (X)
245
1400
312
1600
279
1700
308
1875
199
1100
219
1550
405
2350
324
2450
319
1425
255
1700
15

Graphical Presentation
House price model: scatter plot
450 400
350
300
250
200
150
100
50
00 500 1000 1500 2000 2500 3000
Square Feet
16
House Price ($1000s)

Calculation Output
Regression Statistics
The regression equation is:
Multiple R R Square
Adjusted R Square Standard Error
Observations
ANOVA
0.76211 0.58082 0.52842
41.33032 10
df
SS
18934.9348 13665.5652 32600.5000
Standard Error
58.03348 0.03297
MS
18934.9348 1708.1957
t Stat
1.69296 3.32938
F
11.0848
P-value
0.12892 0.01039
Significance F
0.01039
Lower 95%
-35.57720 0.03374
house price = 98.24833 + 0.10977 (square feet)
Regression 1 Residual 8 Total 9
Coefficients
Intercept 98.24833 Square Feet 0.10977
Upper 95%
232.07386 0.18580
17

Graphical Presentation
House price model: scatter plot and regression line
450
400
350
300
250
200
150
100
50 0
Slope
= 0.10977
Intercept = 98.248
0 500
1000 1500 2000 2500 3000
Square Feet
house price = 98.24833 + 0.10977 (square feet)
18
House Price ($1000s)

Interpretation of the Intercept b0
house price = + 0.10977 (squarefeet)
98.24833
• b0 is the estimated average value of Y when the value of X is zero
• Here, no houses had 0 square feet, so b0 = 98.24833 ($1000) just indicates that, for houses within the range of sizes observed, $98,248.33 is the portion of the house price not explained by square feet
19

Interpretation of the Slope Coefficient b1
house price =98.24833 + (square feet)
0.10977
• b1 measures the estimated change in the average value of Y as a result of a one-unit change in X
• Here, b1 = .10977 tells us that the average value of a house increases by .10977 ($1000) = $109.77, on average, for each additional one square foot of size
20

Predictions using Regression
Predict the price for a house with 2000 square feet:
house price = 98.25 + 0.1098 (sq.ft.) = 98.25 + 0.1098 (2000)
= 317.85
The predicted price for a house with 2000 square feet is 317.85 ($1,000s) = $317,850
21

Interpolation vs. Extrapolation
When using a regression model for prediction, only predict within the relevant range of data
Relevant range for interpolation
Do not try to extrapolate beyond the range of observed X’s
22
450
400
350
300
250
200
150
100
50 0
0 500
1000 1500 2000 2500 3000
Square Feet
Department of Statistics, ITS Surabaya
House Price ($1000s)

Regression – Residual Analysis
23

Measures of Variation
Total variation is made up of two parts:
SST = SSR + SSE
SST= (Y−Y)2 SSR= (Y −Y)2 SSE= (Y−Y)2
Total Sum of Squares
Regression Sum of Squares
∑∑ˆ∑ˆ iiii
Error Sum of Squares
where:
ˆ
Y = Average value of the dependent variable
Yi = Observed values of the dependent variable
Yˆi= Predicted value of Y for the given Xi value
24

Measures of Variation (2)
SST = total sum of squares
Measures the variation of the Yi values around their mean Y
SSR = regression sum of squares
 Explained variation attributable to the relationship between X and Y
 SSE = error sum of squares
 Variation attributable to factors other than the relationship between X and Y


25

Measures of Variation (3)
Y
Yi ∧ Y∧ _ SSE = ∑ (Yi -Yi )2
∧ SST=∑(Yi-Y)2 Y∧
_ SSR = ∑ (Yi -Y)2
_ YY
_
Xi
X
26

Coefficient of Determination, r2
 The coefficient of determination is the portion of the total variation in the dependent variable that is explained by variation in the independent variable
 The coefficient of determination is also called r-squared and is denoted as r2
r2 =SSR/SST
= regression sum of squares / total sum of squares
0≤r2 ≤1
note:
27

Examples of approximate r2 values
Y
r2 = 1
Perfect linear relationship between X and Y:
100% of the variation in Y is explained by variation in X
X
Y
r2 = 1
X
r2 = 1
28

Examples of approximate r2 values (2)
Y
0 < r2 < 1 Weaker linear relationships between X and Y: Some but not all of the variation in Y is explained by variation in X X X Y 29 Examples of approximate r2 values (3) Y r2 = 0 No linear relationship between X and Y: The value of Y does not depend on X. r2 = 0 X 30 Multiple R R Square Adjusted R Square Standard Error Observations 0.76211 0.58082 0.52842 41.33032 10 Calculation Output Regression Statistics r2 = SSR = 18934.9348 = 0.58082 SST 32600.5000 58.08% of the variation in house prices is explained by variation in square feet ANOVA df Regression 1 Residual 8 13665.5652 9 SS 18934.9348 Total 32600.5000 Coefficients Standard Error 58.03348 0.03297 MS 18934.9348 1708.1957 t Stat 1.69296 3.32938 F 11.0848 P-value 0.12892 0.01039 Significance F 0.01039 Lower 95% -35.57720 0.03374 Upper 95% 232.07386 0.18580 Intercept 98.24833 Square Feet 0.10977 31 Assumptions of Regression • Linearity: The underlying relationship between X and Y is linear • Independence of residuals: Residual values (also known as fitting errors) are statistically sound and sum to zero 32 Residual Analysis ei =Yi-Yˆi • The residual for observation i, ei, is the difference between the observed and predicted value • Check the assumptions of regression by examining the residuals • Examine for linearity assumption • Evaluate independence assumption • Examine for constant variance for all levels of X • Graphical Analysis of Residuals: plot residuals vs. X 33 Residual Analysis for Linearity YY xx xx  Not Linear Linear 34 residuals residuals Residual Analysis for Independence Not Independent X X  Independent X 35 residuals residuals residuals Residual Analysis for Equal Variance Y x Y x xx  Non-constant variance Constant variance 36 residuals residuals House Price Residual Output House Price Model Residual Plot RESIDUAL OUTPUT Predicted House Price Residuals 1 251.92316 -6.923162 2 273.87671 38.12329 3 284.85348 -5.853484 4 304.06284 3.937162 5 218.99284 -19.99284 6 268.38832 -49.38832 7 356.20251 48.79749 8 367.17929 -43.17929 9 254.6674 64.33264 10 284.85348 -29.85348 80 60 40 20 0 -20 0 -40 -60 1000 2000 3000 Square Feet Does not appear to violate any regression assumptions 37 Residuals Multiple Regression 38 Multiple Regression • Multiple regression is an extension of simple linear regression • It is used when we want to predict the value of a variable based on the value of two or more other variables • The variable we want to predict is called the dependent variable • The variables we are using to predict the value of the dependent variable are called the independent variables 39 Multiple Regression Example A researcher may be interested in the relationship between the weight of a car, engine size of a car and petrol consumption. • Independent Variable 1: weight • Independent Variable 2: horsepower • Dependent Variable: miles per gallon 40 Multiple Regression Fitting • Linear regression is based on fitting a line as close as possible to the plotted coordinates of the data on a two-dimensional graph • Multiple regression with two independent variables is based on fitting a plane as close as possible to the plotted coordinates of your data on a three-dimensional graph: Y = a + b1X1 + b2X2 • More independent variables extend this into higher dimensions • The plane (or higher dimensional shape) will be placed so that it minimises the distance (sum of squared errors) to every data point 41 Multiple Regression Graphic 42 Multiple Regression Assumptions • Multiple regression assumes that the independent variables are not highly correlated with each other • Use scatter plots to check 43 Multiple Regression Assumption BAD X2 GOOD • Multiple regression assumes that the independent variables are not highly correlated with each other X2 X1 X1 X2 • Use scatter plots to check X2 X1 X1 44 Pitfalls of Regression Analysis • Lacking an awareness of the assumptions underlying least-squares regression • Using a regression model without knowledge of the subject matter • Extrapolating outside the relevant range • For multiple regression remember the importance of independence assumptions 45 Avoiding the Pitfalls of Regression • Start with a scatter plot of X vs. Y to observe possible relationships • Performresidualanalysistochecktheassumptions: Plot residuals vs. X to check for violations • If there is no evidence of assumption violation, test for significance of the regression coefficients • Avoid making predictions or forecasts outside the relevant range 46 Acknowledgements • Materials are partially adopted from ... • Previous COMP2008 slides including material produced by James Bailey, Pauline Lin, Chris Ewin, Uwe Aickelin and others • Sutikno, Department of Statistics, Faculty of Mathematics and Natural Sciences Sepuluh Nopember Institute of Technology (ITS), Surabaya 47