Basics Simple regression Regression assumptions Multiple regression
BUSANA 7001 – Predictive and Visual Analytics for Business
Week 4: Predictive analytics using multiple regressions
£ius BUSANA 7001, Week 4 1/88
Copyright By PowCoder代写 加微信 powcoder
Basics Simple regression Regression assumptions Multiple regression
Simple regression
Regression assumptions
Multiple regression
£ius BUSANA 7001, Week 4 2/88
Basics Simple regression Regression assumptions Multiple regression
Introduction
The purpose of quantitative analysis is to nd or test certain relations.
Correlation coecients shed some light on the direction of the linear relation:
• positive
• negative or
• no linear relation.
£ius BUSANA 7001, Week 4 3/88
Basics Simple regression Regression assumptions Multiple regression
Let’s investigate the relation between car price and car length using SAS provided data set.
/* Creating data file: */
DATA work.car_data;
SET SAShelp.Cars;
/* Correlation coefficient: */
proc corr data=work.car_data;
var invoice length;
£ius BUSANA 7001, Week 4 4/88
Basics Simple regression Regression assumptions Multiple regression
Example II
£ius BUSANA 7001, Week 4 5/88
Basics Simple regression Regression assumptions Multiple regression
Example III
Correlation coecient between car price and length is 0.16659 (p-val.=0.0005).
=⇒ The relation is positive and statistically signicant.
However, correlation coecient does not let us estimate or predict
the car price given certain car length:
• e.g., what is approximately price of 180-inch length car?
£ius BUSANA 7001, Week 4 6/88
Basics Simple regression Regression assumptions Multiple regression
Introduction II
One needs to use regression analysis in order to answer this question!
Regressions:
Simple regression: y = β0 + β1x
Multiple linear regression: y = β0 + β1×1 + β2×2 + · · · + βnxn
y is the dependent variable
x, x1, x2, xn β0
β1, β2, βn
are independent variables is intercept
are slopes.
£ius BUSANA 7001, Week 4 7/88
Basics Simple regression Regression assumptions Multiple regression
Simple regression
Let’s regress using SAS car price on car length:
• INVOICE = f(LENGTH) = intercept + slope × LENGTH.
/* OLS regression: */
PROC REG DATA=work.car_data;
MODEL invoice=length;
£ius BUSANA 7001, Week 4 8/88
Basics Simple regression Regression assumptions Multiple regression
Simple regression II
£ius BUSANA 7001, Week 4 9/88
Basics Simple regression Regression assumptions Multiple regression
Simple regression III
£ius BUSANA 7001, Week 4 10/88
Basics Simple regression Regression assumptions Multiple regression
Simple regression IV
No obvious trends or patterns in the residuals.
£ius BUSANA 7001, Week 4 11/88
Basics Simple regression Regression assumptions Multiple regression
Simple regression V
£ius BUSANA 7001, Week 4 12/88
Basics Simple regression Regression assumptions Multiple regression
95% condence interval vs. 95% prediction interval
Condence intervals tell you how well you have determined the mean. Assume that the data are randomly sampled from a normal distribution and you are interested in
determining the mean. If you sample many times, and calculate a condence interval of the mean from each sample, you’d expect 95% of those intervals to include the true value of the population mean.
Prediction intervals tell you where you can expect to see the next data point sampled. Assume that the data are
randomly sampled from a normal distribution. Collect a sample of data and calculate a prediction interval. Then sample one more value from the population. If you repeat this process many times, you’d expect the prediction interval to capture the individual value 95% of the time.
Source: https://www.graphpad.com/support/faqid/1506/.
£ius BUSANA 7001, Week 4 13/88
Basics Simple regression Regression assumptions Multiple regression
Simple regression: Interpretation of the results
We got that:
• intercept = 8131.76
• slope = 204.69
• INVOICE = 8131.76 + 204.69 × LENGTH.
Car price increases by $204.69 for each additional inch of car length.
£ius BUSANA 7001, Week 4 14/88
Basics Simple regression Regression assumptions Multiple regression
Simple regression: Predictions
Suppose we would like to estimate the price of 180-inch length car: • INVOICE = 8131.76 + 204.69 × LENGTH
• INVOICE = 8131.76 + 204.69 × 180 = $28,712.4
£ius BUSANA 7001, Week 4 15/88
Basics Simple regression Regression assumptions Multiple regression
Simple regression: Predictions II
Let’s check the actual prices of 180-inch length cars:
PROC PRINT DATA=work.car_data (obs=20);
var make model length invoice;
where length = 180;
£ius BUSANA 7001, Week 4 16/88
Basics Simple regression Regression assumptions Multiple regression
Simple regression: Predictions III
=⇒ most of the cars are more expensive than our predicted value.
£ius BUSANA 7001, Week 4 17/88
Basics Simple regression Regression assumptions Multiple regression
R-squared is a goodness of t or accuracy measure.
The higher the R-squared, the better the model.
R-squared is the ratio of the variation explained to the total variation (of the dependent variable).
0 ≤ R-squared ≤ 1.
£ius BUSANA 7001, Week 4 18/88
Basics Simple regression Regression assumptions Multiple regression
R-squared and correlation
• R-squared of the regression model = 0.0278 • Correlation coecient = 0.16659.
Correlation2 = R-squared 0.166592 = 0.0278
£ius BUSANA 7001, Week 4 19/88
Basics Simple regression Regression assumptions Multiple regression
R-squared and correlation II
R-squared of the regression model = 0.0278
=⇒ Car length can explain only 2.78% of variation in car prices: • explanatory power of the model is weak
• there should be other factors that better explain car prices.
£ius BUSANA 7001, Week 4 20/88
Basics Simple regression Regression assumptions Multiple regression
Predicted values and residuals
Predicted values of car prices (INVOICEpred) can be computed from:
• INVOICEpred = 8131.76 + 204.69 × LENGTH, where LENGTH is the actual car length.
Residuals (INVOICEres) are then computed as: • INVOICEres = INVOICE INVOICEpred,
where INVOICE is the actual car price.
£ius BUSANA 7001, Week 4 21/88
Basics Simple regression Regression assumptions Multiple regression
Predicted values and residuals II
Consider the following 180-inch length cars: • LX: $18,630
• Chevrolet Corvette 2dr: $39,068.
The predicted car prices are the same: $28,712.4.
The residuals are:
• LX: $10,082.4
• Chevrolet Corvette 2dr: $10,355.6.
£ius BUSANA 7001, Week 4 22/88
Basics Simple regression Regression assumptions Multiple regression
Predicted values and residuals III
We can compute predicted values and residuals manually or we can ask SAS to do this:
PROC REG DATA=work.car_data;
MODEL invoice=length;
OUTPUT OUT=work.reg_results r=Res p=Pred;
`Pred’ are predicted values `Res’ description are residuals.
£ius BUSANA 7001, Week 4 23/88
Basics Simple regression Regression assumptions Multiple regression
Regression analysis using SAS Visual Analytics
One should use `Linear regression’ object (from `SAS Visual Statistics’ list).
£ius BUSANA 7001, Week 4 24/88
Basics Simple regression Regression assumptions Multiple regression
Assumptions of linear regression
1. Linear relationship 2. Normality
3. Independence
4. Homoscedasticity
£ius BUSANA 7001, Week 4 25/88
Basics Simple regression Regression assumptions Multiple regression
1. Linear relationship
The relation between the independent variable(x) and the dependent variable (y) is linear.
Detecting a linear relationship is fairly simple.
In most cases, linearity is clear from the scatterplot.
Relevant SAS code:
/* Scatterplot: */
proc gplot data=work.car_data;
title ‘Scatter plot of Invoice and length’;
plot Invoice * Length=1;
£ius BUSANA 7001, Week 4 26/88
Basics Simple regression Regression assumptions Multiple regression
1. Linear relationship II
=⇒ no obvious non-linear relation found.
£ius BUSANA 7001, Week 4 27/88
Basics Simple regression Regression assumptions Multiple regression
2. Normality
The dependent variable y is distributed normally for each value of the independent variable x.
Outliers are the main reason for the violation of this assumption.
To check for the normality, one can use:
• scatterplots (y vs x)
• histogram of standardized residuals
• A normal probability versus residual probability distribution plot (P-P plot).
=⇒ there are a few outliers.
£ius BUSANA 7001, Week 4 28/88
Basics Simple regression Regression assumptions Multiple regression
3. Independence
The values of y should depend on independent variables but not on its own previous values.
The violation of this assumption is observed mostly in time series data (e.g., gross domestic product (GDP))
Autocorrelation coecients for dierent lags can help detect dependencies:
• correlation between GDP and GDP lagged by 1 period • correlation between GDP and GDP lagged by 2 periods • correlation between GDP and GDP lagged by 3 periods • correlation between GDP and GDP lagged by 4 periods.
£ius BUSANA 7001, Week 4 29/88
Basics Simple regression Regression assumptions Multiple regression
4. Homoscedasticity
The variance in y is the same at each stage of x.
There is no special segment or an interval in x where the dispersion in y is distinct.
Scatterplots (y vs x) can be used to detect heteroscedasticity (which is opposite of homoscedasticity).
In our example, the variance is higher when car is 175-205 inch long but statistical tests imply that the model does not suer from heteroscedasticity
Plots with the residual versus predicted values can also be used to detect heteroscedasticity.
£ius BUSANA 7001, Week 4 30/88
Basics Simple regression Regression assumptions Multiple regression
4. Homoscedasticity II
The simple way to deal with heteroscedasticity is to segment out the data and build dierent regression lines for dierent intervals.
In general, if the rst three assumptions are satised, then heteroscedasticity might not even exist.
As a rule of thumb, rst three assumptions need to be xed before attempting to x heteroscedasticity.
£ius BUSANA 7001, Week 4 31/88
Basics Simple regression Regression assumptions Multiple regression
4. Homoscedasticity III
To detect heteroscedasticity, one can use White’s and Breusch-Pagan tests.
Procedure MODEL (rather than REG) includes them.
Relevant SAS code (both procedures give the same results):
PROC REG DATA=work.car_data;
MODEL invoice=EngineSize;
PROC MODEL DATA=work.car_data;
PARMS a1 b1;
invoice = a1 + b1 * EngineSize;
FIT invoice / WHITE PAGAN=(1 EngineSize);
£ius BUSANA 7001, Week 4 32/88
Basics Simple regression Regression assumptions Multiple regression
Tests’ results:
4. Homoscedasticity IV
=⇒ H0 of no heteroscedasticity is rejected.
£ius BUSANA 7001, Week 4 33/88
Basics Simple regression Regression assumptions Multiple regression
Solutions:
4. Homoscedasticity V
• adjust standard errors (a.k.a., (heteroskedasticity) robust standard errors, White-Huber standard errors etc.)
• transform non-normally distributed variables (e.g., using natural logarithm).
£ius BUSANA 7001, Week 4 34/88
Basics Simple regression Regression assumptions Multiple regression
Adjusted standard errors
Option ACOV adjusts standard errors:
PROC REG DATA=work.car_data;
MODEL invoice=EngineSize / ACOV;
This option can be used in SAS procedure REG only.
£ius BUSANA 7001, Week 4 35/88
Basics Simple regression Regression assumptions Multiple regression
Adjusted standard errors II
The robust standard errors can be used even under homoskedasticity.
Then the robust standard errors will become just regular standard errors.
£ius BUSANA 7001, Week 4 36/88
Basics Simple regression Regression assumptions Multiple regression
Log-transformed variables
Variables with positive values can be log-transformed.
DATA work.car_data;
SET work.car_data;
log_MSRP=log(MSRP);
log_length=log(length);
BUSANA 7001, Week 4
Basics Simple regression Regression assumptions Multiple regression
Log-transformed variables II
£ius BUSANA 7001, Week 4 38/88
Basics Simple regression Regression assumptions Multiple regression
Log-transformed variables III
£ius BUSANA 7001, Week 4 39/88
Basics Simple regression Regression assumptions Multiple regression
Log-transformed variables IV
Let’s estimate 3 regression models and predict MSRP of a 180 inches long car.
PROC REG DATA=work.car_data;
MODEL log_MSRP= length;
PROC REG DATA=work.car_data;
MODEL MSRP= log_length;
PROC REG DATA=work.car_data;
MODEL log_MSRP= log_length;
BUSANA 7001, Week 4
Basics Simple regression Regression assumptions Multiple regression
Presentation of results
SAS does not present regression results properly.
We should manually make tables.
We should also make table descriptions that include:
• variable denitions
• that t-statistics based on standard errors are reported in brackets.
• that ***, **, and * indicate signicance at 1%, 5%, and 10% levels, respectively.
£ius BUSANA 7001, Week 4 41/88
Basics Simple regression Regression assumptions Multiple regression
We get the following results:
Log-transformed variables V
Dependent variable:
Independent variables LENGTH
ln(LENGTH) Intercept
ln(MSRP) Model 1
0.0095*** [6.05]
8.4922*** [28.83]
0.079 0.077 428
MSRP Model 2
44,467*** [3.70] 199,551*** [3.18]
0.031 0.029 428
ln(MSRP) Model 3
1.8039*** [6.16] 0.8446 [0.55]
0.082 0.080 428
Adjusted R
Number of observations
BUSANA 7001, Week 4
Basics Simple regression Regression assumptions Multiple regression
Description for the previous table
Table 1: Determinants of car prices
This table presents the results of OLS regressions where the dependent variable is the manufacturer suggested retail price (MSRP) or its natural logarithm. LENGTH is a car length in inches. The absolute values of t-statistics based on standard errors are reported in brackets. ***, **, and * indicate signicance at 1%, 5%, and 10% levels, respectively.
The description above should be above the table.
£ius BUSANA 7001, Week 4 43/88
Basics Simple regression Regression assumptions Multiple regression
Log-transformed variables VI
Let’s compute predicted values of MRSP for 180 inches long car.
ln(180) ≈ 5.1930
Model 1: ln(MSRP) = 8.4922 + 0.0095 × 180 ≈ 10.2022. MSRP = exp(10.2022) ≈ 26,962.44
Model 2: MSRP = 199551 + 44,467 × 5.1930 ≈ 31,364.21 Model 3: ln(MSRP) = 0.8446 + 1.8039 × 5.1930 ≈ 10.2122.
MSRP = exp(10.2122) ≈ 27,232.73
Models 1 and 3 are preferred (one of the reasons is higher R2). The dierence between the predictions of Models 1 and 3 is 1%. However, the dierence between the predictions of Models 2 and 3 is 15%.
£ius BUSANA 7001, Week 4 44/88
Basics Simple regression Regression assumptions Multiple regression
When linear regression can’t be used
If any of the 4 assumptions are not satised, the linear regressions should not be used.
Linear regression can’t be used when:
• the relation between y and x is nonlinear
• the errors are not normally distributed
• there is a dependency within the values of the dependent variable
• the variance pattern of y is not the same for the entire range of x.
£ius BUSANA 7001, Week 4 45/88
Basics Simple regression Regression assumptions Multiple regression
Multiple regression
Let’s investigate the relation between the discount and car length.
It is likely that the discount is impacted by many other factors besides car length:
• car manufacturer (`make’) • car model (`model’)
• car type (`type’)
• drivetrain (`drivetrain’)
• production place (`origin’)
• engine (`enginesize’, `cylinders’, `horsepower’)
• fuel eciency (`MPG_City’, `MPG_Highway’)
• and maybe car weight (`weight’) and wheel base (`wheelbase’).
£ius BUSANA 7001, Week 4 46/88
Basics Simple regression Regression assumptions Multiple regression
Multiple regression II
If our regression model does not include any of the important independent variables, then the model suers from the omitted variable bias.
OLS estimator is likely to be biased and inconsistent due to omitted variable bias.
=⇒ coecient estimates might become unreliable.
One should include as many variables as possible in the regression
model if the data is available.
I found that some variables in the previous slide cannot be included in the model together with car length (due to multicollinearity).
£ius BUSANA 7001, Week 4 47/88
Basics Simple regression Regression assumptions Multiple regression
is a phenomenon due to a high interdependency between the independent variables.
If we include highly correlated independent variables in the same regression model, then this could cause multicollinearity.
Implications of multicollinearity:
• it increases the variance of the coecient estimates and make the estimates very sensitive to minor changes in the model
• coecient estimates might become unstable and dicult to interpret
• T-test results are not trustworthy etc.
£ius BUSANA 7001, Week 4 48/88
Basics Simple regression Regression assumptions Multiple regression
Multiple regression III
First, let’s check the summary statistics of the discount.
/* Creating discount variable: */
DATA work.car_data;
SET work.car_data;
discount=msrp/invoice-1;
proc univariate data=work.car_data plots;
var discount;
£ius BUSANA 7001, Week 4 49/88
Basics Simple regression Regression assumptions Multiple regression
Properties of DISCOUNT
£ius BUSANA 7001, Week 4 50/88
Basics Simple regression Regression assumptions Multiple regression
Properties of DISCOUNT II
£ius BUSANA 7001, Week 4 51/88
Basics Simple regression Regression assumptions Multiple regression
Properties of DISCOUNT III
£ius BUSANA 7001, Week 4 52/88
Basics Simple regression Regression assumptions Multiple regression
Properties of DISCOUNT IV
£ius BUSANA 7001, Week 4 53/88
Basics Simple regression Regression assumptions Multiple regression
Correlation matrix
Let’s look at the correlation matrix of discount and other numerical variables:
proc corr data=work.car_data;
var discount invoice enginesize cylinders horsepower
MPG_City MPG_Highway Length weight wheelbase;
£ius BUSANA 7001, Week 4 54/88
Basics Simple regression Regression assumptions Multiple regression
Correlation matrix II
£ius BUSANA 7001, Week 4 55/88
Basics Simple regression Regression assumptions Multiple regression
Determinants of DISCOUNT
Let’s consider engine size as potential determinant for discount.
SAS code for scatterplot:
proc gplot data=work.car_data;
title ‘Scatter plot of discount and engine size’;
plot Discount * Enginesize=1;
£ius BUSANA 7001, Week 4 56/88
Basics Simple regression Regression assumptions Multiple regression
Determinants of DISCOUNT II
£ius BUSANA 7001, Week 4 57/88
Basics Simple regression Regression assumptions Multiple r
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com