DSME5110F: Statistical Analysis
Lecture 9 Linear Regression
Outline • Linear Regression
Copyright By PowCoder代写 加微信 powcoder
– Simple Linear Regression • Regression Summary
• Assumptions • Application
– Multiple Linear Regression • Regression Summary
• Assumptions •…
Example: Bordeaux Equation
• Large differences in price and quality between years, although wine is produced in a similar way
• Meant to be aged, so hard to tell if wine will be good when it is on the market
• Expert tasters predict which ones will be good
• , a Princeton economics professor, claimed he can predict wine quality without tasting the wine (March 1990)
• Reaction from the world’s most influential wine expert:
– “Ashenfelter is an absolute total sham”
– “rather like a movie critic who never goes to see the movie but tells you how good it is based on the actors and the director”
Example: Moneyball
• A 2011 American sports film
• An account of the Oakland Athletics baseball team’s 2002 season and their general manager ‘s attempts to assemble a competitive team.
• Beane ( ) and assistant GM Peter Brand ( ), faced with the franchise’s limited budget for players, build a team of undervalued talent by taking a sophisticated sabermetric approach to scouting and analyzing players.
• Nominated for six Academy Awards
• Former Green Bay Packers vice president stated that the film “persuasively exposed front office tension between competing scouting applications: the old school “eye-balling” of players and newer models of data-driven statistical analysis … Moneyball—both the book and the movie—will become a time capsule for the business of sports”.
What Data Looks Like
𝑥𝑥1 𝑋𝑋1 𝑋𝑋2 ⋯ 𝑋𝑋𝑘𝑘 𝑌𝑌
𝑥𝑥3 ⋯ ⋯ ⋯ ⋯ ⋯
• Observations/samples/records/…: rows
• Variables: Columns
– Predictors/Independent Variables/…
– Outcome/dependent variable/output/response/…
• Continuous: quantitative, a number like weight or length
• Discrete: qualitative, a symbol, like ‘cat’ or ‘dog’, ‘0’ or ‘1’, {0, 1, 2, 3}, {small, median, large}
Linear Regression
• An approach to model the linear relationship between a scalar outcome variable 𝑦𝑦 and one or more predictors
denoted by 𝑥𝑥
– The case of one independent variable is called simple linear
regression.(𝑦𝑦=𝛽𝛽 +𝛽𝛽 𝑥𝑥+𝜖𝜖) 01
– For more than one independent variables, it is called multiple linearregression.(𝑦𝑦=𝛽𝛽 +𝛽𝛽𝑥𝑥 +𝛽𝛽𝑥𝑥 +⋯𝛽𝛽𝑥𝑥 +𝜖𝜖)
• A tool for predicting a quantitative response. – useful and widely used statistical learning method.
01122 𝑘𝑘𝑘𝑘
Outline • Linear Regression
– Simple Linear Regression • Regression Summary
• Assumptions • Application
– Multiple Linear Regression • Regression Summary
• Assumptions •…
Simple Linear Regression
Example of Linear Regression
Deco is going to open a new store with 10 thousand babies nearby and wants to predict the daily demand. Deco collected data (the number of babies nearby and daily demand) of stores opened in elsewhere.
# Babies 𝑥𝑥𝑖𝑖 7
Demand 𝑦𝑦𝑖𝑖 150
Number of babies around the new store
Simple Linear Regression Model
• Assume simple linear regression model (approximately
𝑦𝑦 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥 + 𝜖𝜖
– 𝑦𝑦: dependent variable (criterion variable); what we want to
– 𝑥𝑥: independent variable; what we use to make the prediction
predict/explain
– 𝜖𝜖: error term (positive or negative)
– 𝛽𝛽0: 𝑦𝑦-intercept
– 𝛽𝛽1: slope, or regression coefficient for the predictor
• interpret 𝛽𝛽1 as the average effect on 𝑦𝑦 of a one unit increase in 𝑥𝑥
• Simple linear regression equation (linear) E[𝑦𝑦|x]=𝛽𝛽 +𝛽𝛽𝑥𝑥
The Estimated Regression Equation
• Use data to produce estimates 𝑏𝑏0 and 𝑏𝑏1, then the estimated regression equation (regression line)
𝑦𝑦� = 𝑏𝑏 + 𝑏𝑏 𝑥𝑥
• 𝑦𝑦� is the estimator of E[𝑦𝑦|𝑥𝑥], or the mean of the
dependent variable for a given level of the independent variable 𝑥𝑥
• 𝑏𝑏0 is the point estimator of the 𝑦𝑦-intercept term 𝛽𝛽0 • 𝑏𝑏1 is the point estimator of the regression
coefficient 𝛽𝛽1
Method of Least Squares • Ideas of Linear Regression
– Find a linear function that represents the dependent variable 𝑦𝑦 (the daily demand) as a function of an independent variable 𝑥𝑥 (the number of babies nearby) and best fits the observed data
– The best model (choice of coefficients) has the smallest error terms
• Mathematical treatment (least squares method):
Find the coefficients 𝑏𝑏0 and 𝑏𝑏1 of the linear function y� = 𝑏𝑏 0 + 𝑏𝑏 1 𝑥𝑥
such that the sum of squared errors (SSE) 𝑛𝑛𝑛𝑛
𝑆𝑆 𝑆𝑆 𝑟𝑟 𝑟𝑟 𝑟𝑟 = �𝑖𝑖 = 1 𝑦𝑦 𝑖𝑖 − y � 𝑖𝑖 2 = �𝑖𝑖 = 1 𝜖𝜖 𝑖𝑖 2
is minimized for the observed data
Method of Least Squares
Formula of Linear Regression
The best estimates of 𝑏𝑏0 and 𝑏𝑏1 can be written as: 𝑏𝑏 =∑𝑛𝑛 (𝑥𝑥𝑖𝑖−𝑥𝑥�)(𝑦𝑦𝑖𝑖−𝑦𝑦�),
𝑥𝑥 𝑖𝑖 − 𝑥𝑥�
2 ∑ 𝑏𝑏0=𝑦𝑦�−𝑏𝑏1𝑥𝑥̅∑
where 𝑥𝑥̅=1 𝑛𝑛 𝑥𝑥𝑖𝑖 and 𝑦𝑦�=1 𝑛𝑛 𝑦𝑦𝑖𝑖 are the 𝑛𝑛 𝑖𝑖=1 𝑛𝑛 𝑖𝑖=1
sample means.
The Estimated Regression Equation
• Expresses how predicted value of dependent variable is related to the independent variable
• Two purposes: – Description
– Only within the range of the independent variable(s)
• Regressiondoesnotestablishcausality
– We say that the independent variable is associated
with, or related to, the dependent variable
– Normally, regression does not deal with issues of causality but only association.
Example Continued
𝑦𝑦� = 𝑏𝑏 0 + 𝑏𝑏 1 𝑥𝑥 = 5 0 . 6 + 1 5 . 9 𝑥𝑥
What demand would we expect from investing in a store with 10 thousand babies nearby?
𝑦𝑦� = 𝑏𝑏 + 𝑏𝑏 𝑥𝑥 = 01
Linear Regression with R
• Linear Regression Syntax:
– model <- lm(dv ~ iv, data = mydata)
– dv~ivisanRformula
• dv is the dependent variable in the mydata data frame to be modeled
• iv is an R formula specifying the independent variable(s) in the mydata data frame to use in the model
– data specifies the data frame in which the dv and iv variables can be found – Could also be expressed: model <- lm(mydata$dv ~ mydata$iv)
– Example:
• ins_model <- lm(expenses ~ age + smoker, data = insurance)
– The model object model captures and stores a lot of useful information and results about the regression model, including the regression coefficients and other information that can be used to check model validity and assumptions.
– We can recover the information and results by using extractor functions, such as summary().
Linear Regression with R
N_babies <- c(7, 2, 6, 4, 14, 15, 16, 12, 14, 20, 15, 7)
Demd <- c(150, 100, 130, 150, 250, 270, 240, 200, 270, 440, 340, 170)
Linear Regression
relation <- lm(Demd ~ N_babies)
summary(relation)
Plot the chart.
plot(N_babies, Demd) abline(relation)
Prediction
result <- predict(relation, data.frame(N_babies = c(10, 20))) result
Regression Summary
Outline • Linear Regression
– Simple Linear Regression • Regression Summary
• Assumptions • Application
– Multiple Linear Regression • Regression Summary
• Assumptions •...
– Small 𝑝𝑝-value suggests that the true coefficient is unlikely to be zero, hence the feature is unlikely to have no relationship with the dependent variable.
The 𝑝𝑝-value
𝐻𝐻0: 𝑛𝑛𝑛𝑛 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑛𝑛𝑛𝑛𝑟𝑟𝑟𝑟𝑟𝑝𝑝 𝑏𝑏𝑟𝑟𝑟𝑟𝑏𝑏𝑟𝑟𝑟𝑟𝑛𝑛 𝑋𝑋 𝑟𝑟𝑛𝑛𝑎𝑎 𝑌𝑌, 𝑟𝑟. 𝑟𝑟. , 𝛽𝛽1 = 0
Understanding the Regression Summary:
Hypothesis test: – Nullhypothesis
– Alternativehypothesis
𝐻𝐻𝑎𝑎: 𝑟𝑟 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑛𝑛𝑛𝑛𝑟𝑟𝑟𝑟𝑟𝑝𝑝 𝑏𝑏𝑟𝑟𝑟𝑟𝑏𝑏𝑟𝑟𝑟𝑟𝑛𝑛 𝑋𝑋 𝑟𝑟𝑛𝑛𝑎𝑎 𝑌𝑌, 𝑟𝑟. 𝑟𝑟. , 𝛽𝛽1 ≠ 0
𝒑𝒑-value: denoted by P(> |𝑟𝑟|) .
– 𝑝𝑝-values less than the significance level are considered statistically significant.
– Under normal circumstances, any statistic that is accompanied by a 𝑝𝑝- value with one or more asterisks is considered statistically significant; if it has a period next to it, or an empty space, then it is said to be non- significant.
1. E𝑏𝑏1=𝛽𝛽1
2. Standard deviation of 𝑏𝑏1 is 𝜎𝜎𝑏𝑏 = 𝜎𝜎
The Test Statistic (Optional)
Three properties of sampling distribution 𝑏𝑏 :
, where 𝜎𝜎 is the population standard deviation of 3. Shapeofthesamplingdistribution𝑏𝑏 isnormal
the error term 𝜖𝜖 1 ∑ 𝑥𝑥𝑖𝑖−𝑥𝑥̅ 2
Since the population parameter 𝜎𝜎𝑏𝑏1 is unknown, we need to estimate it
𝑟𝑟= 𝑟𝑟𝑦𝑦|𝑥𝑥1with𝑟𝑟=∑𝑦𝑦𝑖𝑖−𝑦𝑦�𝑖𝑖2 𝑏𝑏1 ∑ 𝑥𝑥 − 𝑥𝑥̅ 2 𝑦𝑦|𝑥𝑥 𝑛𝑛 − 2
The test statistic we use (because 𝛽𝛽 = 0 in the null hypothesis):
𝑟𝑟 = 𝑏𝑏1 − 𝛽𝛽1 = 𝑏𝑏1
It follows a 𝑟𝑟-distribution with 𝑛𝑛 − 2 degrees of freedom The rejection region is: 𝑟𝑟 ≥ 𝑟𝑟𝛼𝛼/2,𝑛𝑛−2 and 𝑟𝑟 ≤ 𝑟𝑟𝛼𝛼/2,𝑛𝑛−2
A more convenient way is to check the 𝑝𝑝-value
𝑟𝑟𝑏𝑏 𝑟𝑟𝑏𝑏 11
also Test 𝛽𝛽0?
• No. Usually we are not concerned with testing the
significance of the intercept term
• 𝛽𝛽 is simply the predicted value of the dependent
variable when the independent variable equals 0
– It does not have a clear meaning: simple from a linear function that fits the data the best
– In some cases, it makes no sense that the independent variable is 0
• Do not try to make predictions for values of 𝑥𝑥 that fall outside the range of values that make up the sample
• Consider the relationship between price and camera megapixels – would it make sense to say 2 megapixels would be -50HKD?
Understanding the Regression Summary: Goodness of Fit: the Coefficient of Determination
Coefficient of determination or Multiple R-squared: 𝑅𝑅2
𝑟𝑟2 = 1 − 𝑆𝑆𝑆𝑆𝑟𝑟𝑟𝑟𝑟𝑟 𝑆𝑆𝑆𝑆𝑦𝑦
– Here𝑆𝑆𝑆𝑆 =∑𝑛𝑛 𝑦𝑦 −𝑦𝑦� 2 (totalsumofsquares).
– Measure the proportion of variability in 𝑦𝑦 that can be explained
– The value of 𝑟𝑟2 is between 0 and 1
• 𝑟𝑟2 ≈ 0 means the regression did not explain much of the variability in the outcome
• 𝑟𝑟2 ≈ 1 means that a large proportion of the variability in the outcome has been explained by the regression
• Generally,ahigher𝑟𝑟2valueimpliesabetterregressionmodel
Outline • Linear Regression
– Simple Linear Regression • Regression Summary
• Assumptions • Application
– Multiple Linear Regression • Regression Summary
• Assumptions •…
Remark on the Model Assumptions • The relationship between the predictors and
the outcomes is linear.
– If the relationship is not linear, then the model will
make systematic errors.
• See the toy model in the R code.
Example 9.1: Electricity
• A researcher wants to estimate the relationship between the electricity usage and size of a house. Random samples were collected and the data is shown on the right, where
– 𝑦𝑦 is electricity usage of the house (measured in kilowatt per hour), and
– 𝑥𝑥 is size of the house (measured in square feet)
• See the file: electricity_demo.xlsm
• The scatterplot of y against x shows that these two variables have quite strong positive linear relationship (Correlation = 0.912).
The Regression Equation
• To fit a simple regression model of y (electricity usage) against x (square footage of the house), the R code and output are shown below.
• The result shows that the regression equation is:
– y = 578.9278 + 0.5403x
• This estimated regression equation may not be meaningful unless it survives some model usefulness and fitness tests. After all, given any data for y and x, an estimated regression equation can always be obtained with the least square method.
Summary of the Regression Model
• To get the summary statistics of the regression model needed to perform model validity checks, we can use the R function, summary(m), and the output is shown below:
Illustration of Model Assumptions
• When X = xi, the Y values will be independently and normally distributed around the mean of Y (i.e., yi) with a constant standard
deviation. This is equivalent to saying that the random error εi for any given xi will be independently and normally distributed with mean = 0 and a constant standard deviation.
Regression Line
Validating Regression Analysis
Assumptions
• Scatterplot – a graphical presentation of two quantitative variables.
– In regression analysis, dependent variable is often along the vertical axis and the independent variable along the horizontal axis
• However, a simple scatterplot and 𝑝𝑝-value can mislead us.
– The general pattern of the plotted points suggests the nature of the relationship between the two variables.
Validating Regression Analysis Assumptions
• We use residual analysis to validate the assumption underling the application of regression analysis
– an important step in any simple regression analysis
– It should be done once the regression model has been estimated and the
residuals calculated
• Residual analysis – typically involves a residual plot
– The purpose is to understand if the assumptions underlying the correct usage of the regression model are valid
– Do the residuals appear to be linearly related?
– Does variance of residuals around line seem constant?
– Are there any conspicuous outliers?
• Residual plot
• Use the plot() function: plot(mydata$iv, model$residuals)
– Simple linear regression: a plot of of residual values against x values
– (Multiple linear regression: a plot of residual values against the predicted dependent variable 𝑦𝑦�)
Outline • Linear Regression
– Simple Linear Regression • Regression Summary
• Assumptions • Application
– Multiple Linear Regression • Regression Summary
• Assumptions •…
Example 9.2: Predicting the Price of
Used Toyota Corolla Cars
• A large Toyota car dealership offers purchasers of new Toyota cars the option to buy their used car as part of a trade-in.
• A new promotion promises to pay high prices for used Toyota Corolla cars for purchasers of a new car.
• The dealer then sells the used cars for a small profit.
• To ensure a reasonable profit, the dealer needs to be able to predict the price that the dealership will get for the used cars. For that reason, data were collected on all previous sales of used Toyota Corollas at the dealership.
Example 9.2: Predicting the Price of Used Toyota Corolla Cars
• A dataset:
– PrevioussalesofusedToyotaCorollasduringthesummerof2004ata
retailer in the Netherlands
– 1436records
– Variables (selected):
• Goal: predict the price of a used Toyota Corolla car – Dependent variable (outcome): Price
– Independent variable(s) (predictor(s)): others
Linear Regressions
• Manydifferentpredictorscouldbeused
using Kilometers using Horsepower using Weight
• Multiple linear regression allows us to use all these variables to improve our predictive ability
• Using simple linear regression with each predictor on its own:
– 𝑅𝑅2 = – 𝑅𝑅2 = – 𝑅𝑅2 = – 𝑅𝑅2 =
Outline • Linear Regression
– Simple Linear Regression • Regression Summary
• Assumptions • Application
– Multiple Linear Regression • Regression Summary
• Assumptions •…
Multiple Regression
• Multiple regression model (with 𝑘𝑘 independent variables) 𝑦𝑦 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2𝑥𝑥2 + ⋯ + 𝛽𝛽𝑘𝑘𝑥𝑥𝑘𝑘 + 𝜖𝜖
– 𝑦𝑦: dependent or criterion variable
– 𝑥𝑥1, 𝑥𝑥2, ⋯ , 𝑥𝑥𝑘𝑘: the independent or predictor variables – 𝜖𝜖: error term
– 𝛽𝛽0: intercept
– 𝛽𝛽1, 𝛽𝛽2, ⋯ , 𝛽𝛽𝑘𝑘: partial regression coefficients
• 𝛽𝛽𝑗𝑗: interpreted as the average effect on 𝑦𝑦 of a one unit increase in 𝑥𝑥𝑗𝑗, holding all other predictors fixed.
• Best model coefficients selected to minimize SSE
Remarks on Multiple Regression
• Since there are at least two independent variables involved, we no longer refer to the regression line but rather to the regression plane
– Difficult (and often impossible) to render a graphical image of the regression plane on a two-dimensional page
• Partial regression coefficient – the regression coefficients that form the multiple regression model.
– Unless independent variables are perfectly uncorrelated with one another, a partial regression coefficient (that makes up multiple regression model) isn’t equal to the simple linear regression coefficient on the same independent variable
– Nor in general does the partial regression coefficient remain constant when it’s accompanied by different sets of other independent variables
Multiple Regression Equations
• Multiple regression equation – expresses how expected
value of 𝑦𝑦 is related to the independent variables E[𝑦𝑦|𝑥𝑥1,𝑥𝑥2,⋯,𝑥𝑥𝑘𝑘] = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2𝑥𝑥2 + ⋯+ 𝛽𝛽𝑘𝑘𝑥𝑥𝑘𝑘
• Estimated multiple regression equation – a general expression that specifies how the predicted value of the dependent variable 𝑦𝑦� is related to the independent variables
𝑦𝑦� = 𝑏𝑏 0 + 𝑏𝑏 1 𝑥𝑥 1 + 𝑏𝑏 2 𝑥𝑥 2 + ⋯ + 𝑏𝑏 𝑘𝑘 𝑥𝑥 𝑘𝑘
– Rcandoit
– We do not propose to solve 𝑏𝑏𝑗𝑗 and 𝑏𝑏0 through simple equations like in simple regression, because the mathematical solution is considerably more tedious to work out
Adding More Predictors
Age, Kilometers
Age, Kilometers, Horsepower
Age, Kilometers, Horsepower, Weight
• Adding more predictors can always improve 𝑅𝑅2
• Diminishing returns as more predictors are added • Questions:
All (~ . )
– Arethesemodelsgood?
– Whichoneisthebest?Howtocomparedifferentmodels?
Outline • Linear Regression
– Simple Linear Regression • Regression Summary
• Assumptions • Application
– Multiple Linear Regression • Regression Summary
• Assumptions •…
Multiple Coefficient of Determination
Multiple coefficient of determination (multiple r-squared): an important measure of the goodness of fit of the regression model
– Any model with a high coefficient of determination may contain independent variables that shouldn’t be included
The adjusted multiple coefficient of determination compensates for the number of independent variables
– the adjusted r-squared value corrects R-squared by penalizing models with more predictors.
– The adjusted R-squared is useful for comparing the performance of models with different numbers of predictors 𝑘𝑘(1 − 𝑟𝑟2)
– Calculated and interpreted the same way
– Models with more predictors always explain more variation, and hence the
multiple coefficient of determination is larger
𝑟𝑟2 = 𝑟𝑟2 − 𝑛𝑛 − 𝑘𝑘 − 1 𝑎𝑎𝑎𝑎𝑗𝑗
Penalty for number of predictors
Tests of Significance
• For multiple regression, there are two types of significance tests:
– One for the significance of the overall regression model itself
– The other for the significance of each independent variable, taken one at a time
Tests of Significance: Overall Regression Model
• To determine if the overall regression model is significant
• Hypothesis test:
– Nullhypothesis𝐻𝐻 :𝛽𝛽 =𝛽𝛽 =⋯=𝛽𝛽 =0
– Alternative hypothesis 𝐻𝐻𝑎𝑎: at least one is nonzero
– Nullhypothesis𝐻𝐻:𝑟𝑟2=0
• An alternative expression of this:
– If the overall regression model offers no benefit in terms of explanatory or predictive power over simply modelling the dependent variable 𝑦𝑦 using its own mean 𝑦𝑦�?
– Alternative hypothesis 𝐻𝐻 : 𝑟𝑟 > 0 𝑎𝑎
• The test statistic is the F-statistic and its distribution is F-distribution. – The R function associated with this distribution is qf().
• Not required: The particular F-distribution has df
and df = 𝑛𝑛 − 𝑘𝑘 − 1 denominator degrees of freedom
= 𝑘𝑘 numerator degrees of freedom • Note required: the rejection region is 𝐹𝐹 ≥ qf(𝛼𝛼 = 0.05, df , df , lower. tail =
• One can also calculate the 𝑝𝑝-value – It is reported in the output
Tests of Significance: The Independent Variables
• Even if the significance test for the overall regression model indicates it is significant, you need to test each independent variable one by one to identify which are significant and which are not
• These significance tests are almost the same as significance tests used during simple linear regressionthe only difference surrounds the test statistic distribution (a 𝑟𝑟-distribution with a degree of freedom df = 𝑛𝑛 − 𝑘𝑘 − 1)
• For example, for 𝛽𝛽1, the hypothesis test is – Null Hypothesis 𝐻𝐻0: 𝛽𝛽1 = 0
– Alternative Hypothesis 𝐻𝐻1: 𝛽𝛽1 ≠ 0
• The test statistic is a t-statistic and its distribution is t-distribution. – Recall the R fun
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com