Statistical Inference STAT 431
Lecture 11: Simple Regression (I) Fitting Equations to Data
Regression Analysis • What is regression analysis?
– A body of methods for constructing equations to describe the relationship between a response variable and a set of explanatory variables
• Goals:
– Description of the relationship
– Inference of the relationship
– Prediction of the value of the response variable
• Starting point: one explanatory variable (simple regression)
• Later on: a group of explanatory variables (multiple regression)
STAT 431 2
Example: Real Estate Appraisal
• Data: 47 recent home sales in Newton, MA from, 9/16/2013 though 9/27/2013. Source: zillow.com
• How can we model and predict sale prices using sqft alone? STAT 431
3
46 home sales Newton, MA in Sept 2013
STAT 431 4
Equal-means model
E(Y)=μY
STAT 431 5
Equal-means model: predicted average prices
STAT 431 6
Equal-means model: predicted average prices
STAT 431 7
Separate-means model: 4 groups
STAT 431 8
Separate-means model: 4 groups
STAT 431 9
Separate-means model: 4 groups
STAT 431 10
Separate-means model: 4 groups
μμμμ
1234
STAT 431 11
How many parameters do we have to estimate?
Separate-means model: 10 groups
STAT 431 12
How many parameters do we have to estimate?
Regression line
E[Price|Sqft] = 0 + 1 ⇥ Sqft
STAT 431 13
When is it appropriate to consider a regression line?
• Different groups correspond to different levels of a quantitative explanatory variable X (e.g., sqft.)
• Predicted means of Y (e.g., sale prices) in consecutive groups fall along a line.
STAT 431 14
Example: Price of a Diamond Ring
• How is the price of a diamond ring affected by the size of the diamond?
• The data set [diamond.txt] contains the price (in Singapore $) and weight (in carats) of 48 diamond rings. The scatter plot of the data set is given below.
• Scatter plot reveals the relationship between weight and price
• The sample correlation coefficient r = 0.989
• A regression line seems appropriate
●
● ● ●
● ● ●
●
● ● ● ●
●●●● ●
●
●
●
●
●
● ●
●
● ● ●● ●
● ●
●
0.15 0.20
Weight (carats)
0.25 0.30 0.35
STAT 431 15
Price (Singapore dollars) 200 400 600 800 1000
The Basic Setup for Simple Regression
• Observe n independent pairs (x1, y1), . . . , (xn, yn) where x explains/predicts y – x is called the predictor, explanatory variable, or independent variable
– y is called the response, outcome, or dependent variable
Examples
Weight of diamond Price of ring Square footage House price
• Goal: to construct an equation describing how y changes with x
x
y
Ad spending
Revenue
STAT 431 16
The Least Square (LS) Regression Line • The simplest equation is a linear equation
y = 0 + 1x
• However, the observed data usually does not fall perfectly on any particular line
• A regression analysis computes the line, called the least squares (LS) regression line, that minimizes the sum of squared vertical distances from the line to the data.
• In other words, we find ( 0 , 1 ) that minimizes
Xn
[yi ( 0 + 1xi)]2 i=1
• Usingcalculus,onecanshowthatthesolutionsare ˆ =r·sy, ˆ =y ̄ ˆx ̄ 1 sx 0 1
– 0 is the intercept
– 1 is the slope
STAT 431
17
Computing the LS Line in R
In R, the key command for computing the LS line is lm.
E.g. the diamond data
> diamond.fit <- lm(Price ~ Weight, data = diamond) > diamond.fit
Call:
lm(formula = Price ~ Weight, data = diamond)
Coefficients: (Intercept) Weight
-259.6 3721.0
●
● ● ●
● ● ● ● ●
●
●
●
●
●
●
● ●
●
● ● ●● ●
● ●
● ● ● ●
● ●
ˆ 0
ˆ 1
We can also add the line to the scatter plot
> abline(diamond.fit)
STAT 431
18
0.15
0.20 0.25 0.30 Weight (carats)
0.35
Price (Singapore dollars) 200 400 600 800 1000
Interpreting the LS Line
Using R, we have obtained the following equation for the LS regression line:
Price (Singapore dollars) = -259.6 + 3721 * Weight (carats) • Interpreting the coefficients
– Slope: the predicted change in response per unit change in the explanatory variable
• Caution: magnitude depends on the units of both variables – Intercept
• much less interesting: the predicted response value when the explanatory variable equals 0
• Using the equation
– How much should I expect to pay for a diamond ring with a 0.3 carat
diamond?
Answer: ˆ + ˆ ⇥ 0.3 = 259.6 + 3721 ⇥ 0.3 = 856.70 01
STAT 431 19
Example: Real Estate Appraisal
E[Price|Sqft] = 26 + 0.36 ⇥ Sqft
STAT 431 20
Prediction
E[Price|Sqft] = 26 + 0.36 ⇥ Sqft
Interpolation: What is the predicted average price of a house with 4000 sqft in
Newton, MA?
E[Price|Sqft = 4000] = ?
Extrapolation: What is the predicted average price of a “shed” with 72.2 sqft?
E[Price|Sqft = 72.2] = ?
STAT 431 21
Example: Sales and Display Footage
• A large chain of liquor stores would like to know the relationship between sales of
a new wine ( y ) and the display footage devoted to it ( x ).
• Once the relationship is understood, then the chain can optimize the amount of
display footage in order to maximize sales.
• The data set [display.txt] contains sales (dollars) and display footage (linear shelf-feet) per month collected from 47 stores of the chain
● ●
●
● ●
●
● ●●●● ●
●●●
●
●●
● ●
● ● ● ●
●
●●
●●
● ●●
● ●
● ●
● ● ●
● ●●●
Using a linear equation to predict sales for a given amount of display footage seems unreasonable.
A LS regression line does not capture relationship between the response and the predictor when not much is on display.
1234567 Display footage (foot)
STAT 431 22
Sales (dollars)
100 200 300 400
Transformation in LS Regression
• The shape of the relationship is similar to the shape of y = log x – (In this course, log x always denotes the natural logarithm)
• So, we could consider fitting a curve of the form y = 0 + 1 log x
• How to fit such a curve?
1. Transform the data
(xi, yi) ! (log xi, yi)
2. Obtain ( ˆ , ˆ ) by fitting a LS
● ●
●
● ●
●
● ●●●● ●
●●●
●
●●
● ●
● ● ● ●
●
●●
●●
● ●●
● ●
● ●
● ● ●
● ●●●
01
regression line on the transformed data
• What is the optimality of ( ˆ , ˆ ) in this case? 01
1234567 Display footage (foot)
STAT 431
23
Sales (dollars)
100 200 300 400
To obtain ( ˆ , ˆ ) in R, after pre-processing … 01
> display.logfit <- lm(Sales ~ log(Display), data = display) > display.logfit
Call:
lm(formula = Sales ~ log(Display), data = display)
Coefficients:
(Intercept) log(Display)
83.56 138.62
● ●
●
● ●
●●
● ●
● ● ● ●
●
● ●
●●
● ●●●● ●
●●
● ●
● ●
●●
● ●●
●
● ● ●
● ●●●
ˆ 0
ˆ 1
So,thefittedcurveis y=83.56+138.62logx Visually, it describes the relationship better than the regression line on the original scale.
0.0
0.5 1.0 1.5 Log display footage (log foot)
2.0
STAT 431
24
Sales (dollars)
100 200 300 400
Interpreting the Coefficients
• The fitted equation is
y = 83.56 + 138.62 log x
• Note that for any value x , we have
log(1.1x) = log(1.1) + log x ⇡ 0.1 + log x
• This leads to the following interpretation of ˆ = 138.62 :
If the display footage were 10% larger, the sales would be about $14 higher.
• The interpretation of ˆ = 83.56 : 0
The predicted sales are $83.56 when the display footage x satisfies log x = 0, i.e., when x = 1.
1
STAT 431 25
Typical Transformations: Tukey’s Bulging Rule
x ! px,logx,1/x or y ! y2
x ! x2 or y ! y2
x ! px,logx,1/x or y ! py,logy,1/y
x ! x2 or
y ! py,logy,1/y
Shapes of scatter plots and suggested transformations
STAT 431 26
• ✏i
1. 2. 3.
A Probabilistic Model for Simple Regression
Yi = 0 + 1xi +⇥i, i=1,…,n Signal Noise
values are noises (errors) satisfying the following assumptions
Independence: ✏i ’s are mutually independent random variables Homoscedasticity:✏i ’s have common mean 0 , and common variance 2 Normality: The ✏i’s are normally distributed
• Equivalently, we can write
ind 2 Yi ⇠N( 0+ 1xi, )
STAT 431 27
Yi = 0 + 1xi +⇥i, i=1,…,n Signal Noise
• The simple regression model:
– Assumes a true regression line
E(Yi) = 0 + 1xi
– The xi ’s are usually treated as deterministic, and the randomness only comes
from the noises ✏i ’s. [Also called fixed design] – There are three parameters in total
0, 1, ⇥2
STAT 431 28
Class Summary
– Simple regression summarizes the relationship between a predictor and a
response
– The LS regression line
• Optimization problem and solution
• Interpretation of the regression coefficients
– Transformation to new coordinates allows LS regression to capture nonlinear trends as well
• Appropriate transformations often suggested by the shape of the scatter plot – A probabilistic model for simple regression
• Reading parts of Sections 10.1, 10.2 and 10.4 of the textbook
• Next class: Simple Regression (II) (parts of Ch.10.1—10.3)
– Probabilistic model and basic inferences
STAT 431 29
• Key points of this class: