Predictive Analytics – Week 2: Linear Regression and Statistical Thinking
Predictive Analytics
Week 2: Linear Regression and Statistical Thinking
Semester 2, 2018
Discipline of Business Analytics, The University of Sydney Business School
QBUS2820 content structure
1. Statistical and Machine Learning foundations and
applications.
2. Advanced regression methods.
3. Classification methods.
4. Time series forecasting.
Before the lecture 2, review linear algebra, especially matrix
multiplication, rank, determinant and inverse.
2/47
Week 2: Linear Regression and Statistical Thinking
1. Introduction
2. The least squares algorithm
3. The MLR model
4. Statistical properties
5. Interpreting a linear regression model
6. Regression modelling
3/47
Introduction
Linear regression
The linear regression is a simple and widely used method for
supervised learning. There are several important reasons for
developing an in-depth understanding of this method.
• It is very useful for prediction in many settings.
• It is extremely useful conceptually. Many advanced statistical
learning methods can be understood as extensions and
generalisations of linear regression.
• Due to its simplicity, linear regression is often a useful
jumping-off point for model building and analysis.
• Interpretability.
4/47
Example: advertisement data
Consider from example the advertisement data from the ISL
textbook (see next slide). Possible questions:
• Is there a relationship between advertising budget and sales?
• How strong is the relationship between advertising budget and
sales?
• Which media contribute to sales?
• How accurately can we predict future sales?
• Is there synergy among the advertisement media?
5/47
Example: advertisement data
0 50 100 200 300
5
1
0
1
5
2
0
2
5
TV
S
a
le
s
0 10 20 30 40 50
5
1
0
1
5
2
0
2
5
Radio
S
a
le
s
0 20 40 60 80 100
5
1
0
1
5
2
0
2
5
Newspaper
S
a
le
s
(Figure from ISL)
6/47
Example: advertisement data
To answer our questions, we can use a model such as
sales = β0 + β1 × TV + β2 × radio + β3 × newspaper + ε
7/47
Statistical thinking
Statistical thinking is using statistical models, statistical theory,
and critical thinking to learn from data.
• How do I design a study to answer a certain question?
• How relevant and representative are my data?
• What is the variability in my data? Can I reliably draw
conclusions in light of this variability?
• How do I correctly interpret my results?
• Can I generalise my conclusions in the way that I would like
to?
• What are the limitations of my analysis?
8/47
The least squares algorithm
Linear regression
In the linear regression method for prediction, we consider a
regression function of the form
f(x) = β0 + β1×1 + β2×2 + . . .+ βpxp
We learn the prediction coefficients β̂0, β̂1, . . . , β̂p by fitting the
model to the training data using the least squares method.
9/47
Least squares
Let D = {(yi,xi)}Ni=1 be the training data. We define the residual
sum of squares as a function of parameter values β as
RSS(β) =
N∑
i=1
(yi − f(xi; β))2
=
N∑
i=1
yi − β0 − p∑
j=1
βjxij
2
10/47
Least squares
The ordinary least squares (OLS) method selects the coefficient
values that minimise the residual sum of squares
β̂ = argmin
β
N∑
i=1
yi − β0 − p∑
j=1
βjxij
2
11/47
Least squares
(Figure from ISL)
12/47
Interpretation
If our loss function L(y, f(x)) is the squared error loss, the OLS
algorithm consists of minimising the empirical loss for our choice of
predictive function:
β̂ = argmin
β
1
N
N∑
i=1
L(yi, f(xi))
13/47
Least squares and linear algebra
In order to obtain a solution to the OLS minimisation problem, we
need linear algebra.
https://xkcd.com/1838/
14/47
Design matrix
X =
1 x11 x12 . . . x1p
1 x21 x22 . . . x2p
…
…
…
. . .
…
1 xN1 xN2 . . . xNp
15/47
Least squares and linear algebra
We equivalently write the RSS as
RSS(β) = (y −Xβ)T (y −Xβ)
= yTy − 2βTXTy + βTXTXβ
We optimise the RSS by taking the p+ 1 partial derivatives and
setting them to zero.
16/47
Vector differentiation rules
Let x and a be vectors of equal dimension and A a matrix with
column dimension the same as number of rows in x. Then:
d(xTa)
dx
= a
d(xTAx)
dx
= (A+AT )x
17/47
Partial derivatives
RSS(β) = yTy − 2yTXβ + βTXTXβ
The vector of partial derivatives is
d(RSS(β))
dβ
=
d(yTy)
dβ
−
d(2βTXTy)
dβ
+
d(βTXTXβ)
dβ
=
= 0− 2XTy + 2XTXβ
18/47
OLS estimates
The first order condition is:
d(RSS(β))
dβ
= −2XTy + 2XTXβ = 0
The least squares estimate β̂ therefore satisfies
XTXβ̂ = XTy
If (XTX)−1 is invertible, left multiplication with (XTX)−1 gives
the unique solution
β̂ = (XTX)−1XTy.
19/47
OLS for big data?
The OLS solution is β̂ = (XTX)−1XTy, given that you can
compute the matrix XTX.
X is a matrix of size N × (p+ 1). If N is very large, then X is so
large that it is impossible to compute XTX or it is close to being
singular.
For big data, it’s challenging to compute this matrix! Solutions?
20/47
XTX non-invertible?
• Reason: Multiconlinearity problem or redundant predictors.
Rank and determinant of XTX= ?
• Solution: Drop one or more highly correlated predictors from
the model or collect more data.
• Reason: The number of predictors is too large, e.g. N � p.
Rank and determinant of XTX= ?
• Solution: Drop some predictors or collect more data; Add
regularization term into the model. More details later.
• For real matrices X:
rank(XTX) =rank(XXT )=rank(X)=rank(XT )
21/47
Fitted values
The fitted values based on the training inputs are
ŷi = β̂0 +
p∑
j=1
β̂jxij
The vector of fitted values for the entire sample is:
ŷ = Xβ̂ = X(XTX)−1XTy.
We refer to H = X(XTX)−1XT as the hat matrix.
22/47
Residuals
The regression residuals are:
ei = yi − ŷi
= yi − β̂0 −
p∑
j=1
β̂jxij
The vector of residuals is:
e = y − ŷ
= y −X(XTX)−1XTy
=
(
I −X(XTX)−1XT
)
y.
23/47
Measuring fit
We can show that
TSS = RegSS + RSS
∑
(yi − y)2 =
∑
(ŷi − y)2 +
∑
e 2i
• TSS: total sum of squares.
• RegSS: regression sum of squares.
• RSS: residual sum of squares.
24/47
Measuring fit
R2 =
RegSS
TSS
= 1−
RSS
TSS
Interpretation:
• The R2 measures the proportion of the variation in the
response data that is accounted for by the estimated linear
regression model.
• The R2 can only increase when you add another variable to
the model.
• The R2 is an useful part of the regression toolbox, but it does
not measure the predictive accuracy of the estimated
regression, or more generally how good the model is.
25/47
Prediction
Let β̂ be the OLS coefficients obtained from the training sample.
ŷ0 = β̂0 +
p∑
j=1
β̂jx0j
26/47
The MLR model
Models and algorithms
So far, we have talked about the least squares algorithm and even
arrived at predictions without reference to a model. The current
practice of data science places large emphasis on algorithmic
thinking towards problem solving.
https://xkcd.com/1831/
27/47
Statistical models
A statistical model is a description of a data generating process
based on a set of mathematical assumptions about the population
and the sampling process.
A regression model is a description of the relationship between a
response variable Y and predictors X1, . . . , Xp. More formally, it is
a model of the form p(y|x; θ).
Formulating statistical models and making assumptions allow us to
say more about a problem.
28/47
The Multiple Linear Regression (MLR) model
1. Linearity: if X = x, then
Y = β0 + β1×1 + . . .+ βpxp + ε
for some population parameters β0, β1, . . . , βp and a random
error ε.
2. The conditional mean of ε given X is zero: E(ε|X) = 0.
3. Constant error variance: Var(ε|X) = σ2.
4. Independence: all the error pairs εi and εj (i 6= j) are
independent.
5. The distribution of X1, . . . , Xp is arbitrary.
6. There is no perfect multicollinearity.
29/47
Checking the assumptions
It is fundamental to check the assumptions with data. We do this
with residual diagnostics. The following plots are often useful:
• Fitted values against residuals.
• Predictors against residuals.
• Fitted values against squared or absolute residuals.
• Predictors against squared or absolute residuals.
• Residual distribution.
• If the observations are ordered: residuals against coordinates
(time and/or space).
30/47
Statistical properties
Sampling distribution of an estimator
In classical statistics, the population parameter β is fixed and the
data is a random sample from the population. We estimate β by
applying an estimator β̂(D) to data (in our case the OLS
algorithm).
We study the uncertainty of an estimate by computing the
sampling distribution of the estimator.
31/47
Sampling distribution of an estimator
Imagine that we draw many different datasets D(s) (s = 1, . . . , S)
from the true model p(y|X;β). Each dataset has size N .
For each of these datasets, we apply the estimator β̂(·) and obtain
a set of estimates {β̂(D(s))}. The sampling distribution is the
induced distribution on β̂(·) as S →∞.
This concept is not necessarily intuitive since it refers to
hypothetical datasets rather than data that we do have.
32/47
Sampling distribution of an estimator
Establishing the sampling distribution allows us to answer
questions such as:
• Is there a significant relationship between the response and
the predictors?
• Are all the predictors related to the response, or only a subset?
• How accurate are our coefficient estimates?
• How accurate are our predictions?
33/47
Sampling distribution
Under the Gaussian MLR model with ε ∼ N(0, σ2), we can obtain
an exact sampling distribution for the OLS estimator,
β̂ ∼ N
(
β, σ2(XTX)−1
)
When estimating σ2, we have
β̂j − βj
SE(β̂j)
∼?,
We can then rely on this distribution for hypothesis testing.
Review your study notes of previous units or the reference book
for: OLS estimator sample distribution, regression coefficient
significance testing, confidence interval, ANOVA, etc.
34/47
Interpreting a linear regression
model
Advertisement data
We now estimate the linear regression model
sales = β0 + β1 × TV + β2 × radio + β3 × newspaper + ε
To interpret the results, we need to note the following:
• The observational units in the data are markets.
• The response variable (sales) is in thousands of units.
• The predictors are in thousands of dollars.
What is the population of interest? (You always need to be able to
answer this question)
35/47
Advertisement data
OLS Regression Results
==============================================================================
Dep. Variable: Sales R-squared: 0.897
Model: OLS Adj. R-squared: 0.896
Method: Least Squares F-statistic: 570.3
Date: Prob (F-statistic): 1.58e-96
Time: Log-Likelihood: -386.18
No. Observations: 200 AIC: 780.4
Df Residuals: 196 BIC: 793.6
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
——————————————————————————
Intercept 2.9389 0.312 9.422 0.000 2.324 3.554
TV 0.0458 0.001 32.809 0.000 0.043 0.049
Radio 0.1885 0.009 21.893 0.000 0.172 0.206
Newspaper -0.0010 0.006 -0.177 0.860 -0.013 0.011
==============================================================================
Omnibus: 60.414 Durbin-Watson: 2.084
Prob(Omnibus): 0.000 Jarque-Bera (JB): 151.241
Skew: -1.327 Prob(JB): 1.44e-33
Kurtosis: 6.332 Cond. No. 454.
==============================================================================
36/47
Interpreting coefficients
ŝales = −2.9389
(0.312)
+0.0458
(0.001)
×TV +0.1885
(0.009)
×radio −0.0010
(0.006)
×newspaper
Interpretation (TV):
If we select two markets from the population, where the radio and
newspaper budgets are the same, but the TV budget differs by 100
dollars, we would expect 4.58 more units sold in the market with
higher TV budget.
37/47
Interpreting coefficients
Mathematically:
βj = E(Y |Xj = xj + 1, X6=j = x 6=j)−E(Y |Xj = xj , X6=j = x6=j)
For example, with p = 2 and focusing on the first predictor:
E(Y |X1 = x1 + 1, X2 = x2)− E(Y |X1 = x1, X2 = x2)
= E[β0 + β1(x1 + 1) + β2×2 + ε]− E[β0 + β1×1 + β2×2 + ε]
= [β0 + β1(x1 + 1) + β2×2]− [β0 + β1×1 + β2×2]
= β1
38/47
Omitted variables
With observational data, the assumption that E(ε|X = x) is
generally not satisfied. In this case, there are omitted variables:
variables that are correlated with both the predictor and the
response. This leads to omitted variable bias when estimating
regression coefficients.
Here is an example: if we regress wealth on the number of luxury
cars owned, the slope is positive (luxury cars predict wealth).
However, we can imagine that buying more luxury cars will not
make you richer.
39/47
Example: education and wages
OLS Regression Results
==============================================================================
Dep. Variable: Hourly wage R-squared: 0.162
Model: OLS Adj. R-squared: 0.162
Method: Least Squares F-statistic: 1729.
Date: Prob (F-statistic): 0.00
Time: Log-Likelihood: -57425.
No. Observations: 17919 AIC: 1.149e+05
Df Residuals: 17916 BIC: 1.149e+05
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
——————————————————————————
Intercept -7.5017 0.337 -22.278 0.000 -8.162 -6.842
Education 1.1937 0.024 50.255 0.000 1.147 1.240
Experience 0.4511 0.011 40.772 0.000 0.429 0.473
==============================================================================
Omnibus: 10774.032 Durbin-Watson: 0.744
Prob(Omnibus): 0.000 Jarque-Bera (JB): 237446.384
Skew: 2.484 Prob(JB): 0.00
Kurtosis: 20.128 Cond. No. 117.
==============================================================================
40/47
Causal analysis
Causal analysis means to estimate a model of the type
E(Y | do X = x). This is an explicit intervention: “if we do
X = x, then we predict E(Y |X = x)”.
This is different from predictive modelling: “if we observe X = x,
then we predict E(Y |X = x)”.
Causal analysis requires an appropriate study design (such as A/B
testing).
41/47
Study designs
Ramsey and Shafter (2002).
42/47
Study note
• For our purposes, the textbook is not sufficiently rigorous
regarding the interpretation of linear regression coefficients.
• While our interpretation is less simple than the one provided
by most textbooks, it is the correct one for observational data
that is prevalent in business.
43/47
Regression modelling
Regression modelling
• All the material from Statistical Modelling for Business
continues to be relevant in Predictive Analytics.
• In particular, constructing a helpful set of predictor variables is
extremely important for supervised learning as it is often
essential to improving performanceConstructing a helpful set
of predictor variables possible is extremely important for
supervised learning. This is known in machine learning and
data science as feature engineering.. This is known in
machine learning and data science as feature engineering.
• It is also useful to build models that fit as much as possible
the assumptions on data (for example, constant error
variance).
44/47
Regression modelling
• Data transformation (in particular log and power
transformations).
• Categorical predictors.
• Interactions.
• Polynomial regression.
• Regression splines.
• Robust regression.
45/47
Potential problems
• Nonlinearity.
• Non-constant error variance.
• Correlated errors.
• Outliers and high leverage points.
• Multicollinearity.
• Non-Gaussianity.
46/47
Review questions
• How do we obtain the OLS estimates? Go through the full
process.
• What is a sampling distribution?
• We formulated several questions about the advertisement
data. Answer some of these questions based on the Python
output in the slides.
• What is the correct interpretation of a linear regression model
coefficient with observational data?
• What is the difference between predictive and causal analysis?
47/47
Introduction
The least squares algorithm
The MLR model
Statistical properties
Interpreting a linear regression model
Regression modelling