Data mining
Prof. Dr. Matei Demetrescu Summer 2020
Statistics and Econometrics (CAU Kiel)
Summer 2020 1 / 36
Today’s outline
Linear Regression
1 Linear regression
2 Adding flexibility
3 Further regression details
4 Up next
Statistics and Econometrics (CAU Kiel) Summer 2020 2 / 36
Linear regression
Outline
1 Linear regression
2 Adding flexibility
3 Further regression details
4 Up next
Statistics and Econometrics (CAU Kiel) Summer 2020 3 / 36
Linear regression
The simplest approach
Linear regression is a simple approach to supervised learning. It assumes that the dependence of E(Y |X) on X1, X2, . . . , Xp is linear.1
Although it may seem overly simplistic, linear regression is extremely useful both conceptually and practically. It gives first answers to questions like
Is there some relation between variables? Which of them is a potential predictor? How accurately can we predict prices?
Is the relation linear?
And: It is interpretable, and easy to fit and predict with.
(You may complicate your life with other methods, but only if you must!)
1This is seldom true in practice; but, as we’ll see, it is a good start.
Statistics and Econometrics (CAU Kiel) Summer 2020 4 / 36
Real estate again
house.age
0 1 2 3 4 5 6
0.03 0.05
20 40 60 80 120
−0.21
−0.67
0.57
price
0 10 20 30 40
0 2 4 6 8 10
Statistics and Econometrics (CAU Kiel)
Summer 2020
5 / 36
Linear regression
house.loc
−0.60
nr.stores
20 60 100 0 2 4 6
0 4 8
0 20 40
Linear regression
Simple linear regression
Let’s assume a simple linear model (i.e. with one predictor),
Y = β0 + β1X + ε.
The error term ε has zero conditional mean if the model is true; but is in
any case uncorrelated with X.
Given some estimates βˆ0 and βˆ1 for the model coefficients, we predict
prices using
Yˆ = βˆ 0 + βˆ 1 X .
Of course, Yˆ ̸= Y so there are prediction errors.2
2In sample, we call predictions fitted values and errors are residuals. OLS estimation minimizes the sum of squared residuals.
Statistics and Econometrics (CAU Kiel) Summer 2020 6 / 36
Linear regression
Assessing accuracy of coefficient estimates
We’ve seen that forecast errors depend on estimation error.
Standard errors reflect how estimators vary under repeated sampling. (You should know how they look for OLS estimation.)
These standard errors can be used to compute confidence intervals, βˆ1 ± 2 · SE(βˆ1).
Recall, a 95% confidence interval “covers” the true unknown value of the parameter with 95% probability. I.e. there is approximately a 95% chance that the interval
βˆ1 − 2 · SE(βˆ1), βˆ1 + 2 · SE(βˆ1)
will contain the true value of β1 (regularity conditions assumed).
Statistics and Econometrics (CAU Kiel) Summer 2020 7 / 36
Linear regression
Hypothesis Testing I
Standard errors can also be used to perform hypothesis tests. E.g. we test quite often the null
H0 : There is no (linear) relation between X and Y against the alternative
HA : There is some relation between X and Y
This corresponds to testing
H0 : β1 = 0 against HA : β1 ̸= 0,
since if β1 = 0 then the model reduces to Y = β0 + ε, and X is not (linearly) associated with Y .
Statistics and Econometrics (CAU Kiel) Summer 2020
8 / 36
Linear regression
Hypothesis Testing II
To test this null hypothesis, we compute the t t = βˆ 1 − 0 ,
SE(βˆ1 )
which is approximately standard normal (regularity conditions assumed).
Using statistical software, it is easy to compute the probability under the null of observing any value equal to |t| or larger, i.e. the p-value.
Results for the real estate prices Coefficient
Std. Error t-statistic 0.6526 70.26
0.3925 -18.50
p-value < 10−4 < 10−4
intercept
Location
45.8514 -7.2621
Statistics and Econometrics (CAU Kiel)
Summer 2020
9 / 36
Linear regression
Assessing overall model accuracy
Compute the Residual Standard Error – this gives an indication of prediction error.
Compute R2 – this quantifies the linear dependence and is scale-free.3 Real estate results
Quantity
Residual Standard Error
R2
F -statistic
Value
10.07 0.4538 342.2
This is Prices on Location only – how all available predictors?
3For simple linear regression, R2 = r2 where r is the sample correlation of X and Y . Statistics and Econometrics (CAU Kiel) Summer 2020 10 / 36
Linear regression
More inputs/features/predictors
Here our model is (finally!)
Y =β0 +β1X1 +β2X2 +···+βpXp +ε,
We interpret βj as the average effect on Y of a one unit increase in Xj,
holding all other predictors fixed.
And of course we look for better predictions!
Given estimates βˆ0, βˆ1, . . . , βˆp we predict
Yˆ = βˆ 0 + βˆ 1 X 1 + βˆ 2 X 2 + · · · + βˆ p X p .
One typically uses OLS estimation (recall what this implies for the estimated quantity).
Statistics and Econometrics (CAU Kiel) Summer 2020
11 / 36
Linear regression
Interpreting regression coefficients
The ideal scenario is when the predictors are uncorrelated4 Each coefficient can be estimated and tested separately.
Interpretations such as “a unit change in Xj with a βj change in Y , while all the other variables stay fixed”, are possible.
In any case, claims of causality should be avoided for observational data.
Correlations amongst predictors cause interpretation problems:
Omitted variable bias (though not a killer when focussing on prediction)
Interpretations become hazardous — when Xj changes, other features do not stay fixed on the average.
4This is called a balanced design.
Statistics and Econometrics (CAU Kiel) Summer 2020 12 / 36
Linear regression
Results for real estate data
intercept
Age
Location
Nr.Stores
Follow-up questions:
Coefficient
42.9772 -0.2529 -5.3791 1.2974
Std. Error t-statistic
1.3845 31.041 0.0401 -6.305 0.4530 -11.874 0.1943 6.678
p-value
< 10−4 < 10−4 < 10−4 < 10−4
1 Is at least one of the predictors X1, X2, . . . , Xp useful for prediction?
2 Do all the predictors help to explain Y , or is only a subset of the predictors useful? Which subset?
3 How well does the model fit the data?
4 How accurate is our prediction?
Statistics and Econometrics (CAU Kiel) Summer 2020 13 / 36
Linear regression
... and some (linear) answers
1 Use the F statistic, which is approx. χ2(p) distributed under the null;
2 Use individual t statistics (see model selection later on);
3 Compute R2;
4 Compute the residual standard error.
(More Q&A in due time.) For the real estate data:
Quantity
Residual Standard Error
R2 F-statistic
Value
9.251 0.5411 161.1
Statistics and Econometrics (CAU Kiel)
Summer 2020
14 / 36
Adding flexibility
Outline
1 Linear regression
2 Adding flexibility
3 Further regression details
4 Up next
Statistics and Econometrics (CAU Kiel) Summer 2020 15 / 36
Adding flexibility
Discrete-valued inputs
Some predictors are not quantitative but are qualitative, taking a discrete set of values.
These are also called categorical predictors or factor variables. Don’t confuse them with count variables – these can be treated as
usually (unless they’re targets).
The linearity assumption is usually understood as Y changing at a
constant rate as X increases in value.
But Y = β0 + β1 × category is not very meaningful, so
we look at the marginal effect of other (metric) predictors for each category.
Statistics and Econometrics (CAU Kiel) Summer 2020
16 / 36
Example: Credit rating data I
26 3. Linear Regression
tomersà
Balance
0 500
1500
2
4 6
8
50 100 150
Rating
200 600 1000
20
40
60
Age
80 100
5
10 15
20
2000 8000
14000
Credit Card Data
Cards
Four additional qualitative variables
gender,
student (student
status),
status (marital status), and
ethnicity
(Caucasian, African American (AA) or Asian).
Education
Adding flexibility
Income
Limit
200 600 1000
50 100 150
2 4 6 8
0 500 1500
2000 8000 14000
5 10 15 20
20 40 60 80 100
(Data available from the companion website of James et al. textbook.) FIGURE 3à6à The Credit data set contains information about balance, age, 28 / 48
cards, education, income, limit, and rating for a number of potential cus-
Statistics and Econometrics (CAU Kiel)
Summer 2020
17 / 36
Adding flexibility
Example: Credit rating data II
Investigate differences in credit card balance between males and females, ignoring the other variables. We create a new variable
Resulting model:
Results for gender model: Coefficient
Std. Error
33.13 46.05
t-statistic 15.389
0.429
p-value < 10−4
0.6690
Xi = 1, if ith person is female 0, if ith person is male
Yi =β0+β1Xi+εi =β0+β1+εi, ifithpersonisfemale β0 + εi, if ith person is male
intercept
gender[Female]
509.80 19.73
Statistics and Econometrics (CAU Kiel)
Summer 2020
18 / 36
Adding flexibility
More than two categories/levels I
With more than two levels, we create additional dummy variables.
For example, for the ethnicity variable we create two dummy variables. The first could be
Xi1 = 1 0
and the second could be
Xi2 = 1 0
if ith person is Asian
if ith person is not Asian
(And so on for more categories.)
if ith person is Caucasian
if ith person is not Caucasian
Statistics and Econometrics (CAU Kiel) Summer 2020 19 / 36
Adding flexibility
More than two categories/levels II
Then both of these variables can be used in the regression equation:
β0 + β1 + εi, Yi =β0+β1Xi1+β2Xi2+εi = β0+β2+εi,
ith person is Asian ithpersonisCaucasian ith person is AA.
β0 + εi,
There will always be one fewer dummy variable than the number of levels.
The level with no dummy variable (here, AA) is the baseline. Results for ethnicity
intercept
ethnicity[Asian]
ethnicity[Caucasian]
Coef.
531.00 -18.69 -12.50
Std. Err.
46.32 65.02 56.68
t-stat. 11.464
-0.287 -0.221
p-val. < 10−4
0.7740 0.8260
Statistics and Econometrics (CAU Kiel)
Summer 2020
20 / 36
Adding flexibility
Interactions
In analysing the real estate data, we assumed additivity: the slope coefficient (i.e. effect) of Location does not depend on other variables.
This is what the linear model states!
Price=βˆ +βˆ ·Age+βˆ ·Location+βˆ ·Nr.Stores 0123
But what if Location has a different effect on prices depending on Age? Taking this into account would increase prediction accuracy.
In marketing, this is known as a synergy effect (in statistics: interaction).
Statistics and Econometrics (CAU Kiel) Summer 2020 21 / 36
Adding flexibility
Modelling interactions — real estate data
The model now takes the form
Sales=β0 +β1 ·Age+β2 ·Location+β3 ·Nr.Stores + β4 · (Age · Location) + ε
=β0 +β1 ·Age+(β2 +β4 ·Age)·Location+β3 ·Nr.Stores+ε. Results:
Intercept
Age
Location Nr.Stores
Age · Location
Coefficient
45.20927 -0.35166 -8.24568 1.24911 0.14428
Std. Error
1.51872 0.04924 0.95964 0.19240 0.04273
t-statistic
29.768 -7.141 -8.592 6.492 3.377
p-value
< 10−4 < 10−4 < 10−4 < 10−4
0.0008
The interpretation of coefficients of both Age and Location changes.
For the sake of interpretability, always include all linear terms entering an
interaction even if insignificant (hierarchy principle).
Statistics and Econometrics (CAU Kiel) Summer 2020 22 / 36
Adding flexibility
Interactions between qualitative and quantitative variables
Consider the credit ratings data set again, and suppose that we wish to predict balance using income (quantitative) and student (qualitative).
Without an interaction term, the model takes the form
balance ≈ β0 + β1 · income + β2 0
=β1 ·income+β0 +β2 β0
if person is a student
if person is not a student
if person is a student
if person is not a student.
Statistics and Econometrics (CAU Kiel)
Summer 2020
23 / 36
Adding flexibility
Slopes also vary between groups
With interactions, it takes the form
balance ≈ β0 + β1 · income + β2 + β3 · income if student
90 3. Linear Regression
0 if not student
student non−student
0 50 100 150 0 50 100 150
Income Income
FIGURE 3ààà For the Credit data, the least squares lines are shown for pre-
Statistics and Econometrics (CAU Kiel) Summer 2020 24 / 36
Balance
200 600 1000 1400
Balance
200 600 1000 1400
Credit data; Left: no interaction between income and student.
Adding flexibility
(Non-)Linearity and feature engineering
A linear relation will easily turn out to be too simplistic.
(Use e.g. residual plots & model selection to judge on that.)
The linear regression model is linear in the predictors,
... but there’s no law stopping us from adding suitable features.
Interaction terms are actually such artificial predictors
... and we recall experimenting with squares.
May consider other transformations of predictors5 in addition to (low degree!) polynomials.
5Logs are very popular, but: when working with log Y rather than Y , OLS fits the conditional expectation of the logs, which is not the log of E(Y |X), and interpretation of coefficients changes of course.
Statistics and Econometrics (CAU Kiel) Summer 2020 25 / 36
Adding flexibility
All together
Intercept
Age
Age2
Location
Nr.Stores
Nr.Stores · Location Age · Location
Coefficient
49.321393 -1.051326 0.017914 -6.408751 1.580103 -1.178413 0.123380
Std. Error
1.659967 0.146640 0.003515 0.939301 0.197932 0.236816 0.040284
t-statistic p-value 29.712 < 10−4
-7.169 < 10−4 5.096 < 10−4 -6.823 < 10−4 7.983 < 10−4 -4.976 < 10−4 3.063 0.0023
of feature selection interactions).
We only kept significant coefficients; this is a form coming on top of the feature engineering (squares,
Statistics and Econometrics (CAU Kiel)
Summer 2020 26 / 36
Some standard diagnostics
Residuals vs Fitted 271
313
114
10 20 30 40 50 60
Scale−Location 271
113413
114
Fitted values Normal Q−Q
Fitted values
Residuals vs Leverage
271
221313
Cook's distance
0.00 0.05 0.10
−3 −2 −1 0 1 2 3
Statistics and Econometrics (CAU Kiel)
Summer 2020
27 / 36
Adding flexibility
271
313
1 0.5
0.5
0.15
10 20 30 40 50 60
Standardized residuals
Residuals
−4 02468
−40 0 40 80
Standardized residuals
Standardized residuals 0.0 1.0 2.0 3.0
−5 0 5 10
Further regression details
Outline
1 Linear regression
2 Adding flexibility
3 Further regression details
4 Up next
Statistics and Econometrics (CAU Kiel) Summer 2020 28 / 36
Further regression details
Linearity is not the only assumption
What else could go wrong? correlation of errors non-constant variance of errors outliers
high-leverage points collinearity
Statistics and Econometrics (CAU Kiel) Summer 2020
29 / 36
Further regression details
Correlation of error terms
What are correlated errors?
Regression errors are not independent anymore εi provides information about εj
This is a form of model misspecification
When should we expect correlated errors? time series data
when there are hidden factors
Why are they bad?
correlated errors lead to biased estimation of standard error
this means confidence intervals are too small or too large Fortunately, no bias in slope parameter estimators (unless you have a dynamic model)
May resort to robust standard errors (but better to treat the cause).
Statistics and Econometrics (CAU Kiel) Summer 2020 30 / 36
Further regression details
Non-constant error variance: Simulated data set
1234567
1234567
Y
0 2 4 6 8 10
Y
0 2 4 6 8 10
Error variance may change with the predictor too;
This also biases standard errors... ... but not the LS estimators.
Statistics and Econometrics (CAU Kiel) Summer 2020 31 / 36
Further regression details
Outliers
Outliers are values far away from model predictions:
1234567
Remove if bad data, otherwise model!
Alternatively, use robust regression methods such as Least Absolute
Deviations estimation.
Statistics and Econometrics (CAU Kiel) Summer 2020 32 / 36
... but beware of differences.
Y
0 2 4 6 8 10
Further regression details
High-leverage points
Outliers: unusual Y for usual X values
Leverage: unusual X values
A problem even if corresponding Y only just a bit outlying
2 4 6 8 10 12 14
Worth taking a closer look at (typically far away from centroid of X
distribution)
Statistics and Econometrics (CAU Kiel) Summer 2020
33 / 36
Y
0 2 4 6 8 10
Further regression details
(Multi-)Collinearity
High correlations among the regressors:
Reduced predictive power for certain combinations of feature values Unstable estimates
How to detect?
look at the correlations between the covariates look at the eigenvalues of X′X
Ways out?
select a subset of predictors (model selection)
put some constraints on βˆ (regularization & dimensionality reduction)
Statistics and Econometrics (CAU Kiel) Summer 2020 34 / 36
Up next
Outline
1 Linear regression
2 Adding flexibility
3 Further regression details
4 Up next
Statistics and Econometrics (CAU Kiel) Summer 2020 35 / 36
Up next
Coming up
(Almost) Linear classification
Statistics and Econometrics (CAU Kiel) Summer 2020 36 / 36