Predictive Analytics – Week 1: Introduction to Predictive Modelling
Predictive Analytics
Week 1: Introduction to Predictive Modelling
Semester 2, 2018
Discipline of Business Analytics, The University of Sydney Business School
Week 1: Introduction to Predictive Modelling
1. Content structure
2. Introduction
3. Business examples and data
4. Notation
5. Statistical decision theory
6. Evaluating model performance
7. Key concepts and themes
2
Content structure
QBUS2820 content structure
1. Statistical and Machine Learning foundations and
applications.
2. Advanced regression methods.
3. Classification methods.
4. Time series forecasting.
3
Content structure
1. Statistical Machine Learning foundations and applications:
key concepts in predictive modelling, statistical thinking,
K-nearest neighbours, model evaluation, model selection, and
model inference, etc.
2. Regression: subset selection, ridge regression, LASSO,
principal components regression, etc.
3. Classification: key concepts, evaluating classification models,
logistic regression, regularised logistic regression, linear and
quadratic discriminant analysis, etc.
4. Forecasting: key concepts, time series, exponential smoothing
and ARIMA models, etc.
4
Learning outcomes
By successfully completing this unit, you are expected to:
1. Understand the conceptual and theoretical foundations of
predictive modelling.
2. Develop an in-depth knowledge of basic methods for
regression, classification, and forecasting methods for business
applications.
3. Be able to conduct a complete data analysis project based on
these foundations and methods.
4. Know how to use Python for your practical workflow under
realistic data complexity (including tasks such as data
manipulation and visualisation).
5. Effectively communicate your results to guide decision making. 5
Comments
• This unit is designed as training for real-world predictive
analytics, which requires a range of skills.
• Practical work in this area involves more than knowing the
methods in the lectures: professionals typically spend a
substantial amount of time on tasks such as data
management, exploratory data analysis, feature engineering,
and implementing methods.
• All of this generally done through coding. Therefore, Python
is your bridge between knowledge and practice.
• For these reasons, please note that this unit requires
independent work and higher than average workload (within
the university guidelines).
6
Introduction
Introduction
Predictive modelling is a set of methods for detecting patterns in
data and using these patterns for predicting future data and
informing decision making. In this unit, we will draw on methods
from the fields of statistics, econometrics, and machine learning.
7
Introduction
Two trends bring predictive modelling to the forefront of successful
business decision making:
• We are in the era of big data. The Internet and increasing
presence of data capturing devices (such as mobile phones,
cameras, sensors, card readers, etc), combined with large
reductions in the cost of storage, brought an unprecedented
availability of data, and continued dramatic growth in the size
of data sets.
• Advancing computing power (realising Moore’s law)
increases the scope for exploring complex patterns in data.
8
Types of prediction
Different types of data lead to different types of prediction
problems:
• In cross sectional prediction, we work with data collected by
observing subjects (such as individuals, firms, assets, etc).
Our objective is to predict the value of a response variable for
a new subject.
• In forecasting, we want to predict the value of a response
variable at specific point in the future, based on past and
current information. Forecasting can be based on time series
data for the response variable only.
9
Types of learning
• Supervised learning
• Unsupervised learning
10
Supervised learning
In the context of statistical learning, supervised learning is the
task of learning a function to predict an output variable Y based
on observed input variables x1, . . . , xp. We develop methods that
learn this function based on labelled data {(xi, yi)}Ni=1, which we
call the training data.
11
Supervised learning
In supervised learning, the output or response variable can be of
any type. We will study methods that address two main classes of
supervised learning problems:
• In regression, the response is a quantitative scalar (such as
the income of a worker).
• In classification, the response is nominal or categorical
variable Y ∈ {1, . . . , C}, where C is the number of classes.
When C = 2, this is called binary classification; if C > 2, this
is called multiclass classification.
12
Example: handwritten digit recognition
A view of the MNIST dataset.
13
Unsupervised learning
Unsupervised Learning: No distinction is made between Y and
X. “Unlabelled” data is used to uncover hidden patterns, clusters,
relationships or distribution
• E.g. Principle Component Analysis: Aiming to find the key
factors determining data patterns
• Goal: Hypothesis generation, then to be tested in supervised
learning
Learner: A learner is a (mathematical) model for learning, e.g.
estimated a regression model based on a training data set.
14
Data science
Data science is a multidisciplinary field that combines knowledge
and skills from statistics, machine learning, software engineering,
data visualisation, and domain expertise (in our case, business
expertise) to uncover value from large and diverse data sets.
Data scientists often work directly with stakeholders (say, product
managers) link their analysis to actionable results. A common
objective is to create data products.
15
The data science process: a real-world perspective
https://docs.microsoft.com/en-us/azure/machine-learning/data-science-
process-overview 16
https://docs.microsoft.com/en-us/azure/machine-learning/data-science-process-overview
https://docs.microsoft.com/en-us/azure/machine-learning/data-science-process-overview
Data analysis process in this unit
1. Problem formulation.
2. Data collection and preparation.
3. Exploratory data analysis (EDA).
4. Model building, estimation, and selection.
5. Model evaluation.
6. Communicate results.
17
Business examples and data
Examples
• Credit card fraud detection: collect data from multiple sources
to learn typical customer behaviour, then use this model to
detect suspicious transactions for further investigation.
• Customer risk analysis: instead of denying sales (say, auto
loans, credit cards, and insurance policies) to higher risk
customers, it is usually a better strategy to price risk
accordingly using available data.
• Advertising: making online ads more relevant to users by
predicting click-through rates.
18
Zillow Kaggle competition
• Kaggle is a crowdsourcing platform that allows organisations
to post data prediction problem to be solved by public
competition.
• Zillow’s Home Value Prediction is a current competition (with
a 1.2 million dollar cash prize) that invites participants to
make predictions about the future sale prices of homes (a
regression problem).
• In this competition, the goal is to improve on Zillow’s home
valuation estimates (“ZEstimates”), which are based on 7.5
million statistical and machine learning models that analyze
hundreds of data points on each property.
19
https://www.kaggle.com/c/zillow-prize-1
Customer relationship management
• Customer relationship management (CRM) is a set of
practices that involve collecting and studying customer
information with the objective of maximising customer
lifetime value (CLV), the net value of a customer to a firm
over his/her entire lifetime.
• CRM may be part of a customer-centric (as opposed to
brand-centric) business strategy, which focuses on customer
satisfaction and loyalty towards the acquisition and retention
of profitable customers.
• CRM has four main areas: customer acquisition, retention,
churn, and win-back. Statistical models and machine learning
algorithms play a central role in in each of these areas.
20
Customer relationship management
The data is from Kumar and Petersen (2012), and refers to
corporate clients.
21
Customer relationship management
Kumar and Petersen (2012) estimate a model to predict the
response
Y =
1 if the customer was acquired,0 if the customer was not acquired,
based on predictors such as the dollar spent on marketing efforts to
acquire the prospect, and characteristics of the prospect’s firm
such as industry, revenue, and number employees.
This is a binary classification problem.
22
Notation
Study tips
• Always start by making sure that you understand the notation
and definitions. Focus first on meaning, then connections.
• If there is a learning challenge, is the root of the problem in
understanding notation, concepts, reasoning, or algebra?
• When reading an equation, you should be able to identify
parameters and constants, distinguish between random
variables and observed values, and distinguish between scalars,
vectors, and matrices.
• When there is an expectation or variance operator, what
distribution is it over? That is, what random variables do they
refer to?
23
Notation
• We use upper case letters such as Y to denote random
variables, regardless of dimension.
• Lower case letters denote observed values. For example, y
denotes the realised value of the random variable Y .
• We use i to index the observations, j to index the inputs. For
example, yi is the observed response for sample i, while xij is
the value of predictor j for observation i.
• We use the hat notation (e.g. β̂) for estimators and
estimates. The notation may not distinguish between the two
(refer to context).
• Vectors are in lower case bold letters. Matrices are in upper
case bold letters.
24
Vector and matrix notation
Response vector:
y =
y1
y2
…
yN
Review the provided materials of liner algebra.
25
Vector and matrix notation
Vector of predictor (features, attributes, covariates, regressors,
independent variables) values for observation i:
xi =
xi1
xi2
…
xip
Vector of observed values for predictor j:
xj =
x1j
x2j
…
xNj
26
Vector and matrix notation
Design matrix:
X =
x11 x12 . . . x1p
x21 x22 . . . x2p
…
…
. . .
…
xN1 xN2 . . . xNp
27
Statistical decision theory
Prediction
We define prediction as follows:
1. Train a predictive function f̂(x) using data D = {(yi,xi)}Ni=1.
2. Upon observing a new input point x0, make the prediction
f̂(x0), the predictive function evaluated at x0.
How should we perform this prediction task? How do we define our
objective? How do we measure success in achieving this objective?
To answer these questions, we turn to decision theory. We mostly
focus on regression problems for simplicity.
28
Loss function
A loss function or cost function L(y, f(x)) measures the cost of
predicting f(x) when the truth is y. The most common loss
function for regression is the squared loss:
L(y, f(x)) = (y − f(x))2
For binary classification, a typical loss function is the 0-1 loss:
L(y, ŷ) =
1 if y 6= ŷ0 if y = ŷ,
where ŷ is the prediction.
29
Expected loss
Let Y and X have a joint probability distribution P (X,Y ). The
idea of decision theory is that we take the action that minimises
our expected loss or risk:
R(f) = E [L(Y, f(X))] ,
where the expectation is over P (X,Y ). Here, the risk is for a
given function f(·).
We can use the law of iterated expectations to rewrite the
expected loss as
R(f) = E
[
E (Y − f(X))2 |X
]
.
30
Optimal prediction
The optimal action is to choose the prediction function δ(.) that
minimises the expected loss. This is equivalent to minimising the
expected loss at every input point x:
δ(x) = argmin
f(·)
E(L(Y, f(x))|X = x)
The solution for the squared loss (see module notes) is the
conditional expectation:
δ(x) = E(Y |X = x)
Concept: under the squared error loss, the optimal prediction of Y
at any point X = x is the conditional mean E(Y |X = x).
31
Statistical modelling
• Our regression problem reduces to the estimation of the
conditional expectation function E(Y |X = x). In order to
learn this function, we need to introduce assumptions.
• Assumptions lead to statistical models.
• For example, the linear regression model assumes that
E(Y |X = x) is linear:
E(Y |X = x) = xTβ
32
Additive error model
The additive error model is our basic general model for
regression. It assumes that the relationship between Y and X is
described as
Y = f(X) + ε,
where f(.) is an unknown regression function, and ε is a random
error with mean zero (E(ε) = 0).
Under this model,
E(Y |X = x) = E(f(x) + ε) = f(x),
since E(ε) = 0.
33
Example: linear regression
In the special case of the linear regression model, we assume that
f(X) = β0 + β1X1 + . . .+ βpXp,
leading to the model
Y = β0 + β1X1 + . . .+ βpXp + ε,
and predictions
f̂(x) = β̂0 + β̂1×1 + . . .+ β̂pxp,
where β̂ = (β̂0, β̂1, . . . , β̂p) is the vector of least squares estimates
of the model parameters.
34
Statistical decision theory
Our discussion of statistical decision theory lays the foundation for
the rest of our discussion.
• Evaluating model performance: estimating the expected loss
of a trained model.
• Choosing a learning method: finding and estimating an
appropriate model such that we minimise our expected loss.
35
Evaluating model performance
Evaluating model performance
Model evaluation consists of estimating the expected loss of a
trained model. To incorporate model assessment into our analysis,
we split the dataset into three parts.
• Training set: for exploratory data analysis, model building,
model estimation, model selection, etc.
• Validation set: for appropriate model selection.
• Test set: for model evaluation.
36
Training, validation and test data
• Because we are interested on the estimating how well a model
will predict future data, the test set should be kept in a
“vault” and brought in strictly at the end of the analysis. The
test set does not lead to model revisions.
• We generally allocate 50-80% of the data to the training
sample.
• A higher proportion of training data leads to more accurate
model estimation, but higher variance in estimating the
expected loss.
• The split of the data into the training, validation and test sets
is often random, but sometimes there are reasons to consider
alternative schemes.
37
Evaluating test performance
Suppose that we have test observations {(ỹi, x̃i)}Mi=1 and
corresponding predictions f̂(x̃i) for i = 1, . . . ,M . We evaluate
model performance by computing the empirical risk for the test
set:
R̂test =
1
M
M∑
i=1
L
(
ỹi, f̂(x̃i)
)
Below, we drop the specific notation for test observations for
simplicity.
38
Mean squared error
The choice of loss function leads to a measure of predictive
accuracy. Suppose that we that we have observations yi and
predictions ŷi = f̂(xi) for an arbitrary sample, i = 1, . . . , n. The
mean squared error is:
MSE =
1
n
n∑
i=1
(yi − ŷi)2
The test mean squared error is the MSE evaluated for the test set.
39
Mean squared error
The root mean-squared error and the prediction R2 are derived
from the MSE and you may be a better way to report the test
results:
RMSE =
√√√√ 1
n
n∑
i=1
(yi − ŷi)2
Prediction R2 = 1−
∑n
i=1(yi − ŷi)2∑n
i=1(yi − y)2
40
Mean absolute error
Another common measure of performance is the mean absolute
error (MAE):
MAE =
1
n
n∑
i=1
|yi − ŷi|
• Implicit in the use of the MAE is the absolute error loss
function. The absolute error setting is less mathematically
tractable, which is one of the reasons why focus on the
squared error loss.
• In this case the optimal prediction is the conditional median,
not the mean.
41
Generalisation error
The test or generalisation error is the expected loss for the model
estimated with the training data D. We define it as
Err = E
[
L
(
Y, f̂(X)
) ∣∣D] ,
where the expectation is over P (X,Y ).
Concept: the test MSE estimates the test error (under the
squared error loss).
42
Standard error
As always, you should report a measure of sample uncertainty for
every important estimate in your analysis. The test MSE is a
sample average, so obtaining a standard error is straightforward.
The formula for a general sample is:
SE(MSE) =
1
√
n
√√√√√√ n∑
i=1
((
yi − f̂(xi)
)2
−MSE
)2
n− 1
Inference for the test errors is possible, but we do not pursue this
here.
43
Key concepts and themes
Key concepts and themes
• Underfitting and Overfitting.
• Parametric vs non-parametric models.
• No-free lunch theorem.
• Accuracy vs interpretability.
44
Overfitting
• We say that there is overfitting when an estimated model is
excessively flexible, incorporating minor variations in the
training data that are likely to be noise rather than predictive
patterns.
• An overfit model has small training errors, but may predict
poorly. In essence, it has memorised the training set.
• Not being misled by overfitting is an important reason why we
use a test set.
• We will present more details about bias variance
decomposition later.
45
Illustration: predicting fuel economy
• This example uses data extracted from the fueleconomy.gov
website run by the US government, which lists different
estimates of fuel economy for passenger cars and trucks.
• For each vehicle in the dataset, we have information on
various characteristics such as engine displacement and
number of cylinders, along with laboratory measurements for
the city and highway miles per gallon (MPG) of the car.
• We here consider the unadjusted highway MPG for 2010 cars
as the response variable, and a single predictor, engine
displacement.
46
Illustration: predicting fuel economy
A scatter plot reveals a nonlinear association between the two
variables. We therefore need a model that is sufficiently flexible to
capture this nonlinearity.
47
Illustration: predicting fuel economy
48
Parametric vs nonparametric models
There are many ways to define statistical models, but the most
important distinction is the following:
• A parametric model has a fixed number of parameters.
Parametric models are faster to use, and more interpretable,
but have the disadvantage of making stronger assumptions
about the data.
• In a nonparametric model, the number of parameters grows
with the size of the training data. Nonparametric are more
flexible, but have larger variance and can be computationally
infeasible for large datasets. An example is the K-nearest
neighbours method, which we will study in the next module.
49
No free lunch theorem
All models are wrong, but some are useful. – George Box
• The field of machine learning proposes a large range of models
and algorithms to solve supervised and unsupervised learning
problems.
• However, there is no single model or approach that works
optimally for all problems. This is sometimes called the no
free luch theorem.
• Therefore, applied statistical learning requires awareness of
speed-accuracy-complexity trade-offs and data-driven
consideration of different approaches for every problem.
50
Accuracy vs interpretability
Particularly in data mining, interpretability is an important
consideration in addition to predictive accuracy. Highly flexible,
nonparametric methods, tend to be less interpretable than simpler
methods.
Flexibility
In
te
rp
re
ta
b
il
it
y
Low High
L
o
w
H
ig
h Subset Selection
Lasso
Least Squares
Generalized Additive Models
Trees
Bagging, Boosting
Support Vector Machines
51
Ernan
Typewritten Text
regression
Study guide
• Recall three important concepts from these slides, and explain
them in your own words.
• Use the review questions in the next slide to self-test on key
concepts.
• Study the mathematical details in the module notes.
• Study (or revise) Chapters 1 and 2 of ISL. Reader Chapter 3
before the next module.
52
Review questions (1/2)
• What is predictive modelling?
• What is the difference between cross-sectional prediction and
forecasting?
• What is supervised learning?
• What is a loss function?
• What do we learn from statistical decision theory for
regression problems?
• How do we evaluate model performance with data?
53
Review questions (2/2)
• What is the difference between the generalisation error and
the expected prediction error?
• What is the bias-variance trade-off and why is it important for
predictive modelling?
• What is model selection? How is it different from model
evaluation?
• What is overfitting?
• What is the difference between parametric and nonparametric
models? What are the advantages and disadvantages of each
approach?
54
Content structure
Introduction
Business examples and data
Notation
Statistical decision theory
Evaluating model performance
Key concepts and themes
Underfitting and Overfitting
Parametric vs nonparametric models
No free lunch theorem
Accuracy vs interpretability