Data mining
Prof. Dr. Matei Demetrescu Summer 2020
Statistics and Econometrics (CAU Kiel)
Summer 2020 1 / 30
Today’s outline
Statistical learning
1 The data mining process
2 Supervised learning
3 Unsupervised learning
4 Up next
Statistics and Econometrics (CAU Kiel) Summer 2020 2 / 30
The data mining process
Outline
1 The data mining process
2 Supervised learning
3 Unsupervised learning
4 Up next
Statistics and Econometrics (CAU Kiel) Summer 2020 3 / 30
The data mining process
A bit of history
Let’s start with a small warning:
Data mining used to be taken (and still is, in certain disciplines) in a negative sense, describing misuses of statistics where one looks for as many findings as possible, until one finds something.1
One purpose of this course is to make sure you know how to avoid this, while of course still look for patterns, trends and dependencies in data coming from various sources.
Until then, let’s give some structure on the data mining process.
1Think of tossing coins until you get heads 10 times in a row.
Statistics and Econometrics (CAU Kiel) Summer 2020 4 / 30
The data mining process
CRISP-DM
Cross Industry Standard Process for Data Mining is a standard intended to ease (and ease checking) the implementation of data mining projects and specifies six steps:
1 Business understanding
2 Data understanding
3 Data preparation
4 Modelling
5 Evaluation
6 Deployment
Needless to emphasize: in practice they’re all equally important.
Statistics and Econometrics (CAU Kiel) Summer 2020
5 / 30
The data mining process
Business understanding
In this step, one sets concrete goals and requirements for the data mining process:
What problem does the management want to tackle? Are there constraints? (How) Do possible solutions impact?
Assess situation (resources); project cost and benefits?
Define data mining goals
Outcome: concrete tasks and (rough) road map.
Btw., although statisticians are (usually) able to focus on the numbers, knowledge of where the data come from is of great help.
(Think of what makes a good econometrician.)
Statistics and Econometrics (CAU Kiel) Summer 2020 6 / 30
The data mining process
Data understanding
Once we agree on what to do, follow-up questions pop up: What data are available?
Are they informative enough to allow the completing the tasks decided above?
What concrete data problems are there? Can these be overcome?
(There may be some iterations between business and data understanding if required data is not available.)
Statistics and Econometrics (CAU Kiel) Summer 2020 7 / 30
The data mining process
Data preparation I
After sorting out what is needed and available, get down to earth:
Generate the data set to be used in the modelling stage: Clean data (remove inaccuracies, outliers, mistakes) Merge datasets from different sources
Transform variables if necessary
Make sure data is tidy…
Statistics and Econometrics (CAU Kiel) Summer 2020 8 / 30
The data mining process
Data preparation II
Tidy data:
Each variable must have its own column; Each observation must have its own row; Each value must have its own cell.
These may be intuitive requirements, but
recall that data may have many sources, and
tidiness needs to be imposed when setting up the “final” data set.
Data preparation is of huge importance in practice, (even if not very rewarding intellectually).
Statistics and Econometrics (CAU Kiel) Summer 2020
9 / 30
The data mining process
Modelling
Use suitable methods (from classical statistics, statistical learning, machine learning etc) to
obtain information (be exploratory in a sense), or set up predictions based on the available data set.
This is the fun part!
Statistics and Econometrics (CAU Kiel) Summer 2020 10 / 30
The data mining process
Evaluation
After setting up candidate models and predictions, one must Compare model performance
1 need error estimates for this, btw.,
2 which is something well-studied in statistics
Match modeling outcomes with goals of the data mining project, Assess the practical utility of the modelling results etc.
(There may be some iterations between modelling and evaluation if data mining goals not achieved.)
Statistics and Econometrics (CAU Kiel) Summer 2020 11 / 30
The data mining process
Deployment
To conclude, set up a project summary and report to management (or whoever initiated the project).
One must ensure that the data analysis in particular is readable,
reproducible, and maintainable.
Here’s for instance where a programming language like R has decisive advantages over Excel.2
2You would be surprised how much data and how many data analyses in enterprises are available only as Excel data sheets (ok, sometimes enhanced with VBA scripts).
Statistics and Econometrics (CAU Kiel) Summer 2020 12 / 30
The data mining process
Statistics and statistical learning I
Our focus is on the modelling and (partly) the evaluation stage, with emphasis to statistics and statistical learning. Methods may look similar, so there’s some overlap between stats, statistical learning, and machine learning.
Classical statistics has a well-defined flow of operations:
1 Set up statistical model;
2 Estimate (few) parameters;
3 Check model; if not “good” enough, return to step 1.
Use the resulting model for prediction (but causality and explanation are important aspects).
Statistics and Econometrics (CAU Kiel) Summer 2020 13 / 30
The data mining process
Statistics and statistical learning II
Statistical learning (supervised or not)
has a focus on more complex models and datasets, but inherits statistical thinking and tools.
Importantly, we still assume that the data Yi ∈ R and Xi ∈ Rp are a (random) sample from some population.
Choosing the appropriate degree of complexity is an essential task.3 Finally, machine learning focusses entirely on results (and often has the
luxury of resorting to huge, even generated, datasets).
3Nonparametric statistics come closest to statistical learning from this point of view. Statistics and Econometrics (CAU Kiel) Summer 2020 14 / 30
The data mining process
Dictionary Statistics – Learning
Statistics
Population/ Statistical model
Estimation
Prediction/ Classification
–
Covariates/Predictors
Variable of interest
Clustering/ Latent components
Learning
no explicit use
Learning Supervised learning
Predictor/Classifier Features/Inputs Target/Respose Unsupervised learning
Meaning
Sets of assumptions about
the joint distribution of Y and X Using data to make an informed
guess about an unknown quantity Predicting (discrete) outcomes
for Y when only X is available
A mapping from X to outcomes of Y
The Xs
The Y s
Putting data into groups (using statistical models or not), no specific target variable
Statistics and Econometrics (CAU Kiel)
Summer 2020 15 / 30
Supervised learning
Outline
1 The data mining process
2 Supervised learning
3 Unsupervised learning
4 Up next
Statistics and Econometrics (CAU Kiel) Summer 2020 16 / 30
Supervised learning
A targeted view
We are primarily interested in prediction (or classification). What we need is a predictor function, say f(x), such that
Yˆ = f ( X )
is a prediction (i.e. an educated guess) of the outcome of Y .
For classification, f takes discrete values (the predicted labels), and we call it classifier; at times, we may denote it by C(x).
Even conditionally on x, Y is random (for both prediction and classification), while Yˆ is just an number
– so prediction/classification mistakes are unavoidable.
Statistics and Econometrics (CAU Kiel) Summer 2020 17 / 30
Supervised learning
Learning from data
The point is to make as few prediction/classification errors as possible. (This implies certain properties of f which we discuss separately.)
In some situations, we may specify f from prior knowledge.
In many other situations, we must learn about f from available data.
In supervised learning, values (or labels) of Yi are paired with Xi, and we have n such training cases which allow us to learn a suitable f.
If e.g. f is a regression function, this amounts to estimating a regression model from the available sample.
Statistics and Econometrics (CAU Kiel) Summer 2020 18 / 30
Supervised learning
There are limits to learning
We know from Advanced Statistics II that estimates are never 100% precise in finite samples.
Working with large and complex data sets and models does not improve on this.
Let’s look at the instructive case of polynomial regression.
Predict real estate prices in New Taipei City (per unit of surface) given age of building
Data available from the Machine Learning Repository of the University of California, Irvine
Statistics and Econometrics (CAU Kiel) Summer 2020
19 / 30
0 10 20 30 40
house.age
Piecewise quadratic fit
0 10 20 30 40
house.age
Extremely flexible piecewise quadratic fit
0 10 20 30 40
house.age
0 10 20 30 40
Statistics and Econometrics (CAU Kiel)
Summer 2020
20 / 30
Supervised learning
Polynomial and piecewise polynomial fits
Linear fit
Quadratic fit
price
price
20 40 60 80 100 120
20 40 60 80 100 120
price
price
20 40 60 80 100 120
20 40 60 80 100 120
house.age
Supervised learning
Lessons
Flexibility of the fitted model is “good”
Polynomials can approximate unknown smooth functions arbitrarily well (the Weierstrass approximation theorem)
No need to search for suitable functional forms with flexible models. But there is such a thing as too much of it.
We need to find an optimal degree of flexibility!
Dimensionality also plays a role (though not obvious from the example): number of coefficients to be estimated depends on both number of
features p and order of polynomial r (roughly pr+1/r)!
The need for essentially more data to fit flexible in higher dimensions
is known as “the curse of dimensionality”; we’ll get back to it later.
Statistics and Econometrics (CAU Kiel) Summer 2020 21 / 30
Supervised learning
… and the most important one
Data broken down by distance to public transportation
0 10 20 30 40
State-of-the-art learning algorithms are useless without informative predictors!
Statistics and Econometrics (CAU Kiel) Summer 2020 22 / 30
price
20 40 60 80 100 120
Supervised learning
Sideline remark: interpretability
(Low-degree) Polynomial regression is still a useful tool, since People love a good story!
For instance, model averaging (combining predictions from different models) usually outperforms a linear prediction, but it does not immediately clarify what role a particular predictor plays.
To be fair, some are happy with good predictive performance alone, but, in general, there is a trade-off between performance and interpretability.
Statistics and Econometrics (CAU Kiel) Summer 2020 23 / 30
Unsupervised learning
Outline
1 The data mining process
2 Supervised learning
3 Unsupervised learning
4 Up next
Statistics and Econometrics (CAU Kiel) Summer 2020 24 / 30
Unsupervised learning
No target in view
In unsupervised learning, one does not focus on predicting or classifying conditional on feature values.
Statistical view:
Set up a statistical model with (few) latent variables;
This allows us to understand how data depend on one another, and also allows us to represent data in lower dimensionality.
One could say that
supervised learning focuses on conditional distribution of Y given X, unsupervised learning deals with unconditional distribution of X alone.
Learning view:
focus directly (and less model-based) on the two main tasks, i.e.
find clusters of “similar” observations and reduce data dimensionality.
Statistics and Econometrics (CAU Kiel) Summer 2020 25 / 30
Unsupervised learning
Clustering: missing labels
Let Yi ∈ {0, 1} , Xi ∈ Rp, but the Y s are latent (unobserved). Pooling observations leads to a mixture distribution for Xi.
Contrary to classification, you do not observe Yi so no supervised training is available.
The task is to group observations with similar values of Xi.
(As a side effect, you get estimates of the latent Yi for each observation).
Statistical approaches may make use of the mixture distribution, specifying the model more concretely (e.g. normal distribution of Xi given Yi) and estimating parameters
Learning approaches are a bit more agnostic; they may use the same algorithms, but don’t necessarily interpret them in a statistical key.
Statistics and Econometrics (CAU Kiel) Summer 2020 26 / 30
Example: Cluster analysis
Scatterplot of data with two features
Unsupervised learning
X2
−2 0 2 4 6
−4 −2 0 2 4 6 8
Statistics and Econometrics (CAU Kiel)
Summer 2020 27 / 30
X1
How does one group the data points? (I.e. form clusters?)
How many clusters should one build?
Unsupervised learning
Again: it matters which features you take
Marginal distribution of X1
Marginal distribution of X2
−4 −2 0 2 4 6 8
Here, X2 seems to allow for better clustering, although X1 may still have
a contribution.
−5 0 5
Statistics and Econometrics (CAU Kiel) Summer 2020 28 / 30
0.00 0.05 0.10 0.15
Density
Density
0.00 0.05 0.10 0.15 0.20 0.25
Up next
Outline
1 The data mining process
2 Supervised learning
3 Unsupervised learning
4 Up next
Statistics and Econometrics (CAU Kiel) Summer 2020 29 / 30
Up next
Coming up
Prediction and classification
Statistics and Econometrics (CAU Kiel) Summer 2020 30 / 30