Data mining
Prof. Dr. Matei Demetrescu Summer 2020
Statistics and Econometrics (CAU Kiel)
Summer 2020 1 / 18
Today’s outline
A brief overview
1 Find a needle in a haystack
2 The course
3 Practical details
4 Up next
Statistics and Econometrics (CAU Kiel) Summer 2020 2 / 18
Find a needle in a haystack
Outline
1 Find a needle in a haystack
2 The course
3 Practical details
4 Up next
Statistics and Econometrics (CAU Kiel) Summer 2020 3 / 18
Find a needle in a haystack
Data are pervasive
Data availability is becoming less and less of a problem
Naturally, one tries to get the most out of them:
Analyze past customer data to predict future behavior
Analyze past figures to predict current demand
Use past customer correspondence to automatically classify incoming letters1
1Or mails, or instant messages…
Statistics and Econometrics (CAU Kiel) Summer 2020 4 / 18
Find a needle in a haystack
So let’s put them to use!
Data mining refers to “extracting” “information” from some data set.
The more data, the more to extract…
But the widespread availability of data poses, among others, statistical and
computational problems.
We’ll be focussing on statistical aspects here.
Statistics and Econometrics (CAU Kiel) Summer 2020 5 / 18
Find a needle in a haystack
Related fields and buzzwords
Data science: mathematics, statistics and computer science working together to deliver insights from data
Machine learning: variety of computational methods implementing learning algorithms2 (often with origins in computer science).
Analytics (descriptive/diagnostic/predictive/prescriptive): the use of data to answer various (business-related) questions
Big data
… etc.
2Here, the term learning is used to emphasize that learning systems are not rigidly programmed from scratch, but modified (in time) as to improve performance; fitting a flexible model (i.e. the mathematical/statistical relationships describing how a sample of data is generated) may be a better description.
Statistics and Econometrics (CAU Kiel) Summer 2020 6 / 18
Find a needle in a haystack
(Almost) Getting it right
“I keep saying that the sexy job in the next 10 years will be statisticians. And I’m not kidding.”
Hal Varian
Google chief economist (Quote of the Day, New York Times, August 5, 2009)
(But stats alone can be misleading: https://assets.amuniversal.com/c0864e106d6401301d80001dd8b71c47)
Statistics and Econometrics (CAU Kiel) Summer 2020 7 / 18
The course
Outline
1 Find a needle in a haystack
2 The course
3 Practical details
4 Up next
Statistics and Econometrics (CAU Kiel) Summer 2020 8 / 18
The course
Aim of the course
This course gives an introduction to complex methods useful for the analysis of potentially large data sets and for related predictions.
Compared to previous courses, they may rather appear to be statistical heuristics. This is deceiving.
After successfully participating, you will become …
… familiar with flexible fitting methods for (nonlinear) models
and, in particular,
… with computer implementations in R.
Statistics and Econometrics (CAU Kiel) Summer 2020 9 / 18
The course
A warning
The focus of this class is not on the efficient implementation.
Implementation is a serious issue in practice; do not underestimate it. Working with large data sets and complex models requires
fast data access: data bases, suitable software
a lot of computations: fast computers, parallelizing strategies and suitable visualization strategies.
… but we focus on the statistical aspects.
Statistics and Econometrics (CAU Kiel) Summer 2020 10 / 18
The course
Outline
1 Statistical learning
2 Prediction and classification
3 Using linear models
4 Model selection and error estimation
5 Dealing with many features: Shrinkage and dimensionality reduction
6 Getting nonlinear: Local regression, trees and more
7 Ensemble methods: Bagging, boosting, and model averaging
8 Interpretable models
9 Unsupervised learning
Statistics and Econometrics (CAU Kiel) Summer 2020 11 / 18
Practical details
Outline
1 Find a needle in a haystack
2 The course
3 Practical details
4 Up next
Statistics and Econometrics (CAU Kiel) Summer 2020 12 / 18
Practical details
Materials
Lecture notes and slides will be made available via OLAT
The basic textbook is
Hastie, T. , R. Tibshirani and J. Friedman (2011, 2nd ed.) The
Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer
A pdf copy is freely available from the authors: https://web.stanford.edu/~hastie/ElemStatLearn/
The introductory version is also useful
James, G. et al. (2013) Introduction to statistical learning, Springer
Further interesting books:
Bishop, C. (2006) Pattern Recognition and Machine Learning, Springer
Han, J., M. Kamber and J. Pei (2012, 3rd ed.) Data Mining: Concepts and Techniques, Elsevier
Statistics and Econometrics (CAU Kiel) Summer 2020 13 / 18
Practical details
Tutorial
1 PC-tutorial, introduction to R and some case studies
2 Postponed to the 2nd half of the semester
3 non-compulsory home assignments, essentially programming selected methods yourselves – you may earn bonus points for the exam
Statistics and Econometrics (CAU Kiel) Summer 2020 14 / 18
Practical details
Exam & grades
Written exam
Use of lecture notes allowed
May earn bonus points during the semester via R home assignments, max 15 on top of max 100.
Statistics and Econometrics (CAU Kiel) Summer 2020 15 / 18
Practical details
Office hours & more
We’ll use video conferences to have Q&A sessions during the original time slots.
When you have questions outside class:
Try first the OLAT forum of this course
Emails also work, mdeme@stat-econ.uni-kiel.de, mokuneva@stat-econ.uni-kiel.de
otherwise by zoom (and, as soon as possible, also in person) – by appointment.
Statistics and Econometrics (CAU Kiel) Summer 2020
16 / 18
Up next
Outline
1 Find a needle in a haystack
2 The course
3 Practical details
4 Up next
Statistics and Econometrics (CAU Kiel) Summer 2020 17 / 18
Up next
Coming up
Statistical learning: Basic notions
Statistics and Econometrics (CAU Kiel) Summer 2020 18 / 18