Prof. Dr. Matei Demetrescu University of Kiel Institute for Statistics and Econometrics Summer 2020
Data Mining
Course description
The course provides a statistical introduction to methods designed for analyzing large and complex data sets and relations. The focus is on regression and classification methods. We start in a parametric setup with linearity, but move on relatively fast to discuss issues appearing in practice such as regressor (feature) selection and take a look at model selection techniques like cross-validation. The course is completed by taking a glimpse at specific nonparametric techniques such as regression and decision trees. Selected case studies are discussed in a computer class using R. After completing the course, you will be able to conduct complex data analyses on your own.
Prerequisites
• (Advanced) Statistics I+II or equivalent
Outline
1. Statistical learning
2. Prediction and classification
3. Using linear models
4. Model selection and error estimation
5. Dealing with many features: Shrinkage and dimensionality reduction 6. Getting nonlinear: Local regression, trees and more
7. Ensemble methods: Bagging, boosting, and model averaging
8. Interpretable models
9. Unsupervised learning
Schedule
• The course will begin on April 8th as an OLAT based online course; it is not clear yet when or whether we can revert to normal. Please see the following generic in- formation regarding online teaching at the Institute for Statistics and Econometrics.
Concerning this course, the plan is to upload slides and video tutorials each wee- kend, followed by live Q&A sessions during the original time slot (i.e. Thursday 2:15pm to 3:45pm) using suitable video conference software.
Please note that plans may change as we go. Any changes will ba communi- cated via OLAT.
• The PC tutorials will be re-scheduled towards the end of the semester; we will let you know as soon as we have more information.
Materials
• Slides and lecture notes will be made available in due time via OLAT • The basic textbook is
– Hastie, T. , R. Tibshirani and J. Friedman (2009, 2nd ed.) The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer1
(A pdf copy is freely available from the authors: https://web.stanford.edu/ ~hastie/ElemStatLearn/download.html)
• More focused:
– Bishop, C. (2006) Pattern Recognition and Machine Learning, Springer
• A different perspective:
– Han, J., M. Kamber and J. Pei (2012, 3rd ed.) Data Mining: Concepts and
Techniques, Elsevier
Exam
• written exam
• you may use the slides
• you can earn some bonus points by solving R assignments
1An introductory version we may sometimes use is James, G., D. Witten, T. Hastie and R. Tibshirani (2013) An Introduction to Statistical Learning: With Applications in R, Springer. A pdf copy is also freely available from the authors, feel free to search for it.
Contact:
• mdeme@stat-econ.uni-kiel.de, mokuneva@stat-econ.uni-kiel.de Office hours:
• Office hours are only available online: per email and video call (the latter by ap- pointment).