Admin Overview Basic concepts
Introduction
MAST90083 Computational Statistics and Data Mining
Dr Karim Seghouane
School of Mathematics & Statistics The University of Melbourne
Introduction 1/25
Admin Overview Basic concepts
Outline
§i. Admin
§ii. Introduction & overview
§iii. Basic concepts
Introduction 2/25
Admin Overview Basic concepts
Admin
Lectures – Dr Karim Seghouane Mon. 14:15-16:15
Theatre or via zoom Practical lab – Jiadong Mao
8 Groups
f2f: G2/Tues. 9:00-10:00, G3/Tues. 14:15-15:15 in PAR- -G70 ( ) G4/Fri. 10:00-11:00 in
PAR- -G69 (Thompson Lab)
Online: G1/Wedn. 15:15-16:15, G5/Tues. 16:15-17:15,
G6/Thur. 16:15-17:15, G7/Wedn. 16:15-17:15 and G8/Thur. 12:00-13:00
Consultation time
Wedn. 08:00-09:00 Frid. 08:00-09:00
Introduction 3/25
Admin Overview Basic concepts
Admin – Assessment
Problem solving assignments 45%
Three assignments due early, mid and late semester Each written assignment is worth 15%
Exam. 55%
Introduction 4/25
Admin Overview Basic concepts
Admin – LMS
All relevant teaching material will be posted on LMS (including the supplementary and the additional material)
Due to the time limits during the lectures, you will need to go over some mathematical details & deepen your knowledge outside of the lectures time.
Your assignment must be submitted via LMS.
Discussion board?
Introduction 5/25
Admin Overview Basic concepts
Admin – References
Elements of Statistical Learning by , Tibshirani Robert & (2009).
An Introduction to Statistical Learning by , & Tibshirani Robert (2013).
Introducing Monte with R by R.P. Christian & G. Casella (2010).
Computational Statistics by G.H. Givens & J.A. Hoeting (2005).
Academic articles, links to blogs & videos on LMS.
Introduction 6/25
Admin Overview Basic concepts
Admin – Communication
Note that emails regarding the course material, laboratories and assignments will be addressed during office hours and if time permits.
It is expected that questions regarding these matters will be asked during consultation hours or during the laboratories.
There will be no consultation hours during non-teaching periods.
I will be out of office during SWOT Vac week and the first two weeks of the examination period.
I will provide an extra consultation time on week 12.
Students should plan ahead with any queries regarding assignments or course material.
Introduction 7/25
Admin Overview Basic concepts
Lecture schedule (provisional)
Data mining ( 7 weeks ):
linear model selection and regularisation; kernel and local
regression; basis expansion and spline regression; general additive models (GAM); classification and regression trees; bagging, random forests and boosting; support vector machines (SVM); component analysis and deep learning.
Computational statistics ( 5 weeks ):
EM algorithm; Bayes computing; Monte Carlo methods; and bootstrap methods.
Introduction 8/25
Admin Overview Basic concepts
Question to the class…
How would you explain in a few sentences to a general audience
What a statistician / data scientist does?
What is the main purpose of that?
Introduction 9/25
Admin Overview Basic concepts
Some answers
What a statistician / data scientist does?
Creat a model (box) to understand the releathinseep between several variable
make sense of data, extract important patterns and trends, we can call this learning from data
What is the main purpose of that?
Predict/decide or describe/understand
Introduction 10/25
Admin Overview Basic concepts
The power of (statistical) model
Sample Population
Introduction 11/25
Admin Overview Basic concepts
Once upon a time …
Having the data → holding the power !!!
Image source: http://freshlearners.blogspot.com.au/2015/07/most-recent-communications-technology.html
Introduction 12/25
Admin Overview Basic concepts
Today
The real power is in KNOWING WHAT TO DO with the data
Image source: http://psmit.com/about.html
Introduction 13/25
Admin Overview Basic concepts
Taxonomy of covered methods
Data Mining
Unsupervised learning
Supervised learning
Classification
Regression
Parametric
Logistic regression
Nonparametric
Trees
SVM
Neural nets
Discriminant analysis
Parametric
Nonparametric
Kernel regression
Splines/ Basis expansion
MARS Trees
Hierarchical clustering
K‐means
Principal component analysis
Linear models regression GAM
Introduction 14/25
Admin Overview Basic concepts
Regression vs. Clasification
Regression – The aim is to predict the numerical outcome of a subject based on its features.
Clasification – The aim is to predict the class belonging of a subject based on its features.
Introduction 15/25
Admin Overview Basic concepts
Parametric vs. Nonparametric
Parametric approach – Makes an explicit assumption about the functional form. (Restriction on the shape)
Nonparametric approach – does not assume any functional form for the underlying model structure.
Introduction 16/25
Admin Overview Basic concepts
Introduction 17/25
Admin Overview Basic concepts
Supervised vs. Unsupervised
Supervised learning – In case our data set contains the response (outcome) measurements, the fitted model relates the different features to the response.
Unsupervised learning – In case our data set contains only information about the different features of the subjects, the fitted model aims to segment the subjects into groups or just learn about the relationship of the features.
Introduction 18/25
Admin Overview Basic concepts
Taxonomy of covered methods
Data Mining
Unsupervised learning
Supervised learning
Classification
Regression
Parametric
Logistic regression
Nonparametric
Trees
SVM
Neural nets
Discriminant analysis
Parametric
Nonparametric
Kernel regression
Splines/ Basis expansion
MARS Trees
Hierarchical clustering
K‐means
Principal component analysis
Linear models regression GAM
Introduction 19/25
Admin Overview Basic concepts
How models are fitted
By optimization
We minimize some criterion (e.g. squared error with some additional penalty)
Introduction 20/25
Admin Overview Basic concepts
Trade-offs
Flexibility (predictability) – Interpretability trade-off Bias – Variance trade-off.
variance – how much fˆ changes with different data set. bias – the error from the approximation.
Introduction 21/25
Admin Overview Basic concepts
Complexity
Complexity trade-offs involving sample size, dimensionality and empirical performance. It is a characteristic feature of supervised learning methods.
Notion of curse of dimensionality: for a fixed sample size, the expected classification error will improve by increasing the number of features, but eventually will decrease. This is a consequence of the large size of high-dimensional spaces, which require correspondingly large training sample sizes.
Scissors effect: the expected error typically decreases as sample size increases, and more complex classification rules achieve smaller error for large sample sizes; however, simpler classification rules can perform better under small sample sizes, by virtue of needing less data.
Introduction 22/25
Admin Overview Basic concepts
Complexity
Expected accuracy in a discrete classification problem for various training sample sizes as a function of the number of predictors.
Introduction 23/25
Admin Overview Basic concepts
Complexity
Expected error as a function of sample size for two classification rules. There is a problem-dependent critical sample size N0, under which one should use the simpler classification rule.
Introduction 24/25