Motivation
Motivation
Statistical / Machine Learning
5.0 – Statistical Learning
Motivation
Statistical / Machine Learning
Basic concepts
5.1 – Basic concepts
Motivation
Statistical / Machine Learning
Basic concepts
“Machine learning”?
Essentially, it is statistical learning
Essentially, we’re looking for a pattern (unseen up to now)
If there is no pattern, then ML will be counter-productive as it
is likely to produce one!
Assumes availability of relevant data
Examples: consumer taste/habits, online advertising, election
forecasts, risk prediction
Warning: linear regression is now referred to as “artificial
intelligence” by a lot of people
Motivation
Statistical / Machine Learning
Basic concepts
Statistical learning: regression
Variable of interest: Wage (continuous)
(figure from ISLR)
http://www-bcf.usc.edu/~gareth/ISL/
Motivation
Statistical / Machine Learning
Basic concepts
Statistical learning: classification
Variable of interest: Default (categorical)
(figure from ISLR)
http://www-bcf.usc.edu/~gareth/ISL/
Motivation
Statistical / Machine Learning
Basic concepts
Statistical learning: supervised learning
Variable of interest: Species (categorical)
With prior experience available (supervised classification)
4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
2.
0
2.
5
3.
0
3.
5
4.
0
Iris dataset
Sepal.Length
S
ep
al
.W
id
th
setosa
versicolor
virginica
setosa
versicolor
virginica
Motivation
Statistical / Machine Learning
Basic concepts
Statistical learning: unsupervised learning
Variable of interest: Species (categorical)
With no prior experience available (clustering)
4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
2.
0
2.
5
3.
0
3.
5
4.
0
Iris dataset
Sepal.Length
S
ep
al
.W
id
th
Motivation
Statistical / Machine Learning
Learning framework
5.2 – Learning framework
Motivation
Statistical / Machine Learning
Learning framework
Data splitting
A dataset is randomly split into a training set and a test set
The training set is used to fit (or train) the model
The test set is used to validate this model
Model is both tuned and fitted during training
We expect MSEtrain < MSEtest Overfitting occurs when too much emphasis is put on training the data: Training set yields a much smaller MSE than test set Model is describing the training data too well and unable to adapt to new data Yields poor prediction performance Motivation Statistical / Machine Learning Learning framework General framework with large data Example in model selection: Random split into training, validation and test sets (e.g. 50%, 25%, 25% resp.) Training set: used to fit the model Validation set: used to measure prediction error and choose best model Test set: used to measure generalization error of final model (i.e. ability to predict from new data) [The Elements of Statistical Learning, T. Hastie, R. Tibshirani, J. Friedman, Springer] Motivation Statistical / Machine Learning Learning framework General framework with small data Cross-validation is usually used when dealing with small samples. Example: model selection for classification with N = 50, p = 5000 1 Randomly divide samples into K cross-validation folds 2 For each fold k = 1, 2, . . . ,K: a Find a subset of predictors with higher (univariate) correlation with class labels, using all data except fold k b Using just this subset of predictors, build a multivariate classifier, using all data except fold k c Use the classifier to predict the class labels for fold k and compute corresponding prediction error [The Elements of Statistical Learning, T. Hastie, R. Tibshirani, J. Friedman, Springer] Motivation Statistical / Machine Learning Learning framework General framework: summary Randomly split training and test set Training set: Model calibration (tuning) Fit model on whole training set Test set: Model validation (test) Motivation Statistical / Machine Learning Learning framework Key aspects of the data (typical challenges) Data scales Beware of scale effects in heterogeneous data Illustrative example: iris dataset Dimensionality (too many covariates) Many variables may be “aligned” / redundant Data pre-filtering: must be done independently of class labels Illustrative example: iris dataset Dimensionality (too many dimensions in the observed data) Pre-process data using dimension reduction techniques Factor analysis, PCA are predominant and common choices Illustrative examples: iris and EuStockMarkets datasets Motivation Statistical / Machine Learning Performance assessment 5.3 - Performance assessment Motivation Statistical / Machine Learning Performance assessment Performance indicators for regression Mean Square Error (MSE): MSE = 1 n n∑ i=1 (Yi − f̂(Xi))2 LOO CV test MSE: MSE(n) = 1 n n∑ i=1 (Yi − f̂−i(Xi))2 k-fold CV test MSE: MSE(k) = 1 k k∑ i=1 MSE−ii Motivation Statistical / Machine Learning Performance assessment Performance indicators for classification Prediction accuracy (error rate) Err = 1 n n∑ i=1 I(Yi 6= Ŷi) LOO CV error rate: Err(n) = 1 n n∑ i=1 I(Yi 6= Ŷ −ii ) k-fold CV test MSE: Err(k) = 1 k k∑ i=1 Err−ii Motivation Statistical / Machine Learning Performance assessment Performance indicators for classification Recall: one seeks to retain or reject a null hypothesis H0 on the basis of evidence. Let us denote H1 the alternative hypothesis. H0 is true H1 is true H0 is accepted Correct decision Type II error H1 is accepted Type I error Correct decision The Null hypothesis can never be proven Type I error occurs when H0 is true but rejected P(Type I error) = significance level of the test P(Type II error) = false negative rate 1 - P(Type II error) = statistical power of the test (sensitivity) Motivation Statistical / Machine Learning Performance assessment Performance indicators for classification False Positive rate = FP/N = specificity (I) True Positive rate = TP/N = sensitivity = recall (1-II) AUC of the ROC Actual value Prediction outcome + - + TP FN - FP TN TOTAL P N ROC for some dataset Specificity S en si tiv ity 1.0 0.8 0.6 0.4 0.2 0.0 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 AUC=0.7314 Motivation Statistical / Machine Learning Some techniques of reference 5.4 - Some techniques of reference Motivation Statistical / Machine Learning Some techniques of reference Logistic regression Use a scrambled subsample x of the iris dataset: is = sample(1:150); x = iris[is[1:100],] Recode x$Species into is.virginica with values in (0; 1) (we’re changing the problem formulation slightly) Fit model: fit <- glm(Species~., data=x, family=binomial(logit)) Use fit to predict Species of remaining data points: testset = iris[-is,] y <- testset[,1:4] pred <- predict(fit, newdata=y, type=’response’) Assess prediction performance Motivation Statistical / Machine Learning Some techniques of reference Naive Bayes classification Recall: find unknown true label li for each observation Yi, given the observed predictor vector x0 Naive Bayes Classifier: for all i = 1, . . . , n l̂i = argmax j Pr(Yi = j|Xi = x0) Example: for a two-class problem, l̂i = 1 if Pr(Y = 1|X = x0) > 0.5
Motivation
Statistical / Machine Learning
Some techniques of reference
Regression and Classification Trees
Ex: ISLR::Hitters (http://www-bcf.usc.edu/~gareth/ISL)
http://www-bcf.usc.edu/~gareth/ISL
Motivation
Statistical / Machine Learning
Some techniques of reference
Regression and Classification Trees
Ex: ISLR::Hitters
Motivation
Statistical / Machine Learning
Some techniques of reference
Clustering
Idea: arrange n individuals into groups wrt a set of measures
The choice of measure is key and determines classification
Variables should be rescaled first (and weighted)
Highly dimensional data may require prior reducing
(PCA not necessarily pertinent here)
http://cran.r-project.org/web/views/Cluster.html
http://cran.r-project.org/web/views/Cluster.html
Motivation
Statistical / Machine Learning
Some techniques of reference
Hierarchical clustering
Hierarchical clustering: split-and-merge to construct a dendrogam
M = matrix(c(0,3,5,8,4,
3,0,2,6,8,
5,2,0,3,4,
8,6,3,0,1,
4,8,4,1,0),nrow=5)
dM = data.frame(M, row.names=c(“A”,”B”,”C”,”D”,”E”))
dmat = dist(dM)
plclust( hclust(dmat) )
Motivation
Statistical / Machine Learning
Some techniques of reference
Hierarchical clustering
D E
A
B C
4
6
8
1
0
1
2
Cluster Dendrogram
hclust (*, “complete”)
dmat
H
e
ig
h
t
Motivation
Statistical / Machine Learning
Some techniques of reference
Hierarchical clustering
Example: data(eurodist)
Creating a dendrogram:
hc = hclust(eurodist, method=”ward”)
Plotting dendrograms:
plot(hc) # (or plclust)
plot(hc, hang=-1)
rect.hclust(hc, k=3)
Motivation
Statistical / Machine Learning
Some techniques of reference
k-means clustering
k-means clustering is another popular clustering method
It is comparable to an Expectation-Maximisation algorithm
?kmeans…
Initialize at k clusters (use hierarchical to choose k?)
Move an individual to another cluster if criterion is optimized
Risk of convergence to local solution
Motivation
Statistical / Machine Learning
Some techniques of reference
Principal component analysis
Project data according to highest variance components
Linear orthogonal transformation:
each component is uncorrelated with preceding ones
Reveals internal structure of the data, explaining its variance
PC1 = greatest variance by any projection of the data
Theoretically optimal transform for a given data in LS terms
PCA is widely used e.g. to reduce problem dimensionality
?prcomp…
Motivation
Statistical / Machine Learning
Some techniques of reference
Principal component analysis
Example: ?USArrests
Statistics, in arrests per 100,000 residents for assault, murder,
and rape in each of the 50 US states in 1973
Also contains the % of the population living in urban areas
plot(USArrests, main=”USArrests data”) # scatterplots
pairs(USArrests, panel = panel.smooth,
main = “USArrests data”)
Motivation
Statistical / Machine Learning
Some techniques of reference
Principal component analysis
Murder
50 150 250
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
10 20 30 40
5
1
0
1
5
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
5
0
1
5
0
2
5
0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Assault
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
● UrbanPop
3
0
5
0
7
0
9
0
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
5 10 15
1
0
2
0
3
0
4
0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
30 50 70 90
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Rape
USArrests data
Motivation
Statistical / Machine Learning
Some techniques of reference
Principal component analysis
data(USArrests)
cor(x) # correlation matrix
eigen(cor(x)) # eigenvalue decomposition
# compare result with prcomp:
prcomp(USArrests, scale = TRUE) # same vectors!
prcomp(∼Murder+Assault+Rape, data=USArrests,scale=T)
plot(prcomp(USArrests, scale = TRUE))
# equiv. cov(prcomp(USArrests, scale = TRUE)$x)
# i.e. (prcomp(USArrests, scale = TRUE)$sdev)^2,
# i.e. the eigenvalues of the cov/correl matrix
summary(prcomp(USArrests, scale = TRUE))
Motivation
Statistical / Machine Learning
Some techniques of reference
Principal component analysis
prcomp(USArrests, scale = TRUE)
V
a
ri
a
n
ce
s
0
.0
0
.5
1
.0
1
.5
2
.0
Statistical / Machine Learning
Basic concepts
Learning framework
Performance assessment
Some techniques of reference