Supervised versus Unsupervised Learning
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 79/102
Supervised vs. Unsupervised Learning
Machine learning problems can generally be categorized as supervised or unsupervised.
The regression and classification problems that we have discussed so far are examples of supervised learning.
What does it mean to be supervised? For each observation of the predictor measurement(s) xi , i = 1, 2, · · · , N, there is an associated response variable yi .
We wish to fit a model that relates the response to the predictors either for predicting the response based on future observations (prediction) or for understanding the relationship between the response and the predictors (inference).
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus
80/102
Supervised vs. Unsupervised Learning (cont.)
In contrast, unsupervised learning describes a more challenging scenario: We observe predictor variables xi but there is no associated response variable yi .
It is not possible to fit a regression model or to do classification, since there is no response variable to predict.
In this setting, we are working blind; the situation is referred to as unsupervised because we lack a response variable that can supervise our analysis.
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 81/102
Supervised vs. Unsupervised Learning (cont.)
In statistics and machine learning, we week to find and understand relationships between variables, in this case of unsupervised learning, the relationship among the xi ’s themselves (there are N many of them).
One statistical learning tool that we can use here is cluster analysis, or clustering.
The goal of cluster analysis is to ascertain, on the basis of
x1 , x2 , · · · , xN , whether the observations fall into analysis relatively distinct groups.
For example, in a market segmentation study, we might observe multiple characteristics (variables) for potential customers, such as zip code, family income, and shopping habits. We might believe that the customers fall into different groups, such as big spenders versus low spenders.
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 82/102
Supervised vs. Unsupervised Learning (cont.)
The following figure provides a simple illustration of the clustering problem. It is a plot of 50 measurements on two variables (X1,X2).
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 83/102
Supervised vs. Unsupervised Learning (cont.)
We wish to see how well the observations are clustered into groups and how well the groups are separated between each other.
But the problem is do we cluster into 2 or 3 or more groups? How do we determine this?
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 84/102
Idea and Algorithm of K-means Clustering
Suppose we have N d-dimensional observations
xi ≡ (xi1,xi2,··· ,xid) ∈ Rd, i = 1,2,··· ,N.
The k-means clustering algorithm requires a predetermined choice for the number of centroids of the clusters, say K.
Given K, randomly initialize the K centroids, say C1,C2,··· ,CK, where Ck ≡ (Ck1,Ck2,··· ,Ckd) ∈ Rd, for k = 1,2,··· ,K. GivenC1,C2,···,CK,computethedistanceofxi fromeachcentroid Ck.
Xn
Xm
C1
C2
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 85/102
K-means (cont.)
Mathematically, this means we compute the distance ∆i corresponding to observation xi defined by
d
∆i = min (xij − Ckj )2
1≤k ≤K
j=1
and assign xi to cluster k∗ if ∆i achieves its minimum for k = k∗. This is the STEP I.
STEP I: Cluster assignment: The cluster label assignment for xi is taken as L(xi) = k∗ where
d
k∗ =argmin(xij −Ckj)2
1≤k ≤K j =1
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus
86/102
K-means (cont.)
STEP I: Given the cluster assignment for each xi , recalculate the new centroids as
Ck=1 xi nk i:L(xi)=k
where nk is the number of observations xi with L(xi) = k.
Cycle between STEPS I and II, that is, the cluster assignment and
centroid computation, until convergence.
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus
87/102
R codes
plot(x, col=”black”, main=”Scatterplot of X_1 and X_2″,
xlab=””, ylab=””, pch=20, cex=2)
Scatterplot of X_1 and X_2
0246
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus
88/102
−6 −4 −2 0 2 4
R codes (cont.)
#Start with two clusters
km.out=kmeans(x,2,nstart=20)
plot(x, col=(km.out$cluster+1),
main=”Two clusters for Kmeans”,
xlab=””, ylab=””, pch=20, cex=2)
Two clusters for Kmeans
0246
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus
89/102
−6 −4 −2 0 2 4
R codes (cont.)
#Three clusters
km.out=kmeans(x,3,nstart=20)
Three clusters for Kmeans
0246
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 90/102
−6 −4 −2 0 2 4
R codes (cont.)
#Four clusters
km.out=kmeans(x,4,nstart=20)
Four clusters for Kmeans
0246
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 91/102
−6 −4 −2 0 2 4
R codes (cont.)
#Five clusters
km.out=kmeans(x,5,nstart=20)
Five clusters for Kmeans
0246
Where to stop? Which is the optimal cluster number K?
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 92/102
−6 −4 −2 0 2 4
Choosing an optimal K
K is like the flexibility parameter. The larger the value of K, the
kmeans clustering algorithm “fits” the data better and better.
The maximum value of K = N results in each observation xi being in its own cluster and centroid.
To avoid overfitting, consider the within-cluster-variance
Kd
W = (xij −Ckj)2
k=1 i:L(xi )=k j=1 Note that as K ↑ N, W ↓ 0.
As K increases, the drop in W will be large initially until a point where subsequent drops in W will no longer be that significant.
We choose the optimal K, K∗, as the point where the decrease in W starts to become insignificant. This criteria of choosing K∗ is called the elbow criteria.
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 93/102
R codes for optimal K∗
#Choosing optimal K
Kmax = 6;W <- vector("numeric", Kmax)
for (k in 1:Kmax){W[k] = kmeans(x,k,nstart=20)$tot.withinss} plot(c(1:Kmax), W, main="Plot of W(k) versus k",
type="b", xlab="", ylab="", pch=20, cex=2) Plot of W(k) versus k
Seems like K∗ = 3.
123456
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus
94/102
100 200 300 400
Hard versus Soft Thresholding for Clustering
Kmeans is a hard thresholding clustering procedure. Each observation is clustered into one and only one cluster according to
d
k∗ =argmin(xij −Ckj)2
1≤k ≤K j =1
However, observations located at an equidistant point between two
centroids should not be clustered into one or the other.
Soft thresholding means that there is a probability associated with each observation of being included into each cluster.
For observations lying at an equidistant point between two centroids should have probabilities 0.5 and 0.5 of belonging to each cluster.
How to achieve this?
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 95/102
Gaussian Mixture Models
The Gaussian mixture model (GMM) is a model based clustering teachnique that uses a likelihood based approach to estimate the probability of each observation belonging to a particular cluster.
The GMM assumes that the observations belong to each class is normally distributed density function with unknown mean and variance.
The GMM likelihood with K components looks like this:
K
π(x) = pk φd(x ; μk,Σk)
k=1
where pk ≥ 0 are mixture probabilities summing to 1, φd (x ; μ, Σ) is
the d-variate normal pdf with mean μ and covariance matrix Σ.
The mixture probabilities pk, k = 1,2,··· ,K and the means and covariance, μk and Σk, of each class k, k = 1,2,··· ,K are unknown and have to be estimated based on a training dataset.
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 96/102
Selecting the optimal K∗
The optimal K∗ is determined using the Bayes Information Criteria
(BIC). This is a model selection tool that penalizes overfitting. The formula for the BIC is
BIC(K)=−2loglˆK +mK log(N)
where mK is the total number of unknown parameters for a GMM with K components, lˆK is the estimate of the GMM likelihood function based on parameter values learned from a training dataset of size N.
This should be a positive number and the optimal K∗ will minimize the BIC. However the R package calls −BIC as the BIC, so it will find the optimal K∗ by maximizing its version of BIC.
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 97/102
R codes for GMM fitting
#GMM fitting
library(mclust)
BIC <- mclustBIC(x, modelNames = "VII") #Determine K using the BIC criteria plot(BIC)
VII
123456789
Number of components
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 98/102
BIC
−300 −290 −280 −270
summary(BIC)
## Best BIC values:
## VII,3 VII,2 VII,4
## BIC -268.0977 -272.09910 -280.08661
## BIC diff 0.0000 -4.00136 -11.98887
mod1 <- Mclust(x, x = BIC)
summary(mod1, parameters = TRUE)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust VII (spherical, varying volume) model with 3 compo
##
## log-likelihood n df BIC ICL
## -115.3423 30 11 -268.0977 -269.0544
ne
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 99/102
## Clustering table: ## 1 2 3
## 71112
##
## Mixing probabilities:
## 1 2 3
## 0.2375402 0.3583167 0.4041431
##
## Means:
## [,1] [,2] [,3]
## [1,] 2.596463 4.123592 1.157236
## [2,] -1.416490 -4.198373 2.818566
##
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 100/102
## Variances:
## [,,1]
## [,1] [,2]
## [1,] 0.388412 0.000000
## [2,] 0.000000 0.388412
## [,,2]
## [,1] [,2]
## [1,] 0.8079786 0.0000000
## [2,] 0.0000000 0.8079786
## [,,3]
## [,1] [,2]
## [1,] 1.991936 0.000000
## [2,] 0.000000 1.991936
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 101/102
R codes for classification labels
plot(mod1, what = "classification")
0246
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 102/102
−6 −4 −2 0 2 4