Slide 1
BUSINESS SCHOOL
Discipline of Business Analytics
QBUS6850 Team
Topics covered
k-Nearest-Neighbours Algorithm
Unsupervised learning introduction
K-means
References
Friedman et al., (2001), Chapters 13.1 – 13.3, 14.3
James et al., (2014), Chapter 10.3
Bishop, (2006), Chapter 2.5.2, 9.1
2
Learning Objectives
Understand the intuition of k-NN
Understand how k-NN works
Understand the mode specification of k-NN
Understand the classification confusion matrix
Understand precision and recall
Understand the intuition of K-means
Understand the loss function of K-means and how it
works
Understand why and how to do random initialization for
K-means
3
k-Nearest-Neighbours Algorithm
4
k-Nearest-Neighbours is a classification algorithm that in
order to determine the classification of a point, combines
the classification of the k nearest points.
k-NN requires no distribution assumptions.
It is a supervised learning algorithm: classify a point
based on the known classification of other points.
k-NN Classifier
5
Find k nearest neighbours of x.
Predict the class of x to be the majority class of the k
nearest neighbours.
Parameter k can be determined using validation.
k-NN Classifier
How it works?
6
Training set:
k-NN algorithm identifies
the 𝑘𝑘 nearest neighbours
of the blue data point,
regardless of the label
k-NN Intuition
𝑥𝑥2- saving
𝑥𝑥1-income
𝒟𝒟 = 𝐱𝐱1, 𝑡𝑡1 , 𝐱𝐱2, 𝑡𝑡2 , 𝐱𝐱3, 𝑡𝑡3 , … , 𝐱𝐱𝑁𝑁 , 𝑡𝑡𝑁𝑁
where each and 𝑡𝑡𝑛𝑛 is or 𝐱𝐱𝑛𝑛 = 𝑥𝑥𝑛𝑛1, 𝑥𝑥𝑛𝑛2 = (income, saving)
7
k-NN Intuition
Training Set:
Suppose: 𝑘𝑘 = 3
We have red and green labelled
data points
Target:
Predict the class of the blue
data point.
𝑥𝑥2- saving
𝑥𝑥1-income
𝒟𝒟 = 𝐱𝐱1, 𝑡𝑡1 , 𝐱𝐱2, 𝑡𝑡2 , 𝐱𝐱3, 𝑡𝑡3 , … , 𝐱𝐱𝑁𝑁 , 𝑡𝑡𝑁𝑁
8
Training Set:
Suppose: 𝑘𝑘 = 3
We have red and green labelled
data points
Target:
Predict the class of the blue
data point: red class
Always choose ODD values
of 𝒌𝒌 for a 2 class problem.
Why?
k-NN Intuition
𝑥𝑥2- saving
𝑥𝑥1-income
𝒟𝒟 = 𝐱𝐱1, 𝑡𝑡1 , 𝐱𝐱2, 𝑡𝑡2 , 𝐱𝐱3, 𝑡𝑡3 , … , 𝐱𝐱𝑁𝑁 , 𝑡𝑡𝑁𝑁
9
Training Set:
Now 𝒌𝒌 = 𝟓𝟓
Target:
Predict the class of the blue
data point: green class
k-NN Intuition
𝑥𝑥2- saving
𝑥𝑥1-income
𝒟𝒟 = 𝐱𝐱1, 𝑡𝑡1 , 𝐱𝐱2, 𝑡𝑡2 , 𝐱𝐱3, 𝑡𝑡3 , … , 𝐱𝐱𝑁𝑁 , 𝑡𝑡𝑁𝑁
10
This is a nonparametric model. There is no need for parameter
estimation.
Training Set:
The model is defined as
k-NN Model
where 𝑁𝑁𝑘𝑘(𝐱𝐱) is the neighbourhood of 𝐱𝐱 defined by 𝒌𝒌 closest points
in the training dataset, and Mode() is the Mode function (regarding 𝑡𝑡𝑛𝑛
values of 𝒌𝒌 closest points).
𝒟𝒟 = 𝐱𝐱1, 𝑡𝑡1 , 𝐱𝐱2, 𝑡𝑡2 , 𝐱𝐱3, 𝑡𝑡3 , … , 𝐱𝐱𝑁𝑁 , 𝑡𝑡𝑁𝑁
𝑓𝑓 𝐱𝐱,𝛽𝛽 = Mode 𝐱𝐱𝑛𝑛 ∈ 𝑁𝑁𝑘𝑘 𝐱𝐱 where 𝐱𝐱𝑛𝑛 ∈ 𝒟𝒟
11
𝒌𝒌 Closest Points
Euclidean distance between 𝐱𝐱𝑖𝑖and 𝐱𝐱𝑗𝑗.
𝐱𝐱𝑖𝑖
𝐱𝐱𝑗𝑗
There are other distance measures can be used.
𝐱𝐱𝒊𝒊 − 𝐱𝐱𝑗𝑗 = (𝐱𝐱𝒊𝒊−𝐱𝐱𝑗𝑗)𝑇𝑇(𝐱𝐱𝒊𝒊−𝐱𝐱𝑗𝑗)
12
The k-NN prediction rule for regression:
For a new point 𝐱𝐱0, the model predicts the value at 𝐱𝐱0 as
the average value of all the values at the 𝒌𝒌 neighbours in
𝑁𝑁𝑘𝑘(𝐱𝐱0)
The k-NN prediction rule for classification:
For a new point 𝐱𝐱0, the model predicts the class of 𝐱𝐱0 as
the majority vote (mode) among the 𝒌𝒌 neighbours in
𝑁𝑁𝑘𝑘(𝐱𝐱0).
13
x is one of training data, then who are the
neighbourhood? x is its own neighbourhood.
If 𝑘𝑘 = 1, there is no misclassification for the training
data
1-NN is more accurate on the training data, while
causing overfitting problem. The algorithm is also quite
sensitive to outliers.
“kernel” methods (later in the course) use a
generalized notion of neighbourhood but are in
essence same thing as k-NN.
k = 1, 1-Nearest-Neighbor
14
k = 1, 1-Nearest-Neighbor
Each training vector defines a
region in space, resulting a
Voronoi partition of the space
15
Always choose ODD values of k for a 2 class problem.
k cannot be a multiple of the number of class, to avoid a
tie when assigning class.
k=?
16
Loss function for
classification/Confusion
matrix
17
k-NN Objective
• The target of k-NN?
• Minimize number of misclassification of training set? This can be
minimized at k = 1. Might not the good way.
• There is a k for which bias and variance are balanced optimally.
• How to decide k?
• We can still incorporate the model and features techniques, including
training & validation & test sets, CV, etc, to see which k is the best
regarding the target metrics, e.g. misclassification rate.
• Is misclassification rate always a good metric for all applications?
The misclassification rate is:
sprediction of # Total
FNFP +
18
Skewed Data
• In the customer default example, we predict t =1 if default,
and t = 0 if not default.
• Suppose using logistic regression or k-NN algorithm, we get
99% accuracy rate, or 1% misclassification rate (error rate).
• In reality, there is approximately 0.5% customers would
default ( t = 1).
Is this a good classifier / machine learning model?
19
• What if we build a simpler classifier that predicts everyone to
be t = 0 (not default)?
• This classifier will only have 0.5% misclassification rate.
• Is it a better classifier?
Skewed data!
20
Confusion Matrix
Predicted values
0 (-) 1 (+)
Actual
value
0 (-) True negative (TN) False positive (FP)
1 (+) False negative (FN) True positive (TP)
The recall, or sensitivity rate, or true positive rate (TPR) is:
Positives ACTUAL of # Total
TP
FNTP
TP
TPR =
+
=
For all customers who actually default, what is the percentage that we
correctly predicted as t =1 (default). Higher the better.
t =1 represents the rare class that we want to capture, e.g. default
Second ROW
sum of the
confusion matrix
21
The precision (positive predictive rate) is:
Positives PREDICTED of # Total
TP
FPTP
TP
PPR =
+
=
For all customer we predict t =1 (default), what is the percentage of them
who actually default. Higher the better.
Second COLUMN
sum of the
confusion matrix
Why not build a simpler classifier that predict everyone to be t = 0 (not
default)?
The recall will be 0=> to avoid cheating
We check both precision and recall instead of just evaluating one.
22
Recall — how good a test is at detecting the
positives. A test can cheat and maximize this by
always returning “positive”.
Precision – how many of the positively classified were
relevant. A test can cheat and maximize this by only
returning positive on one result it is most confident.
The cheating is resolved by looking at both metrics
instead of just one.
Confusion Matrix
23
t actual 0 0 0 0 0 1 0 1 0 0
t predicted-1 1 1 1 1 1 1 1 1 1 1
t predicted-2 0 0 0 0 0 1 0 0 0 0
Calculate the precision and recall for each model.
24
t actual 0 0 0 0 0 1 0 1 0 1
t predicted-A 0 1 1 1 1 1 1 1 1 1
t predicted-B 0 0 0 0 0 0 0 1 0 0
Precision and Recall Tradeoff
Recall
Precision
1
10
For model A: we predict t =1 even if we are
NOT that confident, e.g. low probability
threshold of logistic regression
Recall=3/3= 100%
Precision=3/9=33.33%
High recall, low precision.
For model B: we only predict t =1 only if we
are VERY confident, e.g. high probability
threshold of logistic regression
Recall= 1/3= 33.33%.
Precision=1/1=100%
Low recall, high precision,. Demo only
25
F1-Score
RecallPrecision
Recall*Precision
2
Recall
1
Precision
1
1
2ScoreF1
+
=
+
=−
If a classifier is cheating, then precision
or recall are likely to be close to 0, then
F1-Score will be close to 0.
To evaluate a classifier:
Incorporate the model and feature selection techniques, including
training & validation & test sets and CV
Evaluate misclassification rate, precision & recall and F1-Score.
Harmonic mean
26
The confusion matrix for a binary classifier again
Define the true positive rate (TPR) and the false positive rate
(FPR)
Both jointly assess the performance of a classifier. The best
case would be a higher TPR and a lower FPR. This means
the classifier wont make many errors on positive and negative
examples, respectively
Other Measures for Binary Classifiers
27
Predicted values
0 (-) 1 (+)
Actual
values
0 (-) True negative (TN) False positive (FP)
1 (+) False negative (FN) True positive (TP)
Receiver Operating Characteristic (ROC) metric to evaluate
classifier output quality. Particularly useful for binary
classification classifiers
ROC curves typically feature true positive rate (TPR) on the Y
axis, and false positive rate on (FPR) the X axis. This means
that the top left corner of the plot is the “ideal” point — a false
positive rate of zero, and a true positive rate of one.
This is not very realistic, but it does mean that a larger Area
Under the Curve (AUC) is usually better.
ROC curves are typically used in binary classification to study
the output of a classifier. In order to extend ROC curve and
ROC area to multi-class or multi-label classification, it is
necessary to binarize the output. One ROC curve can be
drawn per label.
Receiver Operating Characteristic (ROC)
28
Take logistic regression classifier as an example
For all the test/validation examples, the logistic regression
model produces the probability for each case/example to be
class A (or Class 1).
A typical decision is if the probability (or score) is larger than
0.5, then we classify the case as Class A (or Class 1),
otherwise Class B (or Class 0).
Why do we use 0.5? In fact, taking a threshold between 0.0
and 1.0, we can make decision based on the same training
model.
For each threshold between 0.0 and 1.0, we can have
decisions over all the examples, then we can construct a
confusion matrix, then we can calculate TPR and FPR, and
finally draw a point on the coordinates.
Receiver Operating Characteristic (ROC)
29
In test dataset:100 positive cases (Class 1) and 100 negative
cases (Class 0)
Take threshold as 0.0, then all the testing cases will be
classified as positive. Why? Hence TPR=1.0 and FPR=1.0
Take the threshold as 1.0, when all the testing cases will be
classified as negative. Why? Hence TPR=0.0 and FPR=0.0
Threshold = 0.25 produces TPR=0.9 and FPR=0.60
ROC: Example
30
Threshoold
=0.25
Predicted
0 (-) 1 (+)
Tr
ue
T
es
t 0 (-) 40 60
1 (+) 10 90
Threshold = 0.50 produces TPR=0.80 and FPR=0.20
Threshold = 0.75 produces TPR=0.40 and FPR=0.10
ROC: Example
31
Threshool
d =0.50
Predicted
0 (-) 1 (+)
Tr
ue
T
es
t
0 (-) 80 20
1 (+) 20 80
Threshool
d =0.70
Predicted
0 (-) 1 (+)
Tr
ue
T
es
t 0 (-) 90 10
1 (+) 60 40
This finally gives us the ROC. sklearn can automate the
process of finding ROC. See Lecture04_Example00.py
ROC: Example
32
1.0
1.0
Threshold=0.0
Threshold=0.75
Threshold=0.5
Threshold=0.25
Threshold=1.0
Python Example
(Lecture04_Example01.py)
33
Unsupervised Learning
34
Supervised Learning
5
5
Training Set:
N=?
𝑥𝑥2- saving
𝑥𝑥1-income
𝒟𝒟 = 𝐱𝐱1, 𝑡𝑡1 , 𝐱𝐱2, 𝑡𝑡2 , 𝐱𝐱3, 𝑡𝑡3 , … , 𝐱𝐱𝑁𝑁 , 𝑡𝑡𝑁𝑁
35
𝑥𝑥2- saving
𝑥𝑥1-income
Unsupervised Learning
Training Set:
In unsupervised learning
we do not make the
distinction between the
response/target variable
(t) and the feature (x).
d=?
Number of features=?
Clustering
𝒟𝒟 = 𝐱𝐱1, 𝐱𝐱2, 𝐱𝐱3, … , 𝐱𝐱𝑁𝑁
36
Applications of Clustering
Customer segmentation
Social network analysis
Market segmentation
Image clustering
37
K-means Algorithm
38
K-means is a clustering algorithm that tries to partition a
set of points into K sets (clusters) such that the points in
each cluster tend to be near each other.
It is a unsupervised learning algorithm, since data are
not labelled.
K-means
39
K-means Input
K (number of clusters)
Training Set:
How to choose K?
How to select the starting values of cluster centroids?
K-Means is an iterative algorithm and it contains two steps.
i. cluster assignment step
ii. centroid move step.
𝒟𝒟 = 𝐱𝐱1, 𝐱𝐱2, 𝐱𝐱3, … , 𝐱𝐱𝑁𝑁
40
How K-means work
𝑥𝑥2- saving
𝑥𝑥1-income
41
𝑥𝑥2- saving
𝑥𝑥1-income
Randomly pick two (why?)
clusters centroids;
The first of the two steps in
the loop of K-means: cluster
assignment step.
Go through each of the
red observation, and decide
each of them is closer to
the green cluster centroid or
the blue cluster centroid.
Loop 1
42
Cluster Assignment Step.
This step is to go through the data set
and color each of the points either red
or blue, depending on whether it is
closer to the red cluster centroid or the
blue cluster centroid.
𝑥𝑥2- saving
𝑥𝑥1-income
Loop 1
43
Centroid move step.
𝑥𝑥2- saving
𝑥𝑥1-income
Take the two cluster centroids
(green and blue crosses), and move
them to the average of the points
colored with the same colour.
Specifically:
Check all the green points and
compute the average, and move the
green cluster centroid to the
average. Do this for the blue points
and centroid as well.
Loop 1
44
𝑥𝑥2- saving
𝑥𝑥1-income
New Centroids’ locations.
Loop 1
45
Cluster Assignment Step.
𝑥𝑥2- saving
𝑥𝑥1-income
Loop 2
46
Loop 2
Centroid Move Step.
𝑥𝑥2- saving
𝑥𝑥1-income
47
New Centroids’ locations.
Loop 2𝑥𝑥2- saving
𝑥𝑥1-income
48
Loop 3
Cluster Assignment Step.
𝑥𝑥2- saving
𝑥𝑥1-income
49
Loop 3
𝑥𝑥2- saving
𝑥𝑥1-income
Centroid Move Step.
50
Loop 3𝑥𝑥2- saving
𝑥𝑥1-income
New Centroids’ locations.
Keep running additional iterations
of K-means until the cluster centroids
will not change anymore and the
clusters of the observations will not
change any more.
51
𝐶𝐶𝑛𝑛: the index of clusters (1,2,…,K) to which data point 𝐱𝐱𝑛𝑛 is assigned (at the
moment)
𝜇𝜇𝑘𝑘: cluster centroid k. Total number of clusters: K.
𝜇𝜇𝐶𝐶𝑛𝑛: cluster centroid to which data point 𝐱𝐱𝑛𝑛 was assigned.
Suppose 𝐱𝐱𝑛𝑛 is allocated to cluster 2 (𝐶𝐶𝑛𝑛 = 2), then 𝜇𝜇𝐶𝐶𝑛𝑛 = 𝜇𝜇2
K-means Loss Function
Loss function (distortion): to be minimized
Euclidean distance between 𝐱𝐱𝑛𝑛
and 𝜇𝜇𝑘𝑘.
𝐱𝐱𝑛𝑛
𝜇𝜇𝑘𝑘
𝜷𝜷: index of clusters and cluster centroids
𝐿𝐿 𝜷𝜷 = �
𝑛𝑛=1
𝑁𝑁
𝑥𝑥𝑛𝑛 − 𝜇𝜇𝐶𝐶𝑛𝑛
2
52
K-means Algorithm
Randomly initialize K cluster centroids, 𝜇𝜇1, 𝜇𝜇2, 𝜇𝜇3, … , 𝜇𝜇𝐾𝐾
Loop the two step process until converge:
for n in 1:N
𝐶𝐶𝑛𝑛: assign the observations 𝐱𝐱𝑛𝑛 to the closest centroid
𝜇𝜇𝑘𝑘.
for k in 1:K
𝜇𝜇𝑘𝑘: mean of the data points assigned to cluster k => as
the new cluster centroid
Cluster
Assignment
Step.
Centroid
Move
Step.
In this step, we fix 𝜇𝜇1, 𝜇𝜇2, 𝜇𝜇3, … , 𝜇𝜇𝐾𝐾, and aim to minimize the
loss function L 𝜷𝜷 with respect to 𝐶𝐶1,𝐶𝐶2,𝐶𝐶3, … ,𝐶𝐶𝑁𝑁.
In this step, we fix 𝐶𝐶1,𝐶𝐶2,𝐶𝐶3, … ,𝐶𝐶𝑁𝑁, and aim to minimize the
loss function L 𝜷𝜷 with respect to 𝜇𝜇1, 𝜇𝜇2, 𝜇𝜇3, … , 𝜇𝜇𝐾𝐾.
53
How to initialize the K-means?
Random initialization:
K