程序代写代做代考 python algorithm Slide 1

Slide 1

BUSINESS SCHOOL

 Discipline of Business Analytics

QBUS6850 Team

 Topics covered
 k-Nearest-Neighbours Algorithm
 Unsupervised learning introduction
 K-means

 References
 Friedman et al., (2001), Chapters 13.1 – 13.3, 14.3
 James et al., (2014), Chapter 10.3
 Bishop, (2006), Chapter 2.5.2, 9.1

2

Learning Objectives

 Understand the intuition of k-NN
 Understand how k-NN works
 Understand the mode specification of k-NN
 Understand the classification confusion matrix
 Understand precision and recall
 Understand the intuition of K-means
 Understand the loss function of K-means and how it

works
 Understand why and how to do random initialization for

K-means

3

k-Nearest-Neighbours Algorithm

4

k-Nearest-Neighbours is a classification algorithm that in
order to determine the classification of a point, combines
the classification of the k nearest points.

k-NN requires no distribution assumptions.

It is a supervised learning algorithm: classify a point
based on the known classification of other points.

k-NN Classifier

5

 Find k nearest neighbours of x.

 Predict the class of x to be the majority class of the k

nearest neighbours.

 Parameter k can be determined using validation.

k-NN Classifier

How it works?

6

Training set:

k-NN algorithm identifies
the 𝑘𝑘 nearest neighbours
of the blue data point,
regardless of the label

k-NN Intuition
𝑥𝑥2- saving

𝑥𝑥1-income

𝒟𝒟 = 𝐱𝐱1, 𝑡𝑡1 , 𝐱𝐱2, 𝑡𝑡2 , 𝐱𝐱3, 𝑡𝑡3 , … , 𝐱𝐱𝑁𝑁 , 𝑡𝑡𝑁𝑁
where each and 𝑡𝑡𝑛𝑛 is or 𝐱𝐱𝑛𝑛 = 𝑥𝑥𝑛𝑛1, 𝑥𝑥𝑛𝑛2 = (income, saving)

7

k-NN Intuition

Training Set:

Suppose: 𝑘𝑘 = 3
We have red and green labelled
data points

Target:
Predict the class of the blue
data point.

𝑥𝑥2- saving

𝑥𝑥1-income

𝒟𝒟 = 𝐱𝐱1, 𝑡𝑡1 , 𝐱𝐱2, 𝑡𝑡2 , 𝐱𝐱3, 𝑡𝑡3 , … , 𝐱𝐱𝑁𝑁 , 𝑡𝑡𝑁𝑁

8

Training Set:

Suppose: 𝑘𝑘 = 3
We have red and green labelled
data points

Target:
Predict the class of the blue
data point: red class

Always choose ODD values
of 𝒌𝒌 for a 2 class problem.
Why?

k-NN Intuition
𝑥𝑥2- saving

𝑥𝑥1-income

𝒟𝒟 = 𝐱𝐱1, 𝑡𝑡1 , 𝐱𝐱2, 𝑡𝑡2 , 𝐱𝐱3, 𝑡𝑡3 , … , 𝐱𝐱𝑁𝑁 , 𝑡𝑡𝑁𝑁

9

Training Set:

Now 𝒌𝒌 = 𝟓𝟓

Target:
Predict the class of the blue
data point: green class

k-NN Intuition
𝑥𝑥2- saving

𝑥𝑥1-income

𝒟𝒟 = 𝐱𝐱1, 𝑡𝑡1 , 𝐱𝐱2, 𝑡𝑡2 , 𝐱𝐱3, 𝑡𝑡3 , … , 𝐱𝐱𝑁𝑁 , 𝑡𝑡𝑁𝑁

10

This is a nonparametric model. There is no need for parameter
estimation.

Training Set:

The model is defined as

k-NN Model

where 𝑁𝑁𝑘𝑘(𝐱𝐱) is the neighbourhood of 𝐱𝐱 defined by 𝒌𝒌 closest points
in the training dataset, and Mode() is the Mode function (regarding 𝑡𝑡𝑛𝑛
values of 𝒌𝒌 closest points).

𝒟𝒟 = 𝐱𝐱1, 𝑡𝑡1 , 𝐱𝐱2, 𝑡𝑡2 , 𝐱𝐱3, 𝑡𝑡3 , … , 𝐱𝐱𝑁𝑁 , 𝑡𝑡𝑁𝑁

𝑓𝑓 𝐱𝐱,𝛽𝛽 = Mode 𝐱𝐱𝑛𝑛 ∈ 𝑁𝑁𝑘𝑘 𝐱𝐱 where 𝐱𝐱𝑛𝑛 ∈ 𝒟𝒟

11

𝒌𝒌 Closest Points

Euclidean distance between 𝐱𝐱𝑖𝑖and 𝐱𝐱𝑗𝑗.

𝐱𝐱𝑖𝑖

𝐱𝐱𝑗𝑗

There are other distance measures can be used.

𝐱𝐱𝒊𝒊 − 𝐱𝐱𝑗𝑗 = (𝐱𝐱𝒊𝒊−𝐱𝐱𝑗𝑗)𝑇𝑇(𝐱𝐱𝒊𝒊−𝐱𝐱𝑗𝑗)

12

The k-NN prediction rule for regression:
For a new point 𝐱𝐱0, the model predicts the value at 𝐱𝐱0 as
the average value of all the values at the 𝒌𝒌 neighbours in
𝑁𝑁𝑘𝑘(𝐱𝐱0)

The k-NN prediction rule for classification:
For a new point 𝐱𝐱0, the model predicts the class of 𝐱𝐱0 as
the majority vote (mode) among the 𝒌𝒌 neighbours in
𝑁𝑁𝑘𝑘(𝐱𝐱0).

13

 x is one of training data, then who are the
neighbourhood? x is its own neighbourhood.

 If 𝑘𝑘 = 1, there is no misclassification for the training
data

 1-NN is more accurate on the training data, while
causing overfitting problem. The algorithm is also quite
sensitive to outliers.

 “kernel” methods (later in the course) use a
generalized notion of neighbourhood but are in
essence same thing as k-NN.

k = 1, 1-Nearest-Neighbor

14

k = 1, 1-Nearest-Neighbor

Each training vector defines a
region in space, resulting a
Voronoi partition of the space

15

Always choose ODD values of k for a 2 class problem.

k cannot be a multiple of the number of class, to avoid a
tie when assigning class.

k=?

16

Loss function for
classification/Confusion

matrix

17

k-NN Objective

• The target of k-NN?
• Minimize number of misclassification of training set? This can be

minimized at k = 1. Might not the good way.
• There is a k for which bias and variance are balanced optimally.
• How to decide k?
• We can still incorporate the model and features techniques, including

training & validation & test sets, CV, etc, to see which k is the best
regarding the target metrics, e.g. misclassification rate.

• Is misclassification rate always a good metric for all applications?

The misclassification rate is:

sprediction of # Total
FNFP +

18

Skewed Data

• In the customer default example, we predict t =1 if default,
and t = 0 if not default.

• Suppose using logistic regression or k-NN algorithm, we get
99% accuracy rate, or 1% misclassification rate (error rate).

• In reality, there is approximately 0.5% customers would
default ( t = 1).

Is this a good classifier / machine learning model?

19

• What if we build a simpler classifier that predicts everyone to
be t = 0 (not default)?

• This classifier will only have 0.5% misclassification rate.

• Is it a better classifier?

Skewed data!

20

Confusion Matrix

Predicted values
0 (-) 1 (+)

Actual
value

0 (-) True negative (TN) False positive (FP)

1 (+) False negative (FN) True positive (TP)

The recall, or sensitivity rate, or true positive rate (TPR) is:

Positives ACTUAL of # Total
TP

FNTP
TP

TPR =
+

=

For all customers who actually default, what is the percentage that we
correctly predicted as t =1 (default). Higher the better.

t =1 represents the rare class that we want to capture, e.g. default

Second ROW
sum of the
confusion matrix

21

The precision (positive predictive rate) is:

Positives PREDICTED of # Total
TP

FPTP
TP

PPR =
+

=

For all customer we predict t =1 (default), what is the percentage of them
who actually default. Higher the better.

Second COLUMN
sum of the
confusion matrix

Why not build a simpler classifier that predict everyone to be t = 0 (not
default)?

The recall will be 0=> to avoid cheating

We check both precision and recall instead of just evaluating one.

22

 Recall — how good a test is at detecting the
positives. A test can cheat and maximize this by
always returning “positive”.

 Precision – how many of the positively classified were
relevant. A test can cheat and maximize this by only
returning positive on one result it is most confident.

 The cheating is resolved by looking at both metrics
instead of just one.

Confusion Matrix

23

t actual 0 0 0 0 0 1 0 1 0 0

t predicted-1 1 1 1 1 1 1 1 1 1 1

t predicted-2 0 0 0 0 0 1 0 0 0 0

Calculate the precision and recall for each model.

24

t actual 0 0 0 0 0 1 0 1 0 1
t predicted-A 0 1 1 1 1 1 1 1 1 1
t predicted-B 0 0 0 0 0 0 0 1 0 0

Precision and Recall Tradeoff

Recall

Precision

1

10

For model A: we predict t =1 even if we are
NOT that confident, e.g. low probability
threshold of logistic regression
Recall=3/3= 100%
Precision=3/9=33.33%
High recall, low precision.

For model B: we only predict t =1 only if we
are VERY confident, e.g. high probability
threshold of logistic regression
Recall= 1/3= 33.33%.
Precision=1/1=100%
Low recall, high precision,. Demo only

25

F1-Score

RecallPrecision
Recall*Precision

2

Recall
1

Precision
1

1
2ScoreF1

+
=

+
=−

If a classifier is cheating, then precision
or recall are likely to be close to 0, then
F1-Score will be close to 0.

To evaluate a classifier:
 Incorporate the model and feature selection techniques, including

training & validation & test sets and CV
 Evaluate misclassification rate, precision & recall and F1-Score.

Harmonic mean

26

 The confusion matrix for a binary classifier again

 Define the true positive rate (TPR) and the false positive rate
(FPR)

 Both jointly assess the performance of a classifier. The best
case would be a higher TPR and a lower FPR. This means
the classifier wont make many errors on positive and negative
examples, respectively

Other Measures for Binary Classifiers

27

Predicted values

0 (-) 1 (+)

Actual
values

0 (-) True negative (TN) False positive (FP)
1 (+) False negative (FN) True positive (TP)

 Receiver Operating Characteristic (ROC) metric to evaluate
classifier output quality. Particularly useful for binary
classification classifiers

 ROC curves typically feature true positive rate (TPR) on the Y
axis, and false positive rate on (FPR) the X axis. This means
that the top left corner of the plot is the “ideal” point — a false
positive rate of zero, and a true positive rate of one.

 This is not very realistic, but it does mean that a larger Area
Under the Curve (AUC) is usually better.

 ROC curves are typically used in binary classification to study
the output of a classifier. In order to extend ROC curve and
ROC area to multi-class or multi-label classification, it is
necessary to binarize the output. One ROC curve can be
drawn per label.

Receiver Operating Characteristic (ROC)

28

 Take logistic regression classifier as an example
 For all the test/validation examples, the logistic regression

model produces the probability for each case/example to be
class A (or Class 1).

 A typical decision is if the probability (or score) is larger than
0.5, then we classify the case as Class A (or Class 1),
otherwise Class B (or Class 0).

 Why do we use 0.5? In fact, taking a threshold between 0.0
and 1.0, we can make decision based on the same training
model.

 For each threshold between 0.0 and 1.0, we can have
decisions over all the examples, then we can construct a
confusion matrix, then we can calculate TPR and FPR, and
finally draw a point on the coordinates.

Receiver Operating Characteristic (ROC)

29

 In test dataset:100 positive cases (Class 1) and 100 negative
cases (Class 0)

 Take threshold as 0.0, then all the testing cases will be
classified as positive. Why? Hence TPR=1.0 and FPR=1.0

 Take the threshold as 1.0, when all the testing cases will be
classified as negative. Why? Hence TPR=0.0 and FPR=0.0

 Threshold = 0.25 produces TPR=0.9 and FPR=0.60

ROC: Example

30

Threshoold
=0.25

Predicted

0 (-) 1 (+)

Tr
ue

T
es

t 0 (-) 40 60

1 (+) 10 90

 Threshold = 0.50 produces TPR=0.80 and FPR=0.20

 Threshold = 0.75 produces TPR=0.40 and FPR=0.10

ROC: Example

31

Threshool
d =0.50

Predicted
0 (-) 1 (+)

Tr
ue

T
es

t
0 (-) 80 20

1 (+) 20 80

Threshool
d =0.70

Predicted
0 (-) 1 (+)

Tr
ue

T
es

t 0 (-) 90 10

1 (+) 60 40

 This finally gives us the ROC. sklearn can automate the
process of finding ROC. See Lecture04_Example00.py

ROC: Example

32

1.0

1.0
Threshold=0.0

Threshold=0.75

Threshold=0.5

Threshold=0.25

Threshold=1.0

Python Example
(Lecture04_Example01.py)

33

Unsupervised Learning

34

Supervised Learning

5

5

Training Set:

N=?

𝑥𝑥2- saving

𝑥𝑥1-income

𝒟𝒟 = 𝐱𝐱1, 𝑡𝑡1 , 𝐱𝐱2, 𝑡𝑡2 , 𝐱𝐱3, 𝑡𝑡3 , … , 𝐱𝐱𝑁𝑁 , 𝑡𝑡𝑁𝑁

35

𝑥𝑥2- saving

𝑥𝑥1-income

Unsupervised Learning

Training Set:

In unsupervised learning
we do not make the
distinction between the
response/target variable
(t) and the feature (x).

d=?
Number of features=?

Clustering

𝒟𝒟 = 𝐱𝐱1, 𝐱𝐱2, 𝐱𝐱3, … , 𝐱𝐱𝑁𝑁

36

Applications of Clustering

 Customer segmentation

 Social network analysis

 Market segmentation

 Image clustering

37

K-means Algorithm

38

K-means is a clustering algorithm that tries to partition a
set of points into K sets (clusters) such that the points in
each cluster tend to be near each other.

It is a unsupervised learning algorithm, since data are
not labelled.

K-means

39

K-means Input

K (number of clusters)

Training Set:

How to choose K?
How to select the starting values of cluster centroids?

K-Means is an iterative algorithm and it contains two steps.
i. cluster assignment step
ii. centroid move step.

𝒟𝒟 = 𝐱𝐱1, 𝐱𝐱2, 𝐱𝐱3, … , 𝐱𝐱𝑁𝑁

40

How K-means work
𝑥𝑥2- saving

𝑥𝑥1-income

41

𝑥𝑥2- saving

𝑥𝑥1-income

Randomly pick two (why?)
clusters centroids;

The first of the two steps in
the loop of K-means: cluster
assignment step.

Go through each of the
red observation, and decide
each of them is closer to
the green cluster centroid or
the blue cluster centroid.

Loop 1

42

Cluster Assignment Step.
This step is to go through the data set
and color each of the points either red
or blue, depending on whether it is
closer to the red cluster centroid or the
blue cluster centroid.

𝑥𝑥2- saving

𝑥𝑥1-income

Loop 1

43

Centroid move step.

𝑥𝑥2- saving

𝑥𝑥1-income

Take the two cluster centroids
(green and blue crosses), and move
them to the average of the points
colored with the same colour.

Specifically:
Check all the green points and
compute the average, and move the
green cluster centroid to the
average. Do this for the blue points
and centroid as well.

Loop 1

44

𝑥𝑥2- saving

𝑥𝑥1-income

New Centroids’ locations.

Loop 1

45

Cluster Assignment Step.

𝑥𝑥2- saving

𝑥𝑥1-income

Loop 2

46

Loop 2

Centroid Move Step.

𝑥𝑥2- saving

𝑥𝑥1-income

47

New Centroids’ locations.

Loop 2𝑥𝑥2- saving

𝑥𝑥1-income

48

Loop 3
Cluster Assignment Step.

𝑥𝑥2- saving

𝑥𝑥1-income

49

Loop 3
𝑥𝑥2- saving

𝑥𝑥1-income

Centroid Move Step.

50

Loop 3𝑥𝑥2- saving

𝑥𝑥1-income

New Centroids’ locations.
Keep running additional iterations

of K-means until the cluster centroids

will not change anymore and the

clusters of the observations will not

change any more.

51

𝐶𝐶𝑛𝑛: the index of clusters (1,2,…,K) to which data point 𝐱𝐱𝑛𝑛 is assigned (at the
moment)
𝜇𝜇𝑘𝑘: cluster centroid k. Total number of clusters: K.
𝜇𝜇𝐶𝐶𝑛𝑛: cluster centroid to which data point 𝐱𝐱𝑛𝑛 was assigned.

Suppose 𝐱𝐱𝑛𝑛 is allocated to cluster 2 (𝐶𝐶𝑛𝑛 = 2), then 𝜇𝜇𝐶𝐶𝑛𝑛 = 𝜇𝜇2

K-means Loss Function

Loss function (distortion): to be minimized
Euclidean distance between 𝐱𝐱𝑛𝑛
and 𝜇𝜇𝑘𝑘.

𝐱𝐱𝑛𝑛

𝜇𝜇𝑘𝑘

𝜷𝜷: index of clusters and cluster centroids

𝐿𝐿 𝜷𝜷 = �
𝑛𝑛=1

𝑁𝑁

𝑥𝑥𝑛𝑛 − 𝜇𝜇𝐶𝐶𝑛𝑛
2

52

K-means Algorithm
Randomly initialize K cluster centroids, 𝜇𝜇1, 𝜇𝜇2, 𝜇𝜇3, … , 𝜇𝜇𝐾𝐾
Loop the two step process until converge:

for n in 1:N
𝐶𝐶𝑛𝑛: assign the observations 𝐱𝐱𝑛𝑛 to the closest centroid

𝜇𝜇𝑘𝑘.

for k in 1:K
𝜇𝜇𝑘𝑘: mean of the data points assigned to cluster k => as

the new cluster centroid

Cluster
Assignment
Step.

Centroid
Move
Step.

In this step, we fix 𝜇𝜇1, 𝜇𝜇2, 𝜇𝜇3, … , 𝜇𝜇𝐾𝐾, and aim to minimize the
loss function L 𝜷𝜷 with respect to 𝐶𝐶1,𝐶𝐶2,𝐶𝐶3, … ,𝐶𝐶𝑁𝑁.

In this step, we fix 𝐶𝐶1,𝐶𝐶2,𝐶𝐶3, … ,𝐶𝐶𝑁𝑁, and aim to minimize the
loss function L 𝜷𝜷 with respect to 𝜇𝜇1, 𝜇𝜇2, 𝜇𝜇3, … , 𝜇𝜇𝐾𝐾.

53

How to initialize the K-means?

Random initialization:
K