Where are we now?
XML Template (2011) [10.8.2011–6:17pm] [1–18] K:/IVI/IVI 415994.3d (IVI) [PREPRINTER stage]
Kandel et al. 3
Figure 1. The iterative process of wrangling and analysis. One or more initial data sets may be used and new versions may come later. The wrangling and analysis phases overlap. While wrangling tools tend to be separated from the visual analysis tools, the ideal system would provide integrated tools (light yellow). The purple line illustrates a typical iterative process with multiple back and forth steps. Much wrangling may need to take place before the data can be loaded within visualization and analysis tools, which typically immediately reveals new problems with the data. Wrangling might take
2
place at all the stages of analysis as users sort out interesting insights from dirty data, or new data become available or
Classification
• Introduction to classification
• Comparing Classification and Regression
• Decision tree classification
• Will make connections to entropy
• k nearest neighbour classification
3
Classification vs Regression
4
Classification
• Classification
• Making predictions about the future, based on historical data
• Predictive modelling/classification
• The sexy part of data science ?!
• A foundation for machine intelligence, AI, machines replacing humans and taking our jobs ….
5
Classification – Example 1
Predicting disease from microarray data Training
Gene 1
Gene 2
Gene 3
…
Gene n
Develop cancer <1 year
Person 1
2.3
1.1
0.3
...
2.1
1
Person 2
3.2
0.2
1.2
...
1.1
1
Person 3
1.9
3.8
2.7
...
0.2
1
...
...
...
...
...
...
..
Person m
2.8
3.1
2.5
...
3.4
0
Testing
Gene 1
Gene 2
Gene 3
...
Gene n
Develop cancer <1 year
Person X
2.1
0.9
0.6
...
1.9
?
6
Classification – Example 2
Animal classification
Test data
https://www-users.cs.umn.edu/~kumar/dmbook/ch4.pdf
7
Classification- Example 3
Banking: classifying borrowers
Test data
Tid
Home Owner
Marital status
Annual Income
Defaulted Borrower
11
No
Single
55K
?
8
Classification – Example 4
Detecting tax cheats
Tid
1 2 3 4 5 6 7 8 9 10
10
Refund
Marital
Taxable
Cheat
No No No No Yes No No Yes No Yes
Status
Income
Yes
Single
125K
No
Married
100K
No
Single
70K
Yes
Married
120K
No
Divorced
95K
No
Married
60K
Yes
Divorced
220K
No
Single
85K
No
Married
75K
No
Single
90K
Test data
Tid
Refund
Marital Status
Taxable Income
Tax Cheat
11
Yes
Married
125K
?
9
categorical categorical
continuous class
Classification: Definition
• Given a collection of records (training set )
• Eachrecordcontainsasetofattributes,oneclasslabel.
• Find a predictive model for each class label as a function of the values of all attributes, i.e. y = f(x1, x2, ..., xn)
• y: discrete value, target variable
• x1,... xn: attributes, predictors
• f: is the predictive model (a tree, a rule, a mathematical formula)
• Goal: previously unseen records should be assigned a class as accurately as possible
• Atestsetisusedtodeterminetheaccuracyofthemodel,i.e.the full data set is divided into training and test sets, with the training set used to build the model and the test set used to validate it
10
Classification framework
11
Regression: Recall Definition
• Givenacollectionofrecords(trainingset)
• Eachrecordcontainsasetofattributesandonetargetvariable
• Learn predictive model from data, i.e. y = f(x1, x2, ..., xn)
• y: target variable is continuous (key difference to classification) • x1,... xn: attributes, predictors
12
Regression example 1
Predicting ice-creams consumption from temperature: y = f(x)
?
13
Regression example 2
Predicting the activity level of a target gene
Gene 1
Gene 2
Gene 3
...
Gene n
Person 1
2.3
1.1
0.3
...
2.1
Person 2
3.2
0.2
1.2
...
1.1
Person 3
1.9
3.8
2.7
...
0.2
...
...
...
...
...
...
Person m
2.8
3.1
2.5
...
3.4
Gene n+1
3.2
1.1
0.2
...
0.9
Gene n+1
?
Gene 1
Gene 2
Gene 3
...
Gene n
Person m+1
2.1
0.9
0.6
...
1.9
14
Quiz
• Is it classification or regression?
• Prediction of rain or no rain tomorrow based on today’s rain data
• Prediction of the amount of rain tomorrow based on today’s rain data • Prediction of exam mark based on study hours
• Prediction of exam pass or fail based on study hours
• Prediction of salary based on degree results
• Prediction of complications of surgery based on pre-op medical data
15
Decision Trees
16
Decision Tree example: Prediction of tax cheats
Splitting Attributes
Tid
1 2 3 4 5 6 7 8 9 10
10
Refund
Marital
Taxable
Cheat
No No No No Yes No No Yes No Yes
Status
Income
Yes
Single
125K
No
Married
100K
No
Single
70K
Yes
Married
120K
No
Divorced
95K
No
Married
60K
Yes
Divorced
220K
No
Single
85K
No
Married
75K
No
Single
90K
Refund
Yes No
NO MarSt Single, Divorced
Married
NO
TaxInc
< 80K
NO
> 80K
YES
Training Data
Model: Decision Tree
17
categorical categorical
continuous class
Decision Tree example: Prediction of tax cheats 2
MarSt
NO
NO
Single, Divorced
Refund
TaxInc
< 80K
NO
Married
Tid
1 2 3 4 5 6 7 8 9 10
10
Refund
Marital
Taxable
Cheat
No No No No Yes No No Yes No Yes
Status
Income
Yes
Single
125K
No
Married
100K
No
Single
70K
Yes
Married
120K
No
Divorced
95K
No
Married
60K
Yes
Divorced
220K
No
Single
85K
No
Married
75K
No
Single
90K
Yes
No
> 80K
YES
Training Data
There can be more than one tree that fits the same data!
Model: Decision Tree
18
categorical categorical
continuous class
Decision Tree Classification Task
Attrib1
Yes No No Yes No No Yes No No No
Attrib2
Attrib3
125K 100K 70K 120K 95K 60K 220K 85K 75K 90K
Class
No No No No Yes No No Yes No Yes
Large
Medium
Small
Medium
Large
Medium
Large
Small
Medium
Small
Tid
1 2 3 4 5 6 7 8 9 10
Tree Induction algorithm
Induction
Learn Model
10
Model
Decision Tree
Training Set
Apply Model
Attrib2
Small
Medium
Large
Tid Attrib1 Attrib3
11 No 55K ?
12 Yes 80K ?
13 Yes 110K ?
14 No Small 95K ?
15 No Large 67K ?
Test Set
Class
Deduction
10
19
Apply DT Model to Test Data (1)
Start from the root of the tree
Refund
Yes No
Test Data
Refund
Marital
Taxable
Status
Income
Cheat
No
10
Married
80K
?
NO
MarSt
Single, Divorced
TaxInc
Married
NO
< 80K
NO
> 80K
YES
20
Apply DT Model to Test Data (2)
Test Data
Refund
Marital
Taxable
Status
Income
Cheat
No
10
Married
80K
?
Refund
Yes No
NO
MarSt
Single, Divorced
TaxInc
Married
NO
< 80K
NO
> 80K
YES
21
Apply DT Model to Test Data (3)
Test Data
Refund
Marital
Taxable
Status
Income
Cheat
No
10
Married
80K
?
Refund
Yes No
NO
MarSt
Single, Divorced
TaxInc
Married
NO
< 80K
NO
> 80K
YES
22
Apply DT Model to Test Data (4)
Test Data
Refund
Marital
Taxable
Status
Income
Cheat
No
10
Married
80K
?
Refund
Yes No
NO
MarSt
Single, Divorced
TaxInc
Married
NO
< 80K
NO
> 80K
YES
23
Apply DT Model to Test Data (5)
Test Data
Refund
Marital
Taxable
Status
Income
Cheat
No
10
Married
80K
?
Refund
Yes No
NO
MarSt
Single, Divorced
TaxInc
Married
NO
< 80K
NO
> 80K
YES
24
Apply DT Model to Test Data (6)
Test Data
Refund
Marital
Taxable
Status
Income
Cheat
No
10
Married
80K
?
Refund
Yes No
NO
MarSt
Single, Divorced
Married
NO
Assign Cheat to “No”
TaxInc
< 80K
NO
> 80K
YES
25
Apply DT Model to Test Data (7)
Start from the root of tree.
Test Data
Refund
Marital
Taxable
Cheat
Status
Income
10
No
Single
100K
?
Refund
Yes No
NO
MarSt
Single, Divorced
TaxInc
Married
NO
< 80K
NO
> 80K
YES
26
Decision Trees
Decision tree
• A flow-chart-like tree structure
• Internal node denotes a test on an attribute • Branch represents an outcome of the test
• Leaf nodes represent class labels
27
Decision Tree Algorithm
Tid
1 2 3 4 5 6 7 8 9 10
Tree Induction algorithm
Induction
Attrib1
Yes No No Yes No No Yes No No No
Attrib2
Attrib3
125K 100K 70K 120K 95K 60K 220K 85K 75K 90K
Class
No No No No Yes No No Yes No Yes
Large
Medium
Small
Medium
Large
Medium
Large
Small
Medium
Small
Learn Model
10
Model
Decision Tree
Training Set
Apply Model
Attrib2
Small
Medium
Large
Tid Attrib1 Attrib3
11 No 55K ?
12 Yes 80K ?
13 Yes 110K ?
14 No Small 95K ?
15 No Large 67K ?
Test Set
Class
Deduction
10
28
Decision Tree Algorithm (2)
• Many Algorithms:
• We will look at a representative one
(Hunt’s algorithm)
• How would you code an algorithm that automatically constructs a decision tree for the tax cheat data?
Tid
1 2 3 4 5 6 7 8 9 10
Refund
Yes No No Yes No No Yes No No No
Marital Status
Taxable Income
Cheat
No No No No Yes No No Yes No Yes
Single
125K
Married
100K
Single
70K
Married
120K
Divorced
95K
Married
60K
Divorced
220K
Single
85K
Married
75K
Single
90K
10
29
General Structure of Hunt’s Algorithm
• Let Dt be the set of training records that reach node t • General Procedure:
• If Dt contains only records that belong to the same class yt, then t is a leaf node labeled as yt
• If Dt is an empty set, then t is a leaf node labeled by the default class yd
• If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets
• Recursively apply the procedure to each subset
Tid
1 2 3 4 5 6 7 8 9 10
Refund
Yes No No Yes No No Yes No No No
Marital Status
Taxable Income
Cheat
No No No No Yes No No Yes No Yes
Single
125K
Married
100K
Single
70K
Married
120K
Divorced
95K
Married
60K
Divorced
220K
Single
85K
Married
75K
Single
90K
10
Dt
Refund Yes
No
30
Refund
Yes No No Yes No No Yes No No No
Marital Status
Taxable Income
Cheat
No No No No Yes No No Yes No Yes
Single
125K
Married
100K
Single
70K
Married
120K
Divorced
95K
Married
60K
Divorced
220K
Single
85K
Married
75K
Single
90K
Tid
Hunt’s Algorithm Example 1 2
If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset.
Refund Yes
3
4
5
6
7
8
9
10
10 D
t
No
31
Stopping condition: leaf node
• If Dt contains records that all belong to the same class yt, then t is a leaf node labeled as yt • If Dt is an empty set, then t is a leaf node labeled as the default class yd
Refund Yes
No
32
Decision Tree splitting example
Splitting Attributes
Refund
Yes No
NO MarSt Single, Divorced
Married
NO
TaxInc
< 80K
NO
> 80K
YES
33
Tree induction (creation)
34
Tree Induction
• Determine how to split the records
• How to specify the attribute test condition? • How to determine the best split?
• Determine when to stop splitting
• When a node only has instances of a single class
35
How to Specify Test Conditions?
• Depends on attribute types • Nominal
• Ordinal
• Continuous
• Depends on number of ways to split • 2-way split
• Multi-way split
36
Splitting Nominal Attributes
• Multi-waysplit:Useasmanypartitionsasdistinctvalues
CarType
Sports
Family
Luxury
• Binary split: Divide values into two subsets, find an optimal partitioning
{Sports, CarType Luxury}
{Family}
OR {Family, CarType Luxury}
{Sports}
37
Splitting Ordinal Attributes
• Multi-way split: Use as many partitions as distinct values
Size
Medium
{Small, Medium}
Size
{Medium, Size OR Large}
Small
Large
• Binary split: Divides values into two subsets, find optimal partitioning
{Large}
{Small}
• What about this split?
{Small, Large}
Size
{Medium}
38
Splitting Continuous Attributes
• Different ways of handling
• Discretisation to form an ordinal categorical attribute
• Static – discretise once at the beginning
• Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles) or clustering
• Binary Decision: (A < v) or (A 3 v)
• consider all possible splits and find the best cut
• can be more computing intensive
39
Splitting Continuous Attributes
T axable Income > 80K?
Yes No
(i) Binary split
T axable Income?
Tid Refund
1 Yes
2 No
3 No
4 Yes
5 No
6 No
7 Yes
8 No
9 No
10 No
Marital Taxable
< 10K
> 80K
Single 125K Married 100K Single 70K Married 120K Divorced 95K Married 60K Divorced 220K Single 85K Married 75K Single 90K
No No No No Yes No No Yes No Yes
[10K,25K) [25K,50K) [50K,80K) (ii) Multi-way split
10
Status
Income Cheat
40
Tree Induction
• Determine how to split the records
• How to specify the attribute test condition? • How to determine the best split?
• Determine when to stop splitting
• When a node only has instances of a single class
41
How to determine the
Before Splitting: 10 records of class 0, 10 records of class 1
Own Car?
Car Type?
Family Sports
Student ID?
c1 c
C0:1 … C0:1 C0:0 … C0:0
Yes
No
Luxury
c c20 11
Which test condition is better?
10
C1: 0 C1: 0 C1: 1
C1: 1
C0: 6 C1: 4
C0: 4 C1: 6
C0: 1 C1: 3
C0: 8 C1: 0
C0: 1 C1: 7
42
How to determine the best split
• Greedy approach: Nodes with homogeneous class distribution are better • Need a measure of node impurity:
C0: 5 C1: 5
Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
C0: 9 C1: 1
43
Measures of Node Impurity
• Entropy
• We have seen entropy in the feature correlation section, where it was used to
measure the amount of uncertainty in an outcome
• Entropy can also be viewed as an impurity measure
• The set {A,B,C,A,A,A,A,A} has low entropy: low uncertainty and low impurity • The set {A,B,C,D,B,E,A,F} has high entropy: high uncertainty and high impurity
44
Node Impurity Criteria based on entropy
Entropy (H) at a given node t:
H(t)=−∑p(j|t)logp(j|t) j
(NOTE: p( j | t) is the relative frequency of class j at node t) Measures homogeneity of a node:
• Maximum (log nc) when records are equally distributed amongst all classes (nc is number of classes)
• Minimum (0.0) when all records belong to one class
45
Examples of computing Entropy
P(C1)=0/6=0 P(C2)=6/6=1 Entropy=–0log2 0–1log2 1=–0–0=0
P(C1) = 1/6 P(C2) = 5/6
Entropy = – (1/6) log2 (1/6) – (5/6) log2 (5/6) = 0.65
P(C1) = 2/6 P(C2) = 4/6
Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92
H(t)=−∑p(j|t)log2 p(j|t) j
C1
0
C2
6
C1
1
C2
5
C1
2
C2
4
46
Question: What is the entropy?
P(C1) = P(C2) = Entropy =
H(t)=−∑p(j|t)log2 p(j|t) j
C1
13
C2
20
47
Question: What is the entropy?
P(C1) = 13 P(C2) = 20
H(t)=−∑p(j|t)log2 p(j|t) j
C1
13
C2
20
33
33
Entropy = – ((13/33)log(13/33) + (20/33)log(20/33)) = 0.97
48
How good is a Split?
Compare the impurity (entropy) of parent node (before splitting) with the impurity (entropy) of the children nodes (after splitting)
Gain = H(Parent) HX(Parent|Child) k N(vj)
= H(Parent) j=1 N H(vj)
H(vj): impurity measure of node vj
j: children node index
N(vj): number of data points in child node vj N: number of data points in parent node The larger the gain, the better
49
How good is a Split?
Gain = H(Parent) HX(Parent|Child) k N(vj)
= H(Parent) j=1 N H(vj)
Note: the information gain is equivalent to the mutual information
between the class feature and the feature being split on
Thus splitting using the information gain is to choose the feature with highest information shared with the class variable
50
How good is a Split?
Own Car?
Car Type?
Student ID?
c1 c
C0:1 … C0:1 C0:0 … C0:0
Yes
No Family Sports
Luxury
c c20 11
10
C1: 0 C1: 0 C1: 1
C1: 1
C0: 6 C1: 4
C0: 4 C1: 6
C0: 1 C1: 3
C0: 8 C1: 0
C0: 1 C1: 7
51
How to determine the ?
Before Splitting: 10 records of class 0, 10 records of class 1
C0:10 C1:10
C0:10 C1:10
Own Car?
Car Type?
Family Sports
Student ID?
Yes
No
c c20 11
Luxury
c1 c
C0:1 … C0:1 C0:0 … C0:0
C1: 0 C1: 0 C1: 1 C1: 1
C0: 6 C1: 4
C0: 4 C1: 6
C0: 1 C1: 3
C0: 8 C1: 0
Which test condition is better?
– Compute the gain of both splits
– Choose the one with the larger gain
52
C0: 1 C1: 7
10
How to determine the ?
Before Splitting: 10 records of class 0, 10 records of class 1
Own Car?
Car Type?
No Family Sports
Student ID?
Yes
c c20 11
Luxury
c1 c
C0:1 … C0:1 C0:0 … C0:0
C1: 0 C1: 0 C1: 1 C1: 1
10
C0: 6 C1: 4
C0: 4 C1: 6
C0: 1 C1: 3
Own Car: Information gain=0.029
Car type: Information gain=0.62
We should choose Car type as the better split
53
C0: 8 C1: 0
C0: 1 C1: 7
Creating a decision tree
• Calculate information gain [Left Child],[Right Child] for all:
• Refund [Yes], [No]
• Marital status [Single],[Married],[Divorced]
• Taxable income
• [60,60], (60,220]
• [60,70], (70,220]
• [60,75],(75,220]
• [60,85],(85,220]
• [60,90],(90,220]
• [60,95],(95,220]
• [60,100],(100,220]
• [60,120],(120,220]
• [60,125],(125,220]
• Choose feature + split with the highest information gain and use this as the root node and its split
• Do recursively, terminating when a node consists of only Cheat=No or Cheat=Yes.
Tid
1 2 3 4 5 6 7 8 9 10
Refund
Yes No No Yes No No Yes No No No
Marital Status
Single Married Single Married Divorced Married Divorced Single Married Single
Taxable Income
125K 100K 70K 120K 95K 60K 220K 85K 75K 90K
Cheat
No No No No Yes No No Yes No Yes
10
54
Question 2c) from 2016 exam
Given a dataset with two classes, A and B, suppose the root node of a decision tree has 50 instances of class A and 150 instances of class B. Consider a candidate split of this root node into two children, the first with (25 class A and 25 class B), the second with (25 class A and 125 class B). Write a formula to measure the utility of this split using the entropy criterion. Explain how this formula helps measure split utility.
55
Question 2c) from 2016 exam
Given a dataset with two classes, A and B, suppose the root node of a decision tree has 50 instances of class A and 150 instances of class B. Consider a candidate split of this root node into two children, the first with (25 class A and 25 class B), the second with (25 class A and 125 class B). Write a formula to measure the utility of this split using the entropy criterion. Explain how this formula helps measure split utility.
56
Question 2c) from 2016 exam
Given a dataset with two classes, A and B, suppose the root node of a decision tree has 50 instances of class A and 150 instances of class B. Consider a candidate split of this root node into two children, the first with (25 class A and 25 class B), the second with (25 class A and 125 class B). Write a formula to measure the utility of this split using the entropy criterion. Explain how this formula helps measure split utility.
A:50 B:150
A:25 B:25
A:25 B:125
57
Question 2c) from 2016 exam
Given a dataset with two classes, A and B, suppose the root node of a decision tree has 50 instances of class A and 150 instances of class B. Consider a candidate split of this root node into two children, the first with (25 class A and 25 class B), the second with (25 class A and 125 class B). Write a formula to measure the utility of this split using the entropy criterion. Explain how this formula helps measure split utility.
Entropy(root)= -(50/200)log(50/200) –(150/200)log(150/200)
A:25 B:25
A:25 B:125
A:50 B:150
58
Question 2c) from 2016 exam
Given a dataset with two classes, A and B, suppose the root node of a decision tree has 50 instances of class A and 150 instances of class B. Consider a candidate split of this root node into two children, the first with (25 class A and 25 class B), the second with (25 class A and 125 class B). Write a formula to measure the utility of this split using the entropy criterion. Explain how this formula helps measure split utility.
Entropy(root)= -(50/200)log(50/200) –(150/200)log(150/200)
Entropy(left child)= A:25
A:25 B:125
A:50 B:150
-(25/50)log(25/50) -(25/50)log(25/50)
B:25
59
Question 2c) from 2016 exam
Given a dataset with two classes, A and B, suppose the root node of a decision tree has 50 instances of class A and 150 instances of class B. Consider a candidate split of this root node into two children, the first with (25 class A and 25 class B), the second with (25 class A and 125 class B). Write a formula to measure the utility of this split using the entropy criterion. Explain how this formula helps measure split utility.
Entropy(root)= -(50/200)log(50/200) –(150/200)log(150/200)
Entropy(left child)= A:25 -(25/50)log(25/50) B:25 -(25/50)log(25/50)
A:25 B:125
Entropy(right child)= -(25/150)log(25/150)
–(125/150)log(125/150)
A:50 B:150
60
Question 2c) from 2016 exam
Given a dataset with two classes, A and B, suppose the root node of a decision tree has 50 instances of class A and 150 instances of class B. Consider a candidate split of this root node into two children, the first with (25 class A and 25 class B), the second with (25 class A and 125 class B). Write a formula to measure the utility of this split using the entropy criterion. Explain how this formula helps measure split utility.
Split utility= Information Gain
=Entropy(root) – Entropy(root|split)
=Entropy(root) – [(50/200)*Entropy(left child) +(150/200)*Entropy(right child)]
61
MI trees: advantages and disadvantages
• Advantages
• Easy to interpret
• Relatively efficient to construct
• Fast for deciding about a test instance
• Disadvantages
• A simple greedy construction strategy, producing a set of If ..then rules.
Sometimes this is too simple for data with complex structure: • Everything should be as simple as possible, but not simpler
• For complex datasets, the tree might grow very big and difficult to understand 62
K-nearest neighbour classifier
63
Nearest Neighbour Classifier
Basic idea: “If it walks like a duck and quacks like a duck, then it’s probably a duck”
Training Records
Choose k of the “nearest” records
Compute Distance
Test Record
64
Nearest-Neighbour Classifier
Unknown record
! Requires three things
– The set of stored records
– Distance Metric to compute distance between records
– The value of k, the number of nearest neighbours to retrieve
! To classify an unknown record:
1. Compute distance to other
training records
2. Identify k nearest neighbors
3. Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking a majority vote)
65
Definition of Nearest Neighbour
X
X
X
(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor
K-nearest neighbours of a record x are data points that have the k smallest distance to x
66
Distance measure
• Compute distance between two points p=(p1,p2,…), • Euclidean distance
d(p,q)= å(p-q)2
q=(q1,q2,…)
i
ii
• Can also use Pearson coefficient (similarity measure) • Determine the class from nearest neighbour list
• take the majority vote of class labels among the k-nearest neighbours
• weigh the vote according to distance, e.g. w = % &!
67
1 nearest-neighbour
Voronoi Diagram defines the classification boundary
The green area takes the class of the green point
68
K-Nearest Neighbour classifier
Choosing a value for k:
• If k is too small, its sensitive to noise points
• If k is too large, the neighbourhood may include points from other classes
X
http://2.bp.blogspot.com/-EK4tA2525EM/U-c6Q4jJuwI/AAAAAAAADjw/hdqRXuunpnQ/s1600/knn.png
69
Points to remember
• Understand what is the difference between classification and regression and why it is useful to build models for these tasks
• Understand how a decision tree may be used to make predictions about the class of a test instance
• Understand the key steps in building a decision tree: • how to split the instances
• how to specify the attribute test condition • how to determine the best split
• how to decide when to stop splitting
70
Points to remember
• Understand the operation and rationale of the k nearest neighbour algorithm for classification
• Understand the advantages and disadvantages of using the k nearest neighbour or decision tree for classification
71
Where are we now?
2
Outline
• Clustering algorithms • K-means
• Visualisation of clustering tendency (VAT) • Hierarchical clustering
3
Data in higher dimensions
• Difficult to visualise
• How can we determine what the significant groups / segments / communities are?
• Can understand the data better
• Apply separate interventions to each group (e.g. marketing campaign)
4
What is Cluster Analysis?
We will be looking at two classic clustering algorithms • K-means
• Hierarchical clustering
Intra-cluster distances are minimized
Inter-cluster distances are maximized
Figure from Tan, Steinbach and Kumar 2004
5
Quality: What Is a Good Clustering?
• A good clustering method will produce high quality clusters
• Objects within same cluster are close together • Objects in different clusters are far apart
• Clustering is a major task in data analysis and visualisation, useful not just for outlier detection, e.g.
• Market segmentation
• Search engine result presentation • Personality types
6
Cambridge Analytica
• Infer people’s personality from their Facebook likes (correlating likes with an online personality test) – psychographics
• “Kosinski proved that on the basis of an average of 68 Facebook “likes” by a user, it was possible to predict their skin color (with 95 percent accuracy), their sexual orientation (88 percent accuracy), and their affiliation to the Democratic or Republican party (85 percent). But it didn’t stop there. Intelligence, religious affiliation, as well as alcohol, cigarette and drug use, could all be determined.”
7
Cambridge Analytica
• “before long, he was able to evaluate a person better than the average work colleague, merely on the basis of ten Facebook “likes”. Seventy “likes” were enough to outdo what a person’s friends knew, 150 what their parents knew, and 300 “likes” what their partner knew. More “likes” could even surpass what a person thought they knew about themselves.”
• Might then target clusters (segments) of people with certain personality traits, with appropriate political messages
• “For a highly neurotic and conscientious audience the threat of a burglary—and the insurance policy of a gun.”
• Quotes taken from following article https://motherboard.vice.com/en_us/article/how-our-likes-helped-trump-win
8
Pre-requisite: Distance functions
• Clustering methods are typically distance based. Represent each object / instance as a vector (a row in the data) and then compute Euclidean distance between pairs of vectors
• Commonly normalise (scale) each attribute into range [0,1] via a pre-processing step before computing distances
9
How to obtain the clusters?
• Need to assign each object to exactly one cluster
• Each cluster can be summarised by its centroid (the average of all its objects)
10
K-Means Clustering
11
K-Means: The best-known clustering method
Given parameter k, the k-means algorithm is implemented in four steps:
1. Randomlyselectkseedpointsastheinitialcluster centroids
2. Assigneachobjecttotheclusterwiththenearestcentroid
3. Computethecentroidsoftheclustersofthecurrent partitioning (the centroid is the center, i.e., mean point, of the cluster)
4. GobacktoStep2,stopwhentheassignmentdoesnot change
12
K-means – Example
1. Ask user how many clusters they would like (e.g. K=5)
(Example from http://www.autonlab.org/tutorials/kmeans11.pdf)
13
K-means – Example
1. Ask user how many clusters they would like (e.g. K=5)
2. Randomly pick K cluster centroids
14
K-means – Example
1. Ask user how many clusters they would like (e.g. K=5)
2. Randomly pick K cluster centroids
3. For each data point find out which centroid it is closest to. (Thus each centroid “owns” a set of datapoints)
15
K-means – Example
1. Ask user how many clusters they would like (e.g. K=5)
2. Randomly pick K cluster centroids
3. For each data point find out which centroid it is closest to. (Thus each centroid “owns” a cluster of datapoints)
4. Each cluster finds the new centroid of the points in it
16
K-means – Example
1. Ask user how many clusters they would like (e.g. K=5)
2. Randomly pick K cluster centroids
3. For each data point find out which centroid it is closest to. (Thus each centroid “owns” a cluster of datapoints)
4. Each cluster finds the new centroid of the points in it
5. New centroids => new boundaries
6. Repeat 3-5 until no further changes
17
Understanding the Algorithm
For which dataset does k-means require a fewer number of iterations? k= 2
Dataset 1 Dataset 2
18
K-means: Further details
• Typically choose the initial seed points randomly
• Different runs of the algorithm will produce different results
• Closeness measured by Euclidean distance
• Can also use other distance functions (e.g. correlation)
• Algorithm can be shown to converge (to a local optimum) • Typicallydoesnotrequiremanyiterations
19
K-Means Interactive demos
• http://syskall.com/kmeans.js/
• http://tech.nitoyon.com/en/blog/2013/11/07/k-means/
• https://www.naftaliharris.com/blog/visualizing-k-means-clustering/
20
An application:
Cluster-based outlier detection
• An outlier is expected to be far away from any groups of normal objects
• Each object is associated with exactly one cluster and its outlier score is equal to the distance from its cluster centre
• Score = size of circle in following example
21
Clustering based outlier detection 3 clusters
22
Clustering based outlier detection 6 clusters
23
Clustering based outlier detection 9 clusters
24
Visual Assessment of Clustering Tendency (VAT)
25
Visual Assessment of Clustering Tendency (VAT)
• How many clusters are in the data? How big are they? What is the likely membership of objects in each cluster?
• The k parameter for k-means
• One solution: visually determine the clustering structure by inspecting a heat map
• Represent datasets in an n*n image format
• Applicable for many different types of object data
26
What is a (dis)similarity matrix?
Object id
Feature1
Feature2
Feature3
1
5
10
15
2
10
5
10
3
20
20
20
Compute all pairwise distances between objects. This gives a dissimilarity matrix.
Object
1
2
3
1
0
8.7
18.7
2
8.7
0
20.6
3
18.7
20.6
0
27
Visualising a dissimilarity matrix
We can visualise a dissimilarity matrix as a heat map, where the colour of each cell indicates that cell’s value
Object
1
2
3
1
0
8.7
18.7
2
8.7
0
20.6
3
18.7
20.6
0
28
Properties of a dissimilarity matrix D
• The diagonal of D is all zeros
• D is symmetric about its leading diagonal
• D(i,j)=D(j,i) for all i and j
• Objects follow the same order along rows and columns
• Visualising the (raw) dissimilarity matrix may not reveal enough useful information
• Further processing often needed
29
Reordering a Dissimilarity matrix
Example dataset with 16 Random order of the 16 objects for the objects dissimilarity matrix
30
Reordering the matrix reveals the clusters
A better ordering of the 16 objects. Nearby objects in the ordering are similar to each other, producing large dark blocks. We can see four clusters along the diagonal
31
Good VAT images
• A good VAT image suggests both the number of clusters and approximate members of each cluster
• A diagonal dark block appears in the VAT image only when a tight group exists in the data (low within-cluster dissimilarities)
32
Another example:
dissimilarity matrix for Iris dataset
Random order for 150 objects: Where are the clusters???
33
Dissimilarity matrix for Iris data (reordered)
34
Exercise – Match the datasets with the VAT images
12
ab
From: An algorithm for clustering Tendency Assessment, Hu, Y., Hathaway, R. WSEAS Transactions on Mathematics, 2008. 35
VAT ordering example
36
VAT ordering example
37
VAT ordering example
38
VAT ordering example
39
VAT ordering example
40
VAT ordering example
41
VAT ordering example
42
VAT ordering example
43
VAT ordering example
7 6
5
4
3
2
1
44
The VAT algorithm:
Visual assessment for clustering tendency
(Bezdek and Hathaway 2002)
Given an N*N dissimilarity matrix D
Let K={1,…N}, I=J={} ### {} is a set (collection of objects) Pick the two least similar objects oa and ob from D
P(1)=a; I={a}; J=K-{a}
For r = 2, …., N
Select (i,j): pair of most similar objects oi and oj from D
Such that i I, j J ### ’’ means ‘member of’
P(r) = j; I = I {j}; J=J – {j}; ### ‘’ means ‘union’ i.e. add two collections together
• Obtain reordered dissimilarity matrix D* from permutation (ordering) P
• E.g. P(1)=2, P(2)=5, P(3)=7
• The first object in the ordering is 2, the second is 5, the third is 7 ….
• VAT algorithm python code – you will practice in workshops
45
The VAT algorithm – problems
VAT algorithm is not effective in every situation
• For complex shaped datasets (either significant overlap or irregular geometries between different clusters), the quality of the VAT image may significantly degrade
46
Hierarchical Clustering
47
Hierarchical Clustering
• Produces a set of nested clusters organized as a hierarchical tree • Can be visualized as a dendrogram
• A tree like diagram that records the sequences of merges (y-axis is distance)
48
Strengths of Hierarchical Clustering
• No assumption of any particular number of clusters
• Any desired number of clusters can be obtained by cutting the dendrogram
at the appropriate level
• They may correspond to meaningful taxonomies
• Example in biological sciences (e.g., animal kingdom, phylogeny reconstruction, …)
49
Tree of Life (http://www.talkorigins.org/faqs/comdesc/phylo.html)
50
Hierarchical Clustering
• Twomaintypesofhierarchicalclustering
• Agglomerative:
• Start with the points as individual clusters
• At each step, merge the closest pair of clusters until only one cluster (or k clusters) left
• Divisive:
• Start with one, all-inclusive cluster
• At each step, split a cluster until each cluster contains a point (or there are k clusters)
• Traditionalhierarchicalalgorithmsusea(dis)similarityordistancematrix • Merge or split one cluster at a time
51
Agglomerative Clustering Algorithm
• More popular hierarchical clustering technique
• Basic algorithm is straightforward • Compute the similarity matrix
• Let each data point be a cluster
• Repeat
• Merge the two closest clusters
• Update the proximity matrix
• Until only a single cluster remains
• Key operation is the computation of the similarity of two clusters
• Different approaches to defining the distance between clusters distinguish the different algorithms
52
Agglomerative Clustering Example (1)
Start with clusters of individual points and a similarity matrix
p1 p2 p3 p4 p5 p1
p2 p3
p4
p5 .
. .
Similarity Matrix
…
53
Agglomerative Clustering Example (2)
After some merging, we have clusters
C1 C2
C3 C4 C5
C1
C5
Similarity Matrix
C3
C1
C2 C3
C4
C4
C2
C5
54
Agglomerative Clustering Example (3)
We merge the two closest clusters (C2 and C5)
C1 C2
C3 C4 C5
and update the similarity matrix
C1
C2 C3
C4 C5
C1
C3
C4
Similarity Matrix
C2
C5
55
Agglomerative Clustering Example (4)
How do we update the similarity matrix?
C2 U
C5
C1 C3C4
C1
C2 U C5 C3
C4
Similarity Matrix
?
?
?
?
?
?
?
C3
C4
C1
C2 U C5
56
Agglomerative Clustering Example (4)
Similarity?
• We define the similarity to be the minimum distance between the clusters, also known as single linkage
• Other choices also possible, e.g. maximum distance, also known as complete linkage
p1 p2 p3 p4 p5 p1
p2 p3
p4 p5
Similarity Matrix
…
57
Agglomerative Clustering Example (5)
•
MIN (Single Linkage)
p1 p2 p3 p4 p5 p1
p2 p3
p4 p5
…
Similarity Matrix
•
MAX (Complete Linkage)
58
Agglomerative Clustering Example (6)
MIN (Single Linkage: the similarity of two clusters is based on the two most similar (closest) points in the different clusters
• Determined by one pair of points, i.e., by one link in the proximity graph.
15 3
5
2
1 2
4
Nested Clusters
Dendrogram
3 4
6
59
Clustering – Reflections
60
Clustering Reflections: K-means (1)
Limitations of K-means – result quality varies based on size and density
Original Points K-means (3 Clusters)
Original Points K-means (3 Clusters)
61
Clustering Reflections: K-means (2)
Limitations of K-means – not strong for non-globular shapes
Original Points K-means (2 Clusters)
62
Clustering Reflections: K-means (3)
An idea: Use many clusters, merge together
63
Clustering Reflections: Agglomerative (1)
Clustering Reflections: hierarchical (1)
Strength of MIN: Single linkage: Can handle non-similar shapes
Original Points Two Clusters Limitations of MIN: Single linkage: Sensitive to noise and outliers
Original Points Two Clusters
64
Clustering Reflections: Agglomerative (1)
Clustering Reflections: hierarchical (2)
Strength of MAX: Complete linkage: Less susceptible to noise and outliers
Limitations of Max: Complete linkage: Tends to break large clusters
Original Points Two Clusters
65
Clustering Reflections: Agglomerative (3)
Clustering Reflections: hierarchical (3)
• Once a decision is made to combine two clusters, it cannot be undone, i.e. an object that is in the wrong cluster will always stay there.
• No objective function is directly minimised, i.e. no ‘relative’ quality measure
66
Where are we now?
2
Plan
• Discuss correlations between pairs of features in a dataset • Why useful and important
• Pitfalls
• Methods for computing correlation • Euclidean distance
• Pearson correlation
• Mutual information (another method to compute correlation)
3
Correlation
4
What is Correlation?
Correlation is used to detect pairs of variables that might have some relationship
https://www.mathsisfun.com/data/correlation.html
5
What is Correlation?
Visually can be identified via inspecting scatter plots
https://www.mathsisfun.com/data/correlation.html
6
What is Correlation?
Linear relations
7
Example of Correlated Variables
• Can hint at potential causal relationships (change in one variable is the result of change in the other)
• Business decision based on correlation: increase electricity production when temperature increases
8
Example of non-linear correlation
It gets so hot that people aren’t going near the shop, and sales start dropping
https://www.mathsisfun.com/data/correlation.html
9
Example of non-linear correlation
It gets so hot that people aren’t going near the shop, and sales start dropping
https://www.mathsisfun.com/data/correlation.html
10
Climate change [https://climate.nasa.gov/evidence/]
11
Climate Change? http://en.wikipedia.org/wiki/File:2000_Year_Temperature_Comparison.png
12
Climate Change?
13
Climate Change?
14
Climate Change?
15
Salt Causes High Blood Pressure
Intersalt: an international study of electrolyte excretion and blood pressure. Results for 24 hour urinary sodium and potassium excretion.
British Medical Journal; 297: 319-328, 1988.
16
Or Does It!?
Intersalt: an international study of electrolyte excretion and blood pressure. Results for 24 hour urinary sodium and potassium excretion.
British Medical Journal; 297: 319-328, 1988.
If we exclude these four ‘outliers’, which are non-industrialised countries with non-salt diets, we get a quite different result!
17
Example of Correlated Variables
Correlation does not necessarily imply causality!
18
Example of Correlated Variables
Correlation does not necessarily imply causality!
19
Example: Predicting Sales
https://www.kaggle.com/ c/rossmann-store- sales/data
20
Example: Predicting Sales
21
Example: Predicting Sales
22
Example: Predicting Sales
Other correlations • Sales vs. holiday
• Sales vs. day of the week
• Sales vs. distance to competitors • Sales vs. average income in area
23
Factors correlated with human performance
• Bedtime consistency
• High intensity exercise • Intermittent fasting
• Optimism
24
Why is correlation important?
• Discover relationships
• One step towards discovering causality A causes B
Example: causes lung cancer
Feature ranking: select the best features for building better predictive models
A good feature to use, is a feature that has high correlation with the outcome we are trying to predict
25
Microarray data
Each chip contains thousands of tiny probes corresponding to the genes (20k – 30k genes in humans). Each probe measures the activity (expression) level of a gene
Gene 1 expression
Gene 2 expression
…
Gene 20K expression
0.3
1.2
…
3.1
26
Microarray dataset
Gene 1
Gene 2
Gene 3
…
Gene n
Time 1
2.3
1.1
0.3
…
2.1
Time 2
3.2
0.2
1.2
…
1.1
Time 3
1.9
3.8
2.7
…
0.2
…
…
…
…
…
…
Time m
2.8
3.1
2.5
…
3.4
• Each row represents measurements at some time
• Each column represents levels of a gene
27
Correlation analysis on Microarray data
Can reveal genes that exhibit similar patterns similar or related functions Discover functions of unknown genes
28
Compare people
Gene 1
Gene 2
Gene 3
…
Gene n
Person 1
2.3
1.1
0.3
…
2.1
Person 2
3.2
0.2
1.2
…
1.1
Person 3
1.9
3.8
2.7
…
0.2
…
…
…
…
…
…
Person m
2.8
3.1
2.5
…
3.4
• Each row represents measurements about a person
• Each column represents levels of a gene
29
Genetic network
Connect genes with high correlation – understand behaviour of groups of genes
http://www.john.ranola.org/?page_id=116
30
Euclidian Distance
31
Problems of Euclidean distance
• Objects can be represented with different measure scales
Day 1
Day 2
Day 3
…
Day m
Temperature
20
22
16
…
33
#Ice-creams
50223
55223
45098
…
78008
#Electricity
102034
105332
88900
…
154008
d(temp,ice-cr)= 540324 d(temp,elect)= 12309388
• Euclidean distance: does not give a clear intuition about how well variables are correlated
32
Problems of Euclidean distance
Cannot discover variables with similar behaviours/dynamics but at different scale
33
Problems of Euclidean distance
Cannot discover variables with similar behaviours/dynamics but in the opposite direction (negative correlation)
34
Pearson
35
Assessing linear correlation – Pearson
correlation
• We will define a correlation measure rxy, assessing samples from two features x and y
• Assess how close their scatter plot is to a straight line (a linear relationship)
• Range of rxy lies within [-1,1]:
• 1 for perfect positive linear correlation
• -1 for perfect negative linear correlation
• 0 means no correlation
• Absolute value |r| indicates strength of linear correlation
• http://www.bc.edu/research/intasc/library/correlation.shtml
36
Pearson’s correlation coefficient (r)
37
Pearson coefficient example
Height (x)
Weight (y)
1.6
50
1.7
66
1.8
77
1.9
94
• How do the values of x and y move (vary) together?
• Big values of x with big values of y?
• Small values of x with small values of y?
38
Pearson coefficient example
39
Example rank correlation
“If a university has a higher-ranked football team, then is it likely to have a higher-ranked basketball team?”
3
4
5
6
40
Interpreting Pearson correlation values
In general it depends on your domain of application. has suggested
• 0.5 is large
• 0.3-0.5 is moderate
• 0.1-0.3 is small
• less than 0.1 is trivial
41
Properties of Pearson’s correlation
• Range within [-1,1]
• Scale invariant: r(x,y)= r(x, Ky)
• Multiplying a feature’s values by a constant K makes no difference • Location invariant: r(x,y)= r(x, K+y)
• Adding a constant K to one feature’s values makes no difference • Can only detect linear relationships
y = a.x + b + noise
• Cannot detect non-linear relationships y=x3 +noise
42
Examples
https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient
43
2017 Exam question 3a
• http://www.bc.edu/research/intasc/library/correlation.shtml
44
2016 exam question 2a)
45
Mutual Information
46
Recap: Pearson correlation – assess linear correlation between two features
https://www.mathsisfun.com/data/correlation.html
47
What about non-linear correlation?
Pearson correlation is not suitable for this scenario (value less than 0.1)
https://www.mathsisfun.com/data/correlation.html 48
Introduction to Mutual Information
A correlation measure that can detect non-linear relationships
• It operates with discrete features
• Pre-processing: continuous features are first discretised into bins (categories). E.g. small [0,1.4], medium (1.4,1.8), large [1.8,3.0]
Object
Height
Discretised Height
1
2.03
large
2
1.85
large
3
1.23
small
4
1.31
small
5
1.72
medium
6
1.38
small
7
0.94
small
49
Variable discretisation: Techniques
• Domain knowledge: assign thresholds manually
• Carspeed:
• 0-40: slow
• 40-60: medium
• >60: high
• Equal-length bin
• Divide the range of the continuous feature into equal length intervals (bins). If
speed ranges from 0-100, then the 10 bins are [0,10), [10,20), [20,30), …[90,100] • Equal frequency bin
• Divide range of continuous feature into equal frequency intervals (bins). Sort the values and divide so that each bin has same number of objects
50
Discretisation example
• Given the values 2, 2, 3, 10, 13, 15, 16, 17, 19 19, 20, 20, 21 • Show a 3 bin equal length discretisation
• Show a 3 bin equal frequency discretisation
51
A recap on logarithms (to the base 2)
• y = log2x (y is the solution to the question “To what power do I need to raise 2, in order to get x?’’)
• 2*2*2*2=16 which means log216 = 4 (16 is 2 to the power 4) •log2 32=5
• log2 30 = 4.9
• log2 1.2 = 0.26
• log2 0.5 = -1
• In what follows, we’ll write log instead of log2
52
Entropy
• Entropy is a measure used to assess the amount of uncertainty in an outcome
• E.g. Randomly select an element from {1,1,1,1,1,1,1,2}, versus randomly select an element from {1,2,2,3,3,4,5}
• In which case is the value selected more “predictable”? Why?
53
Entropy
• E.g. Randomly select an element from {1,1,1,1,1,1,1,2}, versus randomly select an element from {1,2,2,3,3,4,5}
• In which case is the value selected more “predictable”? Why?
• The former case more certain => low entropy
• The latter case is less certain => higher entropy
• Entropy is used to quantify this degree of uncertainty (surprisingness)
54
Another example
• Consider the sample of all people in this lecture theatre. Each person is labelled young (<30 years) or old (>30 years)
• Randomly select a person and inspect whether they are young or old. • How surprised am I likely to be by the outcome?
• Suppose I repeat the experiment using a random sample of people catching the train to the city in peak hour?
• How surprised am I likely to be by the outcome?
55
Entropy
• Given a feature X. Then H(X) is its entropy. Assuming X uses a number of categories (bins)
• May sometimes write p(i) instead of p
56
Entropy of a Random Variable
• E.g. Suppose there are 3 bins, each bins contains exactly one third of the objects (points) • H(X)=- [ 0.33 x log(0.33) + 0.33 x log(0.33) + 0.33 x log(0.33) ]
57
Another Entropy example
A
B
B
A
C
C
C
C
A
We have 3 categories/bins (A,B,C) for a feature X 9 objects, each in exactly one of the 3 bins
What is the entropy of this sample of 9 objects? Answer: H(X)=1.53
58
How would you compute the entropy for the “Likes to sleep” feature?
Person
Likes to sleep
1
Yes
2
No
3
Maybe
4
Never
5
Yes
6
Yes
7
Never
8
Yes
59
Properties of entropy
• H(X)
• Entropy – when using log base 2 – measures uncertainty of the outcome in bits. This can be viewed as the information associated with learning the outcome
• Entropy is maximized for uniform distribution (highly uncertain what value a randomly selected object will have)
60
Conditional entropy – intuition
Suppose I randomly sample a person. I check if they wear glasses – how surprised am I by their age?
Person
WearGlasses(X)
Age (Y)
1
No
young
2
No
young
3
No
young
4
No
young
5
Yes
old
6
Yes
old
7
Yes
old
61
Conditional entropy H(Y|X)
Measures how much information is needed to describe outcome Y, given that outcome X is known. Suppose X is Height and Y is Weight.
Object
Height (X)
Weight (Y)
1
big
light
2
big
heavy
3
small
light
4
small
light
5
small
light
6
small
light
7
small
heavy
62
Conditional entropy
Object
Height (X)
Weight (Y)
1
big
light
2
big
heavy
3
small
light
4
small
light
5
small
light
6
small
light
7
small
heavy
H(Y|X) = 2/7 * H(Y|X=big) + 5/7 * h(Y|X=small)
= 2/7(-0.5log 0.5 -0.5 log 0.5) + 5/7(-0.8log 0.8-0.2log 0.2) = 0.801
63
Mutual information definition
• WhereXandYarefeatures(columns)inadataset
• MI(mutualinformation)isameasureofcorrelation
• Theamoun