http://poloclub.gatech.edu/cse6242
CSE6242 / CX4242: Data & Visual Analytics
Classification
Duen Horng (Polo) Chau
Assistant Professor
Associate Director, MS Analytics
Georgia Tech
Partly based on materials by
Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos, Parishit Ram (GT PhD alum; SkyTree), Alex Gray
1
Parishit Ram
GT PhD alum; SkyTree
Songs
Label
Some nights
Skyfall
Comfortably numb
We are young
…
…
…
…
Chopin’s 5th
???
How will I rate “Chopin’s 5th Symphony”?
2
Classification
What tools do you need for classification?
1.Data S = {(xi, yi)}i = 1,…,n
xi represents each example with d attributes yi represents the label of each example
2.Classification model f(a,b,c,….) with some parameters a, b, c,…
a model/function maps examples to labels
3.Loss function L(y, f(x))
how to penalize mistakes
o o
o
o
3
Features
Song name
Label
Artist
Lengt h
…
Some nights
Fun
4:23
…
Skyfall
Adele
4:00
…
Comf. numb
Pink Fl.
6:13
…
We are young
Fun
3:50
…
…
…
…
…
…
…
…
…
…
…
Chopin’s 5th
??
Chopin
5:32
…
4
Training a classifier (building the “model”) Q: How do you learn appropriate values
for parameters a, b, c, … such that yi = f(a,b,c,….)(xi), i = 1, …, n
Low/no error on ”training data” (songs) y = f(a,b,c,….)(x), for any new x
Low/no error on ”test data” (songs)
Possible A: Minimize with respect to a, b, c,…
•
o
•
o
5
Classification loss function
Most common loss: 0-1 loss function More general loss functions are defined by
a m x m cost matrix C such that where y = a and f(x) = b
T0 (true class 0), T1 (true class 1)
P0 (predicted class 0), P1 (predicted class 1)
Class
T0
T1
P0
0
C10
P1
C01
0
6
k-Nearest-Neighbor Classifier
The classifier:
f(x) = majority label of the k nearest neighbors (NN) of x
Model parameters:
• •
Number of neighbors k Distance/similarity function d(.,.)
7
But KNN is so simple!
It can work really well! Pandora uses it:
https://goo.gl/foLfMP
(from the book “Data Mining for Business Intelligence”)
8
k-Nearest-Neighbor Classifier
If k and d(.,.) are fixed Things to learn: ? How to learn them: ?
If d(.,.) is fixed, but you can change k Things to learn: ?
How to learn them: ?
9
k-Nearest-Neighbor Classifier
If k and d(.,.) are fixed Things to learn: Nothing How to learn them: N/A
If d(.,.) is fixed, but you can change k
Selecting k: Try different values of k on some hold-out set
10
11
How to find the best k in K-NN?
Use cross validation.
12
Example, evaluate k = 1 (in K-NN)
using 5-fold cross-validation
13
Cross-validation (C.V.)
1.Divide your data into n parts
2.Hold 1 part as “test set” or “hold out set”
3.Train classifier on remaining n-1 parts “training set” 4.Compute test error on test set
5.Repeat above steps n times, once for each n-th part
6.Compute the average test error over all n folds
(i.e., cross-validation test error)
14
Cross-validation variations
Leave-one-out cross-validation (LOO-CV)
•
K-fold cross-validation
• •
test sets of size 1
Test sets of size (n / K)
K = 10 is most common
(i.e., 10 fold CV)
15
k-Nearest-Neighbor Classifier
If k is fixed, but you can change d(.,.) Things to learn: ?
How to learn them: ? Cross-validation: ?
Possible distance functions:
• • •
Euclidean distance: Manhattan distance: …
16
k-Nearest-Neighbor Classifier
If k is fixed, but you can change d(.,.) Things to learn: distance function d(.,.) How to learn them: optimization
Cross-validation: any regularizer you have on your distance function
17
Summary on k-NN classifier
•
o
o
•
Computationally expensive at test time Reading material:
•
•
Advantages
Little learning (unless you are learning the distance functions)
quite powerful in practice (and has theoretical guarantees as well)
Caveats
o
ESL book, Chapter 13.3
~tibs/ElemStatLearn/printings/ESLII_print10.pdf
http://www-stat.stanford.edu/
Le Song’s slides on kNN classifier
www.cc.gatech.edu/~lsong/teaching/CSE6740/lecture2.pdf
http://
18
Points about cross-validation
Requires extra computation, but gives you information about expected test error
LOO-CV:
•
o
o Low variance
•
o
Advantages
Unbiased estimate of test error
(especially for small n)
Caveats
Extremely time consuming
19
Points about cross-validation
K-fold CV:
•
o
•
o o
higher bias
Usually accepted value K = 10
Reading material:
•
•
•
Advantages
More efficient than LOO-CV Caveats
K needs to be large for low variance
Too small K leads to under-use of data, leading to
ESL book, Chapter 7.10
ElemStatLearn/printings/ESLII_print10.pdf
http://www-stat.stanford.edu/~tibs/
Le Song’s slides on CV
http://www.cc.gatech.edu/~lsong/teaching/CSE6740/lecture13-cv.pdf
20
Decision trees (DT)
Visual introduction to decision tree
http://www.r2d3.us/visual-intro-to-machine-learning-part-1/
The classifier:
fT(x): majority class in the leaf in the tree T containing x Model parameters: The tree structure and size
Weather?
21
Decision trees
Things to learn: ? How to learn them: ? Cross-validation: ?
Weather?
22
Learning the Tree Structure Things to learn: the tree structure
How to learn them: (greedily) minimize the overall classification loss
Cross-validation: finding the best sized tree with K-fold cross-validation
23
Decision trees
Pieces:
1. Find the best split on the chosen attribute 2. Find the best attribute to split on
3. Decide on when to stop splitting
4. Cross-validation
Highly recommended lecture slides from CMU
http://www.cs.cmu.edu/afs/cs.cmu.edu/academic/class/15381-s06/www/DTs.pdf
24
Choosing the split
Split types for a selected attribute j:
1. Categorical attribute (e.g. “genre”)
x1j = Rock, x2j = Classical, x3j = Pop
2. Ordinal attribute (e.g., “achievement”)
x1j=Platinum, x2j=Gold, x3j=Silver
3. Continuous attribute (e.g., song duration)
x1j = 235, x2j = 543, x3j = 378
x1,x2,x3
Rock Classical Pop
x1,x2,x3
Plat. Gold x1 x2
x1,x2,x3 Silver
x1
x2 x3
x3
x1,x3 x2
Split on genre
Split on achievement
Split on duration
25
Choosing the split
At a node T for a given attribute d, select a split s as following:
mins loss(TL) + loss(TR) where loss(T) is the loss at node T
Common node loss functions:
• Misclassification rate
• Expected loss
• Normalized negative log-likelihood (= cross-entropy)
More details on loss functions, see Chapter 3.3:
http://www.stat.cmu.edu/~cshalizi/350/lectures/22/lecture-22.pdf
26
Choosing the attribute
Choice of attribute:
1. Attribute providing the maximum improvement in training loss
2. Attribute with highest information gain
(mutual information)
Intuition: an attribute with highest information gain helps most rapidly describe an instance (i.e., most rapidly reduces “uncertainty”)
More details about information gain
https://en.wikipedia.org/wiki/Information_gain_in_decision_trees
27
When to stop splitting?
1. Homogenous node (all points in the node belong to the same class OR all points in the node have the same attributes)
2. Node size less than some threshold
3. Further splits provide no improvement in training loss
(loss(T) <= loss(TL) + loss(TR))
28
Controlling tree size
In most cases, you can drive training error to zero (how? is that a good thing?)
What is wrong with really deep trees?
•
What can be done to control this?
•
o
Really high "variance”
Regularize the tree complexity
Penalize complex models and prefers simpler models
Look at Le Song's slides on the decomposition of error in bias and variance of the estimator http://www.cc.gatech.edu/~lsong/teaching/CSE6740/lecture13-cv.pdf
29
Summary on decision trees
• Advantages
o o o o o
• Caveats
o o o
Easy to implement
Interpretable
Very fast test time
Can work seamlessly with mixed attributes Works quite well in practice
Can be too simplistic (but OK if it works) Training can be very expensive Cross-validation is hard (node-level CV)
30
Final words on decision trees
Reading material:
• •
ESL book, Chapter 9.2
http://www-stat.stanford.edu/ ~tibs/ElemStatLearn/printings/ESLII_print10.pdf
Le Song's slides
CSE6740/lecture6.pdf
http://www.cc.gatech.edu/~lsong/teaching/
31