Machine learning lecture slides
COMS 4771 Fall 2020
0/24
Nearest neighbor classification
Outline
I Optical character recognition (OCR) example I Nearest neighbor rule
I Error rate, test error rate
I k-nearest neighbor rule
I Hyperparameter tuning via cross-validation I Distance functions, features
I Computational issues
1/24
Example: OCR for digits
I Goal: Automatically label images of handwritten digits I Possible labels are {0,1,2,3,4,5,6,7,8,9}
I Start with a large collection of already-labeled images
I D := {(x1,y1),…,(xn,yn)}
I xi is the i-th image; yi œ {0,1,…,9} is the corresponding
label.
I The National Institute for Standards and Technology (NIST)
has amassed such a data set.
3
2/24
Figure 1: Some images of handwritten digits from MNIST data set
3/24
Nearest neighbor (NN) classifier
I Nearest neighbor (NN) classifier NND:
I Represented using collection of labeled examples
D := ((x1, y1), . . . , (xn, yn)), plus a snippet of code
e
I Input: x
I Find xi in D that is “closest” to x (the nearest neighbor)
I (Break ties in some arbitrary fixed way) I Return yi
4/24
Naïve distance between images of handwritten digits (1)
I Treat (grayscale) images as vectors in Euclidean space Rd I d=282 =784
—
–
I Generalizes physical 3-dimensional space
I Each point x = (Òxq,…,x ) œ Rd is a vector of d real numbers
–
–1-d
I Îx≠zÎ2 = d (xj ≠zj)2 Euclidean distance
j=1 I Also called ̧2 distance
— -•s
I WARNING:-Here, xj refers to the j-th coordinate of x e
Figure 2: Grayscale pixel representation of an image of a handwritten “4”
5/24
Naïve distance between images of handwritten digits (2)
I Why use this for images? I simple
I Why not use this for images? a s doesn’t really t r e a t inputs
)
image
(whichhadsepatial significance.
.
6/24
Recap: OCR via NN
I What is the core prediction problem?
I What features (i.e., predictive variables) are available?
I Will these features be available at time of prediction?
I Is there enough information (“training data”) to learn the relationship between the features and label?
I What are the modeling assumptions?
I Is high-accuracy prediction a useful goal for the application?
7/24
Error rate
I Error rate (on a collection of labeled examples S)
I Fraction of labeled examples in S that have incorrect label
prediction from fˆ Iˆ
#{cx.yteSifG1±#
Written as err(f,S)
I (Often, the word “rate” is omitted)
1st
I OCR via NN:
err(NND,D)= O
8/24
Test error rate (1)
I Better evaluation: test error rate I Train/test split, S fl T = ÿ
I S is training data, T is test data I Classifier fˆ only based on S
Iˆ Training error rate: err(f,S)
Iˆ Test error rate: err(f,T)
I On OCR data: test error rate is err(NNS , T ) = 3.09% I Is this good?
I What is the test error rate of uniformly random predictions? I What is the test error rate of a constant prediction?
9/24
Test error rate (2)
I Why is test error rate meaningful?
not interesting
–
Training error rate (always too)
seems
“ki model of the data to say more) I What are the drawbacks of evaluation via test error rate?
–
Test error
rate –
feels
”
More
(Need
a
10/24
Figure 3: A test example and its nearest neighbor in training data (2, 8)
11/24
Figure 4: A test example and its nearest neighbor in training data (3, 5)
12/24
Figure 5: A test example and its nearest neighbor in training data (5, 4)
13/24
Figure 6: A test example and its nearest neighbor in training data (4, 1)
14/24
More on the modeling assumptions
I Modeling assumption: Nearby images are more likely to have the same label than di erent labels.
I This is an assumption about the choice of distance function
I In our OCR example, this is an assumption about the choice of
features
15/24
Diagnostics
I What are the kinds of errors made by NNS?
Figure 7: A test example and its nearest neighbor in training data (2, 8)
Figure 8: Three nearest neighbors of the test example (8,2,2)
16/24
Upgrade: k-NN
I k-nearest neighbor (k-NN) classifier NNk,D I Input: x
I Find the k nearest neighbors of x in D
I Return the plurality of the corresponding labels
I As before, break ties in some arbitrary fixed way
17/24
Typical e ect of k
I Smaller k: smaller training error rate
I Larger k: higher training error rate, but predictions more
“stable” due to voting
I On OCR data: lowest test error rate achieved at k = 3
k13579 err(NNk,S , T ) 0.0309 0.0295 0.0312 0.0306 0.0341
18/24
Hyperparameter tuning
I k is a hyperparameter of k-NN I How to choose hyperparameters?
I Bad idea: Choosing k that yields lowest training error rate (degenerate choice: k = 1)
I Better idea: Simulate train/test split on the training data I Hold-out validation
I Randomly split S into A and B
I Compute validation error rate for all k œ {1, 3, 5, 7, 9}:
Vk :=err(NNk,A,B)
I Let kˆ be the value of k for which Vk is smallest I Classifier to use is NNkˆ,S
19/24
Upgrade: Distance functions (1)
”
I Specialize to input types I Edit distance for strings
I Shape distance for images
I Time warping distance for audio waveforms
he
20/24
Upgrade: Distance functions (2)
I Generic distances for vectors of real numbers I ̧p distances
Qÿ= R1/p d
Îx≠zÎ =a |x ≠z|pb . -pjj
j=1
I What are the unit balls for these distances (in R2)?
.
21/24
Upgrade: Distance functions (3)
I Distance functions for images of handwritten digits
distance ̧2 ̧3 tangent shape test error rate 0.0309 0.0283 0.0110 0.0063
22/24
Features
I When using numerical features (arranged in a vector from Rd): I Scale of features matters
I Noisy features can ruin NN
I E.g., consider what happens in OCR example if you have I another 10000 additional features that are pure “noise”
Or a single pure noise feature whose scale is 10000◊ the nominal scale of pixel values
I “Curse of dim⇐ension”
I Weird e ects in Rd for large d
I Can find 2 (d) points that are approximately equidistant
23/24
Computation for NN
I Brute force search: (dn) time for each prediction (using SO
[I
I
—
Euclidean distance in Rd)
Clever data structures: “improve” to 2d log(n) time
I Approximate nearest neighbors: sub-linear time to get
“approximate” answers
E.g., find point among the top-1% of closest points?
24/24