Machine learning lecture slides
Machine learning lecture slides
COMS 4771 Fall 2020
0 / 24
Nearest neighbor classification
Outline
I Optical character recognition (OCR) example
I Nearest neighbor rule
I Error rate, test error rate
I k-nearest neighbor rule
I Hyperparameter tuning via cross-validation
I Distance functions, features
I Computational issues
1 / 24
Example: OCR for digits
I Goal: Automatically label images of handwritten digits
I Possible labels are {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
I Start with a large collection of already-labeled images
I D := {(x1, y1), . . . , (xn, yn)}
I xi is the i-th image; yi ∈ {0, 1, . . . , 9} is the corresponding
label.
I The National Institute for Standards and Technology (NIST)
has amassed such a data set.
2 / 24
Figure 1: Some images of handwritten digits from MNIST data set
3 / 24
Nearest neighbor (NN) classifier
I Nearest neighbor (NN) classifier NND:
I Represented using collection of labeled examples
D := ((x1, y1), . . . , (xn, yn)), plus a snippet of code
I Input: x
I Find xi in D that is “closest” to x (the nearest neighbor)
I (Break ties in some arbitrary fixed way)
I Return yi
4 / 24
Naïve distance between images of handwritten digits (1)
I Treat (grayscale) images as vectors in Euclidean space Rd
I d = 282 = 784
I Generalizes physical 3-dimensional space
I Each point x = (x1, . . . , xd) ∈ Rd is a vector of d real numbers
I ‖x− z‖2 =
√∑d
j=1(xj − zj)2
I Also called `2 distance
I WARNING: Here, xj refers to the j-th coordinate of x
Figure 2: Grayscale pixel representation of an image of a handwritten “4”
5 / 24
Naïve distance between images of handwritten digits (2)
I Why use this for images?
I Why not use this for images?
6 / 24
Recap: OCR via NN
I What is the core prediction problem?
I What features (i.e., predictive variables) are available?
I Will these features be available at time of prediction?
I Is there enough information (“training data”) to learn the
relationship between the features and label?
I What are the modeling assumptions?
I Is high-accuracy prediction a useful goal for the application?
7 / 24
Error rate
I Error rate (on a collection of labeled examples S)
I Fraction of labeled examples in S that have incorrect label
prediction from f̂
I Written as err(f̂ , S)
I (Often, the word “rate” is omitted)
I OCR via NN:
err(NND, D) =
8 / 24
Test error rate (1)
I Better evaluation: test error rate
I Train/test split, S ∩ T = ∅
I S is training data, T is test data
I Classifier f̂ only based on S
I Training error rate: err(f̂ , S)
I Test error rate: err(f̂ , T )
I On OCR data: test error rate is err(NNS , T ) = 3.09%
I Is this good?
I What is the test error rate of uniformly random predictions?
I What is the test error rate of a constant prediction?
9 / 24
Test error rate (2)
I Why is test error rate meaningful?
I What are the drawbacks of evaluation via test error rate?
10 / 24
Figure 3: A test example and its nearest neighbor in training data (2, 8)
11 / 24
Figure 4: A test example and its nearest neighbor in training data (3, 5)
12 / 24
Figure 5: A test example and its nearest neighbor in training data (5, 4)
13 / 24
Figure 6: A test example and its nearest neighbor in training data (4, 1)
14 / 24
More on the modeling assumptions
I Modeling assumption: Nearby images are more likely to have
the same label than different labels.
I This is an assumption about the choice of distance function
I In our OCR example, this is an assumption about the choice of
features
15 / 24
Diagnostics
I What are the kinds of errors made by NNS?
Figure 7: A test example and its nearest neighbor in training data (2, 8)
Figure 8: Three nearest neighbors of the test example (8,2,2)
16 / 24
Upgrade: k-NN
I k-nearest neighbor (k-NN) classifier NNk,D
I Input: x
I Find the k nearest neighbors of x in D
I Return the plurality of the corresponding labels
I As before, break ties in some arbitrary fixed way
17 / 24
Typical effect of k
I Smaller k: smaller training error rate
I Larger k: higher training error rate, but predictions more
“stable” due to voting
I On OCR data: lowest test error rate achieved at k = 3
k 1 3 5 7 9
err(NNk,S , T ) 0.0309 0.0295 0.0312 0.0306 0.0341
18 / 24
Hyperparameter tuning
I k is a hyperparameter of k-NN
I How to choose hyperparameters?
I Bad idea: Choosing k that yields lowest training error rate
(degenerate choice: k = 1)
I Better idea: Simulate train/test split on the training data
I Hold-out validation
I Randomly split S into A and B
I Compute validation error rate for all k ∈ {1, 3, 5, 7, 9}:
Vk := err(NNk,A, B)
I Let k̂ be the value of k for which Vk is smallest
I Classifier to use is NN
k̂,S
19 / 24
Upgrade: Distance functions (1)
I Specialize to input types
I Edit distance for strings
I Shape distance for images
I Time warping distance for audio waveforms
20 / 24
Upgrade: Distance functions (2)
I Generic distances for vectors of real numbers
I `p distances
‖x− z‖p =
d∑
j=1
|xj − zj |p
1/p .
I What are the unit balls for these distances (in R2)?
21 / 24
Upgrade: Distance functions (3)
I Distance functions for images of handwritten digits
distance `2 `3 tangent shape
test error rate 0.0309 0.0283 0.0110 0.0063
22 / 24
Features
I When using numerical features (arranged in a vector from Rd):
I Scale of features matters
I Noisy features can ruin NN
I E.g., consider what happens in OCR example if you have
another 10000 additional features that are pure “noise”
I Or a single pure noise feature whose scale is 10000× the
nominal scale of pixel values
I “Curse of dimension”
I Weird effects in Rd for large d
I Can find 2Ω(d) points that are approximately equidistant
23 / 24
Computation for NN
I Brute force search: Θ(dn) time for each prediction (using
Euclidean distance in Rd)
I Clever data structures: “improve” to 2d log(n) time
I Approximate nearest neighbors: sub-linear time to get
“approximate” answers
I E.g., find point among the top-1% of closest points?
24 / 24
Nearest neighbor classification