程序代写代做代考 data structure Machine learning lecture slides

Machine learning lecture slides
COMS 4771 Fall 2020
0/24

Nearest neighbor classification

Outline
􏰛 Optical character recognition (OCR) example 􏰛 Nearest neighbor rule
􏰛 Error rate, test error rate
􏰛 k-nearest neighbor rule
􏰛 Hyperparameter tuning via cross-validation 􏰛 Distance functions, features
􏰛 Computational issues
1/24

Example: OCR for digits
􏰛 Goal: Automatically label images of handwritten digits 􏰛 Possible labels are {0,1,2,3,4,5,6,7,8,9}
􏰛 Start with a large collection of already-labeled images
􏰛 D := {(x1,y1),…,(xn,yn)}
􏰛 xi is the i-th image; yi ∈ {0,1,…,9} is the corresponding
label.
􏰛 The National Institute for Standards and Technology (NIST)
has amassed such a data set.
2/24

Figure 1: Some images of handwritten digits from MNIST data set
3/24

Nearest neighbor (NN) classifier
􏰛 Nearest neighbor (NN) classifier NND:
􏰛 Represented using collection of labeled examples
D := ((x1, y1), . . . , (xn, yn)), plus a snippet of code 􏰛 Input: x
􏰛 Find xi in D that is “closest” to x (the nearest neighbor) 􏰛 (Break ties in some arbitrary fixed way)
􏰛 Return yi
4/24

Naïve distance between images of handwritten digits (1)
􏰛 Treat (grayscale) images as vectors in Euclidean space Rd 􏰛 d=282 =784
􏰛 Generalizes physical 3-dimensional space
􏰛 Each point x = (x1,…,xd) ∈ Rd is a vector of d real numbers
􏰛 ∥x−z∥2 =􏰍􏰉dj=1(xj −zj)2
􏰛 Also called l2 distance
􏰛 WARNING: Here, xj refers to the j-th coordinate of x
Figure 2: Grayscale pixel representation of an image of a handwritten “4”
5/24

Naïve distance between images of handwritten digits (2)
􏰛 Why use this for images?
􏰛 Why not use this for images?
6/24

Recap: OCR via NN
􏰛 What is the core prediction problem?
􏰛 What features (i.e., predictive variables) are available?
􏰛 Will these features be available at time of prediction?
􏰛 Is there enough information (“training data”) to learn the relationship between the features and label?
􏰛 What are the modeling assumptions?
􏰛 Is high-accuracy prediction a useful goal for the application?
7/24

Error rate
􏰛 Error rate (on a collection of labeled examples S)
􏰛 Fraction of labeled examples in S that have incorrect label
prediction from fˆ 􏰛ˆ
Written as err(f,S)
􏰛 (Often, the word “rate” is omitted)
􏰛 OCR via NN:
err(NND,D) =
8/24

Test error rate (1)
􏰛 Better evaluation: test error rate 􏰛 Train/test split, S ∩ T = ∅
􏰛 S is training data, T is test data 􏰛 Classifier fˆ only based on S
􏰛ˆ Training error rate: err(f,S)
􏰛ˆ Test error rate: err(f,T)
􏰛 On OCR data: test error rate is err(NNS , T ) = 3.09% 􏰛 Is this good?
􏰛 What is the test error rate of uniformly random predictions? 􏰛 What is the test error rate of a constant prediction?
9/24

Test error rate (2)
􏰛 Why is test error rate meaningful?
􏰛 What are the drawbacks of evaluation via test error rate?
10/24

Figure 3: A test example and its nearest neighbor in training data (2, 8)
11/24

Figure 4: A test example and its nearest neighbor in training data (3, 5)
12/24

Figure 5: A test example and its nearest neighbor in training data (5, 4)
13/24

Figure 6: A test example and its nearest neighbor in training data (4, 1)
14/24

More on the modeling assumptions
􏰛 Modeling assumption: Nearby images are more likely to have the same label than different labels.
􏰛 This is an assumption about the choice of distance function
􏰛 In our OCR example, this is an assumption about the choice of
features
15/24

Diagnostics
􏰛 What are the kinds of errors made by NNS?
Figure 7: A test example and its nearest neighbor in training data (2, 8)
Figure 8: Three nearest neighbors of the test example (8,2,2)
16/24

Upgrade: k-NN
􏰛 k-nearest neighbor (k-NN) classifier NNk,D 􏰛 Input: x
􏰛 Find the k nearest neighbors of x in D
􏰛 Return the plurality of the corresponding labels
􏰛 As before, break ties in some arbitrary fixed way
17/24

Typical effect of k
􏰛 Smaller k: smaller training error rate
􏰛 Larger k: higher training error rate, but predictions more
“stable” due to voting
􏰛 On OCR data: lowest test error rate achieved at k = 3
k13579 err(NNk,S , T ) 0.0309 0.0295 0.0312 0.0306 0.0341
18/24

Hyperparameter tuning
􏰛 k is a hyperparameter of k-NN 􏰛 How to choose hyperparameters?
􏰛 Bad idea: Choosing k that yields lowest training error rate (degenerate choice: k = 1)
􏰛 Better idea: Simulate train/test split on the training data 􏰛 Hold-out validation
􏰛 Randomly split S into A and B
􏰛 Compute validation error rate for all k ∈ {1, 3, 5, 7, 9}:
Vk :=err(NNk,A,B)
􏰛 Let kˆ be the value of k for which Vk is smallest 􏰛 Classifier to use is NNkˆ,S
19/24

Upgrade: Distance functions (1)
􏰛 Specialize to input types 􏰛 Edit distance for strings
􏰛 Shape distance for images
􏰛 Time warping distance for audio waveforms
20/24

Upgrade: Distance functions (2)
􏰛 Generic distances for vectors of real numbers 􏰛 lp distances
d 1/p 
∥x−z∥p = 􏰊|xj −zj|p . j=1
􏰛 What are the unit balls for these distances (in R2)?
21/24

Upgrade: Distance functions (3)
􏰛 Distance functions for images of handwritten digits
distance l2 l3 tangent shape test error rate 0.0309 0.0283 0.0110 0.0063
22/24

Features
􏰛 When using numerical features (arranged in a vector from Rd): 􏰛 Scale of features matters
􏰛 Noisy features can ruin NN
􏰛 E.g., consider what happens in OCR example if you have another 10000 additional features that are pure “noise”
􏰛 Or a single pure noise feature whose scale is 10000× the nominal scale of pixel values
􏰛 “Curse of dimension”
􏰛 Weird effects in Rd for large d
􏰛 Can find 2Ω(d) points that are approximately equidistant
23/24

Computation for NN
􏰛 Brute force search: Θ(dn) time for each prediction (using Euclidean distance in Rd)
􏰛 Clever data structures: “improve” to 2d log(n) time 􏰛 Approximate nearest neighbors: sub-linear time to get
“approximate” answers
􏰛 E.g., find point among the top-1% of closest points?
24/24