CS计算机代考程序代写 data structure Machine learning lecture slides

Machine learning lecture slides

Machine learning lecture slides

COMS 4771 Fall 2020

0 / 24

Nearest neighbor classification

Outline

I Optical character recognition (OCR) example
I Nearest neighbor rule
I Error rate, test error rate
I k-nearest neighbor rule
I Hyperparameter tuning via cross-validation
I Distance functions, features
I Computational issues

1 / 24

Example: OCR for digits

I Goal: Automatically label images of handwritten digits
I Possible labels are {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
I Start with a large collection of already-labeled images

I D := {(x1, y1), . . . , (xn, yn)}
I xi is the i-th image; yi ∈ {0, 1, . . . , 9} is the corresponding

label.
I The National Institute for Standards and Technology (NIST)

has amassed such a data set.

2 / 24

Figure 1: Some images of handwritten digits from MNIST data set

3 / 24

Nearest neighbor (NN) classifier

I Nearest neighbor (NN) classifier NND:
I Represented using collection of labeled examples

D := ((x1, y1), . . . , (xn, yn)), plus a snippet of code
I Input: x

I Find xi in D that is “closest” to x (the nearest neighbor)
I (Break ties in some arbitrary fixed way)
I Return yi

4 / 24

Naïve distance between images of handwritten digits (1)

I Treat (grayscale) images as vectors in Euclidean space Rd
I d = 282 = 784
I Generalizes physical 3-dimensional space

I Each point x = (x1, . . . , xd) ∈ Rd is a vector of d real numbers
I ‖x− z‖2 =

√∑d
j=1(xj − zj)2

I Also called `2 distance
I WARNING: Here, xj refers to the j-th coordinate of x

Figure 2: Grayscale pixel representation of an image of a handwritten “4”

5 / 24

Naïve distance between images of handwritten digits (2)

I Why use this for images?

I Why not use this for images?

6 / 24

Recap: OCR via NN

I What is the core prediction problem?

I What features (i.e., predictive variables) are available?

I Will these features be available at time of prediction?

I Is there enough information (“training data”) to learn the
relationship between the features and label?

I What are the modeling assumptions?

I Is high-accuracy prediction a useful goal for the application?

7 / 24

Error rate

I Error rate (on a collection of labeled examples S)
I Fraction of labeled examples in S that have incorrect label

prediction from f̂
I Written as err(f̂ , S)
I (Often, the word “rate” is omitted)

I OCR via NN:
err(NND, D) =

8 / 24

Test error rate (1)

I Better evaluation: test error rate
I Train/test split, S ∩ T = ∅

I S is training data, T is test data
I Classifier f̂ only based on S
I Training error rate: err(f̂ , S)
I Test error rate: err(f̂ , T )

I On OCR data: test error rate is err(NNS , T ) = 3.09%
I Is this good?

I What is the test error rate of uniformly random predictions?
I What is the test error rate of a constant prediction?

9 / 24

Test error rate (2)

I Why is test error rate meaningful?

I What are the drawbacks of evaluation via test error rate?

10 / 24

Figure 3: A test example and its nearest neighbor in training data (2, 8)

11 / 24

Figure 4: A test example and its nearest neighbor in training data (3, 5)

12 / 24

Figure 5: A test example and its nearest neighbor in training data (5, 4)

13 / 24

Figure 6: A test example and its nearest neighbor in training data (4, 1)

14 / 24

More on the modeling assumptions

I Modeling assumption: Nearby images are more likely to have
the same label than different labels.
I This is an assumption about the choice of distance function
I In our OCR example, this is an assumption about the choice of

features

15 / 24

Diagnostics

I What are the kinds of errors made by NNS?

Figure 7: A test example and its nearest neighbor in training data (2, 8)

Figure 8: Three nearest neighbors of the test example (8,2,2)
16 / 24

Upgrade: k-NN

I k-nearest neighbor (k-NN) classifier NNk,D
I Input: x

I Find the k nearest neighbors of x in D
I Return the plurality of the corresponding labels

I As before, break ties in some arbitrary fixed way

17 / 24

Typical effect of k

I Smaller k: smaller training error rate
I Larger k: higher training error rate, but predictions more

“stable” due to voting
I On OCR data: lowest test error rate achieved at k = 3

k 1 3 5 7 9
err(NNk,S , T ) 0.0309 0.0295 0.0312 0.0306 0.0341

18 / 24

Hyperparameter tuning

I k is a hyperparameter of k-NN
I How to choose hyperparameters?

I Bad idea: Choosing k that yields lowest training error rate
(degenerate choice: k = 1)

I Better idea: Simulate train/test split on the training data
I Hold-out validation

I Randomly split S into A and B
I Compute validation error rate for all k ∈ {1, 3, 5, 7, 9}:

Vk := err(NNk,A, B)

I Let k̂ be the value of k for which Vk is smallest
I Classifier to use is NN

k̂,S

19 / 24

Upgrade: Distance functions (1)

I Specialize to input types
I Edit distance for strings
I Shape distance for images
I Time warping distance for audio waveforms

20 / 24

Upgrade: Distance functions (2)

I Generic distances for vectors of real numbers
I `p distances

‖x− z‖p =


 d∑

j=1
|xj − zj |p


1/p .

I What are the unit balls for these distances (in R2)?

21 / 24

Upgrade: Distance functions (3)

I Distance functions for images of handwritten digits

distance `2 `3 tangent shape
test error rate 0.0309 0.0283 0.0110 0.0063

22 / 24

Features

I When using numerical features (arranged in a vector from Rd):
I Scale of features matters
I Noisy features can ruin NN

I E.g., consider what happens in OCR example if you have
another 10000 additional features that are pure “noise”

I Or a single pure noise feature whose scale is 10000× the
nominal scale of pixel values

I “Curse of dimension”
I Weird effects in Rd for large d
I Can find 2Ω(d) points that are approximately equidistant

23 / 24

Computation for NN

I Brute force search: Θ(dn) time for each prediction (using
Euclidean distance in Rd)

I Clever data structures: “improve” to 2d log(n) time
I Approximate nearest neighbors: sub-linear time to get

“approximate” answers
I E.g., find point among the top-1% of closest points?

24 / 24

Nearest neighbor classification