CS代考 Lecture 3: K-Nearest Neighbors

Lecture 3: K-Nearest Neighbors
Introduction to Machine Learning Semester 2, 2021
Copyright @ University of Melbourne 2021. All rights reserved. No part of the publication may be reproduced in any form by print, photoprint, microfilm or any other means without written permission from the author.
Acknowledgement:

Last time… Machine Learning concepts
• data, features, classes • models, training
• practical considerations
Today… Our first machine learning algorithm
• K-nearest neighbors
• Application to classification • Application to regression
Also: the topic of your first assignment!
• Released on Canvas on Wed 4th August at 7pm!
• Questions: assignment1 discussion board (don’t share solutions!)

Introduction

K-Nearest Neighbors: Example
Your ’photographic memory’ of all handwritten digits you’ve ever seen:

K-Nearest Neighbors: Example
Your ’photographic memory’ of all handwritten digits you’ve ever seen:
Given a new drawing, determine the digit by comparing it to all digits in your ’memory’.

K-Nearest Neighbors: Visualization
K nearest neighbors = K closest stored data points

K-Nearest Neighbors: Visualization
1 nearest neighbor = single closest stored data point

K-Nearest Neighbors: Visualization
4 nearest neighbors = 4 closest stored data points

K-Nearest Neighbors: Algorithm
• Store all training examples
• Compute distance of test instance to all training data points
• Find the K closest training data points (nearest neighbors)
• Compute target concept of the test instance based on labels of the training instances

K-Nearest Neighbors: Target Concepts
KNN Classification
• Return the most common class label among neighbors • Example: cat vs dog images; text classification; …
KNN Regression
• Return the average value of among K nearest neighbors • Example: housing price prediction;

Four problems
1. How to represent each data point?
2. How to measure the distance between data points? 3. What if the neighbors disagree?
4. HowtoselectK?

Feature Vectors
A data set of 6 instances (a…f) with 4 features and a label

Feature Vectors
A data set of 6 instances (a…f) with 4 features and a label
We can represent each instance as a feature vector

Feature (or attribute) Types
Recall, from last lecture?
1. Nominal
• set of values with no intrinsic ordering • possibly boolean
2. Ordinal
• explicitly ordered
3. Numerical
• real-valued, often no upper bound, easily mathematical manipulatable • vector valued

1. How to represent each data point?
2. How to measure the distance between data points? 3. What if the neighbors disagree?
4. HowtoselectK?

Comparing Nominal Feature Vectors
First, we convert the nominal features into numeric features.
apple banana cherry
red yellow red
round curved round
taste size
sweet – sweet – sweet small
red yellow round sweet curved small
apple 1 0 1 1 0 ? banana 0 1 0 1 1 ? cherry 1 0 1 1 0 1

Comparing Nominal Features: Hamming Distance
instance features
red yellow round sweet curved small
apple 1 0 1 1 0 ? banana 0 1 0 1 1 ? cherry 1 0 1 1 0 1
The number of differing elements in two ‘strings’ of equal length.

Comparing Nominal Features: Hamming Distance
instance features
red yellow round sweet curved small
apple 1 0 1 1 0 ? banana 0 1 0 1 1 ? cherry 1 0 1 1 0 1
The number of differing elements in two ‘strings’ of equal length.
d (apple, banana) = 4

Comparing Nominal Features: Simple Matching Distance
instance features
red yellow round sweet curved small
apple 1 0 1 1 0 ? banana 0 1 0 1 1 ? cherry 1 0 1 1 0 1
The number of matching features divided by the number of all features in the sample
• d: distance
d = 1 − mk • k : number of matching features
• m: total number of features

Comparing Nominal Features: Simple Matching Distance
instance features
red yellow round sweet curved small
apple 1 0 1 1 0 ? banana 0 1 0 1 1 ? cherry 1 0 1 1 0 1
The number of matching features divided by the number of all features in the sample
• d: distance
d = 1 − mk • k : number of matching features
• m: total number of features
d (apple, banana) = 1− 62 = 46

Comparing Nominal Feature Vectors: Jaccard Distance
instance features
red yellow round sweet curved small
apple 1 0 1 1 0 ? banana 0 1 0 1 1 ? cherry 1 0 1 1 0 1
Jaccard similarity J: intersection of two sets divided by their union. d=1−J
= 1 − |A ∩ B| |A∪B|
= 1 − |A ∩ B| |A|+|B|−|A∩B|

Comparing Nominal Feature Vectors: Jaccard Distance
instance features
red yellow round sweet curved small
apple 1 0 1 1 0 ? banana 0 1 0 1 1 ? cherry 1 0 1 1 0 1
Jaccard similarity J: intersection of two sets divided by their union. d=1−J
= 1 − |A ∩ B| |A∪B|
= 1 − |A ∩ B| |A|+|B|−|A∩B|
d (apple, banana) = 1− 51 = 45

Comparing Numerical Feature Vectors: Manhattan Distance
Manhattan Distance (or: L1 distance)
• Given two instances a and b, each with a set of numerical features, e.g., a = [2.0, 1.4, 4.6, 5.5]
b = [1.0, 2.4, 6.6, 2.5]
• Their distance d is the sum of absolute differences of each feature
d(a,b)=􏰁|ai −bi| (1)
d(a,b) = |2.0−1.0|+|1.4−2.4|+|4.6−6.6|+|5.5−2.5| =1+1+2+3

Comparing Numerical Feature Vectors: Euclidean Distance
Euclidean Distance (or: L2 distance)
• Given two instances a and b, each with a set of numerical features, e.g., a = [2.0, 1.4, 4.6, 5.5]
b = [1.0, 2.4, 6.6, 2.5]
• Their distance d is the distance in Euclidean space (2-dimensional space). Defined as the squared root of the sum of squared differences of
each feature
(ai −bi)2 (2)
d(a,b) = 􏰂(2.0 − 1.0)2 + (1.4 − 2.4)2 + (4.6 − 6.6)2 + (5.5 − 2.5)2 √√
= 1+1+4+9= 15 = 3.87

Comparing Numerical Feature Vectors: Cosine distance
Cosine Distance
• Cosine similarity = cosine of angle between two vectors (= inner product of the normalized vectors)
• Cosine distance d: one minus cosine similarity a·b 􏰀i aibi
cos(a, b) = |a||b| = 􏰃􏰀i ai2􏰃􏰀i bi2 d(a, b) = 1 − cos(a, b)
• Cosine distance is normalized by the magnitude of both feature vectors, i.e., we can compare instances of different magnitude
→ word counts: compare long vs short documents → pixels: compare high vs low resolution images

Comparing Numerical Feature Vectors: Cosine distance
cos(a, b) = |a||b| = 􏰃􏰀i ai2􏰃􏰀i bi2
feature doc1 doc2 doc3
d(a, b) = 1 − cos(a, b)
word1 word2 word3
200 300 300 200 200 100
200×300+300×200+200×100 cos(doc1, doc2) = √ √
2002 + 3002 + 2002 3002 + 2002 + 1002 d(doc1, doc2) = 0.07
300×50+200×40+100×25 cos(doc2, doc3) = √ √
3002 +2002 +1002 502 +402 +252 d(doc2, doc3) = 0.01

Comparing Ordinal Feature Vectors
Normalized Ranks
• sort values, and return a rank r ∈ {0…m}
• map ranks to evenly spaced values between 0 and 1
• compute a distance function for numeric features (e.g., Euclidean
Example: Customer ratings
1. Sorted ratings: { -2: , 2.Ranks: { 0,
-1: , 1,
1: , 2: }

Comparing Ordinal Feature Vectors
Normalized Ranks
• sort values, and return a rank r ∈ {0…m}
• map ranks to evenly spaced values between 0 and 1
• compute a distance function for numeric features (e.g., Euclidean
Example: Customer ratings
1. Sorted ratings: { -2: 2.Ranks: { 0,
, -1: , 1,
1: , 2: }

Four problems
1. How to represent each data point?
2. How to measure the distance between data points? 3. What if the neighbors disagree?
4. HowtoselectK?

Majority Voting
Equal weights (=majority vote)
Voting Example (k=4)
• w1 = w2 = w3 = w4 = 1

Majority Voting
Equal weights (=majority vote)
Voting Example (k=4)
• w1 = w2 = w3 = w4 = 1
blue: 1+1+1=3

Weighted KNN: Inverse Distance
Inverse Distance wj= 1
dj +ε with ε ≈ 0, e.g., 1e − 10
Voting Example (k=4)
• d1=0; d2=1;d3=d4=1.5 • ε=1e−5

Weighted KNN: Inverse Distance
Inverse Distance wj= 1
dj +ε with ε ≈ 0, e.g., 1e − 10
Voting Example (k=4)
• d1=0; d2=1;d3=d4=1.5
red: 1 = 100000
blue: 1 + 1 + 1 1+ε 1.5+ε 1.5+ε
= 1.0 + 0.67 + 0.67 = 2.34

Weighted K-NN: Inverse Linear Distance
Inverse Linear distance
wj = dk −dj dk −d1
d1 = min d among neighbors dk = max d among neighbors dj = distance of jth neighbor
Voting Example (k=4)
• d1=0; d2=1;d3=d4=1.5

Weighted K-NN: Inverse Linear Distance
Inverse Linear distance
wj = dk −dj dk −d1
d1 = min d among neighbors dk = max d among neighbors dj = distance of jth neighbor
Voting Example (k=4)
• d1=0; d2=1;d3=d4=1.5
red: 1.5−0 =1 1.5−0
blue: 1.5−1 + 1.5−1.5 + 1.5−1.5 = 0.3+0+0 = 0.3 1.5−0 1.5−0 1.5−0

Four problems
1. How to represent each data point?
2. How to measure the distance between data points? 3. What if the neighbors disagree?
4. HowtoselectK?

Selecting the value of K

Selecting the value of K

Selecting the value of K
• jagged decision boundary
• we capture noise
• lower classifier performance
• smooth decision boundary
• danger of grouping together unrelated classes
• also: lower classifier performance!
• what if K == N? (N=number of training instances)
Draw validation error:

Breaking Ties
What if more than K neighbors have the same (smallest) distance?
• select at random
• change the distance metric
What if two classes are equally likely given the current neighborhood?
• avoid an even K
• random tie breaking
• include K + 1th neighbor
• use class with highest prior probability

pollev.com/comp90049

• Intuitive and simple
• No assumptions
• Supports classification and regression
• No training: new data → evolve and adapt immediately
• How to decide on best distance functions? • How to combine multiple neighbors?
• HowtoselectK?
• Expensive with large (or growing) data sets

Lazy vs Eager Learning
Lazy Learning (also known as Instance-based Learning)
• store the training data
• fixed distance function
• fixed prediction rule (majority, weighting, …) • compare test instances with stored instances • no learning

Lazy vs Eager Learning
Eager Learning
• train a model using labelled training instances
• the model will generalize from seen data to unseen data • use the model to predict labels for test instances
• we will look at a variety of eager models and their learning algorithms over the next couple of weeks

Today… Our first machine learning algorithm
• K-nearest neighbors
• Application to classification • Application to regression
Also: the topic of your first assignment!
Next: Probabilities (recap) and probabilistic modeling

Further Reading
• Data Mining: Concepts and Techniques, 2nd ed., and , , 2006. Chapter 2, Chapter 9.5.
• The elements of statistical learning, 2nd ed., , and . : Springer series in statistics, 2001. Chapter 2.3.2