COMP9321:
Data services engineering
Week 7: Classification
Term1, 2021
By Mortada Al-Banna, CSE UNSW
2
Machine Learning for Data Analytics
1. Define and Initialize a Model
2. Train your Model (using your training dataset)
3. Validate the Model (by prediction using your test dataset)
4. Use it: Explore or Deploy as a web service 5. Update and Revalidate
Supervised Learning
We are given input samples (X) and output samples (y) of a function y = f(X).
We would like to “learn” f, and evaluate it on new data.
• Classification: y is discrete (class labels).
• Regression: y is continuous, e.g. linear regression.
4
Classification
• SupervisedLearning
• You need the data labelled with the correct answer to train the algorithm
• Trained classifiers then can map input data to a category.
Classification Examples
5
Credit: Foundations of Data Science: Prediction and Machine Learning course offered by Berkley https://courses.edx.org/courses/course-v1:BerkeleyX+Data8.3x+2T2018/course/
6
https://kevinzakka.github.io/2016/07/13/k-nearest-neighbor/
k-Nearest Neighbour (k-NN)
The KNN classifier is a non parametric and instance-based learning
algorithm.
Non-parametric means it makes no explicit assumptions about the functional form of how the prediction is made, avoiding the dangers of mismodeling the underlying distribution of the data.
Instance-based learning means that our algorithm doesn’t explicitly learn a model. Instead, it chooses to memorize the training instances which are subsequently used as “knowledge” for the prediction phase. Concretely, this means that only when a query to our database is made (i.e. when we ask it to predict a label given an input), will the algorithm use the training instances to spit out an answer.
https://www.section.io/engineering-education/parametric-vs-nonparametric/
k-Nearest Neighbors
Given a query item:
Find k closest matches in a labeled dataset ↓
k-Nearest Neighbors
Given a query item: Return the most Find k closest matches Frequent label
k-Nearest Neighbors
k = 3 votes for “cat”
k-Nearest Neighbors
2 votes for cat,
1 each for Buffalo, Cat wins… Deer, Lion
k-Nearest Neighbour (k-NN)
11
Nearest neighbor classifiers
Basic idea:
– If it walks like a duck, quacks like a duck, then
it’s probably a duck
training samples
compute distance
test sample
choose k of the “nearest” samples
13
k- Nearest Neighbour Classifier Algorithm
Give a training set X_train with lables y_train and given a new instance x_test to be classified:
1. Find the most similar instances (let’s call then X_NN) to x_test that are in X_train.
2. Get the labels y_NN for the instances in X_NN.
3. Predict the label for x_test by combining the labels y_NN (e.g., using majority rule)
14
Nearest Neighbour Need Four things Specified
1. A distance Metric (e.g., Euclidean)
2. How many nearest neighbours to look at (e.g., Five)
3. Optional Weighting function on the neighbours points (e.g., closer points are weighted higher than farther points)
4. How to aggregate the classes of neighbours points (e.g., simple majority voting)
15
Click to add text
16
Click to add text
17
Click to add text
18
Example taken from Applied Machine Learning in Python course by the University of Michigan https://www.coursera.org/learn/python-machine-learning
Classification Data Set Example
19
Example taken from Applied Machine Learning in Python course by the University of Michigan https://www.coursera.org/learn/python-machine-learning
Data As a Table
20
Example taken from Applied Machine Learning in Python course by the University of Michigan https://www.coursera.org/learn/python-machine-learning
Training Set and Test Set
21
Always Remember to inspect your Data
Example taken from Applied Machine Learning in Python course by the University of Michigan https://www.coursera.org/learn/python-machine-learning
22
Example taken from Applied Machine Learning in Python course by the University of Michigan https://www.coursera.org/learn/python-machine-learning
Plot your Data
Choosing the k in k-Nearest Neighbors
Howdowechoose k?
Larger k may lead to better performance
But if we set k too large we may end up looking at samples that are not neighbors (are far away from the query)
We can use cross-validation to find k
Rule of thumb is k < sqrt(n), where n is the number of training examples
ZCeSmCe4l1, 1U:rt0a5s-uNne,aFreidslteNre(UigohbfTo)rs
[Slide credit: O. Veksler]
Choosing k in k-Nearest Neighbors
• The training error rate and the validation error rate are two parameters we need to assess on different K-value
ZCeSmCe4l1, 1U:rt0a5s-uNne,aFreidslteNre(UigohbfTo)rs
https://medium.com/30-days-of-machine-learning/day-3-k-nearest-neighbors-and-bias-variance-tradeoff- 75f84d515bdb
25
Generalization, Overfitting and Underfitting
• Generalization ability refers to an algorithm’s ability to give accurate predictions for new, previously unseen data.
• Assumptions:
▪ Future unseen data (test set) will have the same properties as the current training set
▪ This, models that are accurate on the training set are expected to be accurate on the test set
▪ But that may not happen if the trained model is tuned too specifically to the training set.
▪ Models that are too complex for the amount of training data available are said to overfit and are not likely to generalize well to new data instances.
▪ Models that are too simple, that do not even do well on the training data, are said to underfit and also not likely to generalize well.
26
Example taken from Applied Machine Learning in Python course by the University of Michigan https://www.coursera.org/learn/python-machine-learning
27
Nearest Neighbor
When to Consider
– Instance map to points in Rn
– Less than 20 attributes per instance – Lots of training data
Advantages
– Training is very fast
– Learn complex target functions – Do not lose information
Disadvantages
– Slow at query
– Easily fooled by irrelevant attributes
k-Nearest Neighbors is Sensitive
ZCeSmCe4l1, 1U:rt0a5s-uNne,aFreidslteNre(UigohbfTo)rs
Nearest neighbors sensitive to mis-labeled data (“class noise”). Solution? Smooth by having k nearest neighbors vote
[Pic by Olga Veksler]
k-Nearest Neighbors: Complexity
Expensive at test time: To find one nearest neighbor of a query point x, we must compute the distance to all N training examples. Complexity: O(kdN) for kNN
30/ 22
ZCeSmCe4l1, 1U:rt0a5s-uNne,aFreidslteNre(UigohbfTo)rs
Use subset of dimensions
Compute only an approximate distance (e.g., LSH)
Remove redundant data (e.g., condensing)
[Slide credit: David Claus]
k-Nearest Neighbors: Complexity
31/ 22
ZCeSmCe4l1, 1U:rt0a5s-uNne,aFreidslteNre(UigohbfTo)rs
Storage Requirements: Must store all training data Remove redundant data (e.g., condensing)
Pre-sorting often increases the storage requirements
High Dimensional Data: “Curse of Dimensionality”
Required amount of training data increases exponentially with dimension
Computational cost also increases
[Slide credit: David Claus]
Fun Example:
Where on Earth is this Photo From?
Problem: Where (e.g., which country or GPS location) was this picture taken?
32/ 22
ZCeSmCe4l1, 1U:rt0a5s-uNne,aFreidslteNre(UigohbfTo)rs
[Paper: James Hays, Alexei A. Efros. im2gps: estimating geographic information from a single image.CVPR’08.Projectpage: http://graphics.cs.cmu.edu/projects/im2gps/]
Fun Example:
Where on Earth is this Photo From?
Problem: Where (e.g., which country or GPS location) was this picture taken?
Get 6M images from Flickr with GPs info (dense sampling across world) Represent each image with meaningful features
Do kNN!
33/ 22
ZCeSmCe4l1, 1U:rt0a5s-uNne,aFreidslteNre(UigohbfTo)rs
[Paper: James Hays, Alexei A. Efros. im2gps: estimating geographic information from a single image.CVPR’08.Projectpage: http://graphics.cs.cmu.edu/projects/im2gps/]
Fun Example:
Where on Earth is this Photo From?
Problem: Where (eg, which country or GPS location) was this picture taken?
Get 6M images from Flickr with gps info (dense sampling across world) Represent each image with meaningful features
Do kNN (large k better, they use k = 120)!
34/ 22
ZCeSmCe4l1, 1U:rt0a5s-uNne,aFreidslteNre(UigohbfTo)rs
[Paper: James Hays, Alexei A. Efros. im2gps: estimating geographic information from a single image.CVPR’08.Projectpage: http://graphics.cs.cmu.edu/projects/im2gps/]
35
Machine Learning Evaluation
• There are various metrics and methods to evaluate machine learning algorithms
• They differ according to the algorithm being supervised or unsupervised and they differ according to the task
• Let’s look at some of the metrics and concepts regarding evaluation
Accuracy
• This is the simplest metric
• Number of correct predictions divided by the total number of predictions, multiplied by 100.
36
Accuracy with Imbalanced Classes
• Suppose you have two classes: o The positive class
o The negative class
• Out of 1000 randomly selected items, on average:
• One item belong to the positive class
• The rest of items (999 of them) belong to the negative class
• The Accuracy will be
37
38
Accuracy with Imbalanced Classes
• When you build a classifier to predict the items (positive or negative), you may find out that the accuracy on the test set is 99.9%.
• Be aware that this is not an actually presentation of how good your classifier is.
• For comparison, if we have a “dummy” classifier that do not consider the features at all but rather blindly predict according to the most frequent class
39
Accuracy with Imbalanced Classes
• If we use the same dataset mentioned in the previous slide (the 1000 data instance with 999 negative and 1 positive). What do you think the accuracy of the dummy classifier would be?
Answer:
Accuracy Dummy= 999/1000=99.9%
• Hence the accuracy alone sometime not a good metric to measure how good the model is
40
Dealing with Imbalanced Classes
• Data pre-processing
➢ Random Under Sampling
➢ Random Over Sampling
➢ Cluster-Based Over Sampling
➢ Synthetic Minority Over-sampling
• Select More suitable Metrics to Evaluate Imbalanced Classes
➢ Precession and Recall ➢ F1-Score
Precision and Recall
TP: True Positive FP: False Positive FN: False Negative
41
https://developers.google.com/machine-learning/crash-course/classification/precision-and-recall
Precision and Recall
42
https://developers.google.com/machine-learning/crash-course/classification/precision-and-recall
43
https://towardsdatascience.com/beyond-accuracy-precision-and-recall-3da06bea9f6c
Precision and Recall
Confusion Matrix
https://www.analyticsvidhya.com/blog/2020/04/confusion-matrix-machine-
44 learning/#:~:text=A%20Confusion%20matrix%20is%20an,by%20the%20machine%20learning%20model.
F1 Score
• A metric which combines precision and recall
• Harmonic mean of precision and recall
• Instead of balancing precision and recall, we can just aim for a good F1-score and that would be indicative of a good Precision and a good Recall value as well
45
Useful Resources
• Mathematician hacking dating site https://www.wired.com/2014/01/how-to-hack- okcupid/
• https://medium.com/@adi.bronshtein/a-quick-introduction-to-k-nearest-neighbors- algorithm-62214cea29c7
• https://towardsdatascience.com/building-improving-a-k-nearest-neighbors-algorithm-in- python-3b6b5320d2f8
• https://kevinzakka.github.io/2016/07/13/k-nearest-neighbor/
• https://www.analyticsvidhya.com/blog/2018/03/introduction-k-neighbours-algorithm-
clustering/
• https://heartbeat.fritz.ai/introduction-to-machine-learning-model-evaluation- fa859e1b2d7f
46
47
Questions?