程序代写代做代考 database information theory flex kernel Bayesian decision tree algorithm DATA7703 Machine Learning for Data Scientists

DATA7703 Machine Learning for Data Scientists
Lecture 2 – Supervised Learning
Classification
Anders Eriksson Aug 11, 2020

Week 1
3 4
5 6
7 8
9
10 Oct-13
11 Oct-20 12 Oct-27
Date Aug-4
Aug-18 Aug-25
Sep-1: Sep-8
Sep-15 Sep-22
Topic
Introduction – Basic Concepts of Machine Learning
Regression – Predicting House Prices PCA & LDA – Recognising Faces
Clustering and Similarity
Support Vector Machines & Kernel Methods
Ensemble methods: Random forest and Boosting The Perceptron
Mid-Semester Break
Convolutional Neural Networks
Adversarial Examples Interpreting ML algorithms Course Summary
Assignment 1 due Projects announced
Assignment 2 due Mid-term exam
Lecture Schedule
2
Aug-11 Classification – Sorting Fish
Oct-6
Assignment 3 due Assignment 4 due
Lecture 2 – Supervised Learning – Classification 2

Last Week
Introduction
4

Modern Machine Learning
Original Photo Reference Photo Result
Lecture 2 – Supervised Learning – Classification

Machine Learning is…
Machine learning is about predicting the future based on the past.
Training
Training Data
Past
Testing Data
Future
Learn
Model/Predictor
— Hal Daume III Model/Predictor
Predict
Testing
Lecture 2 – Supervised Learning – Classification

Why “Learn”?
• Learning is used when:
• Human expertise does not exist (navigating on Mars),
• Humans are unable to explain their expertise (speech recognition)
• Solution changes in time (routing on a computer network)
• Solution needs to be adapted to particular cases (user biometrics)
What is a cat? What makes a 2?
Lecture 2 – Supervised Learning – Classification 7

Types of Machine Learning Problems
Supervised
Classification
Output is a discrete variable (e.g., cat/dog)
Regression
Output is continuous (e.g., price, temperature)
Unsupervised
Reinforcement
Lecture 2 – Supervised Learning – Classification
12

Types of Machine Learning Problems
Supervised Classification
Unsupervised
Example: Recognition Faces Training Images Test Images
ORL dataset, AT&T Laboratories, Cambridge UK
12
Lecture 2 – Supervised Learning – Classification

Types of Machine Learning Problems
Supervised
Useful for learning structure in the data (clustering), hidden correlations, reduce dimensionality, etc.
Unsupervised
Lecture 2 – Supervised Learning – Classification 10

Supervised Learning
Classification
11

• •
Supervised Machine Learning Training samples (or examples): x1 , x2 ,…, xn
Each example xi is typically multi-dimensional
• xi1 , xi2 ,…, xid are called features, xi is often called a feature vector
• Example: x1 = [3,7, 35], x2 = [5, 9, 47], …
• how many and which features do we take?
Know desired output for each example y1 , y2 ,…, yn
• This learning is supervised (“teacher” gives desired outputs)
• yi are often one-dimensional
• Example: y1 = 1 (“face”), y2 = 0 (“not a face”)
•
Lecture 2 – Supervised Learning – Classification

Two Types of Supervised Machine Learning
• Classification
○ yi takes value in finite set, called a label or a class
○ Example: yi ={“sunny”, ”cloudy”, ”raining”}
• Regression
○ yi continuous, called an output value
○ Example: yi = temperature [-60,60]
Lecture 2 – Supervised Learning – Classification

Example : Fish sorting
Salmon
Sea Bass
Lecture 2 – Supervised Learning – Classification

Example : Fish sorting – Classifier design
• Notice salmon tends to be shorter than sea bass.
• Use fish length as the discriminating feature.
• Count number of bass and salmon of each length.
Lecture 2 – Supervised Learning – Classification

Example : Fish sorting – Length Classifier
• Find the best length L threshold
• For example, at L = 5, misclassified:
• 1 sea bass
• 16 salmon
• Classification error (total error) 17/50 = 34%
Lecture 2 – Supervised Learning – Classification

Example : Fish sorting – Length Classifier
After searching through all possible thresholds L, the best L= 9, and still 20% of fish is misclassified.
Lecture 2 – Supervised Learning – Classification

Example : Fish sorting – Lesson Learned?
• Length is a poor feature alone.
• What to do?
• Try another feature.
• Salmon tends to be lighter.
• Try average fish lightness .
Lecture 2 – Supervised Learning – Classification

Example : Fish sorting – Lightness classifier
Now fish are classified best at lightness threshold of L=3.5 with classification error of 8%
Lecture 2 – Supervised Learning – Classification

Example : Fish sorting – Combining Features
• Use both length and lightness features
• Feature vector [length,lightness]
• Classification error 4%
Lecture 2 – Supervised Learning – Classification

Example : Fish sorting – Even better
• Decision boundary (wiggly) with 0% classification error.
Lecture 2 – Supervised Learning – Classification

Example : Fish sorting – On New Data
• The goal is for classifier to perform well on new data. • Test “wiggly” classifier on new data: 25% error.
Lecture 2 – Supervised Learning – Classification

• •
•
Example : Fish sorting – What went wrong?
Have only a limited amount of data for training.
Should ensure decision boundary does not adapt too closely to the particulars of training data, but grasps the “big picture”.
Complex boundaries overfit data, i.e. too tuned to training data.
Lecture 2 – Supervised Learning – Classification

Generalization
• The ability to produce correct outputs on previously unseen examples is called generalization
• The big question of learning theory: how to get good generalization with a limited number of examples
• Intuitive idea: favour simpler classifiers
• William of Occam (1284-1347): “entities are not to be multiplied without necessity”
• Simpler decision boundary may not fit ideally to the training data but tends to generalize better to new data
Lecture 2 – Supervised Learning – Classification

Underfitting -> Overfitting You can also underfit the data
Lecture 2 – Supervised Learning – Classification

Sketch of Supervised Machine Learning
• Chose a function f(x,w)
• w are tunable weights
• x is the input sample
• f(x,w) should output the correct class of sample x
• use labeled samples to tune weights w so that f(x,w) give the correct label for sample x
• Which function f(x,w) do we choose?
• different choices will lead to decision boundaries of different complexities
• has to be expressive enough to model our problem well, i.e. to avoid underfitting
• yet not to complicated to avoid overfitting
• other issues, computational requirements, speed…
• f(x,w) sometimes called learning machine
Lecture 2 – Supervised Learning – Classification

Classifiers
• Simple Classifiers
• K-Nearest Neighbour (kNN)
• Naïve Bayes
• Decision trees
Lecture 2 – Supervised Learning – Classification

Simple Classifiers
k- -NN)
Nearest Neighbours
(k
28

k-Nearest Neighbors (kNN)
?
• kNN – one of the simplest classifiers available
• Classify an unknown example with the most common
class among k closest examples
• “tell me who your neighbors are, and I’ll tell you who you are”
Example:
• k=3
• 2 salmon, 1 sea bass
• Classify as salmon
lightness
Lecture 2 – Supervised Learning – Classification
length

kNN – Multiple Classes
• Easy to implement for multiple classes.
?
Example: for k = 5
• 3 fish species: salmon, sea bass, eel
• 3 salmon, 1 eel, 1 sea bass ⇒ classify as sea
bass
lightness
Lecture 2 – Supervised Learning – Classification
length

kNN – Visualisation
• For k=1 Voronoi tesselation is useful
Lecture 2 – Supervised Learning – Classification

kNN: How to Choose k?
• In theory, if infinite number of samples available, the larger is k, the better is classification
• But the caveat is that all k neighbors have to be close • Possible when infinite # samples available
• Impossible in practice since # samples is finite
Lecture 2 – Supervised Learning – Classification

kNN: How to Choose k?
• Rule of thumb is k = sqrt(n), n is number of examples o interesting theoretical properties
• In practice, k = 1 is often used for efficiency, but can be sensitive to “noise”
Lecture 2 – Supervised Learning – Classification

kNN: How to Choose k?
• Larger k gives smoother boundaries, better for generalization
• But only if locality is preserved. Locality is not preserved if end up
looking at samples too far away, not from the same class.
• Interesting theoretical properties if k < sqrt(n), n is # of examples • Can choose k through a method called cross-validation Lecture 2 – Supervised Learning - Classification • • • • • • • – kNN: How Well does it Work? kNN is simple and intuitive, but does it work? If we have lots of samples, kNN works well ! Imagine that you have access to all past, current and future samples... Lecture 2 – Supervised Learning - Classification Advantages kNN - Summary • Can be applied to the data from any distribution o for example, data does not have to be separable with a linear boundary • Very simple and intuitive • Good classification if the number of samples is large enough Disadvantages • Choosing k may be tricky • Test stage is computationally expensive. o No training stage, all the work is done during the test stage. o This is actually the opposite of what we want. Usually we can afford. training step to take a long time, but we want fast test step. • Need large number of samples for accuracy. Lecture 2 – Supervised Learning - Classification Classifiers • Simple Classifiers • K-Nearest Neighbour (kNN) • Naïve Bayes • Decision trees Lecture 2 – Supervised Learning - Classification Simple Classifiers Naïve Bayes 38 Naïve Bayes Classifier Simple (“naïve”) classification method based on Bayes rule. Lecture 2 – Supervised Learning - Classification Naïve Bayes Classifier - Example Grasshoppers Katydids Lecture 2 – Supervised Learning - Classification Naïve Bayes Classifier - Example With a lot of data, we can build a histogram. Let us just build one for “Antenna Length” for now... Lecture 2 – Supervised Learning - Classification Naïve Bayes Classifier - Example Lecture 2 – Supervised Learning - Classification Naïve Bayes Classifier - Example • We want to classify an insect we have found. Its antennae are 3 units long. How can we classify it? • We can just ask ourselves, give the distributions of antennae lengths we have seen, is it more probable that our insect is a Grasshopper or a Katydid (Bush cricket). • There is a formal way to discuss the most probable classification... • p(cj | d) = probability of class cj, given that we have observed d. P(Grasshopper| 3)=? P(Katydid| 3) =? Lecture 2 – Supervised Learning - Classification Naïve Bayes Classifier - Example • P(Grasshopper| 3) = 10/ (10+2)= 0.833 • P(Katydid| 3) = 2/ (10+ 2)= 0.166 Lecture 2 – Supervised Learning - Classification Naïve Bayes Classifier - Example • P(Grasshopper| 7) = 3/ (3+9)= 0.250 • P(Katydid| 7) = 9/ (3+ 9)= 0.750 Lecture 2 – Supervised Learning - Classification Naïve Bayes Classifier - Example • P(Grasshopper| 5) = 6/ (6+6)= 0.500 • P(Katydid| 5) = 6/ (6+ 6)= 0.500 Lecture 2 – Supervised Learning - Classification Naïve Bayes Classifier • A basic example of the Naïve Bayes classifier, also called: • Idiot Bayes • Find out the probability of the previously unseen instance belonging to each class, then simply pick the most probable class. • Simple Bayes Lecture 2 – Supervised Learning - Classification Naïve Bayes Classifier Example: cj = “bush cricket”, d = “antennae length = 3 cm” Lecture 2 – Supervised Learning - Classification Naïve Bayes Classifier – Drew? Assume that we have two classes c1 = male, and c2 = female. We have a person named “Drew” whose gender we do not know. Classifying drew as male or female is equivalent to asking is it more probable that drew is male or female, i.e which is greater p(male | drew) or p(female | drew) Drew Barrymore Drew Carey What is the probability of being called “Drew” given that you are a male? p(male | Drew) = p(Drew | male) p(male) p(Drew) What is the probability of being a male? What is the probability of being named “Drew”? (actually irrelevant, since it is Lecture 2 – Supervised Learning - Classification that same for all classes) Name Gender Drew Male Claudia Female Drew Female Drew Female Alberto Male Karin Female Nina Female Sergio Male Officer Drew Naïve Bayes Classifier – Drew? This is Officer Drew. Is Officer Drew a Male or Female? Luckily, we have a small database with names and gender. We can use it to apply Bayes rule... p(cj | d) = p(d | cj ) p(cj) p(d) Lecture 2 – Supervised Learning - Classification Name Gender Drew Male Claudia Female Drew Female Drew Female Alberto Male Karin Female Nina Female Sergio Male Officer Drew Naïve Bayes Classifier – Drew? This is Officer Drew. Is Officer Drew a Male or Female? Luckily, we have a small database with names and gender. We can use it to apply Bayes rule... p(cj | d) = p(d | cj ) p(cj) p(d) p(male | Drew) = p(Drew | male ) p(male) p(Drew) p(female | Drew) = p(Drew | female ) p(female) p(Drew) Lecture 2 – Supervised Learning - Classification Name Gender Drew Male Claudia Female Drew Female Drew Female Alberto Male Karin Female Nina Female Sergio Male Officer Drew Naïve Bayes Classifier – Drew? This is Officer Drew. Is Officer Drew a Male or Female? p(male | Drew) = p(Drew | male ) p(male) p(Drew) p(female | Drew) = p(Drew | female ) p(female) p(Drew) p(male | drew) = 1/3 * 3/8 = 0.125 3/8 3/8 Lecture 2 – Supervised Learning - Classification Name Gender Drew Male Claudia Female Drew Female Drew Female Alberto Male Karin Female Nina Female Sergio Male Officer Drew Naïve Bayes Classifier – Drew? This is Officer Drew. Is Officer Drew a Male or Female? p(male | Drew) = p(Drew | male ) p(male) p(Drew) p(female | Drew) = p(Drew | female ) p(female) p(Drew) p(male | drew) = 1/3 * 3/8 = 0.125 3/8 3/8 Lecture 2 – Supervised Learning - Classification Name Gender Drew Male Claudia Female Drew Female Drew Female Alberto Male Karin Female Nina Female Sergio Male Officer Drew Naïve Bayes Classifier – Drew? This is Officer Drew. Is Officer Drew a Male or Female? p(male | Drew) = p(Drew | male ) p(male) p(Drew) p(female | Drew) = p(Drew | female ) p(female) p(Drew) p(male | drew) = 1/3 * 3/8 = 0.125 3/8 3/8 Lecture 2 – Supervised Learning - Classification Name Gender Drew Male Claudia Female Drew Female Drew Female Alberto Male Karin Female Nina Female Sergio Male Officer Drew Naïve Bayes Classifier – Drew? This is Officer Drew. Is Officer Drew a Male or Female? p(male | Drew) = p(Drew | male ) p(male) p(Drew) p(female | Drew) = p(Drew | female ) p(female) p(Drew) p(male | drew) = 1/3 * 3/8 3/8 p(female | drew) = 2/5 * 5/8 3/8 = 0.125 3/8 = 0.250 3/8 Officer Drew is more likely to be a Female. Lecture 2 – Supervised Learning - Classification Naïve Bayes Classifier – Multiple Features • So far we have only considered Bayes Classification when we have one attribute (the “antennae length” or name). • But we may have many features. (“antennae length”, “colour”, “weight”) • How do we use all the features? Lecture 2 – Supervised Learning - Classification Naïve Bayes Classifier – Multiple Features OfficerDrew isblue-eyed, over 170cm tall, and has long hair. Is Officer Drew a Male or Female? Name Over 170CM Eye Hair length Sex Drew No Blue Short Male Claudia Yes Brown Long Female Drew No Blue Long Female Drew No Blue Long Female Alberto Yes Brown Short Male Karin No Blue Long Female Nina Yes Brown Short Female Sergio Yes Blue Long Male Lecture 2 – Supervised Learning - Classification Naïve Bayes Classifier – Multiple Features To simplify the task, naïve Bayesian classifiers assume attributes have independent distributions,and therebyestimate OfficerDrew isblue-eyed, over 170cm tall, and has long hair. p(d|cj) = p(d1|cj) * p(d2|cj) * ....* p(dn|cj) d = {“Drew”, “blue eyes”, “taller than 170cm”, “long hair”} p(d | cj) = p(over_170cm = yes|cj) * p(eyes =blue|cj) * .... p(d|female)=2/5 * 3/5 * .... p(d|male) = 2/3* 2/3 * .... Lecture 2 – Supervised Learning - Classification Naïve Bayes Classifier – Multiple Features Assume attributes have independent distributions, and thereby estimate: cj = “bush cricket” d = {“antennae length = 3 cm”, “colour= green”, “weight= 25 grams”) Lecture 2 – Supervised Learning - Classification Add-1 Laplace Smoothing What if there had been no training cases where a female had blue eyes? • That would have led to P(blue | female) = 0 which would have eliminated any other evidence. Solution: Pretend you saw every outcome once more than you actually did: ෠ 𝑃 𝑥𝑐)= Slightly more general version: = 𝑐𝑜𝑢𝑛𝑡 𝑋=𝑥,𝐶=𝑐 +1 𝑐𝑜𝑢𝑛𝑡 𝑋=𝑥,𝐶=𝑐 +1 σ𝑖𝑐𝑜𝑢𝑛𝑡 𝐶=𝑐𝑖 +|𝑋| 𝐿𝐴𝑃 σ𝑖(𝑐𝑜𝑢𝑛𝑡 𝐶=𝑐𝑖 +1) 𝑃 𝑥 𝑐)=𝑐𝑜𝑢𝑛𝑡 𝑋=𝑥,𝐶=𝑐 +α σ𝑖𝑐𝑜𝑢𝑛𝑡 𝐶=𝑐𝑖 +α|𝑋| α:extentofsmoothing ෠ 𝐿𝐴𝑃 Lecture 2 – Supervised Learning - Classification 60 Add-1 Laplace Smoothing What if there had been no training cases where a female had blue eyes? • That would have led to P(blue | female) = 0 which would have overwhelmed any other evidence. Solution: Pretend you saw every outcome once more than you actually did. 𝑐𝑜𝑢𝑛𝑡(𝑏𝑙𝑢𝑒 𝑒𝑦𝑒𝑑 𝑓𝑒𝑚𝑎𝑙𝑒) 𝑃 𝑏𝑙𝑢𝑒 𝑓𝑒𝑚𝑎𝑙𝑒) = 𝑐𝑜𝑢𝑛𝑡 𝑏𝑙𝑢𝑒 𝑒𝑦𝑒𝑑 𝑓𝑒𝑚𝑎𝑙𝑒 +𝑐𝑜𝑢𝑛𝑡(𝑏𝑟𝑜𝑤𝑛 𝑒𝑦𝑒𝑑 𝑓𝑒𝑚𝑎𝑙𝑒) ෠ 𝑃 𝑏𝑙𝑢𝑒 𝑓𝑒𝑚𝑎𝑙𝑒) = (𝑐𝑜𝑢𝑛𝑡 𝑏𝑙𝑢𝑒 𝑒𝑦𝑒𝑑 𝑓𝑒𝑚𝑎𝑙𝑒 +1) (> 0)
(𝑐𝑜𝑢𝑛𝑡 𝑏𝑙𝑢𝑒 𝑒𝑦𝑒𝑑 𝑓𝑒𝑚𝑎𝑙𝑒 +1)+(𝑐𝑜𝑢𝑛𝑡 𝑏𝑟𝑜𝑤𝑛 𝑒𝑦𝑒𝑑 𝑓𝑒𝑚𝑎𝑙𝑒 +1)
෠
Lecture 2 – Supervised Learning – Classification 61

Naïve Bayes Classifier – Summary
• Advantages
• Fast to train (single scan)
• Fast to classify
• Not sensitive to irrelevant features • Handles real and discrete data
• Disadvantages
• Assumes independence of features
Lecture 2 – Supervised Learning – Classification

Classifiers
• Simple Classifiers
• K-Nearest Neighbour (kNN)
• Naïve Bayes
• Decision trees
Lecture 2 – Supervised Learning – Classification

Simple Classifiers
Decision Trees
64

Decision Tree
• a hierarchical tree structure used as a classifier
• based on a series of questions (or rules) about the attributes of the class.
• the attributes of the classes can be any type of variables from binary, nominal, ordinal, and quantitative values,
Example: Predict fuel efficiency of a car
Lecture 2 – Supervised Learning – Classification

Decision Tree
• Easy to understand the decision process.
• Can handle both discrete and continuous features.
• Highly flexible, decision trees can represent increasingly complex decision boundaries as the depth (or number of nodes) of the tree increases.
Example: Predict fuel efficiency of a car
Lecture 2 – Supervised Learning – Classification

•
o o
Decision Tree
A decision tree has 2 kinds of nodes
Each leaf node has a class label, determined by majority vote of
training examples reaching that leaf.
Each internal node is a question on features. It branches out according to the answers.
Example: Predict fuel efficiency of a car
Lecture 2 – Supervised Learning – Classification

Decision Tree – Choosing the attributes
• How do we find a decision tree that agrees with the training data?
• Could just choose a tree that has one path to a leaf for each example – but this just
memorizes the observations (assuming data are consistent)
• We want it to generalize to new examples.
• Ideally, best attribute would partition the data into positive and negative examples
• Learning decision trees is hard. Finding optimal tree for arbitrary data is NP-hard.
• Strategy (greedy):
• choose attributes that give the best partition first
• want correct classification with fewest number of tests
Lecture 2 – Supervised Learning – Classification

Decision Tree – Choosing the attributes
• How do we which attribute or value to split on?
• When should we stop splitting?
• What do we do when we can’t achieve perfect classification?
• What if tree is too large? Can we approximate with a smaller tree?
Lecture 2 – Supervised Learning – Classification

Decision Tree – Basic Algorithm
Basic algorithm for learning decision trees :
1. starting with whole training data
2. select attribute or value along dimension that maximises the
Information Gain
3. create child nodes based on split
4. recurse on each child using child data until a stopping criterion is
reached
■ all examples have same class
■ amount of data is too small
■ tree too large
Lecture 2 – Supervised Learning – Classification

Decision Tree – Measuring information
• A convenient measure to use is based on information theory.
• How much “information” does an attribute give us about the class? • attributes that perfectly partition should give maximal information
• unrelated attributes should give no information
• Entropy H(Y) of a random variable Y
• More uncertainty, more entropy!
Lecture 2 – Supervised Learning – Classification

Decision Tree
• High Entropy
○ Y is from a uniform like distribution
○ Flat histogram
○ Values sampled from it are less predictable
• Low Entropy
○ Y is from a varied (peaks and valleys) distribution
○ Histogram has many lows and highs
○ Values sampled from it are more predictable
Lecture 2 – Supervised Learning – Classification

Decision Tree
• Idea: select the attribute that decreases the entropy (uncertainty) the most after splitting
• Maximise the Information Gain
Lecture 2 – Supervised Learning – Classification

From Entropy to Information Gain
Entropy H(X) of a random variable X Specific conditional entropy
H(X|Y=v) of X given Y=v : Conditional entropy H(X|Y) of
X given Y :
Mututal information (aka Information Gain) of X and Y :
Lecture 2 – Supervised Learning – Classification

Decision Tree – Example
Lecture 2 – Supervised Learning – Classification

ID3 – Iterative Dichotomiser 3
node = root of decision tree
Main loop:
1. A ← the “best” decision attribute for the next node.
2. Assign A as decision attribute for node.
3. For each value of A, create a new descendant of node. 4. Sort training examples to leaf nodes.
5. If training examples are perfectly classified, stop. Else, recurse over new leaf nodes.
Lecture 2 – Supervised Learning – Classification 77

Decision Tree – Real-valued Inputs
What should we do if some of the inputs are real-valued?
Lecture 2 – Supervised Learning – Classification

• •
•
Decision Tree – Real-valued Inputs
Binary tree: split on attribute X at value t ○ One branch : X < t ○ Other branch : X ≥ t Allow repeated splits on same variable How to find t? ○ Search through all possible values of t (seems hard!) ○ But only a finite number of t’s are important: ○ Sort data according to X Into {x1,...,xm} ○ Consider Split Points Of The Form xi + (xi+1 – xi)/2 ○ Moreover, only splits between examples of different classes matter! Lecture 2 – Supervised Learning - Classification Decision Tree - Summary • Decision trees are one of the most popular ML tools • Easy to understand, implement and use • Usually interpretable • Computationally cheap (to solve heuristically) • Presented for classification, can be used for regression and density estimation too • Decision trees will overfit! o Must use tricks to find “simple trees”, e.g., ❑ Fixed depth/Early stopping ❑ Pruning ❑ Hypothesis testing Lecture 2 – Supervised Learning - Classification Classifiers • Simple Classifiers • K-Nearest Neighbour (kNN) • Naïve Bayes • Decision trees Lecture 2 – Supervised Learning - Classification This week • Tutorials begin • Pracs begin Next week • Regression - Predicting House Prices Lecture 2 – Supervised Learning - Classification 82 Lecture 2 – Supervised Learning - Classification 83

Related Posts