CS代写 Lecture 6: Classification with Naive Bayes

Lecture 6: Classification with Naive Bayes
Introduction to Machine Learning Semester 1, 2022
Copyright @ University of Melbourne 2022. All rights reserved. No part of the publication may be reproduced in any form by print, photoprint, microfilm or any other means without written permission from the author.

Copyright By PowCoder代写 加微信 powcoder

Last time…
• Machine learning concepts and approaches • Review of probability
• Review of (basic) optimization
• Back to Machine learning: Naive Bayes Classification
• Deriving the classifier (drawing on foundations in lecture 4)
• Finding the optimal parameters (drawing on foundations in lecture 5) • Example and implementation

Naive Bayes Theory

A little thought experiment…
Given the following dataset:
Temp Humidity
cool normal cool normal cool normal cool normal cool normal cool normal cool normal cool normal cool high
Windy Class false yes false yes false yes false yes false yes false yes false yes false yes true no
What (do you think) is the class of sunny, cool, normal, false?

A little thought experiment…
Given the following dataset:
Temp Humidity Windy Class
hot normal hot normal hot normal hot normal hot normal hot normal cool normal mild high cool high
true yes true no true yes true no true yes true no false yes false no true no
What (do you think) is the class of rainy, hot, normal, true?

A little thought experiment…
Given the following dataset:
Temp Humidity Windy Class
mild normal mild normal hot high cool high cool normal hot normal hot normal mild normal cool high
true yes false yes true yes false yes true no true no false no true no true no
What (do you think) is the class of overcast, mild, high, false?

• y label (e.g., spam, play, …)
• x observation (e.g., email, day, …)
• xm , m ∈ {1, 2, …M } features of x (e.g., temperature, word, …)
• (xi,yi)observation-labelpair;theithdatapoint
• θ,φ,ψ,…parameters
• f(x,y;θ) function of x and y with parameters θ, equivalently: fθ(x,y)

A Probabilistic Learner I
• Let’s come up with a supervised machine learning method
• We build a probabilistic model of the training data Dtrain,
Pθ(x,y) = 􏰊 Pθ(xi,yi) i∈Dtrain
• We learn our model parameters θ such that they maximize the data likelihood (or: log likelihood)
• We subsequently use that trained model to predict the class labels of the test data
• So, given a test instance x ∈ Dtest , which class y is most likely?
yˆ = argmaxP(y|x) y∈Y

A Probabilistic Learner II
The obvious way of doing this:
• For each class y:
• Find the instances in the training data labelled as y • Count the number of times x has been observed
• Choose yˆ with the greatest frequency of observed x

A Probabilistic Learner III
The obvious way of doing this:
• Would require an enormous amount of data
• A test instance x is a bundle of attribute values: to classify an (as-yet) unseen instance would require that every possible combination of attribute values has been attested in the training data a non-trivial number of times
• For m attributes, each taking k different values, and |Y | classes, this means O(|Y | · k m ) instances
• Weather example: perhaps 100s of instances
• 2-class problem, 20 binary attributes: at least 2M instances • 4 classes, 60 ternary attributes: at least 1028 instances
• Would only be meaningful for the instances that we’ve actually seen

Bayes’ Rule
Reformulate the probability of class under features as probability of features under class
P(x, y) = P(y|x)P(x) = P(x|y)P(y)
P(y|x) = P(x|y)P(y) P(x)

Bayes’ Rule
Reformulate the probability of class under features as probability of features under class
P(x, y) = P(y|x)P(x) = P(x|y)P(y) P(y|x) = P(x|y)P(y)
Recall our objective
yˆ = argmaxP(y|x) y∈Y

Bayes’ Rule
Reformulate the probability of class under features as probability of features under class
P(x, y) = P(y|x)P(x) = P(x|y)P(y) P(y|x) = P(x|y)P(y)
Recall our objective
yˆ = argmaxP(y|x)
= argmax P(x|y)P(y)
= argmax P(x|y)P(y) y∈Y

Bayes’ Rule
Reformulate the probability of class under features as probability of features under class
P(x, y) = P(y|x)P(x) = P(x|y)P(y) P(y|x) = P(x|y)P(y)
Recall our objective
yˆ = argmaxP(y|x)
= argmax P(x|y)P(y)
= argmax P(x|y)P(y)
Recall that each observation consists of many features x = x1, x2, …xM
yˆ = argmax P(x1, x2, …, xM |y)P(y) y∈Y
That is still infeasible!

Putting ‘Naive’ in Naive Bayes
To arrive at a more feasible solution, we make a naive assumption
P(x1, x2, …, xM |y)P(y) ≈ P(x1|y)P(x2|y)…P(xM |y)P(y) M
= P(y) 􏰊 P(xm|y) m=1
• The conditional independence assumption: Conditioned on the class y, the features are assumed to be independent
• Intuitively: if I know that the class of the email is spam, none of the words depend on their surrounding words
• Clearly, this is nonsense. But the model works surprisingly well, anyway!

The Naive Bayes Model: Generative Story
The complete Naive Bayes Classifier
yˆ = argmax P(y)P(x1, x2, x3, x4, …xn|y) y∈Y
= argmax P(y) 􏰊 P(xm|y)
The Underlying Probabilistic Model
P(x,y) = 􏰊P(yi) 􏰊 P(xmi |yi)
Algorithm 1 Generative Story of Naive Bayes 1: forObservationi∈{1,2,…N}do
Generate the label yi from P(y) for Feature m ∈ {1, 2, …M} do
Generate feature value xmi given that label=yi from P(xmi |yi)

Naive Bayes Assumptions
P(x,y) = 􏰊P(yi) 􏰊 P(xmi |yi)
• Features of an instance are conditionally independent given the class
• Instances are independent of each other
• The distribution of data in the training instances is the same as the distribution of data in the test instances

Gaussian Naive Bayes with 2 classes
Observations Example Model
real-valued feature vectors of length M labelled with binary class (0,1)
y drawn from Bernoulli distribution xm drawn from Gaussian distribution
p(x,y) = pφ,ψ(x1,x2,··· ,xm,y) = pφ(y)􏰊pψ(xk|y)
= BN(y|φ) 􏰊 N(xk |ψ = {μm,y , σm,y })
(1−y) 􏰊 m = 1
2 π σ m2 , y
1 (xm − μm,y )2 􏰄 exp − 2
Prediction
yˆ = argmaxyp(y|x)

Bernoulli Naive Bayes with 2 classes
Observations Example
binary feature vectors of length M labelled with binary class (0,1)
Prediction
=φy(1−φ)1−y 􏰊(ψy,m)xm(1−ψy,m)(1−xm) m=1
yˆ = argmaxyp(y|x)
y drawn from Bernoulli distribution xm drawn from Bernoulli distribution
p(x,y) = pφ,ψ(x1,x2,··· ,xm,y) = pφ(y)􏰊pψ(xk|y)
= BN(y|φ) 􏰊 BN(xk |ψm,y )

Categorical Naive Bayes with C classes
Observations Example
categorical feature vectors of length M labelled with one of C classes (C > 2)
y drawn from Categorical distribution with C classes xm drawn from Categorical distribution over K values
p(x,y) = pφ,ψ(x1,x2,··· ,xm,y) = pφ(y)􏰊pψ(xk|y)
Prediction
= Cat(y|φ) 􏰊 Cat(xk |ψm,y )
=φy 􏰊􏰊(ψy,m,k) m=1 k=1
yˆ = argmaxyp(y|x)

Maximum Likelihood Estimation for Categorical Naive Bayes
But where do the parameters come from?
• Parameters φ of the Categorical distribution over class labels are the
relative frequencies of classes observed in the training data
φy = count(y) N
• Parameters ψ of theCategorical distributions over features given a class label are the observed relative frequencies of (class, label) among all instances with that class
ψy,m = count(y,m) count (y )

Maximum Likelihood Estimation for Categorical Naive Bayes
But where do the parameters come from?
• Parameters φ of the Categorical distribution over class labels are the
relative frequencies of classes observed in the training data
φy = count(y) N
• Parameters ψ of theCategorical distributions over features given a class label are the observed relative frequencies of (class, label) among all instances with that class
ψy,m = count(y,m) count (y )
• These parameters maximize the probability of the observed dataset P({(xi , yi )}Ni=1; φ, ψ). They are the maximum likelihood estimate of φ and ψ.
• You are invited to derive this result using the optimization techniques we learnt in the last lecture!

Maximum Likelihood Estimation for Categorical Naive Bayes
But where do the parameters come from?
• Parameters φ of the Categorical distribution over class labels are the
relative frequencies of classes observed in the training data
φy = count(y) N
• Parameters ψ of theCategorical distributions over features given a class label are the observed relative frequencies of (class, label) among all instances with that class
ψy,m = count(y,m) count (y )
Activity: What are the four steps we would follow in finding the
• These parameters maximize the probability of the observed dataset
optimal parameters?
P({(xi , yi )}Ni=1; φ, ψ). They are the maximum likelihood estimate of φ
• You are invited to derive this result using the optimization techniques
we learnt in the last lecture!

Maximum Likelihood Estimation for Gaussian Naive Bayes
For each class y and each feature xm, we learn an individual Gaussian distribution parameterized by a mean μy,m and a standard deviation σy,m
Mean: the average of all observed feature value for xm under class y
μ y , m = 1 􏰂 x mi
count (y )
Standard deviation: Sum of squared differences of observed values from the mean. Normalized, and square rooted.
􏰚􏰅i:yi=y(xmi −μy,m)2 σy,m = count(y)

Naive Bayes Example I
Given a training data set, what probabilities do we need to estimate?
Headache Sore severe mild
no severe mild mild mild no
severe severe
Temperature high normal normal normal normal
Cough Diagnosis yes Flu yes Cold yes Flu
no Cold yes Flu

Naive Bayes Example I
Given a training data set, what probabilities do we need to estimate?
Headache Sore severe mild
no severe mild mild mild no
severe severe
Temperature high normal normal normal normal
Cough Diagnosis yes Flu yes Cold yes Flu
no Cold yes Flu
We need P(y = k), P(x = f|y = k), for every possible value k for y and every possible value f for x

Naive Bayes Example I
Given a training data set, what probabilities do we need to estimate?
Headache Sore severe mild
no severe mild mild mild no
Temperature high normal normal normal normal
Cough Diagnosis yes Flu yes Cold yes Flu
severe severe
We need P(y = k), P(x = f|y = k), for every possible value k for y and
every possible value f for x
P(Flu) = 3/5
P(Headache = severe|Flu) = 2/3 P(Headache = mild|Flu) = 1/3 P(Headache = no|Flu) = 0/3 P(Sore = severe|Flu) = 1/3 P(Sore = mild|Flu) = 2/3 P(Sore = no|Flu) = 0/3 P(Temp = high|Flu) = 1/3 P(Temp = normal|Flu) = 2/3 P(Cough = yes|Flu) = 3/3 P(Cough = no|Flu) = 0/3
P(Cold) P(Headache = severe|Cold) P(Headache = mild|Cold) P(Headache = no|Cold) P(Sore = severe|Cold) P(Sore = mild|Cold) P(Sore = no|Cold) P(Temp = high|Cold) P(Temp = normal|Cold) P(Cough = yes|Cold) P(Cough = no|Cold)
= 2/5 = 0/2 = 1/2 = 1/2 = 1/2 = 0/2 = 1/2 = 0/2 = 2/2 = 1/2 = 1/2
no Cold yes Flu

Naive Bayes Example II
Ann comes to the clinic with a mild headache, severe soreness, normal temperature and no cough. Is she more likely to have a Cold, or the Flu?

Naive Bayes Example II
Ann comes to the clinic with a mild headache, severe soreness, normal temperature and no cough. Is she more likely to have a Cold, or the Flu?
P(Co) × 52 ×
P(Fl) × 53 ×
P(H = m|Co)P(S = s|Co)P(T = n|Co)P(C = n|Co) (12)(12)(2)(12) = 0.05
P(H = m|Fl)P(S = s|Fl)P(T = n|Fl)P(C = n|Fl) (13)(13)(23)(03) = 0

Naive Bayes Example III
Bob comes to the clinic with a severe headache, mild soreness, high temperature and no cough. Is he more likely to have a cold, or the flu?
P(Co) × 52 ×
P(Fl) × 53 ×
P(H = s|Co)P(S = m|Co)P(T = h|Co)P(C = n|Co) (02)(02)(02)(12) = 0
P(H = s|Fl)P(S = m|Fl)P(T = h|Fl)P(C = n|Fl) (23)(23)(13)(03) = 0

Smoothing categorical features: Introduction
The problem with unseen features
• If any term P(xm|y) = 0 then the class probability P(y|x) = 0
• But, we already established that in any realistic scenario we won’t see
every class-feature combination during training
• A single zero renders many additional meaningful observations irrelevant
• Solution: no event is impossible: P(xm|y) > 0∀xm∀y
• We need to readjust the remaining model parameters to maintain valid probability distributions (􏰅i ψi = 1)

Epsilon Smoothing
Simplest approach
• if we calculate P(xm|y) = 0, we replace 0 with a very (!) small constant typically called ε
• ε needs to be smaller (preferably much smaller) than N1 (N=the number of training instances). Why?
• Effectively it reduces most comparisons to the cardinality of ε (fewest εs wins)
• Weassumethatεissosmallthat1+ε≈1,sowedonotneedto renormalize or adjust the other probabilities in the model

Epsilon Smoothing
Bob comes to the clinic with a severe headache, mild soreness, high temperature and no cough. Is he more likely to have a cold, or the flu?
P(Co) × P(H = s|Co)P(S = m|Co)P(T = h|Co)P(C = n|Co) 2 × (ε)(ε)(ε)(1)=ε3
P(Fl) × P(H = s|Fl)P(S = m|Fl)P(T = h|Fl)P(C = n|Fl)
3 × (2)(2)(1)(ε) = 12ε = 4ε 5 333 13545

Laplace Smoothing
Add a “pseudocount” α to each feature count observed during training P(xm =j|y =k)= α+count(y =k,xm =j)
Mα + count(y = k)
• the value of α is a parameter; very often α = 1
• all counts are incremented to ensure to maintain monotonicity (for
α = 1: 0 becomes 1, 1 becomes 2, 2 becomes 3, …) • M is the number of values xm can take on

Laplace Smoothing
Add a “pseudocount” α to each feature count observed during training
Headache Sore severe mild
no severe mild mild mild no
severe severe
Temperature high normal normal normal normal
Cough yes yes yes no yes
Diagnosis Flu Cold Flu Cold Flu
smoothed estimate (α = 1) (2+1)/(3+3) = 3/6 (1+1)/(3+3) = 2/6 (0+1)/(3+3) = 1/6 (3+1)/(3+2) = 4/5 (0+1)/(3+2) = 1/5
P(xm =j|y =k)= α+count(y =k,xm =j) Mα + count(y = k)
P(Headache = severe|Flu) P(Headache = mild|Flu) P(Headache = no|Flu) P(Cough = yes|Flu) P(Cough = no|Flu)
original estimate 2/3

Laplace Smoothing
Add a “pseudocount” α to each feature count observed during training P(xm =j|y =k)= α+count(y =k,xm =j)
Mα + count(y = k)
• Probabilities are changed drastically when there are few instances; with a large number of instances, the changes are small
• Laplace smoothing (and smoothing in general) reduces variance of the NB classifier because it reduces sensitivity to individual (non-)observations in the training data
• Laplace smoothing (and smoothing in general) adds bias to the NB classifier. We no longer have a true maximum likelihood estimator.
• How to choose α?
• There are other smoothing methods, including Good-Turing, Kneser- Ney, Regression, … (outside the scope of this class)

“Smoothing” Continuous Features
No problem with unseen features: we can always compute p(x |μ, σ), but • Recalling the Gaussian PDF
1 􏰃 1 (xm − μm,y )2 􏰄 􏰐 exp− 2
2 π σ m2 , y 2 σ m , y
What if a feature (under a class) has zero vari-
ance, i.e., all observed values are the same? Solution 1: Ignore the feature
• Might lose information if there is zero variance only for some classes
• Safe to do if the feature has the same value across all classes: just a
Solution 2: Add a small “smoothing” value to the PDF
• p(x = j|μ,σ) → p(x = j|μ,σ+ε)
• Set ε as a small fraction of the largest observed variance to all
variances (e.g., ε = 1e − 9 in sklearn’s implementation)

Implementation of Gaussian Naive Bayes

Implementing a Naive Bayes Classifier
Naive Bayes is a supervised machine learning method:
• We need to build a model (“training phase”)
• We need to make predictions using that model and evaluate the predictions against the ground truth (“testing phase”)

Training a NB Classifier I
Our model consists of two kinds of probabilities:
• priors P(Y = k) (one per class)
• likelihoods P(X = j|Y = k) (one per attribute value, per class)

Calculating priors by counting I
There is one prior P(Y = k) per class
X1 (Headache) 0.8
X3 (Temperature) Y (Diagnosis)
0.4 39.5 Flu 0.8 37.8 Cold 0.4 37.8 Flu 0.0 37.8 Cold 0.8 37.8 Flu
Cold Flu 23

Calculating priors by counting I
There is one prior P(Y = k) per class
X1 (Headache) 0.8
X3 (Temperature) Y (Diagnosis)
0.4 39.5 Flu 0.8 37.8 Cold 0.4 37.8 Flu 0.0 37.8 Cold 0.8 37.8 Flu
Cold Flu 23
We need to normalize these counts by the total number of training instances N. Options:
• divide each entry by the sum of the entries in the list
• keep a separate counter for the total number of instances N, which is often useful

Calculating likelihoods I
There is one likelihood P(x = j|y = k) per attribute, per class: 2D array?

Calculating likelihood parameters
There is one likelihood P(x = j|y = k) per attribute, per class, for each attribute X: 2D array? 3D array?
• Each likelihood is a Gaussian distribution parameterized by a mean and standard deviation

Calculating likelihoods II
X1 (Headache) 0.8
(Sore) X3 (Temperature) Y (Diagnosis) 0.4 39.5 Flu
0.8 37.8 Cold
0.4 37.8 Flu
0.0 37.8 Cold 0.8 37.8 Flu
Temperature
Cold: {} Cold: {} Flu: {0.8} Flu: {}
Cold: {} Cold: {} Flu: {} Flu: {}

Calculating likelihoods II
X1 (Headache) 0.8
X2 (Sore) 0.4 0.8 0.4 0.0 0.8
X3 (Temperature) Y (Diagnosis) 39.5 Flu
37.8 Cold 37.8 Flu
37.8 Cold 37.8 Flu
Temperature
Cold: {} Cold: {} Flu: {0.8} Flu: {}
Cold: {} Cold: {} Flu: {0.4} Flu: {}

Calculating likelihoods II
X1 (Headache) 0.8
(Sore) X3 (Temperature) Y (Diagnosis) 0.4 39.5 Flu
0.8 37.8 Cold
0.4 37.8 Flu
0.0 37.8 Cold 0.8 37.8 Flu
Temperature
Cold: {0.0, 0.4} Cold: {37.8, 37.8}
Flu: {0.8, 0.4, 0.8} Flu: {39.5, 37.8, 37.8}
Cold: {0.8, 0.0}
Flu: {0.4, 0.4, 0.8}

Calculating likelihoods II
Headache Temperature
We need to turn these numbers into a mean and a standard deviation parameter for each (label, feature) pair:
μcold ,headache = 0.0 + 0.4 = 0.2 2
􏰙 (0.0 − 0.2)2 + (0.4 − 0.2)2 σcold,headache = 2 = 0.2
Store C×M parameter tuples, e.g., a dictionary of dictionaries
Cold: {0.0, 0.4} Cold: {37.8, 37.8}
Flu: {0.8, 0.4, 0.8} Flu: {39.5, 37.8, 37.8}
Cold: {0.8, 0.0}
Flu: {0.4, 0.4, 0.8}

Making predictions using a NB Classifier i
yˆ = argmaxP(y = k)􏰊P(xm = j|y = k;μk,m,σk,m)
• P(y = k) can be read off the data structures from the training phase. • P(xm = j|y = k; μk,m, σk,m) can be computed using the likelihood
function of the Gaussian distribution
1 􏰃 1 (xm − μm,k )2 􏰄
􏰐 exp− 2 2πσ2 2 σm,k
• We only care about the class corresponding to the maximal value, so as we progress through the classes, we can keep track of the greatest value so far.

Making predictions using a NB Classifier ii
We’re multiplying a bunch of numbers (0, 1] together — because of our floating-point number representation, we tend to get underflow.
One common solution is a log-transformation:
yˆ = argmaxP(y = k)􏰊P(xm = j|y = k)
= argmax􏰋logP(y=k)+􏰂logP(xm=j|y=k)􏰌

Evaluating a NB classifier
Evaluation in a supervised ML context (for NB and other methods):
• fundamentally based around comparing predicted labels with the actual labels
We’ll talk about this in much more detail in the upco

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com