Generative versus Discriminative Models
We need to obtain estimates of Pˆ(Y = y|X) for each value that y can take and then, following the Maximum A Posteriori Decision rule, which minimizes overall test error, assign a label such that
Yˆ = argmaxy2CPˆ(Y = y|X)
Bayes Rule says, that you can write P(Y|X) = P(Y\X) = P(X|Y)P(Y)
Copyright By PowCoder代写 加微信 powcoder
P(X) P(X) 1. Logistic Regression and KNN are discriminative model, which
directly computes P(Y|X) X
2. Naive Bayes is a generative model, which computes P(Y|X) indirectly, exploiting the factorization due to Bayes Theorem, i.e. P(X|Y)P(Y)
Naive Bayes
We introduce the Naive Bayes classifier in the context of classification, our applications will focus on text features.
This is a setting where it is most powerful, since were there are many features (i.e. X has many columns, many words to consider), but any given feature only has a small e↵ect on P(Y|X).
We begin with a simple motivating example to illustrate Bayes rule. Bayes Law allows us to rewrite the classification problem as:
Yˆ = argmax Pˆ(Y = y|X) = argmax Pˆ(X|y)Pˆ(y) y 2 C y 2 C Pˆ ( X )
Naive Bayes Classifier
The Classification Problem and Bayes Rule
I The classification problem is still the same, i.e.
Yˆ = argmax Pˆ(Y = y|X) = argmax Pˆ(X|y)Pˆ(y)
y 2 C y 2 C Pˆ ( X )
I Note that the denominator does not change for di↵erent classes Pˆ(X) is constant for each value in C, so we can just drop the denominator.
Yˆ = argmaxy2CPˆ(Y = y|X) = argmaxy2CPˆ(X|y)Pˆ(y) I In reality, we have many features in X.
The“Naive”Bayes Classifier
I Typically you have many features X, i.e. X has many columns I Suppose you can write X = (x1,….,xp), then we can write:
Yˆ = argmaxy2CP(x1,…,xp|y)P(y)
I Very di cult to estimate joint probability Pˆ(x1,…,xp|y), so we
Assumption (Naive Bayes Assumption)
The distribution of features xi , xj within a class Y is independent from one another.
I The simplifying assumption allows us to write
P(x1,…,xp|Y = y) =
The“Naive”Bayes Classifier
I The Naive Bayes Classifier assigns lables such that ˆ Yp
Y = argmaxy2CP(y) P(xi|y) i=1
I This is equivalent to
Y = argmaxy2Clog(P(y)) +
I This is still a linear classifier: it uses a linear combination of the inputs to make a classification decision.
log(P(xi|y))
An Example“Naive”Bayes Classifier
I You are asked to build a predictive model, based on a set of features, whether a car is likely to be stolen.
I You intend to use this information to target resources towards policing.
I What are our features here? X has dimensions 9 ⇥ 3
I Note that all features are binary – so the presence (or absence) of a
feature may tell you something about the underlying probability.
An Example“Naive”Bayes Classifier
I How would we classifiy a new observation xi of a Red Domestic SUV?
xi =(Red,SUV,Domestic)
I The Naive Bayes assumption allows us to factorize the joint
distribution as:
P (Yes ), P (Red |Yes ), P (SUV |Yes ), P (Domestic |Yes )
I Similarly, we need to estimate for yi = No:
P (No ), P (Red |No ), P (SUV |No ), andP (Domestic |No )
I I.e. we need to estimate for yi = Yes:
P(x1,…,xp|y) =
Estimating Parameters Using Training Data
Prior probability P(Yes) = 49
Since all features are binary, the class conditional probabilities are easy to compute given the training data
Pˆ (Red |Yes ) Pˆ (Red |No )
Pˆ (Sports |Yes ) Pˆ (Sports |No )
Pˆ (Domestic |Yes ) Pˆ (Domestic |No )
We estimate these looking at the training data as simple ratios Pˆ(Red|Yes) = Number of stolen red cars = 2.
Number of stolen cars 4
You can fill out this table as
Stolen? Color Type Origin
Computing the Naive Bayes Scores
For our new data point xi = (Red , SUV , Domestic ), we need to compute
Y = argmaxy2CP(Y) P(xi|y)
Foryi =Yes:
Pˆ(Yes)Pˆ(Red|Yes)Pˆ(SUV|Yes)Pˆ(Domestic|Yes)= 42(1 3)2 =0.027
For yi = No:
Pˆ (No )Pˆ (Red |No )Pˆ (SUV |No )Pˆ (Domestic |No ) = 5 2 (1 2 ) 3 = 0.08 9555
So we would classify this instance xi as not stolen. Why dont the probabilities add up to 1?
Evaluating the Naive Bayes assumption
I In general, we can not directly test the Naive Bayes assumption of class conditional independence of individual features.
I However, we can look for evidence in the population data on whether features appear as independent.
I How do we do that? The Naive Bayes assumption states that P(Red, SUV |Yes) = P(Red|Yes)P(SUV |Yes)
I We can see whether this holds approximately in the training data. P(Red, SUV |Yes) = No. of Stolen Red SUVs = 0 = 0
No. of stolen 4
I V e r s u s P ( R e d | Y e s ) P ( S U V | Y e s ) = 24 ⇥ ( 1 34 ) 6 = 0
I This is NOT a statistical test, but suggests that the Naive Bayes assumption does not hold.
Some implicit assumptions made
I We treated the individual features xj as a sequence of Bernoulli distributed random variables.
I This highlights why Naive Bayes is called a generative model, since we model the underlying probability distributions of the individual features contained in the data matrix X.
I We estimate P(Red|Yes) using No. of Stolen Red, this is actually a No of stolen
maximum likelihood estimator for the population probability P(Red|Yes) of a sequence of bernoulli distributed random varibales.
I Why? Suppose you have a sequence of iid coin tosses, the joint likelihood of observing such a sequence of length n, x = (x1, …, xn), where xj = 1 if head occures, can be written as:
pxj (1 p)1 xj
Some implicit assumptions made
I Taking logs,
xjlog(p)+(1 xj)log((1 p)) Xxj 1 xj =0
I a maximum likelihood estimate of pˆ is satisfies a FOC
jp1 p I Thisissolvedbypˆ=Pxj.
I So our intuitive choice for the estimator of the class conditional probabilities etc is actually theoretically well founded, but only if our features follow a bernoulli distribution.
I In reality, the p features in our data matrix X could come from di↵erent generating functions (i.e. some may be Bernoulli, Multinomial, Poisson, Normal, …etc)
Naive Bayes is very powerful…
I Naive Bayes is a classification method that tends to be used for discretely distributed data, and mainly, for text – we present the Bernoulli and Multinomial language models in the next section.
I The next section introduces the idea of representing text as data for economists and political scientist to work with, and presents an example of a Naive Bayes classifier applied to text data.
I Most often Naive Bayes classifiers are used to work with text, such as spam filters, sentiment categorization, … and many other use cases.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com