Naïve Bayes
AIMA 12.6
CMPSC 442
Week 7, Meeting 21, Three Segments
Outline
● Deriving Naive Bayes
● Applying Naive Bayes using Maximum a Posteriori Rule (See AIMA Ch.
20)
● Naive Bayes for Text Classification
2Outline, Wk 7, Mtg 21
Naïve Bayes
AIMA 12.6
CMPSC 442
Week 7, Meeting 20, Segment 1 of 3: Deriving Naïve Bayes
● Assume Toothache,Catch are conditionally independent, given Cavity
● Use conditional independence to compute the probability of Cavity
Example of Conditional Independence (from Mtg 20)
4Deriving Naive Bayes
Deriving Naïve Bayes
1. Product rule applied to a joint probability distribution:
a. P(X1,X2, . . ., Xn, C) = P(X1,X2, . . ., X3|C)P(C)
2. Conditional Independence
a. P(X1,X2, . . ., Xn|C) = P(X1|C)P(X2|C), . . ., P(Xn|C)
3. Substitution
a. P(X1,X2, . . ., Xn, C) = P(X1|C)P(X2|C), . . ., P(Xn|C)P(C)
5Deriving Naive Bayes
Generalization of Conditional Independence
● X and Y are conditionally independent given Z
● Any number of random variables can be conditionally independent
given a single random variable distinct from the rest: the Cause
6Deriving Naive Bayes
Derivation of Naïve Bayes
7Deriving Naive Bayes
Visualizing Naïve Bayes
● Effects e1, …, en assumed conditionally independent given the cause
● Total number of parameters is linear in n
8Deriving Naive Bayes
What Makes Naïve Bayes Naïve?
● Often used when the effects are not conditionally independent
● Can often work well in cases that violate the conditional independence
assumption
○ Text classification
○ Why use a more complicated model that seems more correct if a simpler
(more naïve) model perform as well or better
9Deriving Naive Bayes
Naïve Bayes
AIMA 12.6
CMPSC 442
Week 7, Meeting 20, Segment 2 of 3: Applying Naïve Bayes
Learning the Naïve Bayes Parameters
● Maximum Likelihood estimates
○ Use counts from a training set of data
○ stands for the probability estimate
11Applying Naive Bayes
Example of Naïve Bayes used for Prediction
● NB can be used to “classify”
even when there is no causal
relationship
● Four random variables
conditionally independent, given
PlayTennis
○ Outlook (3 values)
○ Temperature (3 values)
○ Humidity (2 values)
○ Wind (2 values)
● Full Joint Probability = 32 x 22 cases
● Conditional independence gives 20
12Applying Naive Bayes
Learning (Training) Phase: MLE Estimates
13Applying Naive Bayes
Using the Learned Parameters
● Given a new instance:
P(Play|e)=P(Outlook=Sunny, Temperature=Cool, Humidity=High,
Wind=Strong|Play)P(Play)
● Consult conditional probabilities for both cases (Play = Yes; Play = No)
● What does the model predict for P(Play|e)?
14Applying Naive Bayes
What Does the Evidence Predict?
15Applying Naive Bayes
Maximum a Posteriori Decision Rule (MAP)
● Approximately Bayesian
● Foundation for Naïve Bayes classifiers
● Find the most probable hypothesis hi, given the data d
16Applying Naive Bayes
Naïve Bayes
AIMA 12.6
CMPSC 442
Week 7, Meeting 20, Segment 3 of 3: Naïve Bayes for Text
Classification
Naïve Bayes for Text Classification
Classify sentences from journalism into Category = {news, sports, business,
weather, entertainment}
● Compute the probability of Category conditioned on the words W
i
● MLE estimates of prior probabilities of each category, e.g.,
P(Category=weather) = 0.09 if 9% of news sentences pertain to weather
● MLE estimates of conditional probabilities of words, e.g.,
P(Wordi=true|Category = business) for all words Wi where |Wi| > threshold
○ P(W=stock|business) is probably much higher than P(W=stock|weather)
18Naive Bayes for Text Classification
Performance of Naïve Bayes for Text Classification
● Words in running text are not independent, so application of NB for
text classification is naïve
● Nevertheless, NB works well on many text classification tasks
○ The observed probability may be quite different from the NB estimate
○ The NB classification can be correct even if the probabilities are incorrect,
as long as it gets the relative conditional frequency of each class value
correct
19Naive Bayes for Text Classification
Zero Conditional Probability Problem
● What if we have seen no training cases where patient had flu and
muscle aches?
● Then the value of any product using this conditional probability is zero;
none of the other evidence can be considered
20Naive Bayes for Text Classification
Smoothing
● Many words are relatively rare
○ A relatively infrequent word might not occur in one of the text categories
○ A word in the test data might not have been seen in the training data
● If any word is assigned a zero conditional probability for any category,
then the product of conditional probabilities is zero
● Ensure that a conditional probability is assigned for every word in
every class
○ Smooth all conditional probabilities: add a small number to all counts
○ Use a special UNKNOWN token during training and testing; all words not
in the training data can be treated as instances of UNKNOWN
21Naive Bayes for Text Classification
LaPlace Smoothing
● Smoothing is necessary in many applications of NB or other
probabilistic models, especially to text classification
○ Avoid zero counts using one of many smoothing methods
○ LaPlace: Add mp to numerator, m to denominator, of all parameter
estimates, where p is a prior estimate for an unseen value (e.g., 1/t for t
values of Xi), and m is a weight on the prior
22Naive Bayes for Text Classification
Summary
● Naïve Bayes is a classification method for data represented using N
random variables, where N-1 of them are conditionally independent
given the Nth
● The MAP decision rule is used to predict the value of the conditioning
variable, given a new observed example
● Although NB assumes conditional independence in cases where the
random variables are not conditionally independent (e.g., text
classification), it can still perform well
● For text classification, smoothing is critical
23Summary, Wk 7, Mtg 21