Naïve Bayes for SPAM Classification
AIMA 12.6, and Additional Readings
CMPSC 442
Week 8, Meeting 22, 3 Segments
Outline
● Naive Bayes as a Generative Model to Classify Text
● Practical Issues in Applying Naive Bayes to Classify Text
● Naive Bayes for SPAM Classification
2Outline, Wk 8, Mtg 22
Naïve Bayes for SPAM Classification
AIMA 12.6, and Additional Readings
CMPSC 442
Week 8, Meeting 22, Segment 1 of 3: Naive Bayes as a
Generative Model to Classify Text
Machine Learning in General
● Types of ML
○ Replicate a pattern given by a supervision signal
○ Discover new patterns (unsupervised learning)
■ Infer a pattern given an indirect supervision signal (semi-supervised)
○ Learn through trial and error by interacting with environment and receiving
a reinforcement signal (reward)
● Supervised machine learning types:
○ Classification (e.g., Naive Bayes)
○ Regression
○ Sequence prediction
4Naive Bayes for Classifying Text
Naive Bayes as a Machine Learning Algorithm
● Performs classification: given an example, produces a hypothesis as to
which class the example belongs in
○ Relies on Maximum a Posteriori decision rule (Mtg 21, slide 16)
● NB is a form of supervised statistical machine learning: a training dataset
that has already been classified provides the supervision
○ The parameters of the model are the prior probabilities of each class value,
and the conditional probabilities of each conditionally independent variable
○ The parameter values are determined empirically, using maximum likelihood
estimation (counts; Mtg 21, slide 11)
○ A model with learned parameters is tested on previously unseen data
5Naive Bayes for Classifying Text
Representation of the Data
● Real world data is converted to learning examples by defining random
variables that occur in the data
○ Values for the random variables are derived from observations
○ The values of random variables for an example represent the example
● Assumptions:
○ The random variables are conditionally independent, given the class
○ The training and test data are drawn from the same distribution
(stationarity)
6Naive Bayes for Classifying Text
Naive Bayes as a Classifier for Classes of Texts
● Words that occur in any of the documents can serve as random
variables, conditionally independent of one another, given the random
variable for the class (Wk 21, slide 18)
○ Observed counts of each class are the class priors
○ The set of all distinct words that occur is the Vocabulary
○ NB relies on Bag-Of-Words (BOW) representation of documents
○ Observed word counts of words “represent” the example texts
○ The conditional probabilities of a word represents an expectations of how
strongly the word supports the belief in the posterior probability of each
class
7Naive Bayes for Classifying Text
Why a Naive Bayes Model is Generative
● ML models that rely on joint probability distributions are generative
○ Given a full joint probability distribution, P(X1, . . ., Xn, ,Y), the hypothesis
P(Y|X) depends on known priors and conditional probabilities: it’s all
probabilities
○ Conceptually, new data can be generated from the model
■ P(X|Y) = P(X,Y) / P(Y)
● ML models can be discriminative rather than generative
○ Discriminative models do not use full joint probability distributions
○ P(Y|X) depends on features X that discriminate among the outcomes Y
○ Logistic regression is a discriminative modeling method
8Naive Bayes for Classifying Text
Likelihood Function
A likelihood function is a probability function (e.g., probability mass
function)
● Likelihood L is considered to be a function of the parameters L(θ) for fixed x
and y
○ where θ is a vector of parameters
○ x is fixed data (e.g., Naïve Bayes features)
○ y are prior probabilities of the classes
● Bayes rule: Posterior = Likelihood x Priors
9Naive Bayes for Classifying Text
Observations, Counterfactuals, & Random Variables
● Assume we observe data X = x1, x2, . . . , xn
● Counterfactual nature of Bayesian statistics
○ Statistical modeling assumes even though x is observed, the values could
have been different
○ John Stuart Mill, A System of Logic, Ratiocinative and Inductive (1843)
● Essentially the same notion as Kolmogorovian probability: X is a
random variable
○ Observations could have been different
○ Assume each xi drawn from a particular distribution, e.g., Binomial
10Naive Bayes for Classifying Text
Maximum Likelihood Estimation
● The statistical inference problem is to estimate the parameters θ given
observations xi for the classes yj
● Maximum likelihood estimation (MLE) chooses the estimate θ* that
maximizes the likelihood function, i.e.:
● That is, the MLE parameters are those that maximize the probability of
the observed data distribution in the classes
11Naive Bayes for Classifying Text
MLE Estimates for Text Classification: Step 1
● Given a binary text classification C = {pos, neg}
● MLE for the prior probability of pos and neg
12Naive Bayes for Classifying Text
MLE Estimates for Text Classification: Step 2
● Identify the set of unique words wi in the documents, V
● Calculate MLEs for the conditional probability of each word wi ∈ V,
given the Class
13Naive Bayes for Classifying Text
Naïve Bayes for SPAM Classification
AIMA 12.6, and Additional Readings
CMPSC 442
Week 8, Meeting 22, Segment 2 of 3: Practical Issues in
using Naive Bayes to Classify Text
Cromwell’s Rule
In Bayesian approaches
● Use of prior probabilities of 0 or 1 should be avoided
● Because: if the prior probability is 0 or 1, then so is the posterior, and
no evidence can be considered
Named by statistician Dennis Lindley in reference to Oliver Cromwell,
based on well-known quote before the General Assembly of the Church of
Scotland in 1650:
I beseech you, in the bowels of Christ, think it possible that you may
be mistaken.
15Practical Issues
How do We Apply Cromwell’s Rule
● Assume we know how many types never occur in the data
● Steal probability mass from types that occur at least once
● Distribute this probability mass over the types that never occur
16
Or, emulate the fictional Robin Hood
● Redistribute the wealth
● Steal from the rich to give to the poor
Practical Issues
Add-One La Place Smoothing
● Add pseudo counts for unobserved class members where the missing
word could have occurred (e.g., if only you had more data):
○ Add 1 to the numerator in the conditional probabilities for every word
○ Increment the denominator total count of words in the class by the size of
the vocabulary (the class size is expanded by all the pseudo counts)
17Practical Issues
More General Form of La Place
● Add α, a much smaller number than 1
18Practical Issues
Pros and Cons of La Place Smoothing
● Pro
○ Very simple technique
○ Addresses the key idea that smoothing compensates for not having
enough data: Cromwell’s rule
● Cons:
○ Probability of frequent words is underestimated
○ Probability of rare (or unseen) words is overestimated
○ Therefore, too much probability mass is shifted towards unseen words
○ All unseen words are smoothed in the same way
● Many more sophisticated methods for smoothing exist; for this class
use La Place
19Practical Issues
Smoothing Can Cause Underflow
● Smoothing, especially with low α, leads to values close to 0
● Multiplying lots of probabilities, which are between 0 and 1 by
definition, can result in floating-point underflow
● Mitigation: calculate in log space
○ Given that log(xy) = log(x) + log(y): perform all computations by summing
logs of probabilities rather than multiplying probabilities
20Practical Issues
Example: Sentiment Classification, with Smoothing
● Application: reviews of CMPSC 442 at PSU
○ Should the student take the class?
○ How should the student interpret a single review?
● Classes: {Positive, Negative}
● Data: |V| = 25; |Vneg| =17; |Vpos| = 12
21
Neg Just plain boring
Neg Entirely predictable and lacking in content
Neg No interesting information and very little relevance to my life
Pos Amazing relevance to my life
Pos The best class I’ve ever taken
Naive Bayes for Classifying Text
Example adapted from Jurafsky
& Martin, Speech & Language
Processing, Website for 3rd
draft, Chapter 4, Naive Bayes
and Sentiment Classification
Predicting the Sentiment from New Evidence
● New review: “a somewhat predictable class”
● Priors:
● Conditional probabilities for previously observed words, with smoothing
● Maximum probability hypothesis: negative
22Naive Bayes for Classifying Text
Feature Selection
Should all words (features) be used to represent the documents?
● Use terms above some frequency threshold
○ No particular foundation
○ Can work well in practice
● Feature selection using Mutual Information (MI)
○ Clear information-theoretic interpretation
○ The amount of information (bits) obtained about one random variable based on
another random variable (See next slide)
○ May select rare uninformative terms
● Other methods: Bayes factor good at selecting informative rare terms
23Practical Issues
Mutual Information
In training set, choose k words which best predict the categories
Words X with maximal Mutual Information with the class Y:
24Practical Issues
TP FN
FP TN
Evaluation Metrics
● Accuracy
25
● Information Retrieval
Metrics
○ Recall
○ Precision
○ F1: their harmonic mean
Practical Issues
● A confusion matrix is a square matrix of n classes by n hypotheses
● The same confusion matrix is used for accuracy, and IR metrics
● Accuracy is preferred if all cell values can be determined
● Recall and precision are often used when TN is not known
Naïve Bayes for SPAM Classification
AIMA 12.6, and Additional Readings
CMPSC 442
Week 8, Meeting 22, Segment 3 of 3: Naive Bayes for
SPAM Classification
SpamCop (Pantel & Lin, 1988)
● Used a Naïve Bayes classifier
○ Training data: 160 spams, 466 non-spams
○ Test data: 277 spams, 346 non-spams
● Results: error rate of 4.33%
○ To be truly useful, error rate should be smaller
○ Still, good error rate, considering the small training set
27Naive Bayes for Spam Classification
SpamCop Details
● Tokenized all the words in each email (Porter Stemmer)
○ Normalizes different forms of a word (e.g., past tense, present tense)
○ Used only unigram frequencies (individual word frequencies)
■ Bigrams did not help much (frequencies of 2-word sequences)
■ Trigrams degraded results
● Calculated the conditional probabilities of words on {+spam, -spam}
using a form of smoothing called m-estimate
● Omitted words that do not satisfy certain conditions
28Naive Bayes for Spam Classification
More Evidence that Naïve Bayes is not so Naïve
● Naïve Bayes: First and Second place in KDD-CUP 97 competition,
among 16 state-of-the-art algorithms (for that time)
○ Test case: Direct mail response prediction model
○ Predict if mail recipient will respond to the advertisement: 750,00 records
● Optimal if the independence assumptions hold:
○ Very Fast:
■ Learning with one pass over the training data
■ Testing linear in the number of attributes, and document set size
○ Low Storage requirements
29Naive Bayes for Spam Classification
Paul Graham Reimplemented SpamCop (2002)
● Size of training dataset
○ Pantel & Lin: 526 (160 spam/466 non-spam)
○ Graham’s dataset was much larger
● Stemming used by Pantel & Lin, not by Graham
● Document size
○ Pantel & Lin used all tokens, missed “long” spams disguised as genuine
○ Graham used only 15 most significant tokens
● Adding bias to decrease FP rate: Graham added a small weight to counts of
non-spam tokens
30Naive Bayes for Spam Classification
Improvements on SpamCop
Pantel & Lin
● 92% recall
● 1.16% false positives (FP)
Re-implementation by Paul Graham (2002)
● 0.03% FP
31Naive Bayes for Spam Classification
Other Observations on Paul Graham’s Naive Bayes
● A powerful text classifier: CRM114 (Yerazuni, Bill)
○ Straight text classification: multiple classifiers (no naïve Bayes)
○ Accuracy ≥ 99.9%
● Graham’s later improvements perform as well as CRM114
○ Improved treatment of punctuation
○ Better tokenization of compounds (e.g., $20-$25)
○ Distinct token symbols for words in To/From/Subject/Body
■ “free” has a probability in spam body of 65%
■ “free” has a probability in spam Subject line of 98%
○ Increased the size of vocabulary from 23,000 to 187,000
32
Summary – Page One
● Naive Bayes can be used in classification tasks by computing MLE
estimates of the model parameters from observed data, and using the
MAP decision rule
● The random variables in an Naive Bayes model other than the Class
variable can be considered features that represent the data
● Naive Bayes assumptions
○ Conditional independence of the features given the class
○ Stationarity of the probability distributions of all the random variables
33Summary, Wk 8, Mtg 22
Summary – Page Two
● MLE estimates improve with more data
● Smoothing is consistent with the observation that more data, or a
different data set, might provide parameter estimates for cases that
were not observed due to accidental properties of the sample
● Applying Naive Bayes in log space mitigates against underflow
● A Naive Bayes model (or any classifier) can benefit from feature
selection
● Text classification can improve significantly through reliance on domain
knowledge or careful study of the data
34Summary, Wk 8, Mtg 22