CS计算机代考程序代写 information retrieval Bayesian assembly algorithm Naïve Bayes for SPAM Classification

Naïve Bayes for SPAM Classification
AIMA 12.6, and Additional Readings

CMPSC 442
Week 8, Meeting 22, 3 Segments

Outline

● Naive Bayes as a Generative Model to Classify Text
● Practical Issues in Applying Naive Bayes to Classify Text
● Naive Bayes for SPAM Classification

2Outline, Wk 8, Mtg 22

Naïve Bayes for SPAM Classification
AIMA 12.6, and Additional Readings

CMPSC 442
Week 8, Meeting 22, Segment 1 of 3: Naive Bayes as a
Generative Model to Classify Text

Machine Learning in General

● Types of ML
○ Replicate a pattern given by a supervision signal
○ Discover new patterns (unsupervised learning)

■ Infer a pattern given an indirect supervision signal (semi-supervised)
○ Learn through trial and error by interacting with environment and receiving

a reinforcement signal (reward)
● Supervised machine learning types:

○ Classification (e.g., Naive Bayes)
○ Regression
○ Sequence prediction

4Naive Bayes for Classifying Text

Naive Bayes as a Machine Learning Algorithm

● Performs classification: given an example, produces a hypothesis as to
which class the example belongs in
○ Relies on Maximum a Posteriori decision rule (Mtg 21, slide 16)

● NB is a form of supervised statistical machine learning: a training dataset
that has already been classified provides the supervision
○ The parameters of the model are the prior probabilities of each class value,

and the conditional probabilities of each conditionally independent variable
○ The parameter values are determined empirically, using maximum likelihood

estimation (counts; Mtg 21, slide 11)
○ A model with learned parameters is tested on previously unseen data

5Naive Bayes for Classifying Text

Representation of the Data

● Real world data is converted to learning examples by defining random
variables that occur in the data
○ Values for the random variables are derived from observations
○ The values of random variables for an example represent the example

● Assumptions:
○ The random variables are conditionally independent, given the class
○ The training and test data are drawn from the same distribution

(stationarity)

6Naive Bayes for Classifying Text

Naive Bayes as a Classifier for Classes of Texts

● Words that occur in any of the documents can serve as random
variables, conditionally independent of one another, given the random
variable for the class (Wk 21, slide 18)
○ Observed counts of each class are the class priors
○ The set of all distinct words that occur is the Vocabulary
○ NB relies on Bag-Of-Words (BOW) representation of documents
○ Observed word counts of words “represent” the example texts
○ The conditional probabilities of a word represents an expectations of how

strongly the word supports the belief in the posterior probability of each
class

7Naive Bayes for Classifying Text

Why a Naive Bayes Model is Generative

● ML models that rely on joint probability distributions are generative
○ Given a full joint probability distribution, P(X1, . . ., Xn, ,Y), the hypothesis

P(Y|X) depends on known priors and conditional probabilities: it’s all
probabilities

○ Conceptually, new data can be generated from the model
■ P(X|Y) = P(X,Y) / P(Y)

● ML models can be discriminative rather than generative
○ Discriminative models do not use full joint probability distributions
○ P(Y|X) depends on features X that discriminate among the outcomes Y
○ Logistic regression is a discriminative modeling method

8Naive Bayes for Classifying Text

Likelihood Function

A likelihood function is a probability function (e.g., probability mass
function)

● Likelihood L is considered to be a function of the parameters L(θ) for fixed x
and y
○ where θ is a vector of parameters
○ x is fixed data (e.g., Naïve Bayes features)
○ y are prior probabilities of the classes

● Bayes rule: Posterior = Likelihood x Priors

9Naive Bayes for Classifying Text

Observations, Counterfactuals, & Random Variables

● Assume we observe data X = x1, x2, . . . , xn
● Counterfactual nature of Bayesian statistics

○ Statistical modeling assumes even though x is observed, the values could
have been different

○ John Stuart Mill, A System of Logic, Ratiocinative and Inductive (1843)
● Essentially the same notion as Kolmogorovian probability: X is a

random variable
○ Observations could have been different
○ Assume each xi drawn from a particular distribution, e.g., Binomial

10Naive Bayes for Classifying Text

Maximum Likelihood Estimation

● The statistical inference problem is to estimate the parameters θ given
observations xi for the classes yj

● Maximum likelihood estimation (MLE) chooses the estimate θ* that
maximizes the likelihood function, i.e.:

● That is, the MLE parameters are those that maximize the probability of
the observed data distribution in the classes

11Naive Bayes for Classifying Text

MLE Estimates for Text Classification: Step 1

● Given a binary text classification C = {pos, neg}
● MLE for the prior probability of pos and neg

12Naive Bayes for Classifying Text

MLE Estimates for Text Classification: Step 2

● Identify the set of unique words wi in the documents, V
● Calculate MLEs for the conditional probability of each word wi ∈ V,

given the Class

13Naive Bayes for Classifying Text

Naïve Bayes for SPAM Classification
AIMA 12.6, and Additional Readings

CMPSC 442
Week 8, Meeting 22, Segment 2 of 3: Practical Issues in
using Naive Bayes to Classify Text

Cromwell’s Rule

In Bayesian approaches

● Use of prior probabilities of 0 or 1 should be avoided
● Because: if the prior probability is 0 or 1, then so is the posterior, and

no evidence can be considered

Named by statistician Dennis Lindley in reference to Oliver Cromwell,
based on well-known quote before the General Assembly of the Church of
Scotland in 1650:

I beseech you, in the bowels of Christ, think it possible that you may
be mistaken.

15Practical Issues

How do We Apply Cromwell’s Rule

● Assume we know how many types never occur in the data
● Steal probability mass from types that occur at least once
● Distribute this probability mass over the types that never occur

16

Or, emulate the fictional Robin Hood
● Redistribute the wealth
● Steal from the rich to give to the poor

Practical Issues

Add-One La Place Smoothing

● Add pseudo counts for unobserved class members where the missing
word could have occurred (e.g., if only you had more data):
○ Add 1 to the numerator in the conditional probabilities for every word
○ Increment the denominator total count of words in the class by the size of

the vocabulary (the class size is expanded by all the pseudo counts)

17Practical Issues

More General Form of La Place

● Add α, a much smaller number than 1

18Practical Issues

Pros and Cons of La Place Smoothing

● Pro
○ Very simple technique
○ Addresses the key idea that smoothing compensates for not having

enough data: Cromwell’s rule
● Cons:

○ Probability of frequent words is underestimated
○ Probability of rare (or unseen) words is overestimated
○ Therefore, too much probability mass is shifted towards unseen words
○ All unseen words are smoothed in the same way

● Many more sophisticated methods for smoothing exist; for this class
use La Place

19Practical Issues

Smoothing Can Cause Underflow

● Smoothing, especially with low α, leads to values close to 0
● Multiplying lots of probabilities, which are between 0 and 1 by

definition, can result in floating-point underflow
● Mitigation: calculate in log space

○ Given that log(xy) = log(x) + log(y): perform all computations by summing
logs of probabilities rather than multiplying probabilities

20Practical Issues

Example: Sentiment Classification, with Smoothing
● Application: reviews of CMPSC 442 at PSU

○ Should the student take the class?
○ How should the student interpret a single review?

● Classes: {Positive, Negative}
● Data: |V| = 25; |Vneg| =17; |Vpos| = 12

21

Neg Just plain boring

Neg Entirely predictable and lacking in content

Neg No interesting information and very little relevance to my life

Pos Amazing relevance to my life

Pos The best class I’ve ever taken

Naive Bayes for Classifying Text

Example adapted from Jurafsky
& Martin, Speech & Language
Processing, Website for 3rd
draft, Chapter 4, Naive Bayes
and Sentiment Classification

Predicting the Sentiment from New Evidence

● New review: “a somewhat predictable class”
● Priors:
● Conditional probabilities for previously observed words, with smoothing

● Maximum probability hypothesis: negative

22Naive Bayes for Classifying Text

Feature Selection

Should all words (features) be used to represent the documents?

● Use terms above some frequency threshold
○ No particular foundation
○ Can work well in practice

● Feature selection using Mutual Information (MI)
○ Clear information-theoretic interpretation
○ The amount of information (bits) obtained about one random variable based on

another random variable (See next slide)
○ May select rare uninformative terms

● Other methods: Bayes factor good at selecting informative rare terms

23Practical Issues

Mutual Information

In training set, choose k words which best predict the categories

Words X with maximal Mutual Information with the class Y:

24Practical Issues

TP FN

FP TN

Evaluation Metrics

● Accuracy

25

● Information Retrieval
Metrics
○ Recall

○ Precision

○ F1: their harmonic mean

Practical Issues

● A confusion matrix is a square matrix of n classes by n hypotheses
● The same confusion matrix is used for accuracy, and IR metrics
● Accuracy is preferred if all cell values can be determined
● Recall and precision are often used when TN is not known

Naïve Bayes for SPAM Classification
AIMA 12.6, and Additional Readings

CMPSC 442
Week 8, Meeting 22, Segment 3 of 3: Naive Bayes for
SPAM Classification

SpamCop (Pantel & Lin, 1988)

● Used a Naïve Bayes classifier
○ Training data: 160 spams, 466 non-spams
○ Test data: 277 spams, 346 non-spams

● Results: error rate of 4.33%
○ To be truly useful, error rate should be smaller
○ Still, good error rate, considering the small training set

27Naive Bayes for Spam Classification

SpamCop Details

● Tokenized all the words in each email (Porter Stemmer)
○ Normalizes different forms of a word (e.g., past tense, present tense)
○ Used only unigram frequencies (individual word frequencies)

■ Bigrams did not help much (frequencies of 2-word sequences)
■ Trigrams degraded results

● Calculated the conditional probabilities of words on {+spam, -spam}
using a form of smoothing called m-estimate

● Omitted words that do not satisfy certain conditions

28Naive Bayes for Spam Classification

More Evidence that Naïve Bayes is not so Naïve

● Naïve Bayes: First and Second place in KDD-CUP 97 competition,
among 16 state-of-the-art algorithms (for that time)
○ Test case: Direct mail response prediction model
○ Predict if mail recipient will respond to the advertisement: 750,00 records

● Optimal if the independence assumptions hold:
○ Very Fast:

■ Learning with one pass over the training data
■ Testing linear in the number of attributes, and document set size

○ Low Storage requirements

29Naive Bayes for Spam Classification

Paul Graham Reimplemented SpamCop (2002)

● Size of training dataset
○ Pantel & Lin: 526 (160 spam/466 non-spam)
○ Graham’s dataset was much larger

● Stemming used by Pantel & Lin, not by Graham
● Document size

○ Pantel & Lin used all tokens, missed “long” spams disguised as genuine
○ Graham used only 15 most significant tokens

● Adding bias to decrease FP rate: Graham added a small weight to counts of
non-spam tokens

30Naive Bayes for Spam Classification

Improvements on SpamCop

Pantel & Lin

● 92% recall
● 1.16% false positives (FP)

Re-implementation by Paul Graham (2002)

● 0.03% FP

31Naive Bayes for Spam Classification

Other Observations on Paul Graham’s Naive Bayes

● A powerful text classifier: CRM114 (Yerazuni, Bill)
○ Straight text classification: multiple classifiers (no naïve Bayes)
○ Accuracy ≥ 99.9%

● Graham’s later improvements perform as well as CRM114
○ Improved treatment of punctuation
○ Better tokenization of compounds (e.g., $20-$25)
○ Distinct token symbols for words in To/From/Subject/Body

■ “free” has a probability in spam body of 65%
■ “free” has a probability in spam Subject line of 98%

○ Increased the size of vocabulary from 23,000 to 187,000

32

Summary – Page One

● Naive Bayes can be used in classification tasks by computing MLE
estimates of the model parameters from observed data, and using the
MAP decision rule

● The random variables in an Naive Bayes model other than the Class
variable can be considered features that represent the data

● Naive Bayes assumptions
○ Conditional independence of the features given the class
○ Stationarity of the probability distributions of all the random variables

33Summary, Wk 8, Mtg 22

Summary – Page Two

● MLE estimates improve with more data
● Smoothing is consistent with the observation that more data, or a

different data set, might provide parameter estimates for cases that
were not observed due to accidental properties of the sample

● Applying Naive Bayes in log space mitigates against underflow
● A Naive Bayes model (or any classifier) can benefit from feature

selection
● Text classification can improve significantly through reliance on domain

knowledge or careful study of the data

34Summary, Wk 8, Mtg 22