CS计算机代考程序代写 Bayesian decision tree 6b_Text_Classification.dvi

6b_Text_Classification.dvi

COMP9414 Text Classification 1

This Lecture

� Probabilistic Formulation of Text Classification

� Rule-Based Text Classification

� Bayesian Text Classification

◮ Bernoulli Model

◮ Multinomial Naive Bayes

� Evaluating Classifiers

UNSW ©W. Wobcke et al. 2019–2021

COMP9414: Artificial Intelligence

Lecture 6b: Text Classification

Wayne Wobcke

e-mail:w. .au

UNSW ©W. Wobcke et al. 2019–2021

COMP9414 Text Classification 3

Example Movie Reviews/Ratings

. . . unbelievably disappointing . . .

Full of zany characters and richly applied satire, and some great

plot twists.

The greatest screwball comedy ever filmed.

It was pathetic. The worst part about it was the boxing scenes.

UNSW ©W. Wobcke et al. 2019–2021

COMP9414 Text Classification 2

Text Classification Applications

� Spam Detection

� Authorship Analysis

� E-Mail Classification/Prioritization

� News/Scientific Article Topic Classification

� Event Extraction (Event Type Classification)

� Sentiment Analysis

� Recommender Systems (using Product Reviews)

UNSW ©W. Wobcke et al. 2019–2021

COMP9414 Text Classification 5

Help User Define Rules

UNSW ©W. Wobcke et al. 2019–2021

COMP9414 Text Classification 4

Rule-Based Method

UNSW ©W. Wobcke et al. 2019–2021

COMP9414 Text Classification 7

Supervised Learning

� Input: A document (e-mail, news article, review, tweet)

� Output: One class drawn from a fixed set of classes

◮ So text classification is a multi-class classification problem

◮ . . . and sometimes a multi-label classification problem

� Learning Problem

◮ Input: Training set of labelled documents {(d1, c1), · · ·}

◮ Output: Learned classifier that maps d to predicted class c

UNSW ©W. Wobcke et al. 2019–2021

COMP9414 Text Classification 6

Suggest Features using Naive Bayes

UNSW ©W. Wobcke et al. 2019–2021

COMP9414 Text Classification 9

Feature Engineering

Example SpamAssassin (Spam E-Mail)

• Mentions Generic Viagra

• Online Pharmacy

• Mentions millions of (dollar) ((dollar) NN,NNN,NNN.NN)

• Phrase: impress . . . girl

• From: starts with many numbers

• Subject is all capitals

• HTML has a low ratio of text to image area

• One hundred percent guaranteed

• Claims you can be removed from the list

http://spamassassin.apache.org/old/tests 3 3 x.html

UNSW ©W. Wobcke et al. 2019–2021

COMP9414 Text Classification 8

Probabilistic Formulation

� Events: Occurrence of features x, occurrence of document of class c

� Given document x1, · · · ,xn, choose c so that P(c|x1, · · · ,xn) is

maximized

� Apply Bayes’ Rule

◮ P(c|x1, · · · ,xn) =
P(x1,···,xn|c).P(c)

P(x1,···,xn)

◮ Therefore maximize P(x1, · · · ,xn|c).P(c)

UNSW ©W. Wobcke et al. 2019–2021

COMP9414 Text Classification 11

Naive Bayes Classification

w1 w2 w3 w4 Class

1 0 0 1 1

0 0 0 1 0

1 1 0 1 0

1 0 1 1 1

0 1 1 0 0

1 0 0 0 0

1 0 1 0 1

0 1 0 0 1

0 1 0 1 0

1 1 1 0 0

Class = 1 Class = 0

P(Class) 0.40 0.60

P(w1|Class) 0.75 0.50

P(w2|Class) 0.25 0.67

P(w3|Class) 0.50 0.33

P(w4|Class) 0.50 0.50

To classify document with w2, w3, w4

• P(Class = 1|¬w1,w2,w3,w4)

≈ ((1−0.75)∗0.25∗0.5∗0.5)∗0.4
= 0.00625

• P(Class = 0|¬w1,w2,w3,w4)

≈ ((1−0.5)∗0.5∗0.67∗0.33)∗0.6
= 0.03333

UNSW ©W. Wobcke et al. 2019–2021

COMP9414 Text Classification 10

Bernoulli Model

Maximize P(x1, · · · ,xn|c).P(c)

� Features are presence or absence of word wi in document

� Apply independence assumptions

◮ P(x1, · · · ,xn|c) = P(x1|c). · · · .P(xn|c)

◮ Probability of word w (not) in class c independent of context

� Estimate probabilities

◮ P(w|c) = #(w in document in class c)/#(documents in class c)

◮ P(¬w|c) = 1−P(w|c)

◮ P(c) = #(documents in class c)/#documents

UNSW ©W. Wobcke et al. 2019–2021

COMP9414 Text Classification 13

Naive Bayes Classification

Maximize P(x1, · · · ,xn|c).P(c)

� Features are occurrence of word in positions in document

� Apply independence assumptions

◮ P(w1, · · · ,wn|c) = P(w1|c). · · · .P(wn|c)

◮ Position of word w in document doesn’t matter

� Estimate probabilities

◮ Let V be the vocabulary

◮ Let “document” c = concatenation of documents in class c

◮ P(w|c) = #(w in document c)/Σw∈V #(w in document c)

◮ P(c) = #(documents in class c)/#documents

UNSW ©W. Wobcke et al. 2019–2021

COMP9414 Text Classification 12

Bag of Words Model

I love this movie! It’s sweet, but with satirical

humor. The dialogue is great and the adventure

scenes are fun… It manages to be whimsical and

romantic while laughing at the conventions of

the fairy tale genre. I would recommend it to

just about anyone. I’ve seen it several times,

and I’m always happy to see it again whenever

I have a friend who hasn’t seen it yet!

it 6
I 5
the 4
to 3
and 3
seen 2
yet 1
would 1
whimsical 1
times 1
sweet 1
satirical 1
adventure 1
genre 1
fairy 1
humor 1
have 1
great 1

UNSW ©W. Wobcke et al. 2019–2021

COMP9414 Text Classification 15

MNB Example

Words Class

d1 Chinese Beijing Chinese c

d2 Chinese Chinese Shanghai c

d3 Chinese Macao c

d4 Tokyo Japan Chinese j

d5 Chinese Chinese Chinese Tokyo Japan ?

P(Chinese|c) = (5+1)/(8+6) = 3/7

P(Tokyo|c) = (0+1)/(8+6) = 1/14

P(Japan|c) = (0+1)/(8+6) = 1/14

P(Chinese| j) = (1+1)/(3+6) = 2/9

P(Tokyo| j) = (1+1)/(3+6) = 2/9

P(Japan| j) = (1+1)/(3+6) = 2/9

To classify document d5

• P(c|d5) ∝ [(3/7)
3 . 1/14 . 1/14] . 3/4

≈ 0.0003

• P( j|d5) ∝ [(2/9)
3 . 2/9 . 2/9] . 1/4

≈ 0.0001

• Choose Class c

UNSW ©W. Wobcke et al. 2019–2021

COMP9414 Text Classification 14

Laplace Smoothing

� What if word in test document has not occurred in training?

� Then P(w|c) = 0 and so estimate for class c is 0

� Laplace smoothing

◮ Assign small probablity to unseen words

◮ P(w|c) = (#(w in document c)+1)/(Σw∈V #(w in document c)+|V |)

◮ Don’t have to add 1, can be 0.05 or some parameter α

UNSW ©W. Wobcke et al. 2019–2021

COMP9414 Text Classification 17

Evaluating Classifiers

2×2 Contingency Table (single class c)

Class c not Class c

Predicted c True Positive False Positive

Predicted not c False Negative True Negative

� Precision (P) = TP/(TP+FP) – you want what you get

◮ · · · but may not get much

� Recall (R) = TP/(TP+FN) – you get what you want

◮ · · · but you might get a lot more (junk)

� F1 = 2PR/(P+R) – harmonic mean of precision and recall

UNSW ©W. Wobcke et al. 2019–2021

COMP9414 Text Classification 16

Graphical Model for Example

UNSW ©W. Wobcke et al. 2019–2021

COMP9414 Text Classification 19

Multiple Classes: Micro/Macro-Averaging

n (one per class) 2×2 Contingency Tables

� Micro-average = Aggregated measure over all classes

◮ micro-precision = ΣcTPc/Σc(TPc +FPc)

◮ micro-recall = ΣcTPc/Σc(TPc +FNc)

◮ Same when each instance has and is given one and only one label

◮ Dominated by larger classes

� Macro-average = Average of per-class measures

◮ macro-precision = 1
n
ΣcTPc/(TPc +FPc)

◮ macro-recall = 1
n
ΣcTPc/(TPc +FNc)

◮ Dominated by smaller classes

◮ Fairer for imbalanced data, e.g. sentiment analysis

UNSW ©W. Wobcke et al. 2019–2021

COMP9414 Text Classification 18

Multiple Classes: Per-Class Metrics

n×n Confusion Matrix (each instance in one class)

Predicted c1 Predicted c2 · · ·

Class c1 c11 c12 c13

Class c2 c21 c22 c23

· · · c31 c32 c33

� Precision (class ci) = cii/Σ jc ji

◮ Proportion of items predicted as ci correctly classified (as ci)

� Recall (class ci) = cii/Σ jci j

◮ Proportion of items in class ci predicted correctly (as ci)

� Accuracy = Σicii/ΣiΣ jci j

UNSW ©W. Wobcke et al. 2019–2021

COMP9414 Text Classification 20

Summary: Naive Bayes

� Very fast, low storage requirements

� Robust to irrelevant features

� Irrelevant features cancel each other without affecting results

� Very good in domains with many equally important features

◮ Decision Trees suffer from fragmentation in such cases –

especially if little data

� Optimal if the independence assumptions hold

◮ If assumed independence is correct, then it is the Bayes Optimal

Classifier for problem

� Good dependable baseline for text classification

UNSW ©W. Wobcke et al. 2019–2021