6b_Text_Classification.dvi
COMP9414 Text Classification 1
This Lecture
� Probabilistic Formulation of Text Classification
� Rule-Based Text Classification
� Bayesian Text Classification
◮ Bernoulli Model
◮ Multinomial Naive Bayes
� Evaluating Classifiers
UNSW ©W. Wobcke et al. 2019–2021
COMP9414: Artificial Intelligence
Lecture 6b: Text Classification
Wayne Wobcke
e-mail:w. .au
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Text Classification 3
Example Movie Reviews/Ratings
. . . unbelievably disappointing . . .
Full of zany characters and richly applied satire, and some great
plot twists.
The greatest screwball comedy ever filmed.
It was pathetic. The worst part about it was the boxing scenes.
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Text Classification 2
Text Classification Applications
� Spam Detection
� Authorship Analysis
� E-Mail Classification/Prioritization
� News/Scientific Article Topic Classification
� Event Extraction (Event Type Classification)
� Sentiment Analysis
� Recommender Systems (using Product Reviews)
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Text Classification 5
Help User Define Rules
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Text Classification 4
Rule-Based Method
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Text Classification 7
Supervised Learning
� Input: A document (e-mail, news article, review, tweet)
� Output: One class drawn from a fixed set of classes
◮ So text classification is a multi-class classification problem
◮ . . . and sometimes a multi-label classification problem
� Learning Problem
◮ Input: Training set of labelled documents {(d1, c1), · · ·}
◮ Output: Learned classifier that maps d to predicted class c
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Text Classification 6
Suggest Features using Naive Bayes
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Text Classification 9
Feature Engineering
Example SpamAssassin (Spam E-Mail)
• Mentions Generic Viagra
• Online Pharmacy
• Mentions millions of (dollar) ((dollar) NN,NNN,NNN.NN)
• Phrase: impress . . . girl
• From: starts with many numbers
• Subject is all capitals
• HTML has a low ratio of text to image area
• One hundred percent guaranteed
• Claims you can be removed from the list
http://spamassassin.apache.org/old/tests 3 3 x.html
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Text Classification 8
Probabilistic Formulation
� Events: Occurrence of features x, occurrence of document of class c
� Given document x1, · · · ,xn, choose c so that P(c|x1, · · · ,xn) is
maximized
� Apply Bayes’ Rule
◮ P(c|x1, · · · ,xn) =
P(x1,···,xn|c).P(c)
P(x1,···,xn)
◮ Therefore maximize P(x1, · · · ,xn|c).P(c)
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Text Classification 11
Naive Bayes Classification
w1 w2 w3 w4 Class
1 0 0 1 1
0 0 0 1 0
1 1 0 1 0
1 0 1 1 1
0 1 1 0 0
1 0 0 0 0
1 0 1 0 1
0 1 0 0 1
0 1 0 1 0
1 1 1 0 0
Class = 1 Class = 0
P(Class) 0.40 0.60
P(w1|Class) 0.75 0.50
P(w2|Class) 0.25 0.67
P(w3|Class) 0.50 0.33
P(w4|Class) 0.50 0.50
To classify document with w2, w3, w4
• P(Class = 1|¬w1,w2,w3,w4)
≈ ((1−0.75)∗0.25∗0.5∗0.5)∗0.4
= 0.00625
• P(Class = 0|¬w1,w2,w3,w4)
≈ ((1−0.5)∗0.5∗0.67∗0.33)∗0.6
= 0.03333
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Text Classification 10
Bernoulli Model
Maximize P(x1, · · · ,xn|c).P(c)
� Features are presence or absence of word wi in document
� Apply independence assumptions
◮ P(x1, · · · ,xn|c) = P(x1|c). · · · .P(xn|c)
◮ Probability of word w (not) in class c independent of context
� Estimate probabilities
◮ P(w|c) = #(w in document in class c)/#(documents in class c)
◮ P(¬w|c) = 1−P(w|c)
◮ P(c) = #(documents in class c)/#documents
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Text Classification 13
Naive Bayes Classification
Maximize P(x1, · · · ,xn|c).P(c)
� Features are occurrence of word in positions in document
� Apply independence assumptions
◮ P(w1, · · · ,wn|c) = P(w1|c). · · · .P(wn|c)
◮ Position of word w in document doesn’t matter
� Estimate probabilities
◮ Let V be the vocabulary
◮ Let “document” c = concatenation of documents in class c
◮ P(w|c) = #(w in document c)/Σw∈V #(w in document c)
◮ P(c) = #(documents in class c)/#documents
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Text Classification 12
Bag of Words Model
I love this movie! It’s sweet, but with satirical
humor. The dialogue is great and the adventure
scenes are fun… It manages to be whimsical and
romantic while laughing at the conventions of
the fairy tale genre. I would recommend it to
just about anyone. I’ve seen it several times,
and I’m always happy to see it again whenever
I have a friend who hasn’t seen it yet!
it 6
I 5
the 4
to 3
and 3
seen 2
yet 1
would 1
whimsical 1
times 1
sweet 1
satirical 1
adventure 1
genre 1
fairy 1
humor 1
have 1
great 1
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Text Classification 15
MNB Example
Words Class
d1 Chinese Beijing Chinese c
d2 Chinese Chinese Shanghai c
d3 Chinese Macao c
d4 Tokyo Japan Chinese j
d5 Chinese Chinese Chinese Tokyo Japan ?
P(Chinese|c) = (5+1)/(8+6) = 3/7
P(Tokyo|c) = (0+1)/(8+6) = 1/14
P(Japan|c) = (0+1)/(8+6) = 1/14
P(Chinese| j) = (1+1)/(3+6) = 2/9
P(Tokyo| j) = (1+1)/(3+6) = 2/9
P(Japan| j) = (1+1)/(3+6) = 2/9
To classify document d5
• P(c|d5) ∝ [(3/7)
3 . 1/14 . 1/14] . 3/4
≈ 0.0003
• P( j|d5) ∝ [(2/9)
3 . 2/9 . 2/9] . 1/4
≈ 0.0001
• Choose Class c
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Text Classification 14
Laplace Smoothing
� What if word in test document has not occurred in training?
� Then P(w|c) = 0 and so estimate for class c is 0
� Laplace smoothing
◮ Assign small probablity to unseen words
◮ P(w|c) = (#(w in document c)+1)/(Σw∈V #(w in document c)+|V |)
◮ Don’t have to add 1, can be 0.05 or some parameter α
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Text Classification 17
Evaluating Classifiers
2×2 Contingency Table (single class c)
Class c not Class c
Predicted c True Positive False Positive
Predicted not c False Negative True Negative
� Precision (P) = TP/(TP+FP) – you want what you get
◮ · · · but may not get much
� Recall (R) = TP/(TP+FN) – you get what you want
◮ · · · but you might get a lot more (junk)
� F1 = 2PR/(P+R) – harmonic mean of precision and recall
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Text Classification 16
Graphical Model for Example
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Text Classification 19
Multiple Classes: Micro/Macro-Averaging
n (one per class) 2×2 Contingency Tables
� Micro-average = Aggregated measure over all classes
◮ micro-precision = ΣcTPc/Σc(TPc +FPc)
◮ micro-recall = ΣcTPc/Σc(TPc +FNc)
◮ Same when each instance has and is given one and only one label
◮ Dominated by larger classes
� Macro-average = Average of per-class measures
◮ macro-precision = 1
n
ΣcTPc/(TPc +FPc)
◮ macro-recall = 1
n
ΣcTPc/(TPc +FNc)
◮ Dominated by smaller classes
◮ Fairer for imbalanced data, e.g. sentiment analysis
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Text Classification 18
Multiple Classes: Per-Class Metrics
n×n Confusion Matrix (each instance in one class)
Predicted c1 Predicted c2 · · ·
Class c1 c11 c12 c13
Class c2 c21 c22 c23
· · · c31 c32 c33
� Precision (class ci) = cii/Σ jc ji
◮ Proportion of items predicted as ci correctly classified (as ci)
� Recall (class ci) = cii/Σ jci j
◮ Proportion of items in class ci predicted correctly (as ci)
� Accuracy = Σicii/ΣiΣ jci j
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Text Classification 20
Summary: Naive Bayes
� Very fast, low storage requirements
� Robust to irrelevant features
� Irrelevant features cancel each other without affecting results
� Very good in domains with many equally important features
◮ Decision Trees suffer from fragmentation in such cases –
especially if little data
� Optimal if the independence assumptions hold
◮ If assumed independence is correct, then it is the Bayes Optimal
Classifier for problem
� Good dependable baseline for text classification
UNSW ©W. Wobcke et al. 2019–2021