COMP90042 Workshop Week 3
Text Classification & N-gram Language Model
Zenan Zhai
The University of Melbourne
16 March 2014
Table of Contents
Text Classifications
N-gram Language Model
Table of Contents
Text Classifications
N-gram Language Model
Definition
Think of an application of text classification.
Definition
Think of an application of text classification. Sentiment analysis
Author identification Fact checking
⋯
What makes text classification challenging?
Definition
Think of an application of text classification. Sentiment analysis
Author identification Fact checking
⋯
What makes text classification challenging? How to learn document representation? How to do feature selection?
How to deal with data sparsity?
Text classification Models
K-Nearest Neighbors (KNN) Decision Tree
Naive Bayes
Logistic Regression
Support Vector Machine
⋯
K-Nearest Neighbors
Vote by the label of K nearest instances in the training set
Similarity Measure
Eculidean distance d (A, B ) =
√
∑(ai − bi )2
K-Nearest Neighbors
Vote by the label of K nearest instances in the training set
Similarity Measure
Eculidean distance d (A, B ) =
√
∑(ai − bi )2
Not ideal, the measure is greatly affected by document length.
K-Nearest Neighbors
Vote by the label of K nearest instances in the training set
Similarity Measure
Cosine similarity cos(A,B) = A⋅B ∣∣A∣∣ ∣∣B∣
K-Nearest Neighbors
Vote by the label of K nearest instances in the training set
Similarity Measure
Cosine similarity cos(A,B) = A⋅B
∣∣A∣∣ ∣∣B∣
Better than Eculidean distance, but suffers from
curse of dimensionality
Decision Tree
Entropy
Conditional Entropy
Information Gain
H(Y∣X)= ∑p(x)H(Y∣X =x) x∈X
IG (Y ∣a) = H (Y ) − H (Y ∣a)
H(Y ) = − ∑ p(y)log(p(y)) y∈Y
Decision Tree
Entropy
Conditional Entropy
Information Gain
H(Y∣X)= ∑p(x)H(Y∣X =x) x∈X
IG (Y ∣a) = H (Y ) − H (Y ∣a)
H(Y ) = − ∑ p(y)log(p(y)) y∈Y
Tends to prefer rare features which might only appears in a few documents.
Naive Bayes
Every feature is assumed to be independent
P(y∣x1,x2,x3,⋯,xn) ∝ P(y,x1,x2,x3,⋯,xn) n
= P(y)∏P(xi∣y) i=1
Naive Bayes
Every feature is assumed to be independent
P(y∣x1,x2,x3,⋯,xn) ∝ P(y,x1,x2,x3,⋯,xn) n
= P(y)∏P(xi∣y) i=1
Still work on dataset large set of features, but suffers from biases caused by uninformative features.
Logistic Regression
Put linear combination of features in logistic function
P(y) = σ(y) =
1
1 + exp( WX + b)
Logistic Regression
Put linear combination of features in logistic function
P(y) = σ(y) =
1
1 + exp( WX + b)
Relaxed the independent assumption and solves the problem caused by uninformative features by taking weighted sum.
Support Vector Machine
Select the decision boundary which maximize the distance to the support vectors.
Support Vector Machine
Select the decision boundary which maximize the distance to the support vectors.
An very effective methods, but no natural support
multi-classification
Table of Contents
Text Classifications
N-gram Language Model
Uni-gram model
Corpus
1. how much wood would a wood chuck chuck if a wood chuck would chuck wood
2. a wood chuck would chuck the wood he could chuck if a wood chuck would chuck wood
Uni-gram model
Corpus
1. how much wood would a wood chuck chuck if a wood chuck would chuck wood
2. a wood chuck would chuck the wood he could chuck if a wood chuck would chuck wood
Word counts
a chuck could he how if much the wood would Total 4 9 1 1 1 2 1 1 8 4 2 34
Uni-gram model
Corpus
1. how much wood would a wood chuck chuck if a wood chuck would chuck wood
2. a wood chuck would chuck the wood he could chuck if a wood chuck would chuck wood
Word counts
a chuck could he how if much the wood would Total 4 9 1 1 1 2 1 1 8 4 2 34
Why is left out?
Uni-gram model
Word counts
a chuck could he how if much the wood would Total 4 9 1 1 1 2 1 1 8 4 2 34
Sentences
A: a wood could chuck B: wood would a chuck
Uni-gram model
Word counts
a chuck could he how if much the wood would Total 4 9 1 1 1 2 1 1 8 4 2 34
Sentences
A: a wood could chuck B: wood would a chuck
P (A)
P (B )
= P (a)P (wood)P (could)P (chuck)P ()
48192 −5 = × × × × ≈1.27×10
34 34 34 34 34
= P (wood)P (would)P (a)P (chuck)P ()
84492 −5 = × × × × ≈5.07×10
34 34 34 34 34
Uni-gram model with Laplacian smoothing
Word counts
a chuck could he how if much the wood would Total 4 9 1 1 1 2 1 1 8 4 2 34
Sentences
A: a wood could chuck B: wood would a chuck
PL (A) PL (B )
= PL (a)PL (wood)PL (could)PL (chuck)PL ()
5 9 2 10 3 −5 = × × × × ≈1.46×10
45 45 45 45 45
= PL (wood)PL (would)PL (a)PL (chuck)PL ()
9 5 5 10 3 −5 = × × × × ≈3.66×10
45 45 45 45 45
Bi-gram model
Corpus
1. how much wood would a wood chuck chuck if a wood chuck would chuck wood
2. a wood chuck would chuck the wood he could chuck if a wood chuck would chuck wood
Sentences
A: a wood could chuck B: wood would a chuck
Bi-gram model
Corpus
1. how much wood would a wood chuck chuck if a wood chuck would chuck wood
2. a wood chuck would chuck the wood he could chuck if a wood chuck would chuck wood
Sentences
A: a wood could chuck B: wood would a chuck
P (A)
P (B )
= P (a∣)P (wood∣a)P (could∣wood)P (chuck∣could)P (∣chuck)
14010
= ××××=0 24819
= P (wood∣)P (would∣wood)P (a∣would)P (chuck∣a)P (∣chuck)
01100
= ××××=0 28449
Bi-gram model with Laplacian smoothing
Corpus
1. how much wood would a wood chuck chuck if a wood chuck would chuck wood
2. a wood chuck would chuck the wood he could chuck if a wood chuck would chuck wood
Sentences
A: a wood could chuck B: wood would a chuck
PL (A)
PL (B )
= PL (a∣)PL (wood∣a)PL (could∣wood)PL (chuck∣could)PL (∣chuck)
25121 −5 = × × × × ≈2.25×10
13 15 19 12 20
= PL (wood∣)PL (would∣wood)PL (a∣would)PL (chuck∣a)PL (∣chuck)
12211 −6 = × × × × ≈3.60×10
13 19 15 15 20
Tri-gram model
Corpus
1. how much wood would a wood chuck chuck if a wood chuck would chuck wood
2. a wood chuck would chuck the wood he could chuck if a wood chuck would chuck wood
Sentences
A: a wood could chuck B: wood would a chuck
Tri-gram model
Corpus
1. how much wood would a wood chuck chuck if a wood chuck would chuck wood
2. a wood chuck would chuck the wood he could chuck if a wood chuck would chuck wood
Sentences
A: a wood could chuck B: wood would a chuck
P (A)
P (B )
= P (a∣ )P (wood∣ a)⋯P (∣could chuck)
11000
= ××××=? 21401
= P (wood∣ )P (would∣ wood)⋯P (∣a chuck)
00100
= ××××=? 20110
Tri-gram model with Laplacian smoothing
Corpus
1. how much wood would a wood chuck chuck if a wood chuck would chuck wood
2. a wood chuck would chuck the wood he could chuck if a wood chuck would chuck wood
Sentences
A: a wood could chuck B: wood would a chuck
PL (A)
PL (B )
= PL (a∣ )PL (wood∣ a)⋯PL (∣could chuck)
22111 −5 = × × × × ≈1.30×10
13 12 15 11 12
= PL (wood∣ )PL (would∣ wood)⋯PL (∣a chuck)
11211 −6 = × × × × ≈8.83×10
13 11 12 12 11
Kneser-Ney Smoothing
What is continuation count?
Kneser-Ney Smoothing
What is continuation count?
number of word types in the vocabulary which appears before
a word w
∣{wi−1 ∶ C(wi−1,wi) > 0}∣
What is Kneser-Ney smoothing?
Kneser-Ney Smoothing
What is continuation count?
number of word types in the vocabulary which appears before
a word w
∣{wi−1 ∶ C(wi−1,wi) > 0}∣
What is Kneser-Ney smoothing?
use continuation probability instead of trivial uni-gram prob
can be used in either back-off and interpolation
∣{wi−1 ∶ C(wi−1,wi) > 0}∣ ∑wi∈V ∣{wi−1 ∶ C(wi−1,wi) > 0}∣
Pcont (wi ) =
KN Smoothing – Example
Corpus
1. how much wood would a wood chuck chuck if a wood chuck would chuck wood
2. a wood chuck would chuck the wood he could chuck if a wood chuck would chuck wood
Continuation counts
a= he= if= the = =
could = how = much = would =
KN Smoothing – Example
Corpus
1. how much wood would a wood chuck chuck if a wood chuck would chuck wood
2. a wood chuck would chuck the wood he could chuck if a wood chuck would chuck wood
Continuation counts
a = 2 he = 1 if = 1 the = 1 = 1
could = 1 how = 0 much = 1 would = 2
KN Smoothing – Example
Corpus
1. how much wood would a wood chuck chuck if a wood chuck would chuck wood
2. a wood chuck would chuck the wood he could chuck if a wood chuck would chuck wood
Continuation counts
chuck=
a = 2
he = 1 if = 1 the = 1 = 1
wood=
could = 1 how = 0 much = 1 would = 2
KN Smoothing – Example
Corpus
1. how much wood would a wood chuck chuck if a wood chuck would chuck wood
2. a wood chuck would chuck the wood he could chuck if a wood chuck would chuck wood
Continuation counts
chuck= 4 a = 2
he = 1 if = 1 the = 1 = 1
wood= 4 could = 1 how = 0 much = 1 would = 2
KN Smoothing – Example
Continuation counts
chuck = 4 a = 2
he = 1
if = 1 the =1 = 1
Continuation probabilities
wood = 4 could = 1 how = 0 much = 1 would = 2
Pcont (chuck) Pcont (wood)
KN Smoothing – Example
Continuation counts
chuck = 4 a = 2
he = 1
if = 1 the =1 = 1
Continuation probabilities
wood = 4 could = 1 how = 0 much = 1 would = 2
#cont (chuck)
Pcont (chuck) = #cont (a) + … + #cont () + #cont (chuck) + #cont (wood)
= Pcont (wood)
4 2+1+1+0+1+1+1+2+1+4+4
KN Smoothing – Example
Continuation counts
chuck = 4 a=2
he = 1
if = 1 the =1 = 1
Continuation probabilities
wood = 4 could = 1 how = 0 much = 1 would = 2
Pcont (chuck) = =
Pcont (wood) = =
#cont (chuck)
#cont (a) + … + #cont () + #cont (chuck) + #cont (wood)
4 2+1+1+0+1+1+1+2+1+4+4
#cont (wood)
#cont (a) + … + #cont () + #cont (chuck) + #cont (wood)
4 2+1+1+0+1+1+1+2+1+4+4
Back-off and Interpolation
Back-off
Use lower-order n-gram model if higher-order is unseen.
if c(wi,wi−1)>0 otherwise
⎧c(wi ,wi−1)−D ⎪ , ⎪ c(wi−1)
P(wi ∣wi−1) = ⎨⎪α(wi−1) × P(wi ) , ⎪⎩ ∑wj ∶C(wi−1,wj )=0 P(wj )
Interpolation
Take weighted average sum of all orders
P(wi ∣wi−1) = λP(wi ∣wi−1) + (1 − λ)P(wi )
Evaluation
Recall the objective of language model:
Modeling probability for an arbitrary sequence of m words.
Evaluate based on probability of all sequences in test set
√ PP(w1,w2,w3,…,wm)= m 1
P(w1, w2, w3, …, wm) Inverted prob. : lower perplexity → better model
Normalization : take mth root of sequence prob. , m = length(S)
Take aways
Text classification
▸ Applications
▸ Models (Pros & Cons)
n-gram language model (calculation)
Smoothing
▸ problems they address ▸ Pros & Cons