Linear Text classification
Problem defintion: Given a text document, assign it a discrete label y 2 Y where Y is the set of possible labels. Many possible applications:
I Spam filter: Y = {Spam, non-spam}
I Sentiment: Y = {Positive, negative, neutral}
I Genre classification: Y = {sports, fiction, news, · · · }
Bag-of-words representation of a document
A typical representation of a document is a bag of words, which is mathematically a vector of word counts:
x = [1,5,0,3,19,0,···]
where xj is the count of the word j. The length of the vector is the size of the vocabulary |V |. (So that the vector for all documents in a set are of the same length for apple-to-apple comparison.)
I “Vocabulary” here can’t be understood as words in a language. For instance, it could include all bigrams in a collection of documents.
I Alternatives to word counts include simple presence 1 or absence 0 of a word, tf/idf of a word, etc.
I By using word count we dropped all word order information
Feature function
I Not all words are equally important for purposes of predicting a particular label. To predict the label of a document, we assign a score to each word in the vocabulary to indicate the “compatibility” with the label, e.g., “basketball” has a high compatibility with sports, “Gry ndor” has a high
I compatibility with fiction.
These compatibility scores are called weights and they are
I arranged in a vector ✓.
Given a bag-of-words x and a weight vector ✓, we predict the label y by computing the total compatibility score between x
I and y.
In a linear function, this compatibility score is the inner product of between the weights ✓ and a feature function:
yˆ=argmax (x,y)
y2Y X
(x , y ) = ✓ · f (x , y ) = ✓j fj (x , y ) j
More on feature function
I The feature function has two arguments, the word counts x and the label y.
I It will return an feature vector where each element of the vector might be:
fj(x,y) = (xwhale, if y = Fiction 0, Otherwise
I In this case the size of the feature vector is the size of the vocabulary, but it doesn’t have to be.
I The output of the feature function also doesn’t have to be the word count.
Shape of the feature vector
f (x , y = 1) = [x ; 0; 0; · · · ; 0] | {z }
(K 1)⇥V
f (x , y = 2) = [0; 0; · · · ; 0 x ; 0; 0; · · · ; 0]
| {z } | {z }
V (K 2)⇥V f (x , y = K ) = [0; 0; · · · ; 0; x ]
(K 1)⇥V
where K is size of the label set, 0;0;··· ;0 is a vector of
| {z }
| {z }
(K 1)⇥V
(K 1) ⇥ V zeros and the semicolon indicates vertical
concatenation.
Note: Think of a feature vector this way is good for mathematical presentation. It doesn’t have to be implemented this way.
Bias
It is common to add a o↵set feature or bias at the end of the vector of word counts which is always 1, and then we need to add a zero to each of the zero vectors, to make the vector lengths match. The entire vector f(x,y) will be (V +1)K
f(x,y=1)=[x; 0;0;···;0] | {z }
(K 1)⇥(V +1) f(x,y=2)=[0;0;···;0x; 0;0;···;0]
| {z } | {z }
V (K 2)⇥(V +1) f (x , y = K ) = [ 0; 0; · · · ; 0 ; x ]
| {z }
(K 1)⇥(V +1)
Bias: What’s its e↵ect on classification if it is the only feature?
Example “vocabulary” V
Vocabulary for a collection of documents
NEG
NEG
NEU
POS
POS
A B C D E
V = { not, funny,
not funny at all painful, not funny ok overall
funny story
good story, good jokes painful, ok, overall, story, good, jokes }
From vocabulary to feature function
Vocabulary for a collection of documents
NEG
NEG
NEU
POS
POS
A B C D E
V = { not, funny,
not funny at all painful, not funny ok overall
funny story
good story, good jokes painful, ok, overall, story, good, jokes }
f1(x,y) = (xnot, if y = NEG 0, Otherwise
From vocabulary to feature function
Vocabulary for a collection of documents
NEG
NEG
NEU
POS
POS
A B C D E
V = { not, funny,
not funny at all painful, not funny ok overall
funny story
good story, good jokes painful, ok, overall, story, good, jokes }
f2(x,y) = (xfunny, if y = NEG 0, Otherwise
From vocabulary to feature function
Vocabulary for a collection of documents
NEG
NEG
NEU
POS
POS
A B C D E
V = { not, funny, Feature vector for
not funny at all painful, not funny ok overall
funny story
good story, good jokes
painful, ok, overall, story, good, jokes } A:
f (x = featurize(A),y = NEG) = [1;1;0;0;0;0;0;0;1; 0;0;··· ;0] | {z }
(3 1)⇥(8+1)
From vocabulary to feature function
Vocabulary for a collection of documents
NEG
NEG
NEU
POS
POS
A B C D E
V = { not, funny, Feature vector for
not funny at all painful, not funny ok overall
funny story
good story, good jokes
painful, ok, overall, story, good, jokes } C:
f(x =featurize(C),y =NEU)=
[0;0;··· ;0;0;0;0;1;1;0;0;0;1;0;0;··· ;0]
| {z } | {z }
(8+1) (8+1)
From vocabulary to feature function
Vocabulary for a collection of documents
NEG
NEG
NEU
POS
POS
A B C D E
V = { not, funny, Feature vector for
not funny at all painful, not funny ok overall
funny story
good story, good jokes
painful, ok, overall, story, good, jokes } E:
f (x = featurize(E),y = POS) = [0;0;··· ;0;0;0;0;0;0;0;1;1;1]
| {z }
(3 1)⇥(8+1)
The importance of feature functions
I The performance of a model to a large extent depends on the use of proper feature functions
I A lot of research went to find the most e↵ective features when developing machine learning systems
I Models that can’t handle a large number of features usually don’t perform as well
I The most e↵ective features di↵er from task to task, and hence relies on a good understanding of the problem at hand and domain knowledge
Assigning weights to features
Now we know about features. What about the weights (✓)? (x , y ) = f (x , y )✓
I There are many di↵erent ways to estimate the weights ✓. That’s why we have di↵erent machine learning models.
I We find the optimal value of ✓ with a set of training samples of size N: {x1:N,y1:N}
Probability preliminaries
I Joint probability: P(X = a, Y = b) or P(a, b) where X and Y are random variables and a and b are values assigned to the random variables
I Conditional probability: P(X = a|Y = b) = P(X=a,Y=b) P(Y=b)
IP
I Independence: P(Y|X) = P(Y), P(X|Y) = P(X),
Bayes’ theorem: P(X = a|Y = b) = P(Y=B|X=a)P(X=a)
P(Y=b) I Marginalization (sum rule): P(Y) = X P(X,Y)
P(X,Y) = P(X)P(Y)
I Conditional independence: P(X,Y|Z) = P(X|Z)P(Y|Z)
Na ̈ıve Bayes: the objective
The objective is to maximize the joint probability of a set of labeled training documents p(x1:N,y1:N), where N is the number of documents. This is known as the maximum likelihood estimation.
The goal of the training process is to find the weights ✓ that maximizes this likelihood:
✓ˆ = argmaxp(x1:N,y1:N;✓) ✓
YN i=1
XN i=1
The notation p(x1:N,y1:N;✓) indicate that ✓ is a parameter of the probability function. Symbols in bold indicate a vector of variables rather than a single variable.
= argmax
✓
p(x(i),y(i);✓) logp(x(i),y(i);✓)
= argmax
✓
“Independent and Identically Distributed”
I “A collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent.” -from Wikipedia
I Often shortened as i.i.d
I The basis on which we can break down the joint probability of all samples into the product of the probability of each sample
The generative story of Na ̈ıve Bayes
The probability p(x1:N,y1:N;✓) is defined through a generative model, an idealized random process that has generated the observed data. The algorithm that describes the generative model underlying the Na ̈ıve Bayes classifier, with the parameters
⇥ = {μ, }:
Algorithm 1 Generative process for the Na ̈ıve Bayes classification model
for Instance i 2 {1,2,··· ,N} do
Draw the label y(i) / Categorical(μ);
Draw the word counts x(i)|y(i) / Multinomial( (i)).
end for
Multinomial distribution
PX ,Y (x (i ) , y (i ) ) = PX |Y (x (i ) |y (i ) ) ⇥ PY (y (i )
PY |X is a multinomial, which a probabilistic distribution over vectors of non-negative counts. The probability mass function for this distribution is:
j
Crucially, B(x) is a multinomial coe cient that does not depend on , and can usually be ignored.
Pmulti(x, ) = B(x) ⇣QPV ⌘j=1
B(x) = j=1 xj ! Vj = 1 ( x j ! )
YV x j
Parameter estimation
The generative story above allows us to decompose
into
XN i=1
L(✓) = logPmult(x(i); y(i))+logPcat(y(i);μ)
XN i=1
log p(x(i), y(i); ✓)
L( ,μ) =
= logB(x(i))+ xj log y(i),j + logμy(i)
XN XV(i) XN i=1 j=1 i=1
Maximum-likelihood estimation chooses and μ that maximize the log-likelihood of L. Because we want these parameters to be probabilities, the solution must obey the following constraints:
XV
y,j=1 8y j=1
Parameter Estimation
After incorporating the the constraints by adding a set of Lagrange multipliers, we get this new objective (we focus on the y,j for now):
XXV(i) 0@XV 1A `( y)= xj log y,j y,j 1
i:y(i)=y j=1 j=1 Di↵erentiating with respect to the parameters y,j yields,
@`( y) = X x(i)/ y,j
@ y,j j i:y(i)=y
Parameter Estimation
There is a closed form solution to this, which can be obtained by setting each element in the vector of derivatives equal to zero:
y,j = X x(i) j
i:y(i)=y
X ( i ) XN ⇣ ( i ) ⌘ ( i )
y,j / xj = y =y xj =count(y,j)
i=1
where y (i ) = y is an indicator function which returns one if y(i) = y. The symbol / indicates that y,j is proportional to the right-hand side of the equation
i:y(i)=y
Parameter Estimation
IP
Vj=1 y,j = 1. We have an exact solution:
Recall the constraint that y is a vector of probabilities:
j0=1 I Similarly we can arrive at:
y,j = P count(y,j)
V count(y, j0)
μy = P count(y) y02K count(y0)
As is often the case, the result of the mathematical derivation (that needs to turned into code) is usually often much simpler:
Smoothing
Laplace smoothing: add ↵
= ↵+count(y,j)
y,j V↵+PV count(y,j0) j0=1
The bias – variance trade-o↵:
I Unbiased classifiers may overfit the training data, yielding
poor performance on unseen data.
I But if the smoothing is too large, the resulting classifier can underfit instead. In the limit of ↵ ! 1, there is zero variance: you get the same classifier, regardless of the data.
How to determine the best ↵?
Grid Search
Setting hyperparameters with grid search. parameters and hyperparameters
I Try a set of values and find the one that maximizes the accuracy, but on which data set?
I The goal is to maximize the system on unseen data, so we shouldn’t be doing it on training data. Instead, we should do it on a development or tuning set.
I We should also set aside another set called test set that you measure system performance on.
I If the data set is too small, use cross-validation
Prediction with Na ̈ıve Bayes
yˆ = argmaxlogp(x,y;μ, )
= argmax log p(x|y; ) + log p(y; μ)
Plug in the distribution from the generative story, we get:
2 YV x 3
j=1 V
=logB(x)+ xj log( y,j)+logμy j=1
logp(x|y; )+logp(y;μ)=log B(x) j +logμy y,j
= log B (x ) + ✓ · f (x , y ),
where
✓ = [✓(1);✓(2);··· ;✓(K)]
✓(y) =[log y,1;log y,2;···;log y,V;logμy]
4X5
Relation between mathematical models and computer science tools
I Mathematics provides the justification of why a model works the way it does, and computer science focuses on realizing it with e cient algorithms and appropriate data structures.
I Computational algorithms often resort to caching to avoid repeated computation, thus making the computational implementation more e cient, e.g., Viterbi, Forward-Backfoward, CKY
I It’s useful to think about where the mathematical justification ends and computational realization starts.
I The relation between a mathematical expression and its implementation in a programming language can be thought of as a translation process: it’s not always word for word.
Advantages of Na ̈ıve Bayes
With a joint likelihood objective, the estimated parameters of a Na ̈ıve Bayes model can be used for both classification (p(y|x)) and generation (p(x|y)).
P (x , y ) = P (x |y ) ⇥ P (y ) = P (y |x ) ⇥ P (x )
In practice, we rarely use Na ̈ıve Bayes for generation. It’s mostly used as a classification model.
Problems of Na ̈ıve Bayes
Let’s say we want to include some subword units (e.g., morphemes).
P(word = unfit,prefix = un-|y)
= P(prefix = un-|word = unfit, y) ⇥ P(word = unfit|y) = 1 ⇥ P(word = unfit|y)
If we assume conditional independence, P(word = unfit,prefix = un-|y)
⇡ P(word = unfit|y) ⇥ P(prefix = un-|y)
Since P(word = unfit|y) P(word = unfit|y) ⇥ P(prefix = un-|y), conditional independence under-estimates the true probabilities of conjunction of positively correlated features.