What is Text Classification Statistical Classification with NLTK and Scikit-Learn Advice on Machine Learning
COMP3220 — Document Processing and the Semantic Web
Week 03 Lecture 1: Introduction to Text Classification
Diego Moll ́a
Department of Computer Science Macquarie University
COMP3220 2021H1
Diego Moll ́a
W03L1: Text Classification 1/32
What is Text Classification Statistical Classification with NLTK and Scikit-Learn Advice on Machine Learning
Programme
1 What is Text Classification
2 Statistical Classification with NLTK and Scikit-Learn
3 Advice on Machine Learning
Reading
NLTK Book Chapter 6 “Learning to Classify Text”
Some Useful Extra Reading
Jurafsky & Martin (draft), Chapter 4. ”Naive Bayes and Sentiment Classification”.
Diego Moll ́a
W03L1: Text Classification 2/32
What is Text Classification
Statistical Classification with NLTK and Scikit-Learn Advice on Machine Learning
Programme
1 What is Text Classification
2 Statistical Classification with NLTK and Scikit-Learn NLTK
Scikit-Learn
3 Advice on Machine Learning Over-fitting
W03L1: Text Classification 3/32
Diego Moll ́a
What is Text Classification
Statistical Classification with NLTK and Scikit-Learn Advice on Machine Learning
Text Classification
What is Text Classification?
Classify documents into one of a fixed predetermined set of categories.
The number of categories is predetermined. The actual categories are predetermined.
Examples
Spam detection.
Email filtering.
Classification of text into genres. Classification of names by gender. Classification of questions.
Diego Moll ́a
W03L1: Text Classification 4/32
What is Text Classification
Statistical Classification with NLTK and Scikit-Learn Advice on Machine Learning
Example: Spam Filtering
Distinguish this
Date: Mon, 24 Mar 2008
From: XXX YYY
Subj: Re: Fwd: MSc
To: Mark Dras
Hi, Thanks for that. It would
fit me very well to start
2009, its actually much better
for me and I’m planning to
finish the project in one year
(8 credit points).
from this
Date: Mon, 24 Mar 2008
From: XXX YYY
Subj: HELLO
To: madras@ics.mq.edu.au
HELLO, MY NAME IS STEPHINE IN
SEARCH OF A MAN WHO
UNDERSTANDS THE MEANING OF
LOVE AS TRUST AND FAITH IN
EACH OTHER RATHER THAN ONE WHO
SEES LOVE AS THE ONLY WAY OF
FUN …
Diego Moll ́a
W03L1: Text Classification 5/32
What is Text Classification
Statistical Classification with NLTK and Scikit-Learn Advice on Machine Learning
Classification Methods
Manual
Web portals. Wikipedia.
Automatic
Hand coded rules
e.g. ‘Viagra’ == SPAM.
e.g. email filter rules.
Fragile, breaks on new data.
Supervised learning
Use an annotated corpus.
Apply statistical methods.
Greater flexibility.
Diego Moll ́a
W03L1: Text Classification 6/32
What is Text Classification
Statistical Classification with NLTK and Scikit-Learn Advice on Machine Learning
Supervised Learning
Given
Training data annotated with class information.
Goal
Build a model which will allow classification of new data. Method
1 Feature extraction: Convert samples into vectors.
2 Training: Automatically learn a model.
3 Classification: Apply the model on new data.
Diego Moll ́a
W03L1: Text Classification 7/32
What is Text Classification
Statistical Classification with NLTK and Scikit-Learn
Advice on Machine Learning
Programme
NLTK Scikit-Learn
1 What is Text Classification
2 Statistical Classification with NLTK and Scikit-Learn NLTK
Scikit-Learn
3 Advice on Machine Learning Over-fitting
W03L1: Text Classification 8/32
Diego Moll ́a
What is Text Classification
Statistical Classification with NLTK and Scikit-Learn
Advice on Machine Learning
Programme
NLTK
Scikit-Learn
1 What is Text Classification
2 Statistical Classification with NLTK and Scikit-Learn NLTK
Scikit-Learn
3 Advice on Machine Learning Over-fitting
W03L1: Text Classification 9/32
Diego Moll ́a
What is Text Classification
Statistical Classification with NLTK and Scikit-Learn
Advice on Machine Learning
NLTK
Scikit-Learn
NLTK Features
Statistical classifiers are not able to make sense of text. We need to feed them with our interpretation of the text. NLTK classifiers expect a dictionary of features and values.
Example of a Simple Feature Extractor
def gender features(word):
return {’last letter ’: word[−1]}
Diego Moll ́a
W03L1: Text Classification 10/32
What is Text Classification
Statistical Classification with NLTK and Scikit-Learn
Advice on Machine Learning
NLTK
Scikit-Learn
Example: Gender Classification
>>> >>> >>> >>> >>>
>>> >>>
>>> >>>
>>> >>>
import nltk
from nltk . corpus import names
import random
random.seed(1234) # Fixed random seed to facilitate replicability names = ([(name, ’male’) for name in m] +
[(name, ’female ’) for name in f ]) random. shuffle (names)
def gender features(word): return {’last letter’: word[−1]}
featuresets = [( gender features(n), g) for n, g in names] train set , devtest set , test set =
featuresets[1000:], featuresets[500:1000], featuresets[:500] classifier = nltk.NaiveBayesClassifier.train(train set) classifier.classify(gender features(’Neo’))
’male ’
>>> classifier.classify(gender features(’Trinity’))
’ female ’
>>> nltk.classify.accuracy(classifier , test set)
0.776
>>>
Note that the classifier is fed with the gender features and not with the actual names.
Diego Moll ́a
W03L1: Text Classification 11/32
What is Text Classification
Statistical Classification with NLTK and Scikit-Learn
Advice on Machine Learning
NLTK
Scikit-Learn
The Development Set I
Important
Always test your system with data that has not been used for development (Why . . . ?)
Development and Test Sets
Put aside a test set and don’t even look at its contents. Use the remaining data as a development set.
Separate the development set into training and dev-test sets. Use the training set to train the statistical classifiers.
Use the dev-test set to fine-tune the classifiers and do error analysis.
Use the test set for the final system evaluation once all decisions and fine-tuning have been completed.
Diego Moll ́a
W03L1: Text Classification 12/32
What is Text Classification
Statistical Classification with NLTK and Scikit-Learn
Advice on Machine Learning
NLTK
Scikit-Learn
The Development Set II
Diego Moll ́a
W03L1: Text Classification 13/32
What is Text Classification
Statistical Classification with NLTK and Scikit-Learn
Advice on Machine Learning
NLTK
Scikit-Learn
Error Analysis in our Gender Classifier
>>> nltk.classify.accuracy(classifier , devtest set) 0.756
>>> errors = []
>>> for name, tag in devtest names :
guess = classifier.classify(gender features(name)) if tag == ’female’ and guess == ’male’:
false males.append(name)
elif tag == ’male’ and guess == ’female’:
false males.append(name)
>>> len(false males)
59
>>> len(false females)
63
>>> for m in false females [:5]:
print (m) Emmery
Winny Alaa Nate Barrie
Diego Moll ́a
W03L1: Text Classification 14/32
What is Text Classification
Statistical Classification with NLTK and Scikit-Learn
Advice on Machine Learning
NLTK
Scikit-Learn
A Revised Gender Classifier
>>> def gender features2(word): return {’suffix1 ’: word[−1:], ’ suffix2 ’ : word[−2:]}
>>> train set2 = [(gender features2(n), g) for n, g in train names]
>>> devtest set2 = [(gender features2(n),g) for n, g in devtest names]
>>> classifier = nltk.NaiveBayesClassifier.train(train set2)
>>> nltk.classify.accuracy(classifier , devtest set2)
0.77
Diego Moll ́a
W03L1: Text Classification 15/32
What is Text Classification
Statistical Classification with NLTK and Scikit-Learn
Advice on Machine Learning
NLTK
Scikit-Learn
Beware of Over-fitting
If there are many features on a small corpus the system may over-fit.
>>> def gender features3 (name): features = {}
features[’firstletter ’] = name[0].lower() features[’lastletter’] =name[−1]
for letter in ’abcdefghijklmnopqrstuvwxyz ’:
features [ ’count(%s) ’ % letter ] = name. lower (). count( letter )
features [ ’has(%s) ’ % letter ] = ( letter in name. lower ()) return features
>>> gender features3(’John’)
{’count(u)’: 0, ’has(d)’: False, ’count(b)’: 0, ’count(w)’: 0, …}
>>> train set3 = [(gender features3(n), g) for n, g in train names] >>> devtest set3 = [(gender features3(n), g) for n, g in devtest names] >>> classifier = nltk.NaiveBayesClassifier.train(train set3)
>>> nltk.classify.accuracy(classifier , devtest set3)
0.758
>>> classifier2b = nltk.NaiveBayesClassifier.train(train set2) >>> nltk.classify.accuracy(classifier2b , devtest set2)
0.77
Some types of classifiers are more sensitive to over-fitting than others.
Diego Moll ́a
W03L1: Text Classification 16/32
What is Text Classification
Statistical Classification with NLTK and Scikit-Learn
Advice on Machine Learning
NLTK
Scikit-Learn
Identifying Over-fitting
(see this week’s lecture Jupyter notebook for the code that created this plot)
Diego Moll ́a
W03L1: Text Classification 17/32
What is Text Classification
Statistical Classification with NLTK and Scikit-Learn
Advice on Machine Learning
Programme
NLTK
Scikit-Learn
1 What is Text Classification
2 Statistical Classification with NLTK and Scikit-Learn NLTK
Scikit-Learn
3 Advice on Machine Learning Over-fitting
W03L1: Text Classification
18/32
Diego Moll ́a
What is Text Classification
Statistical Classification with NLTK and Scikit-Learn
Advice on Machine Learning
NLTK
Scikit-Learn
Text Classification in Scikit-Learn
Scikit-learn includes a large number of statistical classifiers. All of these classifiers have a common interface.
The features of a document set are represented as a matrix.
Each row represents a document. Each column represents a feature.
Scikit-learn provides several useful feature extractors for text:
1 CountVectorizer returns a (sparse) matrix of word counts.
2 TfidfVectorizer returns a (sparse) matrix of tf.idf values.
Diego Moll ́a
W03L1: Text Classification 19/32
What is Text Classification
Statistical Classification with NLTK and Scikit-Learn
Advice on Machine Learning
NLTK
Scikit-Learn
Gender Classifier in Scikit-Learn – Take 1
>>> >>>
>>> >>> >>> >>> >>> >>> >>>
from sklearn . naive bayes import MultinomialNB def gender features(word):
”Return the ASCII value of the last two characters”
return [ord(word[−2]), ord(word[−1])]
featuresets = [( gender features(n), g) for n, g in names] train set , test set = featuresets [500:] , featuresets [:500 train X, train y = zip(*train set)
classifier = MultinomialNB()
classifier.fit(train X, train y)
test X, test y = zip(*test set)
classifier.predict(test X[:5])
array([’female’, ’female’, ’male’, ’female’, ’female’], dtype=’ | S6 ’ )
Diego Moll ́a
W03L1: Text Classification 20/32
What is Text Classification
Statistical Classification with NLTK and Scikit-Learn
Advice on Machine Learning
NLTK
Scikit-Learn
Gender Classification – Take 2
In the previous slide we have used this code to encode the last two characters of a name:
def gender features(word):
”Return the ASCII value of the last two characters” return [ord(word[−2]), ord(word[−1])]
This code is not entirely correct since it is representing characters as numbers.
In general, non-numerical information is best represented using one-hot encoding.
sklearn provides the following functions to produce one-hot-encoding vectors:
preprocessing .OneHotEncoding: from integers to one-hot vectors.
preprocessing . LabelBinarizer : from labels to one-hot vectors.
Diego Moll ́a
W03L1: Text Classification 21/32
What is Text Classification
Statistical Classification with NLTK and Scikit-Learn
Advice on Machine Learning
NLTK
Scikit-Learn
One-hot Encoding
Suppose you want to encode five labels: ’a’, ’b’ ,’c’, ’d’, ’e’. Each label represents one element in the one-hot vector. Thus:
’a’ is represented as (1, 0, 0, 0, 0). ’b’ is represented as (0, 1, 0 ,0 ,0). and so on.
This is also called binarization or categorical encoding.
Diego Moll ́a
W03L1: Text Classification 22/32
What is Text Classification
Statistical Classification with NLTK and Scikit-Learn
Advice on Machine Learning
NLTK
Scikit-Learn
One-hot Encoding for Gender Classification
def one hot character(c):
alphabet = ’abcdefghijklmnopqrstuvwxyz ’ result = [0]*(len(alphabet)+1)
i = alphabet.find(c.lower())
if i >= 0:
result[i] = 1 else :
result[len(alphabet)] = 1 # out of the alphab return result
def gender features(word):
last = one hot character(word[−1]) secondlast = one hot character(word[−2]) return secondlast + last
Diego Moll ́a
W03L1: Text Classification 23/32
What is Text Classification
Statistical Classification with NLTK and Scikit-Learn Over-fitting
Advice on Machine Learning
Programme
1 What is Text Classification
2 Statistical Classification with NLTK and Scikit-Learn NLTK
Scikit-Learn
3 Advice on Machine Learning Over-fitting
W03L1: Text Classification
24/32
Diego Moll ́a
What is Text Classification Statistical Classification with NLTK and Scikit-Learn Advice on Machine Learning
Over-fitting
Possible Problems with Machine Learning
ML methods are typically seen as black boxes.
Some methods are better than others for specific tasks but people tend to just try several and choose the one with best results.
Possible problems/mistakes you might face
1 Train and test are the same dataset (don’t do this!).
2 The results on the test set are much worse than those on the
dev-test set.
3 The results of both the test set and the training set are bad.
4 The train/test partition is not random.
5 The results on the test set are good but then the results on your real application problem are bad.
Diego Moll ́a
W03L1: Text Classification 25/32
What is Text Classification Statistical Classification with NLTK and Scikit-Learn Advice on Machine Learning
Over-fitting
Partition into Training and Testing Set
What’s wrong with this partition?
from nltk . corpus import names m= names.words(’male.txt’)
f = names.words(’female.txt’)
names = ([(name, ’m’) for name in m]+ [(name, ’f’) for name in f])
trainset = names[1000:] devtest = names[500:1000] testset = names[:500]
Advice
1 Make sure that the train and test sets have no bias.
2 Make sure that the train ant test sets are representative of your problem.
Diego Moll ́a
W03L1: Text Classification 26/32
What is Text Classification Statistical Classification with NLTK and Scikit-Learn Advice on Machine Learning
Over-fitting
Partition into Training and Testing Set
What’s wrong with this partition?
from nltk . corpus import names m= names.words(’male.txt’)
f = names.words(’female.txt’)
names = ([(name, ’m’) for name in m]+ [(name, ’f’) for name in f])
trainset = names[1000:] devtest = names[500:1000] testset = names[:500]
Advice
1 Make sure that the train and test sets have no bias.
2 Make sure that the train ant test sets are representative of your problem.
Diego Moll ́a
W03L1: Text Classification 26/32
What is Text Classification Statistical Classification with NLTK and Scikit-Learn Advice on Machine Learning
Over-fitting
Randomised Partition
A better partition
from nltk . corpus import names import random
m= names.words(’male.txt’)
f = names.words(’female.txt’)
names = ([(name, ’m’) for name in m]+
[(name, ’f’) for name in f]) random . seed (1234)
random. shuffle (names) trainset = names[1000:] devtest = names[500:1000] testset = names[:500]
Diego Moll ́a
W03L1: Text Classification 27/32
What is Text Classification
Statistical Classification with NLTK and Scikit-Learn Over-fitting
Advice on Machine Learning
Programme
1 What is Text Classification
2 Statistical Classification with NLTK and Scikit-Learn NLTK
Scikit-Learn
3 Advice on Machine Learning Over-fitting
W03L1: Text Classification
28/32
Diego Moll ́a
What is Text Classification Statistical Classification with NLTK and Scikit-Learn Advice on Machine Learning
Over-fitting
Why is machine learning hard?
Height
There are an infinite number of curves that fit the data
even more if we don’t require the curves to exactly fit (e.g., if we assume there’s noise in our data).
In general, more data would help us identify the correct curve better.
Age
Diego Moll ́a
W03L1: Text Classification 29/32
What is Text Classification Statistical Classification with NLTK and Scikit-Learn Advice on Machine Learning
Over-fitting
Why is machine learning hard?
Height
There are an infinite number of curves that fit the data
even more if we don’t require the curves to exactly fit (e.g., if we assume there’s noise in our data).
In general, more data would help us identify the correct curve better.
Age
Diego Moll ́a
W03L1: Text Classification 29/32
What is Text Classification Statistical Classification with NLTK and Scikit-Learn Advice on Machine Learning
Over-fitting
Why is machine learning hard?
Height
There are an infinite number of curves that fit the data
even more if we don’t require the curves to exactly fit (e.g., if we assume there’s noise in our data).
In general, more data would help us identify the correct curve better.
Age
Diego Moll ́a
W03L1: Text Classification 29/32
What is Text Classification Statistical Classification with NLTK and Scikit-Learn Advice on Machine Learning
Over-fitting
Why is machine learning hard?
Height
There are an infinite number of curves that fit the data
even more if we don’t require the curves to exactly fit (e.g., if we assume there’s noise in our data).
In general, more data would help us identify the correct curve better.
Age
Diego Moll ́a
W03L1: Text Classification 29/32
What is Text Classification Statistical Classification with NLTK and Scikit-Learn Advice on Machine Learning
Over-fitting
Why is machine learning hard?
Height
There are an infinite number of curves that fit the data
even more if we don’t require the curves to exactly fit (e.g., if we assume there’s noise in our data).
In general, more data would help us identify the correct curve better.
Age
Diego Moll ́a
W03L1: Text Classification 29/32
What is Text Classification Statistical Classification with NLTK and Scikit-Learn Advice on Machine Learning
Over-fitting
Over-fitting the Training Data
true curve
Height
our estimate
Age
Over-fitting occurs when an algorithm learns a function that is fitting noise in the data.
Diagnostic of over-fitting: performance on training data is much higher than performance on dev or test data.
Diego Moll ́a
W03L1: Text Classification 30/32
What is Text Classification Statistical Classification with NLTK and Scikit-Learn Advice on Machine Learning
Over-fitting
Take-home Messages
1 Explain and demonstrate the need for separate training and test set.
2 Implement feature extractors for statistical classifiers in NLTK and Scikit-Learn.
3 Use NLTK’s and Scikit-Learn’s statistical classifiers.
4 Detect over-fitting.
Diego Moll ́a
W03L1: Text Classification 31/32
What is Text Classification Statistical Classification with NLTK and Scikit-Learn Advice on Machine Learning
Over-fitting
What’s Next
Week 4
Deep Learning.
Reading
Deep Learning book chapters 2 and 3.
Diego Moll ́a
W03L1: Text Classification 32/32