Document Classification 2: Using Wordlists
This time:
Words as features
Using vocabulary lists:
Sentiment classification Relevance detection Document filtering
Scoring documents Decisions criteria
Avoding hand-crafted lists
Words, sparseness and Zipf’s Law!
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 1 / 23
Words as Features
Words provide evidence of being in a particular class:
excellent — evidence that a review is positive MacDonalds — evidence that document is relevant Viagra — evidence that document is SPAM
IDEA: just treat a document as a ‘bag of words’ ignore word order, grammatical structure, etc.
Wordlist based classification:
Naïve approach: use hand-crafted vocabulary lists Or use ML techniques to derive lists from data?
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 2 / 23
Using Vocabulary Lists for Sentiment Classification
Decide whether opinion expressed is +ve or -ve (or neutral)
document feature extractor
+ve words
negative
tokens
classifier -ve words
neutral positive
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 3 / 23
Using Vocabulary Lists for Relevance Classification
Decide whether or not document is relevant (wrt some topic):
document feature extractor
tokens
classifier
relevant word list
relevant
not relevant
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 4 / 23
Using Vocabulary Lists for Document Filtering
Decide whether document is acceptable or problematic:
document feature extractor
WL A
acceptable
tokens
classifier WL B
problem A problem B
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 5 / 23
Word Lists for Sentiment Classification
List of words that typically indicate positive sentiment
good, great, excellent, fast …
List of words that typically indicate negative sentiment
bad, poor, broken, slow, … Mostly adjectives/modifiers
How can these be used for classification?
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 6 / 23
Scoring Documents
A document is treated as a bag of words, d Number of occurrences of word w in d denoted
count(d, w)
Generalise to total occurrences of words from list L in d
count(d, L) = count(d, w) w∈L
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 7 / 23
Example: Scoring Documents for Sentiment Analysis
Positive sentiment word list: L+
Negative sentiment word list: L−
Decision for document, d:
positive if count(d, L+) > count(d, L−) + δ class(d)= negative ifcount(d,L−)>count(d,L+)+δ
neutral otherwise δ is some appropriate margin
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 8 / 23
Example: Scoring Documents for Relevance
Relevant word list: L Decision for document, d:
class(d) = relevant if count(d, L) > δ not relevant otherwise
δ is some appropriate threshold
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 9 / 23
Alternative Decision Criteria
Do not distinguish frequencies above 1
count(d,w) = 1 if w occurs at least once in d 0 otherwise
Associate different weights with different words — where do weights come from?
Effectiveness of these options will be explored in lab sessions
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 10 / 23
Avoiding Reliance on Hand-Crafted Lists
Problems with hand-crafting word lists:
Hard to build comprehensive lists
Hard to build balanced lists
Variation in the way people express themselves Variation in way vocabulary used in different domains Relevance of words not always obvious
etc., etc…..
Can we derive word lists from data?
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 11 / 23
Deriving Word Lists Automatically
Use a sample of documents to derive word lists Documents in sample need to be labelled
Task
sentiment analysis relevance testing acceptability filtering
Document Labels
positive, negative, neutral relevant, irrelevant acceptable, unacceptable
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 12 / 23
Deriving Word Lists Automatically
Such labelled data can be expensive to produce Vocabulary lists highly domain dependent Difficult to adapt classifier to new domains
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 13 / 23
Selection of Words for Word Lists
Most common terms
Positive word list:
— most common words in those documents with positive label
Negative word list:
— most common words in those documents with negative label
Relevant word list:
— most common words in those documents with relevant label
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 14 / 23
Selection of Words for Word Lists (cont.)
Greatest frequency difference
For sentiment classification, a term’s positive score is the count in positive labelled documents minus the count in negative labelled documents
In general, for a class c, a term’s score for class c is the count in documents labelled as c minus the total count in documents not labelled as c
Lists for a class formed from terms with highest scores for that class
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 15 / 23
Proper use of Labelled Data
You have a collection of labelled data
You want to use the labelled data to build a model
You want to find out how well your model works
Important to use separate training and testing partitions of the labelled data
Need to be careful to avoid over-fitting the training data
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 16 / 23
Size of Vocabulary Lists
How large should the vocabulary lists be?
Empirically determine optimal vocabulary size
Use held back development set to evaluate options It would be cheating to determine using test data Example of a hyperparameter
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 17 / 23
Exploit Machine Learning
Learn from the data
Don’t need to second guess how important each feature is Do need to choose what kind of features will be used
Do need to choose a learning method
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 18 / 23
Machine Learning Approach
labelled data
classifier learner
parameters
unlabelled data
classifier
labelled data
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 19 / 23
Supervised Machine Learning: Data Sparseness
Typically models are trained on a set of examples
Trained models are then applied to new, unseen data
But many of the events in training and new data are rare (data is sparse)
— Zipf’s Law
Many events in new data will not have been present in training data
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 20 / 23
Zipf’s Law
Data Science Group (Informatics) NLE/ANLP
Autumn 2015 21 / 23
Zipf’s Law
Suppose we have n types (kinds of events, e.g. words) ordered by frequency t1,…,tn
— t1 is most frequent type so has rank 1 — tn is least frequent type so has rank n
If Nr denotes the frequency of tr (the type with rank r) then Nr ∼ 1
The frequency of a word is inversely proportional to its rank (a is close to 1)
ra
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 22 / 23
Next Topic: The Naïve Bayes Classifier
Some elementary probability theory Bayes’ Law
Naïve Bayes classification
Learning model parameters from data
Multinomial Bayes v. Bernoulli Bayes
The zero probability problem and smoothing
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 23 / 23