Document Similarity
This time:
Characterising document topic
Stop words
The role of word frequency TF-IDF term weighting
From words to phrasal terms
Measuring document similarity
The vector space model Cosine similarity
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 1 / 26
Characteristics of a document
Consider problem of characterising what a document is about (its “topic”)
Characteristic terminology: words and phrases
Can then think about how to measure document similarity
Are two documents about the same topic? Are two document about related topics?
Folksonomies versus taxonomies
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 2 / 26
Stop Words
Some words are so common that they occur everywhere
Not characteristic of document content or topic Referred to as stop words
Can be ignored in determining document topic
People have constructed stop word lists e.g. NLTK provides such a list
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 3 / 26
Stop Word Lists: Example
From NLTK stopwords.words(’english’):
a about against all both but my myself on once she should than that
above after am an
by can no nor only or
so some the their
again and… did… not… other… such… theirs…
…. and so on and so on…..!
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 4 / 26
Identifying Topic Terms
Consider counting number of occurrences of a term
Intuitively, more frequent terms may be more important
IDEA: assign a weight to terms based on frequency Weight should reflect importance of the term
The higher the frequency the greater the weight/importance? Choose k most important terms to characterise document topic How well would this work?
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 5 / 26
Using Raw Frequency
Using raw frequency alone is not enough
Some high-frequency terms are not discriminating
stop words: high frequency but not characteristic of topic
Some content terms evenly distributed across many documents
e.g country in collection of documents to do with travel
such terms have little value in characterising document content
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 6 / 26
Measuring Value of a Term
So what kind of term best characterises document content?
Frequency of occurrence within a document is important: infrequent terms less likely to be characteristic of document content —> give weight to higher frequency terms
BUT…..
Spread of occurrences across documents is also important:
terms that occur everywhere not characteristic of document topic —> give weight to terms occurring in few documents (‘burstiness’)
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 7 / 26
Term Frequency and Document Frequency
Term Frequency
tft ,d
— the number of occurrences of term t in document d
Document Frequency
dft
— number of documents in collection containing term t
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 8 / 26
Term Frequency-Inverse Document Frequency
Inverse Document Frequency
idft = log N dft
— N is total number of documents
TF-IDF: Term Frequency-Inverse Document Frequency
tf-idft,d = tft,d × idft
TF-IDF weighting widely used to identify characteristic terms
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 9 / 26
A Look at TF-IDF
How does TF-IDF work?
TF: Proportional to term frequency
— the greater the term frequency the higher the score
IDF: dampens score for terms with high document frequency — the higher the document frequency, the lower the IDF
— determines impact of increasing raw frequency of term
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 10 / 26
Words and Phrases
So far assumes that topic terms are single words
— unigrams
— this is too restrictive
Often a phrase or multiword term is characteristic of topic: — n-grams
— hedge fund, black hole, tablet device, surface to air missile Same TF-IDF weighting approach can be applied to phrases
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 11 / 26
Phrasal Topic Labels
There are a lot of unigrams
There are many, many more bigrams
Some useful phrases are even longer
Size of space from which topic phrases need to found is huge Challenge is to find candidates that are plausible topic labels
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 12 / 26
Phrasal Topic Labels
Narrowing down the search:
Only consider collocations
— words that occur together far more often than chance
Only consider grammatically plausible sequences
— characterise in terms of plausible part of speech sequences — (consider parts of speech in next lecture)
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 13 / 26
Finding Collocations
Consider a two word phrase w1w2 Some questions:
how frequently do unigrams w1 and w2 occur? how frequently does the bigram w1w2 occur? how frequently might we expect w1w2 to occur?
Some answers:
p(w1): how probable it is that we’ll see w1
p(w2): how probable it is that we’ll see w2
p(w1, w2): how probable it is that we’ll see w1w2 p(w1)p(w2): how probable we might expect w1w2 to occur i.e. if p(w1) and p(w1) are completely independent
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 14 / 26
Point-wise Mutual Information
PMI value tell us how surprising it is to see the bigram w1w2 given how likely w1 and w2 are to show up in general
Not all bigrams are equally surprising PMI(w1w2) = log p(w1, w2)
Gives zero when no surprise:
i.e. when p(w1, w2) = p(w1)p(w2)
Probabilities estimated based on observed frequencies
p(w1)p(w2)
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 15 / 26
Document Similarity: Two Scenarios
1. Navigating a document collection
We have a large collection of documents
— the web
— a large collection of online textbooks
Want to find other documents similar to ones we know about
2 Searching a large document collection
We have a search query
Interested in finding the documents most relevant to the query
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 16 / 26
Vector Space Model for Document Collections
Identify a fixed set of topic terms
— perhaps 100s of thousands of terms
One dimension in the space for each of these terms
Every document associated with a vector in this space
Topic of every document captured within a common vector space
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 17 / 26
Vectors for Documents
Use TF-IDF weights or some variant of TF-IDF
A document d represented as a vector of weights
V ( d )
A bag-of-words representation of a document
Captures topics but not the propositions within document
Word order not taken into account
— Google acquired YouTube versus YouTube acquired Google
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 18 / 26
Queries as Documents
A query can be treated as a bag of words
A query is just a (short) document
A query lies in the same vector space as the documents
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 19 / 26
Question
How can we measure how far apart documents are in this space?
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 20 / 26
Measuring Vector Similarity
Cosine similarity
Quantifying similarity between documents d1 and d2
V (d1) · V (d2) sim(d1, d2) = V (d1) × V (d2)
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 21 / 26
Inside Cosine: Numerator
Dot product
V (d1) · V (d2) sim(d1, d2) = V (d1) × V (d2)
In general, suppose X = (x1,…,xm) and Y = (y1,…,ym) m
Also known as inner product
X · Y = xi yi i=1
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 22 / 26
Inside Cosine: Denominator
Euclidean length
V (d1) · V (d2) sim(d1, d2) = V (d1) × V (d2)
Ingeneral,supposeX =(x1,…,xm)
m
X = x i 2
i=1
A way to normalise vectors
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 23 / 26
Using Cosine
Two uses
Measure similarity of query and documents Measure similarity of two documents
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 24 / 26
Challenges
Requiring exact match not always desirable Allow for spelling errors and spelling alternatives
Allow for morphological variation — stemming or lemmatisation
Documents can be topically similar but not use same vocabulary — synonyms or near synonyms
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 25 / 26
Next Topic: Part-of-Speech Tagging
Parts of Speech (POS)
Open and closed POS categories POS Tagsets
Performing POS tagging
A simple unigram tagger
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 26 / 26