Data Mining and Machine Learning
Lecture 3
Stopping, Stemming & TF-IDF Similarity
Peter Jančovič Slide 1
Data Mining and Machine Learning
Objectives
Understand definition and use of Stop Lists
Understand motivation and methods of Stemming
Understand how to calculate the TF-IDF Similarity between two documents
Slide 2
Data Mining and Machine Learning
Zipf’s Law Zipf’s law
20000 18000 16000 14000 12000 10000
8000 6000 4000 2000
0
Actual statistics from “Jane Eyre”
Word (ranked by frequency)
Slide 3
Data Mining and Machine Learning
Frequency
‘Resolving Power’ of words
3000
2500
2000
1500
1000
Words 500 too 0
common
Upper cutoff
‘Resolving power’ of word
Zipf’s Law
Words too rare
Word (ranked by frequency)
Lower cutoff
Slide 4
Data Mining and Machine Learning
8
15
22
29
36
43
50
57
64
71
78
85
92
99
106
1
Frequency
Text Pre-Processing
Stop Word Removal: Simple techniques to remove ‘noise words’ from texts
– Remove common ‘noise’ words which contribute no information to the IR process (e.g. “the”)
Stemming: Remove irrelevant differences from different ‘versions’ of the same word
– Identify different forms of the same word (e.g. “run” and “ran”) identify them with a common stem
(Later) Exploit semantic relationships between words
– If two words have the same meaning, treat them as the
same word
Slide 5
Data Mining and Machine Learning
Stemming (morphology)
Basic idea: If a query and document contain different forms of the same word, then they are related
Remove surface markings from words to reveal their basic form:
– forms form, forming form – formed form, former form
“form” is the stem of forms, forming, formed, former
Slide 6
Data Mining and Machine Learning
Stemming (morphology)
Stemming replaces tokens (words) with equivalence classes of tokens (words)
Equivalence classes are stems
– Reduces the number of different words in a corpus – Increases the number of instances of each token
Slide 7
Data Mining and Machine Learning
Stemming
Of course, not all words obey simple, regular rules: – running run
– runs run
– women woman – leaves leaf
– ferries ferry
– alumnus alumni – datum data
– crisis crises
[Belew, chapter 2]
Slide 8
Data Mining and Machine Learning
Stemming
Linguists distinguish between different types of morphology:
– Minor changes, such as plurals, tense
– Major changes, e.g. incentive incentivize, which
change the grammatical category of a word
Common solution is to identify sub-pattern of letters within words and devise rules for dealing with these patterns
Slide 9
Data Mining and Machine Learning
Stemming
Example rules [Belew, p 45] – (.*)SSES /1SS
Slide 10
– Any string ending SSES is stemmed by replacing SSES with SS
– E.G: “classes” “class” – (.[AEIOU].*)ED /1
– Any string containing a vowel and ending in ED is stemmed by removing the ED
– E.G. “classed” “class”
Data Mining and Machine Learning
Stemmers
A stemmer is a piece of software which implements a stemming algorithm
The Porter stemmer is a standard stemmer which is available as a free download (see Canvas)
The Porter stemmer implements a set of about 60 rules
Use of a stemmer typically reduces vocabulary size by 10% to 50%
Slide 11
Data Mining and Machine Learning
Example
Apply the Porter stemmer to the ‘Jane Eyre’ and
‘Alice in Wonderland’ texts 34%
16000 14000 12000 10000
8000
6000
4000
2000
0
reduction
22%
reduction
Alice
Jane Eyre
Text
Before stemming After Stemming
Slide 12
Data Mining and Machine Learning
Number of different words
Example
Examples of results of Porter stemmer: – formform
– formerformer
– formedform
– formingform
– formalformal
– formalityformal – formalismformal – formicaformica – formicformic
– formantformant – formatformat
– formationformat
Slide 13
Data Mining and Machine Learning
Example: First paragraph from ‘Alice in Wonderland’
Before
Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, ‘and what is the use of a book,‘thought Alice ‘without pictures or conversation?’
After
alic wa begin to get veri tire of sit by her sister on the bank, and of have noth to do: onc or twice she had peep into the book her sister wa read, but it had no pictur or convers in it, ‘and what is the us of a book,’ thought alic ‘without pictur or convers?’
Slide 14
Data Mining and Machine Learning
Noise Words – “Stop words”
There was no possibility of taking a walk that day. We had been wandering, indeed, in the leafless shrubbery an hour in the morning; but since dinner (Mrs. Reed, when there was no company, dined early) the cold winter wind had brought with it clouds so sombre, and a rain so penetrating, that further out- door exercise was now out of the question
Noise words
– Vital for the grammatical structure of a text
– Of little use in the ‘bundle of words’ approach to identifying what a text is “about”
Slide 15
Data Mining and Machine Learning
Stop Lists
In Information Retrieval, these words are often referred to as Stop Words
Rather than detecting stop words using rules, stop words are simply specified to the system in a text file: the Stop List
Stop Lists typically consist of the most common words from some large corpus
There are lots of candidate stop lists online
Slide 16
Data Mining and Machine Learning
Example 1: Short Stop List (50 wds)
the it not of with are and as but to his from a on or
her who all will she more there if would out their so
in be have
that at an we
is by they him was i which been he this you has for had were when
Slide 17
Data Mining and Machine Learning
Example 2: 300 Word Stop List
the on of be and at
to by
a i
in this that had is not was are he but for from it or with have as an his they
one more you no were if
her out all so she said there what would up their its we about him into been than has them when can who only
……….
held keep sure probably free
real seems behind cannot miss political air question making office brought
whose special heard major problems ago became federal moment study available known result street economic boy
which will
300 most common words from Brown Corpus
other
Slide 18
Data Mining and Machine Learning
The text matters
Alice vs Brown: Most Frequent Words
the the and of to and a to she a
it in of that said is
i was alice he in for you it was with that as
as his this an know has thought
her on they they them at be little which like on at he you were all by out were again
when off who how will me more
with i is her had this one all but Had down she for not up there so are his would be but if their not from about we very or then him what have no been
herself if went out would so do
have when could or there
Slide 19
Data Mining and Machine Learning
stop.c
C program on course Canvas page
– Reads in a stop list file (text file, one word per line)
– Stores stop words in char **stopList – Read text file one word at a time
– Compares each word with each stop word – Prints out words not in stop list
stop stopListFile textFile > opFile
Slide 20
Data Mining and Machine Learning
Examples
Original first paragraph
Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,‘ thought Alice `without pictures or conversation?’
Stop list 50 removed
alice beginning get very tired sitting sister bank having nothing do once twice peeped into book sister reading no pictures conversations what use book thought alice without pictures conversation
Stop list Brown removed
alice beginning tired sitting sister bank twice peeped book sister reading pictures conversations book alice pictures
conversation
Slide 21
Data Mining and Machine Learning
Matching
Given a query q and a document d we want to define a number:
Sim(q,d)
which defines the similarity between q and d
Given the query q we will then return the documents d1 d2 … dN such that:
– d1 is the document for which Sim(q,d) is biggest – d2 has the next biggest value of Sim(q,d),
– etc
Slide 22
Data Mining and Machine Learning
Similarity
The similarity between q and d will depend on the
number of terms which are common to q and d
But we also need to know how useful each common term is for discriminating between different documents.
For example,
– It is probably not significant if q and d share “the”
– But it probably is significant if they share “magnesium”
Slide 23
Data Mining and Machine Learning
IDF weighting
One commonly used measure of the significance of a term for discriminating between documents is the Inverse Document Frequency (IDF)
For a token t define:
IDFt log ND
ND t
ND is the total number of documents in the corpus
NDt is the number of those documents that include t
Slide 24
Data Mining and Machine Learning
Why is IDF weighting useful?
IDFt log ND ND
t
Case 1: t occurs equally often in all documents – ND = NDt,
– hence IDF(t) = 0
Case 2: t occurs in just a few documents – ND>NDt
– hence IDF(t) > 0
Note that IDF(t) ignores how often term t occurs in
a document Slide 25
Data Mining and Machine Learning
Effect of Document Length
Suppose query q consists only of term t
Suppose document d1 also consists only of t – Number of shared terms is 1
– Match is ‘perfect’
Suppose document d2 has 100 terms, including t
– Number of shared terms is 1
– But in this case co-occurrence of t appears less significant
Intuitively the similarity measure Sim(q,d) needs to include normalisation by some function of N and M
Slide 26
Data Mining and Machine Learning
TF-IDF weight
Let t be a term and d a document
TF-IDF – Term Frequency – Inverse Document Frequency
The TF-IDF weight wtd of term t for document d is: w f IDFt
td td
ftd = term frequency – the number of times t occurs in d
where:
Slide 27
Data Mining and Machine Learning
TF-IDF weight (continued)
w f IDFt td td
For wtd to be large:
– ftd must be large, so t must occur often in d
– IDF(t) must be large, so t must only occur in relatively few documents
Slide 28
Data Mining and Machine Learning
Query weights
Now suppose t is a term and q is a query.
If q is a long query, can treat q as a document: w f IDFt
tq tq
where ftq is the (query) term frequency – the number
of times the term t occurs in the query q
If q is a short query, define the TF-IDF weight as
Slide 29
wtq IDFt Data Mining and Machine Learning
Sum over all terms in both q and d
q
‘Length’ of query q
Slide 30
Simq,d tqd
Data Mining and Machine Learning
TF-IDF Similarity
Define the similarity between query q and document
d as:
‘Length’ of document d
w w
td tq d
Document length Suppose d is a document
For each term t in d we can define the TF-IDF weight wtd
The length of document d is defined by: Lend d w2
td td
Slide 31
Data Mining and Machine Learning
Comments on Document Length
This definition of Len(d) may not seem very intuitive at first
It will become more intuitive when we study vector representations of documents and Latent Semantic Indexing (LSI)
For now, just remember that if x = (x1,x2,x3) is a vector in 3 dimensional space, then the length of x is
given by: Slide 32
x x2x2x2 123
Data Mining and Machine Learning
Summary
Understand definition and use of Stop Lists
Understand motivation and methods of Stemming
Understand how to calculate the TF-IDF Similarity between two documents
Slide 33
Data Mining and Machine Learning