程序代写代做代考 data mining algorithm information retrieval Data Mining and Machine Learning

Data Mining and Machine Learning
Lecture 3
Stopping, Stemming & TF-IDF Similarity
Peter Jančovič Slide 1
Data Mining and Machine Learning

Objectives
 Understand definition and use of Stop Lists
 Understand motivation and methods of Stemming
 Understand how to calculate the TF-IDF Similarity between two documents
Slide 2
Data Mining and Machine Learning

Zipf’s Law Zipf’s law
20000 18000 16000 14000 12000 10000
8000 6000 4000 2000
0
Actual statistics from “Jane Eyre”
Word (ranked by frequency)
Slide 3
Data Mining and Machine Learning
Frequency

‘Resolving Power’ of words
3000
2500
2000
1500
1000
Words 500 too 0
common
Upper cutoff
‘Resolving power’ of word
Zipf’s Law
Words too rare
Word (ranked by frequency)
Lower cutoff
Slide 4
Data Mining and Machine Learning
8
15
22
29
36
43
50
57
64
71
78
85
92
99
106
1
Frequency

Text Pre-Processing
 Stop Word Removal: Simple techniques to remove ‘noise words’ from texts
– Remove common ‘noise’ words which contribute no information to the IR process (e.g. “the”)
 Stemming: Remove irrelevant differences from different ‘versions’ of the same word
– Identify different forms of the same word (e.g. “run” and “ran”) identify them with a common stem
 (Later) Exploit semantic relationships between words
– If two words have the same meaning, treat them as the
same word
Slide 5
Data Mining and Machine Learning

Stemming (morphology)
 Basic idea: If a query and document contain different forms of the same word, then they are related
 Remove surface markings from words to reveal their basic form:
– forms  form, forming  form – formed  form, former  form
 “form” is the stem of forms, forming, formed, former
Slide 6
Data Mining and Machine Learning

Stemming (morphology)
 Stemming replaces tokens (words) with equivalence classes of tokens (words)
 Equivalence classes are stems
– Reduces the number of different words in a corpus – Increases the number of instances of each token
Slide 7
Data Mining and Machine Learning

Stemming
 Of course, not all words obey simple, regular rules: – running  run
– runs  run
– women  woman – leaves  leaf
– ferries  ferry
– alumnus  alumni – datum  data
– crisis  crises
[Belew, chapter 2]
Slide 8
Data Mining and Machine Learning

Stemming
 Linguists distinguish between different types of morphology:
– Minor changes, such as plurals, tense
– Major changes, e.g. incentive  incentivize, which
change the grammatical category of a word
 Common solution is to identify sub-pattern of letters within words and devise rules for dealing with these patterns
Slide 9
Data Mining and Machine Learning

Stemming
 Example rules [Belew, p 45] – (.*)SSES  /1SS
Slide 10
– Any string ending SSES is stemmed by replacing SSES with SS
– E.G: “classes”  “class” – (.[AEIOU].*)ED  /1
– Any string containing a vowel and ending in ED is stemmed by removing the ED
– E.G. “classed”  “class”
Data Mining and Machine Learning

Stemmers
 A stemmer is a piece of software which implements a stemming algorithm
 The Porter stemmer is a standard stemmer which is available as a free download (see Canvas)
 The Porter stemmer implements a set of about 60 rules
 Use of a stemmer typically reduces vocabulary size by 10% to 50%
Slide 11
Data Mining and Machine Learning

Example
 Apply the Porter stemmer to the ‘Jane Eyre’ and
‘Alice in Wonderland’ texts 34%
16000 14000 12000 10000
8000
6000
4000
2000
0
reduction
22%
reduction
Alice
Jane Eyre
Text
Before stemming After Stemming
Slide 12
Data Mining and Machine Learning
Number of different words

Example
 Examples of results of Porter stemmer: – formform
– formerformer
– formedform
– formingform
– formalformal
– formalityformal – formalismformal – formicaformica – formicformic
– formantformant – formatformat
– formationformat
Slide 13
Data Mining and Machine Learning

Example: First paragraph from ‘Alice in Wonderland’
Before
Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, ‘and what is the use of a book,‘thought Alice ‘without pictures or conversation?’
After
alic wa begin to get veri tire of sit by her sister on the bank, and of have noth to do: onc or twice she had peep into the book her sister wa read, but it had no pictur or convers in it, ‘and what is the us of a book,’ thought alic ‘without pictur or convers?’
Slide 14
Data Mining and Machine Learning

Noise Words – “Stop words”
There was no possibility of taking a walk that day. We had been wandering, indeed, in the leafless shrubbery an hour in the morning; but since dinner (Mrs. Reed, when there was no company, dined early) the cold winter wind had brought with it clouds so sombre, and a rain so penetrating, that further out- door exercise was now out of the question
 Noise words
– Vital for the grammatical structure of a text
– Of little use in the ‘bundle of words’ approach to identifying what a text is “about”
Slide 15
Data Mining and Machine Learning

Stop Lists
 In Information Retrieval, these words are often referred to as Stop Words
 Rather than detecting stop words using rules, stop words are simply specified to the system in a text file: the Stop List
 Stop Lists typically consist of the most common words from some large corpus
 There are lots of candidate stop lists online
Slide 16
Data Mining and Machine Learning

Example 1: Short Stop List (50 wds)
the it not of with are and as but to his from a on or
her who all will she more there if would out their so
in be have
that at an we
is by they him was i which been he this you has for had were when
Slide 17
Data Mining and Machine Learning

Example 2: 300 Word Stop List
the on of be and at
to by
a i
in this that had is not was are he but for from it or with have as an his they
one more you no were if
her out all so she said there what would up their its we about him into been than has them when can who only
……….
held keep sure probably free
real seems behind cannot miss political air question making office brought
whose special heard major problems ago became federal moment study available known result street economic boy
which will
300 most common words from Brown Corpus
other
Slide 18
Data Mining and Machine Learning

The text matters
Alice vs Brown: Most Frequent Words
the the and of to and a to she a
it in of that said is
i was alice he in for you it was with that as
as his this an know has thought
her on they they them at be little which like on at he you were all by out were again
when off who how will me more
with i is her had this one all but Had down she for not up there so are his would be but if their not from about we very or then him what have no been
herself if went out would so do
have when could or there
Slide 19
Data Mining and Machine Learning

stop.c
 C program on course Canvas page
– Reads in a stop list file (text file, one word per line)
– Stores stop words in char **stopList – Read text file one word at a time
– Compares each word with each stop word – Prints out words not in stop list
stop stopListFile textFile > opFile
Slide 20
Data Mining and Machine Learning

Examples
Original first paragraph
Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,‘ thought Alice `without pictures or conversation?’
Stop list 50 removed
alice beginning get very tired sitting sister bank having nothing do once twice peeped into book sister reading no pictures conversations what use book thought alice without pictures conversation
Stop list Brown removed
alice beginning tired sitting sister bank twice peeped book sister reading pictures conversations book alice pictures
conversation
Slide 21
Data Mining and Machine Learning

Matching
 Given a query q and a document d we want to define a number:
Sim(q,d)
which defines the similarity between q and d
 Given the query q we will then return the documents d1 d2 … dN such that:
– d1 is the document for which Sim(q,d) is biggest – d2 has the next biggest value of Sim(q,d),
– etc
Slide 22
Data Mining and Machine Learning

Similarity
 The similarity between q and d will depend on the
number of terms which are common to q and d
 But we also need to know how useful each common term is for discriminating between different documents.
 For example,
– It is probably not significant if q and d share “the”
– But it probably is significant if they share “magnesium”
Slide 23
Data Mining and Machine Learning

IDF weighting
 One commonly used measure of the significance of a term for discriminating between documents is the Inverse Document Frequency (IDF)
 For a token t define:
IDFt log ND 
 ND  t
 ND is the total number of documents in the corpus
 NDt is the number of those documents that include t
Slide 24
Data Mining and Machine Learning

Why is IDF weighting useful?
IDFt log ND   ND 
t
 Case 1: t occurs equally often in all documents – ND = NDt,
– hence IDF(t) = 0
 Case 2: t occurs in just a few documents – ND>NDt
– hence IDF(t) > 0
 Note that IDF(t) ignores how often term t occurs in
a document Slide 25
Data Mining and Machine Learning

Effect of Document Length
 Suppose query q consists only of term t
 Suppose document d1 also consists only of t – Number of shared terms is 1
– Match is ‘perfect’
 Suppose document d2 has 100 terms, including t
– Number of shared terms is 1
– But in this case co-occurrence of t appears less significant
 Intuitively the similarity measure Sim(q,d) needs to include normalisation by some function of N and M
Slide 26
Data Mining and Machine Learning

TF-IDF weight
 Let t be a term and d a document
 TF-IDF – Term Frequency – Inverse Document Frequency
 The TF-IDF weight wtd of term t for document d is: w f IDFt
td td
ftd = term frequency – the number of times t occurs in d
where:
Slide 27
Data Mining and Machine Learning

TF-IDF weight (continued)
w f IDFt td td
 For wtd to be large:
– ftd must be large, so t must occur often in d
– IDF(t) must be large, so t must only occur in relatively few documents
Slide 28
Data Mining and Machine Learning

Query weights
 Now suppose t is a term and q is a query.
 If q is a long query, can treat q as a document: w f IDFt
tq tq
where ftq is the (query) term frequency – the number
of times the term t occurs in the query q
 If q is a short query, define the TF-IDF weight as
Slide 29
wtq IDFt Data Mining and Machine Learning

Sum over all terms in both q and d
q
‘Length’ of query q
Slide 30
Simq,d tqd
Data Mining and Machine Learning
TF-IDF Similarity
 Define the similarity between query q and document
d as:
‘Length’ of document d
w w
td tq d

Document length  Suppose d is a document
 For each term t in d we can define the TF-IDF weight wtd
 The length of document d is defined by: Lend d  w2
td td
Slide 31
Data Mining and Machine Learning

Comments on Document Length
 This definition of Len(d) may not seem very intuitive at first
 It will become more intuitive when we study vector representations of documents and Latent Semantic Indexing (LSI)
 For now, just remember that if x = (x1,x2,x3) is a vector in 3 dimensional space, then the length of x is
given by: Slide 32
x x2x2x2 123
Data Mining and Machine Learning

Summary
 Understand definition and use of Stop Lists
 Understand motivation and methods of Stemming
 Understand how to calculate the TF-IDF Similarity between two documents
Slide 33
Data Mining and Machine Learning