程序代写 AP89) Statistics

IR H/M Course
Recall: Best-Match Algorithm (Co-ordinated Search)
For each document I, Score (I) = 0; For each query term
Search the vocabulary list

Copyright By PowCoder代写 加微信 powcoder

Pull out the postings list
For each document J in the list,
Score(J) = Score(J) +1
Known as Best-Match approach or Co-ordinated Search
Contrasting Assumptions
• A term is present in a document or not – 0 or 1 (A binary flag)
• It doesn’t consider the degree of association between a term and a document
– How much a document talks ‘about’ the ‘term’?
• The aboutness notion
– This captures the semantics
– If a term appears often in a document, then the document is likely to be about that ‘term’
– Indexing should capture this information
This basically counts how many terms are in
common between a document and a query

IR H/M Course
Text Statistics
• Huge variety of words used in text but:
• Many statistical characteristics of word occurrences are predictable
– e.g., distribution of word counts
• Retrieval models and ranking algorithms depend heavily on statistical properties of words
– Document representation: Transforming raw text of documents into a form that represents their meaning
– Determination of a word’s semantic utility based on its statistical properties
– How can we find meaning in text? What does the distribution of frequency occurrences tell us about the pattern of their use?
• e.g.,importantwordsoccuroftenindocumentsbutarenothigh frequency in collection
Plotting Word Frequency by Rank
• Main idea: Count (Frequency)
– How many times tokens occur in the text
• Over all documents in the collection
• Now rank these according to how often they occur.

IR H/M Course
Most and Least Frequent Terms
Rank Freq Term
system knowledg base problem abstract model languag implem reason
inform expert analysi rule program oper evalu comput case
gener form
enhanc energi emphasi detect desir
critic content consider concern compon compar commerci clause aspect area
aim affect
The Corresponding Curve
system knowledg base problem abstract model languag implem reason
inform expert analysi rule program oper evalu comput case
gener form

IR H/M Course
Zipf’s Law
• Distribution of word frequencies is very skewed
– A few words occur very often, many words hardly ever
– e.g., two most common words (“the”, “of”) make up about 10% of all word occurrences in English documents
• Zipf’s“law”:
– Observation that rank (r) of a word times its frequency (f) is approximately a constant (k)
• Assuming words are ranked in order of decreasing frequency
– i.e., r.f ! k or r.Pr ! A, where Pr is probability of word occurrence and A ! 0.1 for English
Zipf’s Law
• A law of the form y = kxc is called a power law.
• Zipf’slawisapowerlawwithc=–1
• On a log-log plot, power laws give a straight line with slope c.
log(y)=log(kxc)=logk+clog(x)

IR H/M Course
Zipf’s Distribution
• The important points:
– A few elements occur very frequently
– A medium number of elements have medium frequency
– Many elements occur very infrequently
News Collection (AP89) Statistics
Total documents
Total word occurrences Vocabulary size
Words occurring > 1000 times Words occurring once
Word Freq. r
assistant 5,095 1,021 sewers 100 17,110 toothbrush 10 51,555 hazmat 1 166,945
84,678 39,749,179 198,763 4,169 70,064
Pr(%) r.Pr
.013 0.13 2.56 × 10−4 0.04 2.56 × 10−5 0.01 2.56 × 10−6 0.04

IR H/M Course
Top 50 Words from AP89
Zipf’s Law for AP89
• Note problems at high and low frequencies – See the Mandelbrot (1954) Correction:
where β, α, and γ are parameters 14
Probability
Zipf is quite accurate except for very high and low rank.

IR H/M Course
Zipf’s Law
• What is the proportion of words with a given frequency?
– Word that occurs n times has rank rn = k/n
– Number of words with frequency n is
• rn −rn+1 = k/n−k/(n+1) = k/n(n+1)
– This proportion can be found by dividing by the total
number of words = highest rank = k/1 = k
– So, proportion with frequency n is 1/n(n+1)
Zipf’s Law
• Example word frequency ranking
• To compute number of words with frequency 5,099 – rank of “chemical” minus the rank of “summit”
– 1006 − 1002 = 4

IR H/M Course
• Proportionsofwordsoccurringntimesin 336,310 TREC documents
• Vocabulary size is 508,209
Figure’from:’Newman,’M.’E.’J.'(2005)’“Power’laws,’Pareto’ distributions’and’Zipf’s’law.”’Contemporary’Physics’46:323–351.
Power&Laws&are&everywhere!

IR H/M Course
Consequences of Zipf Law
• Therearealwaysafewveryfrequenttokensthatarenot good discriminators. Referred to as “stopwords” in IR
– Correspond to linguistic notion of “closed-class” words • English examples: to, from, on, and, the, …
• Grammatical classes that don’t take on new members.
– Eliminating them greatly reduces inverted-index storage costs
– Postings list for most remaining words in the inverted index will be short since they are rarer, making retrieval fast
• Therearealwaysalargenumberoftokensthatoccuronce – These words do not describe the content of documents
• Mediumfrequencywordsarethemostdescriptive
Vocabulary Growth
• How big is the term vocabulary?
– How does the size of the overall vocabulary (number of unique words) grow with the size of the corpus?
• This determines how the size of the lexicon within the inverted index will scale with the size of the corpus.
• Vocabulary not really upper-bounded due to proper names, typos, invented words (e.g. product, company names), email addresses, etc.

IR H/M Course
Heaps’ Law
• If V is the size of the vocabulary (number of unique words) and n is the number of words in corpus:
V=Kn! withconstantsK, 0 Windows; Microsoft -> Office
• Let’s revisit what sorts of words are useful for
representing document aboutness
Word Frequency vs. Resolving Power
Upper Cut-Off
Lower Cut-Off
Words by rank order
Less discriminative
The$frequency$of$a$word$occurrence$in$an$article$ furnishes$a$useful$measurement$of$word$significance$ [Luhn 1957]
Hypothetical Curve
Less likely to be used in queries
Frequency of words

IR H/M Course
Resolving Power
• Why some words occur more frequently and how such statistics can be exploited when automatically measuring aboutness?
– …. the frequency of a word occurrence in an article furnishes a useful measurement of word significance [Luhn 1957]
• Two critical factors
– Word frequency within a document – Collection frequency
Term Weighting
• More effective approaches score documents based on:
• How many query terms they contain
• How discriminative each of those terms are
• Not all terms are equally useful, so we
weight them
• Weight describes/quantifies the relationship between a keyword and a document.
• Generality
– Use terms with significant weights
– Binary is a special case
– A retrieval method can exploit these weights.

IR H/M Course
• wkd = weight of kth word in document d
• fkd = number of occurrences of kth word in
document d (term frequency)
• N = Number of documents in the collection
• Dk = number of documents containing kth word
How to Weight Terms
• Two demands on the weight
– The degree to which a particular document is about a topic (or a particular keyword)
• Repetition is an indication of emphasis
– Degree to which a keyword discriminates
documents in the collection
• Let us call it discrimk
wkd ” fkd !discrimk

IR H/M Course
Inverse Document Frequency (IDF)
• From a discriminating point of view:
• Queries use rather broadly defined, frequently occurring terms
• It is the more specific terms that are particularly important in identifying relevant material
i d f ( t k ) = l o g DN k
K.#Sparck Jones,#“A#statistical#interpretation#of#term#specificity#and#its#application#in#retrieval” Journal#of#Documentation#.#1972.
l o g DN k
Linear and log scale

IR H/M Course
TF – IDF Weighting Schemes
&N# wkd =fkd$log ! % Dk”
Problem if Dk is zero
& N +1 # ** wkd =fkd$log !
& (N’D )+0.5# Wkd = (1 +wlog (=fkd)f) $log k !
kd kd% Dk+0.5 ”
Retrieval Algorithm
(with weights)
• Given what we know now about term
weighting, we can update the Best-Match retrieval algorithm from before:
• Best-Match
for each document I, Score(I) =0; I =1 to N for each query term tk
Search the vocabulary list
Pull out the postings list
for each document J in the list, dWkd
Score(J) = Score(J) + wkd

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com