CIS 455/555: Internet and Web Systems
Information Retrieval November 11, 2020
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
1
Announcements
n H2M2 and H2M1 regrades due n Homework 3 out
n Again, start early!
n Final project will be released soon
n Time to start thinking about project groups! n Mostlyteamsoffour
n Possiblyupto3teamsoffive
n https://piazza.com/class/kdpept1jdhr3yg?cid=5
n Questionstoask:Timezone?Goals?PerformanceonHWssofar?
2
n Project plan will be due on Nov. 23 © 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
Plan for today
n Web services
n Information retrieval
n Basics
n Precision and recall
n Taxonomy of IR models
n Classic IR models n Boolean model
n Vector model n TF/IDF
NEXT
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
3
Basic model
Docs
Index Terms
doc
match
Information Need ?
Ranking
© 2020 A. Haeberlen, Z. Ives, V. Liu
query
4
© 2020 A. Haeberlen, Z. Ives, V. Liu
Information Retrieval as a field
n IR addressed many issues in the last 30 years: n Classification and categorization of documents
n Systems and languages for searching
n User interfaces and visualization of results
n Area was seen as of narrow interest – libraries, mainly n And then – the advent of the web:
n Universal “library”
n Free (low cost) universal access
n No central editorial board
n Many problems in finding information: IR seen as key to finding the solutions!
5
The full Information Retrieval process
user interest
Text Text
user feedback
query
retrieved docs
ranked docs
Index
Query Operations
Text Processing and Modeling
logical view
logical view
inverted index
Indexing
Searching
Crawler / Data Access
© 2020 A. Haeberlen, Z. Ives, V. Liu
Ranking
Browser / UI
Documents (Web or DB)
6
Precision and recall
n How good is our IR system? n Two common metrics:
n Precision: What fraction
of the returned documents is relevant?
n Recall: What fraction of the relevant documents are returned?
r
p
ideal
better
typical
n How can you build trivial systems that optimize one of them? n Tradeoff: Increasing precision will usually lower
recall, and vice versa
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
7
© 2020 A. Haeberlen, Z. Ives, V. Liu
What is a meaningful result?
n Matching keywords is quite imprecise
n Users are frequently dissatisfied
n One problem: users are generally poor at formulating queries
n Frequent dissatisfaction of Web users (who often give single- keyword queries)
n Issue of deciding relevance is critical for IR
systems: ranking
n Show more relevant documents first
n May leave out documents with low relevance
8
© 2020 A. Haeberlen, Z. Ives, V. Liu
Rankings
n A ranking is an ordering of the documents retrieved that (hopefully) reflects the relevance of the documents to the user query
n A ranking is based on fundamental premises
regarding the notion of relevance, such as: n common sets of index terms
n sharing of weighted terms
n likelihood of relevance
n Each set of premises leads to a distinct IR model
9
Types of IR Models
Set Theoretic
Fuzzy Extended Boolean
Classic Models
boolean vector
probabilistic
Algebraic
Generalized Vector Lat. Semantic Index Neural Networks
U s e r
T a s k
Retrieval: Adhoc
Filtering
Structured Models
Non-Overlapping Lists Proximal Nodes
Probabilistic
Inference Network Belief Network
Browsing
Browsing
Flat Structure Guided Hypertext
© 2020 A. Haeberlen, Z. Ives, V. Liu
10
© 2020 A. Haeberlen, Z. Ives, V. Liu
Classic IR models – Basic concepts
n Each document represented by a set of representative keywords or index terms
n An index term is a document word useful for remembering the document’s main themes
n Traditionally, index terms were nouns because nouns have meaning by themselves
n Search engines assume that all words are index terms (full text representation)
11
© 2020 A. Haeberlen, Z. Ives, V. Liu
Classic IR Models – Weights
n Not all terms are equally useful for representing the document contents: less frequent terms allow identifying a narrower set of documents
n The importance of the index terms is represented by weights associated to them
n Let
n ki be an index term
n dj be a document
n wij be a weight associated with (ki, dj)
n The weight wij quantifies the importance of the index term for describing the document contents
12
Classic IR Models – Notation
dj
ki t
K = (k1, k2, …, kt) wij 3 0
wij = 0
gi(dj) = wij
dj = (w1j, w2j, …, wtj)
is a document
is an index term (keyword)
is the total number of index terms
is the set of all index terms
is a weight associated with (ki,dj)
indicates that term does not belong to doc
is a function which returns the weight associated with pair (ki, dj)
is a weighted vector associated with the document dj
© 2020 A. Haeberlen, Z. Ives, V. Liu
13
Plan for today
n Web services
n Information retrieval
n Basics
n Precision and recall
n Taxonomy of IR models
n Classic IR models n Boolean model
n Vector model n TF/IDF
NEXT
n HITS and PageRank
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
14
© 2020 A. Haeberlen, Z. Ives, V. Liu
Boolean model
n Simple model based on set theory
n Queries specified as boolean expressions n precise semantics
n neat formalism nq=ka Ù(kb Ú¬kc)
n Terms are either present or absent. Thus, wij Î {0,1}
n An example query
q=ka Ù (kb Ú ¬kc)
15
Boolean model for similarity
Query:
q=ka Ù (kb Ú ¬kc)
(1,0,0)
(1,1,0) (1,1,1)
In disjunctive normal form:
q = (kaÙkbÙkc) Ú (kaÙkbÙ¬kc) Ú
conjunctive components
Ka
Kb
(kaÙ¬kbÙ¬kc) !!!!!
Kc
sim(d j , q) = © 2020 A. Haeberlen, Z. Ives, V. Liu
1 if $qcc |(qcc Îqdnf )Ù(“ki :gi(dj)=gi(qcc)) 0 otherwise
16
© 2020 A. Haeberlen, Z. Ives, V. Liu
Drawbacks of boolean model
n Retrieval based on binary decision criteria with no notion of partial matching
n No ranking of the documents is provided (absence of a grading scale)
n Information need has to be translated into a Boolean expression, which most users find awkward
n The Boolean queries formulated by the users are most often too simplistic
n As a consequence, the Boolean model frequently returns either too few or too many documents in response to a user query
17
Plan for today
n Web services
n Information retrieval
n Basics
n Precision and recall
n Taxonomy of IR models
n Classic IR models n Boolean model
n Vector model n TF/IDF
NEXT
n HITS and PageRank
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
18
© 2020 A. Haeberlen, Z. Ives, V. Liu
Vector model
n A refinement of the boolean model, which does not focus strictly on exact matching
n Non-binary weights provide consideration for partial matches n These term weights are used to compute a degree of
similarity between a query and each document
n Ranked set of documents provides for better
matching
19
Vector model
n Define:
wij > 0 whenever ki Î dj
wiq 3 0 associated with the pair (ki,q) dj = (w1j, w2j, …, wtj)
q = (w1q, w2q, …, wtq)
n With each term ki , associate a vector vec(i)
n These vectors (e.g., vec(i) and vec(j)) are assumed to be orthonormal (i.e., index terms are assumed to occur independently within the documents)
n Doesthisassumption(“independenceassumption”)holdinpractice? n Whatinfluencedoyouthinkthishasonperformance?
n The t vectors vec(i) form an orthonormal basis for a t-dimensional space
n In this space, queries and documents are represented as weight vectors
© 2020 A. Haeberlen, Z. Ives, V. Liu
20
Bag of words
n In this model, wij > 0 whenever kiÎdj
n Exact ordering of terms in the document is ignored n This is called the “bag of words” model
n What will be the vectors for the following two
documents?
n “Ape eats banana” n “Banana eats ape”
n What needs to be done to fix this? © 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
21
Similarity
Term
Sense and Sensibility
Pride and Prejudice
Wuthering Heights
affection
0.191956
0.95983
0.82407
jealous
0.01807
0.1270
0.41616
gossip
0.0127
0
0.2564
n In the vector model, queries may return documents that are not a ‘perfect match’
n Hence, we need a metric for the similarity between different documents, or between a document and a query
n Could we simply subtract the vectors? (L1 norm)
n Could we use a dot product?
n Does normalization help?
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
22
From: An Introduction to Information Retrieval, Cambridge UP
Cosine similarity
j
!!å
d×q t w×w
i=1 i,j i,q
j |dj |×|q| åt w2 × åt w2
sim(d,q)=cosq= !j !=
n All weights are nonnegative; hence, 0 ≤ sim(q,dj) ≤1
dj
Q
q
i
© 2020 A. Haeberlen, Z. Ives, V. Liu
i=1 i,j j=1 i,q
23
Plan for today
n Web services
n Information retrieval
n Basics
n Precision and recall
n Taxonomy of IR models
n Classic IR models
n Boolean model
n Vector model
n TF/IDF
n HITS and PageRank
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
24
NEXT
An example
The University of Pennsylvania
n What would be a good match for this query? © 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
25
Weights in the vector model
!!å
d•q tw ́w
sim(d ,q)=cosq= ! j ! = i=1 i,j i,q
j |dj | ́|q| åt w2 ́ åt w2
i=1 i,j j=1 i,q
n How do we compute the weights wij and wiq?
n A good weight must take into account two effects:
n quantification of intra-document contents (similarity) n tffactor,thetermfrequencywithinadocument
n quantification of inter-documents separation (dissimilarity) n idffactor,theinversedocumentfrequency
n wij = tf(i,j) * idf(i) © 2020 A. Haeberlen, Z. Ives, V. Liu
26
© 2020 A. Haeberlen, Z. Ives, V. Liu
Common TF and IDF preprocessing
n Let:
N be the total number of docs in the collection ni be the number of docs which contain ki freq(i,j) raw frequency of ki within dj
n A normalized tf factor is given by
f(i,j) = a + (1-a) * freq(i,j) / max(freq(l,j))
where the maximum is computed over all terms which occur within the document dj. (a is usually set to 0.4 or 0.5)
n The idf factor is computed as idf(i) = log (N / ni)
the log is used to make the values of tf and idf comparable.
It can also be interpreted as the amount of information associated with the term ki
27