IR#H/M#Course
Information Needs
• An information need is the underlying cause of the query that a person submits to a search engine
– Sometimes called information problem to emphasise that information need is generally related to a task
Copyright By PowCoder代写 加微信 powcoder
• Categorised using a variety of dimensions
– e.g., number of relevant documents being sought
– Type of information that is needed
– Type of task that led to the requirement for information
Queries and Information Needs
• A query can represent very different information needs
– May require different search techniques and ranking algorithms to produce the best rankings
• A query can be a poor representation of the information need
– Users may find it difficult to express the information need
– Users are encouraged to enter short queries both by the search engine interface, and by the fact that long queries often don’t work
– Ambiguity: the same query string may represent different information needs
IR#H/M#Course
Query Ambiguity
• 16+% of user queries are ambiguous (Song et al., IP&M 2009)
– e.g., what is a user issuing the query ‘ash’ after?
which/‘ash’?
ash$character
Query Formulation Problem
• Difficult to generate well formulated queries without
– Knowledge of collection
• How terms are distributed, type of docs, etc.
– Retrieval environment
• Term weighting, query language, retrieval
strategy, etc
• First query is a trial run
– Practically used to retrieve few useful items from a
collection
– Learn from those relevant ones
• Query term modification IR#is#an#iterative#process
Why%is%it%difficult%to%formulate%a%query?
IR#H/M#Course
User Interaction
• Interaction with the system occurs
– During query formulation and reformulation – While browsing the search results
• Key aspect of effective retrieval
– Users can’t change the ranking algorithm but can
change the results through interaction
– Helps refine the description of information need • e.g., same initial query, different information needs
• How do users describe what they don’t know?
Interaction & Query Reformulation
• The user needs to find some information. Why?
– To do a task
– ‘Anomalous’ state of knowledge [Belkin 1982]
– Maybe the user is uncertain about something
– The user formulates a query (information need is transformed into a query)
• System helps user to refine the query and … accomplish the
Information* Need*– Q1
Information* Need*– Q2
Information* Need*– Q3
IR#H/M#Course
ASK Hypothesis
• Belkin et al (1982) proposed a model called
Anomalous State of Knowledge
• ASK hypothesis:
– Difficult for people to define exactly what their information need is, because that information is a gap in their knowledge
– Search engine should look for information that fills those gaps
• Interesting idea, little practical impact (yet)
(Explicit) Interactions
“System helps user to refine the query”
Examples of explicit interaction:
•Relevance Feedback: the search engine permits the user to say what documents already retrieved are relevant or non-relevant
•Query Expansion/Query Term Suggestion: the search engine suggests additional query terms for the user
•Query Suggestions: the search engine suggests related queries for the user
IR#H/M#Course
Relevance Feedback
• After initial retrieval results are presented, allow the user to provide feedback on the relevance of one or more of the retrieved documents
• Use this feedback information to reformulate the query
• Produce new results based on reformulated query
• Allows more interactive, multi-pass process
Permit user to select relevant documents
Relevance Feedback Example
Top 10 documents for “tropical fish”
IR#H/M#Course
Relevance Feedback Example • Document 7 (“Breeding tropical fish”) has
explicitly been indicated to be relevant
• In this document, the most frequent terms are:
breeding (4), fish (4), tropical (4), marine (2), pond (2), coldwater (2), keeping (1), interested (1)
• We can add these terms back into the query
– This makes the query more similar to the relevant
– E.g. {tropical fish} -> {tropical^5 fish^5 breeding^4
marine^2 …}
• Specific weights and scoring methods used for relevance feedback depend on retrieval model
Relevance Feedback Using Vectors
• Documents relevant to a particular query resemble each other in the sense that they are represented by reasonably similar vectors
Relevance*Feedback*within*a Vector*Space*Model*is*due*to* J.*J.*Rocchio,*1971
Moving a given query towards the relevant items and away from the non-relevant ones
IR#H/M#Course
Conceptual View of Relevance Feedback
Relevance Feedback
• Usually both of the following occur using the feedback: – Expand query with new terms
– Re-weight terms in query
• There are many variations
– Usually positive weights for terms from relevant docs • Found to be much valuable
– Sometimes negative weights for terms from non-relevant docs – Remove terms that ONLY appear in non-relevant documents
IR#H/M#Course
Query Reformulation for VSM • Change query vector using vector algebra
• Add the vectors for the relevant documents to the query vector
• Subtract the vectors for the non-relevant docs from the query vector
• This adds both positive and negatively weighted terms to the query as well as re-weighting the initial terms
Optimal Query
• Assume that the relevant set of documents Cr are known
• Then the best query that ranks all and only the relevant documents at the top is:
!1!!1!! qopt=C ! dj%N%C ! dj
r #dj$Cr r #dj”Cr where N is the total number of documents.
IR#H/M#Course
Standard Rochio Method
• Since all relevant documents are unknown, just use the known relevant (Dr) and non-relevant (Dn) sets of documents and include the initial query q0
!!β∑!γ∑! qnew =αq0+ D ! dj − D ! dj
r ∀dj∈Dr n ∀dj∈Dn
!: Tunable weight for initial query.
“: Tunable weight for relevant documents.
#: Tunable weight for non-relevant documents.
Usually α=1, β = 0.75 and γ = 0.25
β = 0.75 γ = 0.25
Example: Rocchio Calculation
Dr1 = (.030, 0.00, 0.00,.025,.025,.050, 0.00, 0.00,.120) Dr 2 = (.020,.009,.020,.002,.050,.025,.100,.100,.120)
Dn1 = (.030,.010,.020, 0.00,.005,.025, 0.00,.020, 0.00) q0 = (0.00, 0.00, 0.00, 0.00,.500, 0.00,.450, 0.00,.950)
Relevant docs
Non-rel doc
Original Query
q =αq+”β×D+D %−”γ×D% new 0 $#2(r1 r2)’&$#1 n1’&
Rocchio Calculation
Resulting Feedback Query
qnew =(0.011,0.000875,0.002,0.01,0.527,0.022,0.488,0.033,1.04) 20
Rocchio&method&vs&optimal&query? What&is&the&difference….. Why&is&it&needed?
IR#H/M#Course
Other forms of Explicit Interaction (1)
Query&(Term)&Suggestions/
Explicit Query&Expansion
Query&Suggestions
Other forms of Explicit Interaction (2)
Helping the user define the query on the fly
Query&Auto*completions
Similar&Pages
IR#H/M#Course
Sec. 9.1.1
Relevance Feedback (RF) Performance
• RF generally improves retrieval performance (recall and precision)
• RF is most useful for increasing recall in situations where recall is important
– Users can be expected to review results and to take time to iterate
• Positive feedback is more valuable than negative feedback
– E.g. set ! < "; e.g. ! = 0.25, " = 0.75
– Many systems only allow positive feedback (!=0)
Why is Relevance Feedback not Widely Used?
• Users are in general reluctant to provide explicit feedback (see Excite & Altavista experiments)
• Results in long queries that require more computation to retrieve
– Search engines process lots of queries and allow little time for each one
• Makes it harder to understand why a particular document was retrieved
IR#H/M#Course
Types of User Feedback
Explicit(user(feedback
Relevance(Feedback Query(Suggestions Similar(Pages
Implicit(feedback
Click(tracking
Mouse(tracking
User(Behaviour
No(user(feedback
PseudoCRelevance( Feedback
What(did(the(user( say(was(relevant?
What(did(the(user( click/look(at/etc.?
What(would(the(user( likely(have(clicked?
Pseudo-Relevance Feedback
• Userelevancefeedbackmethodswithoutexplicituser
• Justassumethetopmretrieveddocumentsarerelevant, and use them to reformulate the query:
– Look at the statistics of the terms in the top retrieved documents
– Add the terms with highest weight to the query
– Do relevance feedback (e.g. Rocchio) and carry out the
retrieval with new query
• Aka(automatic)QueryExpansion(QE)
# A number of such QE models are implemented in Terrier
IR#H/M#Course
Pseudo Relevance Feedback: Illustrative Example
• TREC Query: Scottish highland games
• Retrieve top 3 documents using the original query, and use the index to
determine what other terms occur in those documents
• Using Terrier’s Bo1 QE mechanism and Weak Stemming
– Bo1 generalises Rocchio’s approach by using a refined parameter-free statistical term weighting model (no model parameters are needed)
• The expanded query:
– Scottish highland games Ligonier kilt caber clan toss Scot tartan grandfather
artist heavy tradition dance Celtic dancer athlete heather competitor
• In the expanded query (using the relevance assessment)
– These terms are helpful: Ligonier kilt caber clan toss Scot tartan
– These terms bring noise: grandfather artist heavy
– The rest of the added query terms are neutral, e.g. dancer, tradition
Query Re-weighting
• Assume that in the expanded query, the terms are not re- weighted
• Assume that qtw is the frequency of the query term
• For the expanded query:
“Scottish highland games Ligonier kilt caber clan toss scot tartan tradition dance Celtic dancer athlete heather competitor grandfather artist heavy”
qtw = 1 for each query term
We can’t differentiate the informative terms from the others
• Therefore, there is a need for re-weighting the query terms, including the expanded ones
IR#H/M#Course
grandfather
competitor
Pseudo-Relevance Feedback Works
• Pseudo-relevancefeedbackautomatesthe“manual” part of true (explicit) relevance feedback
• Workswellonaverageinmanyexperimentalsettings
• Example:
– TREC 2005 Terabyte Track adhoc task
– Using Terrier’s TF-IDF, MAP=0.3024
– Using Terrier’s TF-IDF+Bo1, MAP=0.3428
– A significant improvement (p-value=0.008169)
IR#H/M#Course
Problems with Pseudo-Relevance Feedback
• If the initial query is hard
– That means the initial results are poor
• Then pseudo-RF creates problems often drifting away from an optimal query
• May not be rewarding if the number of relevant documents is small
– Adding more query terms cannot bring many relevant documents
– e.g. query ‘Homepage of Glasgow Information Retrieval Group’ has only a unique relevant document. It is not helpful to expand the query
Other Sources of Evidence for Query Expansion
• Query Expansion can examine different sources of evidence to automatically reformulate a query:
– Top-ranked documents: based on term co-occurrence with query terms (pseudo-relevance feedback)
– Also the entire corpus (incl. an external corpus), using a similar term co-occurrence analysis
– Previous query reformulations of users (query logs)
– Thesauri & dictionaries (e.g. WordNet) & word embeddings
• However, they are not necessarily effective, as they do not take context into account
IR#H/M#Course
• A thesaurus provides information on synonyms and semantically related words and phrases.
• Example: physician
syn: ||croaker, doc, doctor, MD,
medical, mediciner, medico, ||sawbones
rel: medic, general practitioner,
surgeon, ...
Thesaurus-based Query Expansion
• For each term, t, in a query, expand the query with
synonyms and related words of t from the thesaurus
• You might weight the added terms less than the original
query terms
• Generally increases recall
• It could significantly decrease precision, particularly with ambiguous terms
– interest rate ! interest rate fascinate evaluate
IR#H/M#Course
• A more detailed database of semantic
relationships between English words
• Developed by well-known cognitive psychologist and a team at Princeton University.
• About 155,000 English words.
• Nouns, adjectives, verbs, and adverbs grouped
into about 175k synonym sets called synsets https://wordnet.princeton.edu/
WordNet Synset Relationships
• Antonym: front ! back
• Attribute: benevolence ! good (noun to adjective)
• Pertainym: alphabetical ! alphabet (adjective to noun)
• Similar: unquestioning ! absolute
• Cause: kill ! die
• Entailment: breathe ! inhale
• Holonym: chapter ! text (part to whole)
• Meronym: computer ! cpu (whole to part)
• Hyponym: plant ! tree (specialization)
• Hypernym: apple ! fruit (generalization)
IR#H/M#Course
Query Expansion with WordNet
• The expanded query terms are usually synonyms of the original query terms (i.e. words in the same synset)
• Example: for Scottish highland games , we have the following expanded terms:
– Scotch, Gaelic, upland, hilly, mountainous, bet, gage, stake, punt etc.
• It is also possible to add hyponyms (specialised terms) or to add hypernyms to generalise a query
Statistical Thesaurus
• Existing human-developed thesauri are not easily available in all languages
– Human thesuari are limited in the type and range of synonymy and semantic relations they represent
• Semantically related terms can be discovered from the
statistical analysis of corpora such as term co-occurrence: – If term t1 occurs in X documents, and a term t2 occurs in Y
documents and they co-occur in C documents: • Cosine (t1, t2) =C/sqrt(X*Y)
• Dice (t1, t2) =2C/(X+Y)
• Otherapproachesinvolvedistributionalsemantics – Word embeddings - word2vec, glove, etc
• Does not usually work well compared to pseudo-relevance
IR#H/M#Course
Query Expansion Using External Resource
• A high-quality external resource can bring useful information that improves the quality of the pseudo-relevance set
• [Kwok 2003]
– The pseudo-relevance set contains top-ranked documents returned by
– Aka Collection Enrichment or External Query Expansion
• Works well on large collections, including Web & Microblog corpora
– The larger the external resource is, the better the terms follow a random distribution
– Advanced term weighting models are based on randomness statistical laws
– A large external resource provides better collection statistics for the term
Wikipedia+ Retrieval
Wikipedia+ Index
Wikipedia+ Ranking
Article(5 Article(14
Wikipedia+ QE
Q Enterprise+
Enterprise Index
Example of External QE For Enterprise Search Engine
Enterprise+ Ranking
Doc(9 Doc(12
IR#H/M#Course
Selective Query Expansion
• Idea: Disable query expansion if the pseudo-relevance set is predicted to be poor
– Various query performance predictors have been proposed [He & Ounis, 2004]
– For a given query, estimate how well a given query will do using a statistical analysis of query/corpus and/or the retrieved set of documents
• E.g. are the query terms not informative enough, or is the query ambiguous (e.g. very topically dissimilar terms)
• Works well, and particularly appropriate for retrieval tasks that require early precision (e.g. Web search)
Wisdom of the Crowds
• Consider the query: Scotch whisky shop Bellevue
• There are no such shops in Bellevue, so the user may reformulate to Scotch whisky shop Seattle
• If many users use similar query reformulations, from the query logs we can understand that Bellevue → Seattle is a possible reformulation/ expansion:
Frequency/Score
bellevue ne
bellevue washington
IR#H/M#Course
• Relevance feedback is an effective means for user-directed query modification
• Modification can be done with either direct or indirect user input
• Modification can be done based on an individual’s or a group’s past input
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com