IR H/M Course
Task Statement:
Build a system that retrieves documents that users are likely to find relevant to their queries.
• Relevance
Copyright By PowCoder代写 加微信 powcoder
– What is it?
– Simple (and simplistic) definition: A relevant document contains the
information that a person was looking for when they submitted a query to the search engine
– Many factors influence a person’s decision about what is relevant: e.g. task, context, novelty
– Topical relevance (same topic) vs. user relevance (everything else)
– Related to type of query
Relevance in Practice
• Retrieval models define a view of relevance
– Ranking algorithms used in search engines are based on retrieval models that aim to estimate relevance
– Typically, these models use statistical properties of text rather than linguistic (e.g. counting simple text features such as words instead of parsing and analysing sentences)
– Most well-known/classical retrieval models focus on
topical relevance
– User relevance requires more features, different types of evidence
See#IR#Models#lectures#later
IR H/M Course
The Retrieval Process (Conceptual)
Information need
Formulation
Document representation
Relevance estimation
Retrieval functions
Retrieved documents
Ranked in order of relevance
The Retrieval Process (In Practice)
Input Documents
Text( Processing
Information( Need
Text( Processing
Top(Results
Relevance feedback
Indexing::: Represent each document; identify the meaning; What is the document about?
IR H/M Course
Building a retrieval system
PART 1: TEXT PROCESSING
How Do We Represent Text?
• Remember: Typically, IR models use statistical properties of text rather than linguistic
• “Bag of words”
– Treat all the words in a document as index terms
– Assign a “weight” to each term based on “importance”
(e.g. term frequency or, in simplest case, presence/absence of term)
– Disregard order, structure, meaning, etc. of the words
– Simple, yet effective!
• Assumptions
– Term occurrence is independent
– Document relevance is independent
– “Words” are well-defined
Let’s&also&assume&that&documents&have&been& collected&and&converted&into&plain&text
IR H/M Course
Documents
• Unit of retrieval
– Web page? email; tweets;…
Information vs.
• Passage of free text
– Composed of text, strings of characters from an
– Composed of words of natural language
• Newspaper article, a journal paper, a dictionary
definition, email messages – Size of documents
• Arbitrary
• Email vs. newspaper article vs. journal paper
Effect on Retrieval … More later
What’s’a’Word?
!قا$ ما&’ &(ج(* ! +لنا./ باس2 +لخا&ج(ة +لإس&+ئ(ل(ة ! 98 شا&!9 قب$ +ل;ع!= !س(ق!2 للم&= +لأ!لى بA(ا&= ت!نDC +لتي كانG لفت&= .!(لة +لمق& .+ل&سمي لمنIمة +لتح&(& +لفلس.(ن(ة بع; خ&!جLا م9 لبنا9 عا2 1982
Выступая в0Мещанском суде Москвы экс!глава ЮКОСа заявил не совершал ничего противозаконного,0в0чем обвиняет его генпрокуратура России.0
भारत सरकार ने आ*थक, सव./ण म2 3व4ीय वष, 2005!060म2 सात फ़9सद; 3वकास दर हा=सल करने का आकलन ?कया है और कर सधु ार पर ज़ोर Gदया है
…