CS代写 TREC 2003

MULTIMEDIA RETRIEVAL
Semester 1, 2022
Text Retrieval
 Background

Copyright By PowCoder代写 加微信 powcoder

 Textual Information Retrieval
Slides adapted from . Manning and ̈rst
School of Computer Science

Background
Find a needle from haystacks!!!
Copyright TREC 2003
Information is of no use unless you can actually access it.
School of Computer Science
Unstructured (text) vs. structured (database) data in 1996
80 60 40 20
Unstructured Structured
Data volume
Market Cap

Unstructured (text) vs. structured (database) data in 2006
80 60 40 20
Data volume
Market Cap
Unstructured Structured
Background
 Information Retrieval (IR) is concerned with developing algorithms and models for retrieving information from document repositories according to an information need.
 With the remarkable growth in the use of computers to process multimedia data, there is also an ever-increasing demand for retrieval of such data.
 Retrieval: Text document and Multimedia document
School of Computer Science

Some Historical Remarks on IR
 1950s: Basic idea of searching text with a computer
 1960s: Key developments, e.g.
The SMART system (G. Salton, Harvard/Cornell) The Crainfield evaluations
 1970s and 1980s: Advancements of basic ideas But: mainly with small test collections
 1990s: Establishment of TREC (Text Retrieval Conference) series (since 1992 till today)
 Large text collections
 Expansion to other fields and areas, e.g. spoken document retrieval, non-english or multi-lingual retrieval, information filtering, user interactions, WWW, video retrieval, etc.
, Modern Information Retrieval: a Brief Overview, IEEE Bulletin, 2001.
School of Computer Science
Major Issues in IR
Information Retrieval (IR) deals with the representation, storage, organization of, and access to information items.
Baeza-Yates and Ribeiro-Neto
 Content analysis  Indexing
 Interaction
 Results presentation (visualization)
School of Computer Science

USER Retrieval Process DATA INFORMATION NEED DOCUMENTS
User Interface
INFORMATION RETRIEVAL SYSTEM
USER Retrieval Process INFORMATION NEED
User Interface
RESULT REPRESENTATION
INFORMATION RETRIEVAL SYSTEM

USER Retrieval Process INFORMATION NEED
SELECT DATA FOR INDEXING
PARSING & TERM PROCESSING
User Interface
QUERY PROCESSING (PARSING & TERM PROCESSING)
LOGICAL VIEW OF THE INFORM. NEED
RESULT REPRESENTATION
INFORMATION RETRIEVAL SYSTEM
Document Representation
hello, world, information, retrieval, multimedia, …
 Boolean Vector
 Dictionary / Vocabulary
 The i-th dimension indicates whether the i-th term in the vocabulary appears in the given document
Query: capital AND France (Boolean Query)
Doc. 1: The capital of France is called Paris.
Doc. 2: Paris is the capital of France.
Doc. 3: The capitals of France and England are called Paris and London, respectively.
School of Computer Science

Index (Inverted Index)
 For each term T, we must store a set of all documents that contain T.
 Do we use an array or a list for this?
School of Computer Science
Term-Document Incidence Matrix
1 = Word appears
0 = Word does not appear
respectively
School of Computer Science

Query: capital AND France
respectively
respectively
Query: capital AND France

Term-Document Incidence Matrix
 Search: Very easy, but not very efficient (e.g. 1 000 000 docs, 1 000 terms =
Matrix with 1 000 000 000 cells)
 The good news: This matrix is very sparse (i.e. lots of 0’s, only few 1’s)
 Idea: Just store the ‘hits’ (term incidences) using lists
School of Computer Science
Inverted File
respectively

Inverted File
 Main advantage
 Easy, efficient search
 Disadvantages
 Storage (10%-100% of doc. size)  Modifications (updates, …)
 Often other information is stored as well  to support advanced queries
(e.g. position for phrases)
 to speed up the search process
(e.g. frequency for query optimization)  Bag-of-Words approach
 Ignores word order
 What terms / tokens should go into the dictionary?
School of Computer Science
Inverted File
Dictionary Postings
Dictionary: Usually kept in memory (fast!) Postings: Kept on disks, access via offset
School of Computer Science

Inverted Index Construction
Documents to be indexed.
Token stream.
Linguistic modules
Friends, Romans, countrymen

countryman
Modified tokens. Inverted index.
countryman
School of Computer Science
Indexer steps: Example
 Sequence of (Modified token, Document ID) pairs.
Doc 1 Doc 2
I did enact I was killed i’ the Capitol; Brutus killed me.
So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious
School of Computer Science

 Multiple term entries in a single document are merged.
 Frequency information is added.
Why frequency? Will discuss later.
Query processing: AND
 Consider processing the query:
Locate Brutus in the Dictionary;  Retrieve its postings.
Locate Caesar in the Dictionary;  Retrieve its postings.
“Merge” the two postings:
2 4 8 Brutus
School of Computer Science

 Walk through the two postings simultaneously, in time linear in the total number of postings entries
2 4 8 Brutus 1 2 3 5 8 the list lengths are x and y, the merge takes O(x+y) operations.
Crucial: postings sorted by docID.
School of Computer Science
Boolean queries: Exact match
 The Boolean Retrieval model is being able to ask a query that is a Boolean expression:
 Boolean Queries are queries using AND, OR and NOT to join query terms
 Views each document as a set of words
 Is precise: document matches condition or not.
 Primary commercial retrieval tool for 3 decades.
 Professional searchers (e.g., lawyers) still like
Boolean queries:
 You know exactly what you’re getting.
School of Computer Science

Example: WestLaw http://www.westlaw.com/
 Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992)
 Tens of terabytes of data; 700,000 users
 Majority of users still use boolean queries
 Example query:
 What is the statute of limitations in cases involving the
federal tort claims act?
 LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM
 /3 = within 3 words, /S = in same sentence
School of Computer Science
Example: WestLaw http://www.westlaw.com/
 Another example query:
 Requirements for disabled people to be able to
access a workplace
 disabl! /p access! /s work-site work-place (employment /3 place
 Note that SPACE is disjunction, not conjunction!
 Long, precise queries; proximity operators;
incrementally developed; not like web search
 Professional searchers often like Boolean search:  Precision, transparency and control
 But that doesn’t mean they actually work better….
School of Computer Science

Beyond Boolean Retrieval
 Boolean retrieval is accurate for precise information need, however, not flexible.
 Either match, or not; Either too few, or too much  Position information
 Phrases: Stanford University
 Proximity: Find Gates NEAR Microsoft.
 Need index to capture position information in docs.
 Zones/clusters in documents: Find documents with (author = Ullman) AND (text contains automata).
 Ranking: which is more relevant to the given query?
 Most users don’t want to wade through 1000s of results.  This is particularly true of web search.
 Need to measure proximity from query to each doc.
 Need to decide whether documents presented to user are singletons, or a group of documents covering various aspects of the query.
School of Computer Science
Evidence Accumulation
 1 vs. 0 occurrence of a search term  2 vs. 1 occurrence
 3 vs. 2 occurrences, etc.
Usually more seems better
 Need term frequency information in docs
 If the query term does not occur in the document:
score should be 0
 The more frequent the query term in the document, the higher the score (should be)
School of Computer Science

Term-document Count Matrices
 Consider the number of occurrences of a term in a document:
 Each document is a count vector in Nv: a column below
Antony and Cleopatra

School of Computer Science
Bag of Words Model
 Vector representation doesn’t consider the ordering of words in a document
 John is quicker than Mary and Mary is quicker than John have the same vectors
 This is called the bag of words model.
 In a sense, this is a step back: The positional index was able to distinguish these two documents.
School of Computer Science

Term Frequency tf
 The term frequency tft,d of term t in document d is defined as the number of times that t occurs in d.
 We want to use tf when computing query- document match scores. But how?
 Raw term frequency is not what we want:
 A document with 10 occurrences of the term is more relevant than a document with one occurrence of the term.
 But not 10 times more relevant.
 Relevance does not increase proportionally with term frequency.
School of Computer Science
Log-frequency Weighting
 The log frequency weight of term t in d is wt,d 1log10 tft,d , iftft,d 0
 0, otherwise
 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.
 Score for a document-query pair: sum over terms t in both q and d:
score tqd(1logtft,d)
 The score is 0 if none of the query terms is present in the document.
School of Computer Science

Document Frequency
 Rare terms are more informative than frequent terms
 Frequent terms: a, the, to, …
 Consider a term in the query that is rare in the
collection (e.g., arachnocentric)
 A document containing this term is very likely
to be relevant to the query arachnocentric
 → We want a high weight for rare terms like arachnocentric.
School of Computer Science
Document Frequency
 Consider a query term that is frequent in the collection (e.g., high, increase, line)
 A document containing such a term is more likely to be relevant than a document that doesn’t, but it’s not a sure indicator of relevance.
 For frequent terms, we want positive weights for words like high, increase, and line, but lower weights than for rare terms.
 We will use document frequency (df) to capture this in the score.
 df ( N) is the number of documents that contain the term
School of Computer Science

idf Weight
 dft is the document frequency of t: the number of documents that contain t
 df is a measure of the informativeness of t
 We define the idf (inverse document frequency) of t by
idft log10 N/dft
 We use log N/dft instead of N/dft to “dampen” the effect of idf.
Will turn out the base of the log is immaterial.
School of Computer Science
idf example, suppose N= 1 million
There is one idf value for each term t in a collection.
School of Computer Science

Collection vs. Document Frequency
 The collection frequency of t is the number of occurrences of t in the collection, counting multiple occurrences.
 Example:
Collection frequency
Document frequency
 Which word is a better search term (and should get a higher weight)?
School of Computer Science
tf-idf Weighting
 The tf-idf weight of a term is the product of its tf weight and its idf weight.
wt,d (1logtft,d)logN/dft
 Best known weighting scheme in information retrieval
 Note: the “-” in tf-idf is a hyphen, not a minus sign!
 Alternative names: tf.idf, tf x idf
 Increases with the number of occurrences within a document
 Increases with the rarity of the term in the collection
School of Computer Science

Binary → Count → Weight matrix
Antony and Cleopatra

Each document is now represented by a real-valued vector of tf-idf weights R|V|
School of Computer Science
Documents as vectors
 So we have a |V|-dimensional vector space
 Terms are axes of the space
 Documents are points or vectors in this space
 Very high-dimensional: hundreds of millions of dimensions when you apply this to a web search engine
 This is a very sparse vector – most entries are zero.
School of Computer Science

Queries as vectors
 Key idea 1: Do the same for queries: represent them as vectors in the space
 Key idea 2: Rank documents according to their proximity to the query in this space
 proximity = similarity of vectors  proximity ≈ inverse of distance
 Recall: We do this because we want to get away from the you’re-either-in-or-out Boolean model.
 Instead: rank more relevant documents higher than less relevant documents
School of Computer Science
Formalizing Vector Space Proximity
 First cut: distance between two points
 ( = distance between the end points of the two
 Euclidean distance?
 Euclidean distance is a bad idea . . .
 . . . because Euclidean distance is large for vectors of different lengths.
School of Computer Science

Why distance is a bad idea
The Euclidean distance between q
and d2 is large even though the
distribution of terms in the query q and the distribution of
terms in the document d2 are
very similar.
School of Computer Science
From Angles to Cosines
 The following two notions are equivalent.
 Rank documents in decreasing order of the angle
between query and document
Rank documents in increasing order of cosine(query,document)
 Cosine is a monotonically decreasing function for the interval [0o, 180o]
School of Computer Science

Cosine (query, document)
Dot product
Unit vectors
 Vqd qdqd ii
cos(q,d)        i1
qd q d V q2 V d2
qi is the tf-idf weight of term i in the query
di is the tf-idf weight of term i in the document cos(q,d) is the cosine similarity of q and d … or, equivalently, the cosine of the angle between q and d.
i1 i i1 i
School of Computer Science
Cosine similarity amongst 3 documents
How similar are the novels
SaS: Sense and Sensibility
PaP: Pride and Prejudice, and WH: Wuthering Heights?
Term frequencies (counts)
School of Computer Science

Computing Cosine Scores
School of Computer Science
tf-idf weighting has many variants
Columns headed ‘n’ are acronyms for weight schemes.
Why is the base of the log in idf immaterial?
School of Computer Science

Summary – vector space ranking
 Represent the query as a weighted tf-idf vector
 Represent each document as a weighted tf-idf
 Compute the cosine similarity score for the query vector and each document vector
 Rank documents with respect to the query by score
 Return the top K (e.g., K = 10) to the user
School of Computer Science
Beyond TF-IDF
 More advanced methods are available Probabilistic model
Language model
 IR is still a very active area SIGIR is the annual conference. IR performance keeps improving.
School of Computer Science

Relevance Feedback & Query Expansion
 Personalized Service Focus on relevance feedback
 The complete landscape
Global methods
 Query expansion
 Thesauri
 Automatic thesaurus generation
Local methods
 Relevance feedback
 Pseudo relevance feedback
School of Computer Science
Relevance Feedback
 Relevance feedback: user feedback on relevance of docs in initial set of results
 User issues a (short, simple) query
 The user marks some results as relevant or non-
 The system computes a better representation of the information need based on feedback.
 Relevance feedback can go through one or more iterations.
 Idea: it may be difficult to formulate a good query when you don’t know the collection well, so iterate
School of Computer Science

Key concept: Centroid
 The centroid is the center of mass of a set of points
 Recall that we represent documents as points in a high-dimensional space
 Definition: Centroid  1
(C)  |C | d dC
where C is a set of documents.
School of Computer Science
Rocchio Algorithm
 The Rocchio algorithm uses the vector space model to pick a relevance fed-back query
 Rocchio seeks the query qopt that maximizes qopt  [cos(q,(Cr))cos(q,(Cnr))]
 argmax   
 Tries to separate docs marked relevant and non-
relevant 1 1
qoptC dC d
r djCr nr djCr
 Problem: we don’t know the truly relevant docs
School of Computer Science

The Theoretically
Optimal query
x non-relevant documents o relevant documents
School of Computer Science
Rocchio 1971 Algorithm (SMART)
 Used in practice: 11
qmq0D dD  d j
j r djDr nr djDnr
 Dr = set of known relevant doc vectors
 Dnr = set of known irrelevant doc vectors
 Different from Cr and Cnr !
 qm = modified query vector; q0 = original query vector; α,β,γ:
weights (hand-chosen or set empirically)
 New query moves toward relevant documents and away from irrelevant documents
School of Computer Science

Relevance Feedback on Initial Query
Initial query
xxoo x oox
Revised query
x known non-relevant documents o known relevant documents
School of Computer Science
Relevance Fe

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com