CE306/CE706 Information Retrieval
Ad-hoc Search & Processing Pipeline
Dr Alba García Seco de Herrera
Brief Module Outline (Reminder)
§ Motivation + introduction
§ Processing pipeline / indexing + query processing
§ Large-scale open-source search tools
§ Information Retrieval models
§ Evaluation
§ Log analysis
§ User profiles / personalisation / contextualisation
§ IR applications, e.g. enterprise search
Main Information Retrieval Tasks
§ Ad-hoc retrieval:
§ relatively static document collection
§ different user queries
§ examples: library catalogues, Web search engines
§ Filtering/Routing:
§ relatively static user requests
§ constantly updated document collection
§ examples: news filtering, classification, spam filter …
Ad-hoc Retrieval § Text-based retrieval
§ Given a query and a corpus, find relevant items
§ query: textual description of information need § corpus: a collection of textual documents
§ relevance: satisfaction of the user’s information need
Reminder…
Or…
Ad-hoc Retrieval
§ Start with ‘traditional’ document collection, i.e. treat document as plain text and not HTML document
§ Look at how we match a query against a collection of documents
§ To do so we need to process the document collection into an index structure that allows fast access and search
So it’s actually more like this:
… or as Moodle would put it:
Why start with plain Documents?
§ Main processing/indexing pipeline is the same
§ Transparent process:
§ not constantly tweaked (e.g., links) or tuned using user-interaction data (e.g., clicks) as in Web search
§ Very common:
§ government and corporate intranets (enterprise search), social media, your own personal computers (e.g., Word and PDF documents) …
Basic IR Process
Basic IR Process
Document representation
1
Most Basic View of a Search
Engine
§ A search engines does not scan each document to see if it satisfies the query
§ It uses an index to quickly locate the relevant documents
§ Index: a list of concepts with pointers to documents that discuss them
Index from Manning et al., 2008
Most Basic View of a Search Engine
§ So, what goes in the index is important! § How might we combine concepts, e.g.:
§ “patent search + A/B test’’ ? § “A/B test or user test’’ ? …
Most Basic View of a Search Engine
Most Basic View of a Search Engine
Most Basic View of a Search Engine
Document Representation
§ Document representation: deciding what concepts should go in the index
§ Option 1 (controlled vocabulary): a set a manually constructed concepts that describe the major topics covered in the collection
§ Option 2 (free-text indexing): the set of individual terms that occur in the collection
Document representation
Index from Manning et al., 2008
§ If we view option 1 and option 2 as two extremes, where does this particular index fit in?
Document Representation: (option 1: controlled vocabulary)
§ Think of examples in which controlled vocabularies are being used
§ Why would this be done?
§ What are the pros and cons compared to indexing?
Document = Bag of Words
(i.e., option 2: free-text indexing)
§ Need to locate important information in a document
§ Deep natural language understanding too expensive
§ Common assumption: document is a list of words
§ Documents represented by a set of index terms
§ The search engine will determine which terms are important (depends on the “retrieval model” applied)
Free-text Indexing:
Retrieval Process
1. Define text database, i.e. data sources, operations on text and text structure
2. Process document collection into index database for fast access
3. Parse user query using same text processing operations
4. Apply query operations to processed query
5. Generate system query to retrieve matching
documents
6. Rank and display results
Free-text Indexing
Free-text Indexing what you see
Free-text Indexing what your computer sees
Gerard Salton (8 March 1927 in Nuremberg – 28 August 1995), also known as Gerry Salton, was a Professor of Computer Science at Cornell University. Salton was perhaps the leading computer scientist working in the field of information retrieval during his time. His group at Cornell developed the SMART Information Retrieval System, which he initiated when he was at Harvard.
Free-text Indexing mark-up vs. content
Gerard Salton (8 March 1927 in Nuremberg – 28 August 1995), also known as Gerry Salton, was a Professor of Computer Science at Cornell University. Salton was perhaps the leading computer scientist working in the field of information retrieval during his time. His group at Cornell developed the SMART Information Retrieval System, which he initiated when he was at Harvard.
Free-text Indexing
mark-up
§ Describes how the content should be presented
§ e.g., your browser interprets html mark-up and presents the page as intended by the author
§ Can also define relationships with other documents (e.g., hyperlinks)
§ Can provide evidence of what text is important for search
§ It may also provide useful, “unseen” information!
§ Remember: we ignore mark-up for now
Free-text Indexing Step 1: mark-up removal
Gerard Salton (8 March 1927 in Nuremberg – 28 August 1995), also known as Gerry Salton, was a Professor of Computer Science at Cornell University. Salton was perhaps the leading computer scientist working in the field of information retrieval during his time. His group at Cornell developed the SMART Information Retrieval System, which he initiated when he was at Harvard.
§ Remove HTML mark-up
Free-text Indexing Step 1: mark-up removal
Gerard Salton (8 March 1927 in Nuremberg – 28 August 1995), also known as Gerry Salton, was a Professor of Computer Science at Cornell University. Salton was perhaps the leading computer scientist working in the field of information retrieval during his time. His group at Cornell developed the SMART Information Retrieval System, which he initiated when he was at Harvard.
§ Result is plain text
Free-text Indexing Step 2: tokenization
Gerard Salton ( 8 March 1927 in Nuremberg – 28 August 1995 ) , also known as Gerry Salton , was a Professor of Computer Science at Cornell University. Salton was perhaps the leading computer scientist working in the field of information retrieval during his time . His group at Cornell developed the SMART Information Retrieval System, which he initiated when he was at Harvard .
§ Splitting text into white-space separated `words’ § (Possibly) remove punctuation altogether
§ Not that simple (e.g., ambiguity of punctuation)! § Canada++ —> Canada? Ph.D —> Ph D ?
Free-text Indexing Step 3: case-folding
gerard salton ( 8 March 1927 in Nuremberg – 28 August 1995 ) , also known as gerry salton , was a professor of computer science at cornell university. salton was perhaps the leading computer scientist working in the field of information retrieval during his time . his group at cornell developed the smart information retrieval system, which he initiated when he was at harvard .
§ Typical representation: lower case
§ London —-> london LONDON —> london
§ But: SMART —-> smart March —> march
Free-text Indexing Step 4: stop word removal
gerard salton 8 march 1978 nuremberg 28 august 1995 known gerry salton professor computer science cornell university salton leading computer scientist working field information retrieval time group cornell developed smart information retrieval system initiated harvard
§ Stopwords: words that we choose to ignore because we expect them to not be useful in distinguishing between relevant/non-relevant documents for any query
§ Very frequent words for example (see Zipf’s law later on)
§ Closed class vs. open class words
§ (e.g. determiners and prepositions vs. nouns and verbs)
§ But: `To be or not to be’
Free-text Indexing Final step
§ Apply the pipeline of processing steps to every document in the collection and create an index using the union of all remaining terms
§ We will look at the structure of the index another week
§ As we will see, different structures are possible
§ Different models to match a query against the index, e.g. Boolean, Vector Space, Probabilistic … (next time)
§ Note: the same processing steps need to be applied to the query that gets submitted!
Free-text Indexing Other possible steps
§ Morphological analysis § e.g. initiated —> initiate
§ Stemming
§ e.g. computer/computing —> comput
§ Identification of phrases
§ e.g. ‘leading computer scientist’
Free-text Indexing Other possible steps
§ Named-entity recognition § e.g. ‘cornell university’
§ Relation extraction
§ e.g. is(‘gerry salton’, ‘computer scientist’)
§ … let’s look at some issues in more detail (more to come … watch out for the assignments!)
Free-text Indexing Steps Expanded Zipf’s Law
Free-text Indexing Steps Expanded Zipf’s Law
§ Distribution of word frequencies very skewed
§ A few words occur very often, many words hardly ever occur,
§ e.g., two most common words (“the”, “of”) make up about 10% of all word occurrences in text documents
§ Rank (r) of a word times its frequency (f) is approximately a constant (k), i.e.
f*r=k
Free-text Indexing Steps Expanded Zipf’s Law
Top 50 Words from a collection of news articles (AP89)
Free-text Indexing Steps Expanded Zipf’s Law
AP89: distribution of words (rank-ordered) on a logarithmic scale
Free-text Indexing Steps Expanded Zipf’s Law
Alice in Wonderland
(text available via Project Gutenberg)
Free-text Indexing Steps Expanded Zipf’s Law
European Parliament: Hungarian (transcribed speech, see http://homepages.inf.ed.ac.uk/pkoehn/ )
Free-text Indexing Steps Expanded Zipf’s Law
European Parliament: Hungarian (transcribed speech, see http://homepages.inf.ed.ac.uk/pkoehn/ )
Free-text Indexing Steps Expanded Morphology
§ Many morphological variations of words (with same or similar meaning)
§ inflectional: ‘buy’, ‘bought’, ‘buying’
§ derivational: ‘computer’, ‘compute’, ’computable’ § But more complicated in languages other
than English, e.g.
§ Helsingin sanomat —> Helsinki +GEN + sanoma
+PLURAL
§ Gaststättenneueröffnungsuntergangsgewissheit
(Restaurant+new+opening+failure+certainty)
(Ben Schott “Schottenfreude: German Words for the Human Condition”, John Murray (Publishers), 2013).
Free-text Indexing Steps Expanded Stemming
§ Generally a small but significant effectiveness improvement
§ Can be crucial for some languages, e.g., 5-10% improvement for English, up to 50% in Arabic
Words with the Arabic root ktb
Free-text Indexing Steps Expanded Extracting phrases
§ Many queries are 2-3 word phrases
§ Phrases are more meaningful, less ambiguous and more precise than single words
§ e.g. ‘exam timetable’ vs. ‘timetable’
§ Often via part-of-speech (POS) tagging as
the initial step
§ Alternatively simply word n-grams
Free-text Indexing Steps Expanded Part-of-speech Tagging
§ POS taggers use statistical models of text to predict syntactic tags of words
§ Phrases can then be defined as simple noun groups, for example
Free-text Indexing Steps Expanded Part-of-speech Tagging
§ Examples Noun Phrases
Free-text Indexing
Some possible implementations
§ Mark-up removal
§ e.g., jsoup, Beautiful soup
§ Tokenization:
§ regular expression matching
§ Case-folding:
§ regular expression (matching+substitution)
§ Stopword removal: § stopword lists
Free-text Indexing
Some possible implementations
§ Stemming:
§ e.g. Porter stemmer
§ Part-of-speech tagging, NER: § e.g., Stanford tools
§ Phrase extraction:
§ POS tagger + rule-based POS patterns
§ Relation extraction:
§ e.g., Stanford tools, GATE
Document Representation controlled vocabulary vs. free-text indexing
Cost of assigning index terms
Ambiguity of index terms
Detail of
representation
Controlled Vocabularies
High/Low?
Ambiguous/ Unambiguous?
Can represent arbitrary level of detail?
Free- text Indexing
High/Low?
Ambiguous/ Unambiguous?
Can represent arbitrary level of detail?
Document Representation controlled vocabulary vs. free-text indexing
Cost of assigning index terms
Ambiguity of index terms
Detail of
representation
Controlled Vocabularies
High
Not ambiguous
Can’t represent arbitrary detail
Free- text Indexing
Low
Can be ambiguous?
Any level of detail
§ Both are effective and used often
§ We will focus on free-text indexing
§ cheap and easy
§ most search engines use it (even those that adopt a
controlled vocabulary)
Document representation
1
Document representation
2
Document representation
3
Document representation
4
Summary
§ We focus on ad hoc retrieval
§ Documents need to be indexed before they can be searched
§ Two approaches to indexing:
§ controlled vocabulary vs. free-text indexing
§ Free-text indexing is composed of sequence of typical processing steps (same steps applied to query)
§ Then compare (processed) query with index database
§ Next: how do we match query against index? Ranking?
Summary
§ Here is a “recent” example that combines controlled vocabulary and free-text indexing (read up about it and find out exactly how the two approaches are combined … and why):
Reading
§ Chapter 2
§ Chapter 4.1 and 4.3 § CE706:
§ John Justeson and Slava Katz:”Technical Terminology: some linguistic properties and an algorithm for identification in text”, Natural Language Engineering, 1(1):9–27, 1995. http://journals.cambridge.org/action/displayAb stract?fromPage=online&aid=1313044
Acknowledgements
§ Thanks again to contain substantial material prepared by Udo Kruschwitz and Jaime Arguello
§ Additional material as provided by Croft et al. (2015)