程序代写代做代考 database data mining Data Mining and Machine Learning

Data Mining and Machine Learning
Lecture 5
Query Expansion
Peter Jančovič Slide 1
Data Mining and Machine Learning

Objectives
 To understand how the use of semantic relationships between words can improve the performance of a text IR system
– Query expansion
– Generalisation
– Synonyms, hypernyms & hyponyms – WordNet
Slide 2
Data Mining and Machine Learning

Query Processing
 Remember how we previously processed a query:
 Example:
– “I need information on distance running”
 Stop word removal
– information, distance, running
 Stemming
– information, distance, run
 But what about:
– “The London marathon will take place…”
Slide 3
Data Mining and Machine Learning

Query Expansion
 Add terms to the query to increase the overlap between it and potentially relevant documents…
 …but not irrelevant documents
 Two approaches:
– User feedback
– Linguistic knowledge
Slide 4
Data Mining and Machine Learning

Feedback-based Query Expansion
 User provides feedback on the results of retrieval
– Which of the returned documents are particularly relevant – Which are irrelevant
Slide 5
Data Mining and Machine Learning

Query reformulation
 Revise the query in response to the user feedback
– Query expansion: Find terms in the ‘relevant’ documents that are not in the query. Add them to the query (of maybe just those with large TF-IDF weights)
– Term reweighting: Increase the weight of query terms in relevant documents and decrease the weight of query terms in irrelevant documents. For example
w  f IDF(t)
td td
– Various methods for determining λ have been proposed

Slide 6
Data Mining and Machine Learning

Knowledge-Based Query Expansion
 Recall:
 q = “I need information on distance running”  d = “The London marathon will take place…”
 We know there is a relationship between – run, distance, and marathon
 Words with the same meaning are synonyms
 If a q contains w1 and w2 is a synonym of w1, then
add w2 to q Slide 7
Data Mining and Machine Learning

Thesaurus
 A thesaurus is a ‘dictionary’ of synonyms and semantically related words and phrases
 E.G: Roget’s Thesaurus
 Example:
physician
syn: || croaker, doc, doctor, MD, medical, mediciner, medico ||
rel: medic, general practitioner, surgeon
Slide 8
Data Mining and Machine Learning

Peter Mark Roget 1779 –1869
 Born London 1779
 Founder of the Royal Society of Medicine
 Invented the log-log slide rule
 Professor of Physiology at the Royal Institution, 1834
 Retired 1840
 Roget’s Thesaurus of English Words and Phrases Classified and Arranged so as to Facilitate the Expression of Ideas and Assist in Literary Composition appeared in 1852.
 Died 1869. Buried St James’ Church, West Malvern, Worcestershire.
Slide 9
Data Mining and Machine Learning

Hyponyms
 Not only synonyms are useful for query expansion
 Query q = “Tell me about England”
 Document d = “A visit to London should be on everyone’s itinerary”
 ‘London’ is a hyponym of ‘England’
 Hyponym ~ subordinate ~ subset
 If a query q contains a word w1 and w2 is a hyponym of w1, then w2 should be added to q
Slide 10
Data Mining and Machine Learning

Hypernyms
 Hypernyms are also useful for query expansion
 Query q = “Tell me about England”
 Document d = “Places to visit in the British Isles”
 ‘British Isles’ is a hypernym of ‘England’
 Hypernym ~ generalisation ~ superset
 If a query q contains a word w1 and w2 is a
hypernym of w1, then w2 should be added to q Data Mining and Machine Learning
Slide 11

WordNet
 Online lexical database for the English Language
 http://www.cogsci.princeton.edu/~wn
Category
Forms
Meanings (syn sets)
Nouns
57,000
48,800
Adjectives
19,500
10,000
Verbs
21,000
8,400
Slide 12
See Belew, chapter 6 Data Mining and Machine Learning

WordNet
 Organised as a set of hierarchical trees
 For example, 25 trees for nouns
 ‘Children’ of a node are hyponyms
 Words become more specific as you move deeper into the tree
British Isles England
(Hypernym)
London
Birmingham
(Hyponym)
Slide 13
Data Mining and Machine Learning

Noun Categories
act, action, activity
natural object
animal, fauna
natural phenomenon
artefact
person, human being
attribute, property
plant, flora
body, corpus
possession
cognition, knowledge
process
communication
quantity, amount
event, happening
relation
feeling, emotion
shape
food
state, condition
group, collection
substance
location, place
time
motive
Slide 14
Data Mining and Machine Learning

Query-document scoring
 A query q is expanded to include hyponyms and synonyms
 Recall that for a document d
w f IDFt
td td
w w
Slide 15
Simq,d tqd
Data Mining and Machine Learning
td tq dq

Query expansion
 Suppose:
– t is the original term in the query,
– t’ is a synonym or hyponym of t which occurs in d
 Then we could define:
w  f IDFt
0tt’
 Where tt’ is a weighting depending on how ‘far’ t
t’d tt’ t’d
and t’ are apart according to WordNet (tt=1)
Slide 16
Data Mining and Machine Learning

Example
 Query q is:
– Is the Dark Knight on at the town cinema? – q becomes: dark knight town cinema
 Document d is:
– The latest Batman movie places the caped crusader in a
dark urban environment
– d becomes: late batman move cape crusade dark urban environment
Slide 17
Data Mining and Machine Learning

Example (continued)
 In the similarity calculation, q ∩ d = {dark}
 But:
– move and cinema are synonyms (compare “go to the
cinema” with “go to the movies”)
– crusader is a hyponym of knight
– urban is a hypernym of town
 Therefore, after query expansion,
q ∩ d = {dark, move (syn(cinema)), crusade(hypo(knight)),
Slide 18
urban(hyper(town))}
Data Mining and Machine Learning

Example (continued)
 So, if λ = 1, λsyn = 0.8, λhypo = 0.5 and λhyper = 0.3, then the numerator in the calculation of sim(q,d) becomes
wdark,d*wdark,q
+ 0.8*wmovie,d * wcinema,q
+ 0.5*wcrusader,d * wknight,q + 0.3*wurban,d * wtown,q
Note: this is just a ‘made up’ example. I haven’t consulted WordNet for synonym, hyponym or hypernym information and the weights λ are just for illustration
Slide 19
Data Mining and Machine Learning

Example (continued)
 The drawback of query expansion is that as well as increasing the overlap between a query q and a relevant document d, it may also increase the overlap with an irrelevant document
 Consider:
 The crusades were a dark period in our history when knights
moved from across Europe to join crusades to the holy land
 This becomes: crusade dark period history knight move europe crusade holy land
Slide 20
Data Mining and Machine Learning

Example (continued)
 In this case
q ∩ d = {dark, knight, move (syn(cinema)),
2 x crusade(hypo(knight)), urban(hyper(town)), land(hyper(town))}
 This document is likely to score higher similarity than the previous one
 So, the challenge is:
 Expand queries enough to promote overlap with relevant
documents…
 …but not so much that they overlap with irrelevant documents
Slide 21
Data Mining and Machine Learning

Summary
 Query expansion
– Feedback-based
– Knowledge-based: Synonyms, hyponyms and hypernyms
 Goal is to increase overlap with query and relevant documents
 WordNet
 Generalization
 Example “toy” calculation
Slide 22
Data Mining and Machine Learning