IR H/M Course
Taxonomy of Web Search [Broder 2002] There are three main classes of queries:
• Navigational queries: to reach a particular site that the user has in mind (aka known-item search)
– Reach a particular webpage/URL
Copyright By PowCoder代写 加微信 powcoder
• Informational queries: to acquire some information assumed to be present on one or more webpages.
– Reading/bookmarking/printing are the only further user’s actions
• Transactional queries: to reach a site where further interaction will happen (e.g. shopping, downloading, gaming)
– User further engages/interacts with websites in the results list
It is%often hard%to infer intent from a%query
Tailored Results Depending on Intent
4 Source: Searchengineland.com
IR H/M Course
Queries on the Web
• Commercial web search engines do not disclose their search logs
• However, there have been a number of research studies/reports showing that:
– Queries: 20% (navigational), 48% (informational), 30% (transactional) [Broder 2002]
– Bing also reported in 2011 that ~30% queries are navigational
– 33% of a user’s queries are repeat queries; 87% of the time the user
would click on the same result [Teevan et al., 2005]
– Up to 40% of queries are re-finding [Teevan et al., 2007]
– The length of queries has increased from 2.4 (2001) to 3.08 (2011)
[Taghavi et al., 2011]
– 16% of queries are ambiguous [Song et al., 2009]
– 12%-16% of queries have local intent [Gravano et al., 2003]
In#2013,#Google#responded#to#2#trillion+#queries
What Makes Web Search Difficult?
Size Diversity Dynamicity
• All of these three characteristics can be observed in:
– The Web (in 2013, Google reported it has indexed 30 trillion pages)
– Web users (estimated at ~2 billion at the end of 2012; ~4.3 billion users in early 2019)
IR H/M Course
Web Search Engine Costs
• Quality and performance requirements imply large amounts of compute resources, i.e., very large data centers
• Hundreds of thousands of computers; Heavy consumption of energy:
– Between 1.1% and 1.5% of the world’s total energy use in 2010
– Still ~1% in 2018 (following efforts in energy consumption optimization)
– One Google search ≈1 KJ ≈”turning on a 60W light bulb for ~17 secs
Actors in Web Search
• User’s perspective: accessing information – Relevance
• Search engine’s perspective: monetisation
– Attract more users
– Increase the ad revenue
– Reduce the operational costs
• Advertiser’s perspective: publicity
– Attract more users, c.f. Search Engine Optimisation
– Pay little
IR H/M Course
Web Search
SEARCH IN CONTEXT
Web Dynamics
• Web is changing over time in many aspects, e.g., size, content, structure and how it is accessed by user interactions or queries
– Size: web pages are added/deleted at all time – Content: web pages are edited/modified
– Query: users’ information needs change
– Usage: users’ behaviour change over time
Implications: Crawling, Indexing, Ranking
[Dumais 2012; 2010] [Ke et al., 2006]
IR H/M Course
Temporal Web Dynamics
• Time is pervasive in information systems
– New documents appear all the time
– Document content changes over time
– Query volume changes over time (e.g. lower on week-ends)
• What’s relevant to a query changes over time – E.g., U.S. Open 2019 (in June vs. Sept)
– E.g., U.S. Open 2019 (before, during, after event)
• Userinteractionchangesovertime
– E.g., anchor text, “likes”, query-click streams, social networks
• Relations between entities change over time – E.g., President of the U.S. is <> [in 2020 vs. 2012]
Temporal aspects are of considerable importance in Web Search
Content Dynamics
34% pages do not change [Adar et al., 2009] – avg change among rest: every 123 hours
– popular pages change more often
Previous degree of content change a good predictor of future change [Fetterly et al. 2003]
IR H/M Course
Query Dynamics
Search queries exhibit temporal patterns – Spikes or seasonality
Description: “Seasonality”
Challenges: resolving ambiguities (e.g. us open) and demoting older pages
Description: “Single Spike”
Challenges: Detecting spikes, crawling and ranking fresh documents (often news)
http://www.google.com/insights/search
Kulkarni et al. 2011 13
Temporal Reformulations
Users’&information&needs&change&over&time
IR H/M Course
Location in Web Search
• Query volume varies across different locations
• Query popularity varies differently in different locations
• The meaning of a query could be different in different locations
Frequency distribution of different queries during thanksgiving
Location Ambiguity
Source: Overell 2009
Location ambiguity
IR H/M Course
Personalised Search
• Relevance of a document depends on the user’s
context (e.g. time, location, demographics)
• Users have different interests (e.g. sports) which are
reflected in their short and long-term search history
• Queries could be ambiguous for the search engine; personalisation signals help to resolve that
Personalisation in Search
IR H/M Course
Web Search
RANKING IN WEB SEARCH
Ranking in Web Search
Given a corpus and a query, rank relevant webpages (documents) before non-relevant webpages
corpus: a subset of indexed webpages
query: short keyword query and any other user/context
rank: document scoring function
Goal is to estimate: p(r | d, x)
r relevance; d document; x context (e.g. query, user, location) 20
IR H/M Course
Ranking in Web Search
p p( (r r | | dd, , xq ) ) ∝ s ( f i , M )
• fi: set of ranking features for document i
• M:modelparameters
• s: document scoring function (typically a learning to rank method)
query term matches in title
query term matches in body fi= …
document length popularity
Ranking in Web Search
• The recent history of web search has focused on:
– Developing better ranking features (fi)
– Improving the optimisation of model parameters
– Exploring alternative learning to rank methods (s)
End$to$end(deep(learning(architectures((e.g.(dense( retrieval(approaches)(are(also(just(emerging(
IR H/M Course
query term matches in title query term matches in body …
number of query terms
user id query class ….
query dependent
query features
Recall: Ranking Features
popularity pagerank ….
query independent
Learning(to(rank(models(determine(the(importance(and( combination(of(features((See(Lecture(9)
Ranking Features
• Important to select features likely to be correlated with relevance
– Term matching scores (e.g. BM25, language model, PL2) – Field matching scores (e.g. PL2F)
• Web environment invites unique features: – Link structure
– User signals
– Dynamics
IR H/M Course
Query Independent Features
• There is a lot of junk on the web (e.g. spam, irrelevant forums)
• Knowing what users are reading is a valuable source for knowing what is not junk
• Ideally, we would be able to monitor everything the user is reading and use that information for ranking; this is achieved through toolbars, browsers, operating systems, DNS
• In 1998, no search companies had browsing data.
How did they address this lack of data?
Basic Concepts
Page A Hyperlink Page B
Intuition 1: A hyperlink connotes a conferral of authority on page B, by the author of page A (quality signal) [means of recommendation]
Intuition 2: The anchor of the hyperlink describes the target page (textual context) [means of content enhancement]
IR H/M Course
Link Analysis
• Link-based ranking: Use hyperlinks to rank web documents
– Use link counts as simple measures of popularity/authority – In-degree: counts the number of in-links to a given webpage
In-degree (D1) = 3 Out-degree (D1) = 1
• The 2nd generation web search engines ranked webpages by combining:
– Content-based evidence: e.g. scores obtained using vector space model, probabilistic model, etc.
– Authoritativeness:link-basedranking
PageRank: Random Surfer Model [Brin and Page 1998]
• Claimed to be less susceptible to link spam than simpler link analysis (e.g. in-degree)
• Simulatesaverylargenumberofusersbrowsingthe entire web
• Let users browse randomly. This is a naïve assumption but works okay in practice
• Observe how often pages get visited
• The authoritativeness of a page is a function of its popularity in the simulation
IR H/M Course
PageRank (Simple Concepts) !0111$
#&13 AW = # 0 0 1 1 &
#1000& #& “1010%
Adjencency Matrix
!$ # 0 1/3 1/3 1/3&
#0 01/21/2& T=#1000&
#& “1/2 0 1/2 0 %
M = [T] T if there is no link between page j and i
MAij = 1 otherwise, with nj the number of outgoing links of 29 nj page j
Transition Matrix
MA i j = 0
#$ #1/3 1/2 0 1/2$
!0011/2″ #1/3000$
#1/31/200$ %&
PageRank Computation
xx x1 = 3 + 4
x x2 = 1 3
x3 = x1 + x2 + x4 322
xx x4 = 1 + 2 32
Ax = x M.x
The scores can be obtained by the power iteration method
#1/3 1/2 0 1/2$ #1/31/20 0$
x = lim Ak x where x is some initial column vector with non-zero entries.
X converges to the principal eigenvector of M
!0 0 11/2″ #1/3 0 0 0$
IR H/M Course
0.250 0.250
0.395 0.295
0.118 0.191
Dangling nodes
Possible Problems
Disconnected graphs
!0100 0″ #1 0 0 0 0 $
Node 2 has no outgoing links
– e.g. pdf page or a music file
#$ MA = # 0 0 0 1 1 / 2 $
#00101/2$ #$
Rank sinks
#0000 0$ %&
x(1) =[1/2 1/2 0 0 0]T
x(2) =[0 0 1/2 1/2 0]T
and by any linear combination of x(1) and x(2)
Some%nodes%accumulate%inflated%PageRank% scores%
IR H/M Course
Solution: Teleportation (1)
!0100 0” #1 0 0 0 0 $
#$ MA = # 0 0 0 1 1 / 2 $
#0 0 1 0 1/2$ #$
#0000 0$ %&
“1/5 1/5 1/5 1/5 1/5%
$1 0 0 0 0 ‘ $1/5 1/5 1/5 1/5 1/5’
G=(1−λ)$ 0 0 0 1 1/2 ‘+λ$ 1/5 1/5 1/5 1/5 1/5 ‘
$0 0 1 0 1/2′ $1/5 1/5 1/5 1/5 1/5’
$0 0 0 0 0 ‘ $1/5 1/5 1/5 1/5 1/5’
G=(1-λ)M+ λS
where S ∈ !n×nSij =1/ n 0 < λ < 1 λ is a teleportation parameter 33
G is the google matrix
Solution: Teleportation (2) ! For$example,$setting λ =$0.15
!0 1 0 0 0" !0.03 0.88 0.03 0.03 0.03$ #1 0 0 0 0$ #0.88 0.03 0.03 0.03 0.03 &
MA=#0 0 0 1 1/2$ #0 0 1 0 1/2$
G=# 0.03 0.03 0.03 0.88 0.455 &
# 0.03 0.03 0.88 0.03 0.455 &
# 0.03 0.03 0.03 0.03 0.03 & "%
#00000$ %&
The$unique PageRank score$is given by
x = [0.2 0.2 0.285 0.285 0.03]
G is the google matrix after teleportation
IR H/M Course
PageRank-based Ranking
[Brin and Page 1998]
Extensions and Issues
• Extensions
– Personalized PageRank [Haveliwala 2003]
– PageRank-directed crawling [Cho and Schonfeld 2007] – PageRank without links [Kurland and Lee 2005]
– Build a non-random surfer [Meiss et al. 2010]
– Need a web graph stored in an efficient data structure
– PageRank requires taking powers of a very, very large matrix – PageRank is an approximation of visitation
IR H/M Course
How Important is PageRank? [Najork et al. 2007] • Claim: PageRank is a very important ranking feature
reflecting authority
• Test1:Comparetheempiricalperformanceof PageRank as a ranking feature to:
– Classic text match features
– Simpler models (e.g. number of links pointing to a page)
• Test2:Comparetheempiricalperformanceof PageRank when combined with other ranking signals
Isolated Signals
Corpus: 463 billion webpages
Relevance: 28k queries, with 17 relevance judgments each Experiment: for each query, for each feature, rank documents,
then compare average performance on
[Najork et al. 2007]
IR H/M Course
Signals Combined with BM25F
[Najork et al. 2007]
So, How Important is PageRank?
• Evidence from [Najork et al. 2007] does not support the claim that PageRank is superior to simpler models of usage (e.g. in-degree)
• Similar conclusions were reached using TREC web test collections, with no clear benefits in using PageRank
– In general, link-based ranking features are weak features
IR H/M Course
Other Query Independent Features
• URL features: number of slashes in URL, length of URL
• Documentspamscore[Cormacketal.2011]
• Text quality score: score combining various text quality features (e.g. readability)
• Click-through rate: observed CTR of the page in search results [Dupret and Liao 2010]
• Dwell time: average time spent by the users on the page
• Page load time: average time it takes to receive the page
from its server
• Socialmedialinks[Dongetal.2010]
Query Dependent Features
Using document structure
• Vector space model scores
• BM25, PL2, LM scores
• Proximity scores
• BM25F, PL2F
IR H/M Course
Query Dependent Features
• Includeanchortext as a source of text information [Robertson et al. 2004]
• Similar to manual document keyword assignment
Anchor Text Importance
• Anchor text tends to be short, descriptive, and similar to query text
• Helps when descriptive text in destination page is embedded in image logos rather than in accessible text
• Increases content for popular pages with many in-coming links, increasing recall of these pages
Armonk, NY-based computer giant IBM announced today
www.ibm.com
Joe’s computer hardware links
Dell HP IBM
Big Blue today announced record profits for the quarter
IR H/M Course
Indexing Anchor Text
• Retrieval experiments have shown that anchor text has significant impact on effectiveness for some types of queries
– i.e., more than PageRank
• However, it can have unexpected side effects:
– Sometimes anchor text is not useful: “click here”
– Orchestrated campaigns: e.g., “evil empire” as in: Evil Empire
– IBM’s copyright page (high term freq. for ‘ibm’)
– Spam links pointing to itself
• Solution: Anchor text field can have less weight than the body field
HITS: Hyperlink-Induced Topic Search [Kleinberg 1998]
• A Query dependent feature. Link analysis is carried out over a query-induced graph
• In response to a query, instead of a single ranked list of webpages, compute two inter-related scores:
– Hub scores are high for pages that provide lots of useful links to content pages (topic authorities)
– Authority scores are high for pages that have many incoming
links from good hubs for the subject
• Mutually recursive process: Good hubs
point to good authorities. Good authorities are pointed by good hubs
Hubs Authorities
What are the issues in incorporating anchor text as an index feature for web retrieval?
IR H/M Course
HITS Process
1. Send query q to an IR system to obtain the root set R
of retrieved top-k pages
2. Expand the root set by radius one to obtain an expanded graph S (base set)
! Add pages with outgoing links to pages in the root set
! Add pages with incoming links from pages in the root set
3. Maintain for each page p ! S Authority score: ap (vector a)
Hub score: hp 2
(vector h) 2
(a)=1 !(h)=1
HITS Iterative Algorithm
Initializeforallp!S:ap =hp =1 For i = 1 to k:
For all p ! S: ap = !hq q:q”p
For all p ! S: hp = !aq q:p”q
For all p ! S: ap= ap/c c:!(a p”S
(normalize a)
Update authority scores
Update hub scores scores
a converges to the principal eigenvector of ATA and h converges to that of AAT – A Adjacency Matrix for S: Aij = 1 for i ! S, j ! S iff i”j
For all p ! S: hp= hp/c c: !(h /c) =1 (normalize h)
IR H/M Course
Comparative Example
authority scores
hub scores
0.03 0.06 0.45
PageRank scores
Other Query Dependent Features Matches with a user’s browsing history [Shen 2005] Matches with a user’s reading level [Kim et al. 2012]
Number of previous clicks on a document for a given query
IR H/M Course
Query Features
• Number of query words: query length often suggests different retrieval strategies
– short queries: navigational (e.g. facebook)
– long queries: topic-based (e.g. ‘tutorial on expectation-
maximization algorithms’)
• Trigger words: certain words in the query suggest different retrieval strategies
– e.g. ‘time in mumbai’, ‘brooklyn weather’
Modern Rankers
• Modern rankers take 1000s of features (inc Neural
Net approaches like BERT)
• PageRank and/or other link analysis-based features are just a few of them
• Textmatchingisnoteverything
• Anchor text can be regarded as reference
• Clicks can be considered as votes
• Context of user (location, time, etc.) is very important.
Hand%tuning*this*many*features*is*infeasible*but*machine* learning*algorithms*can*be*used*to*tune*a*ranking*function
IR H/M Course
Training Data in Learning to Rank
• Training query sets (ideally representative of real search logs)
– Need to sample both prevalent queries in the query logs and less frequent query
– The machine learning algorithm will learn only for those types of queries it
• Several documents collected for each query via “pooling”
– e.g. take the top m documents from k different ranking features (or systems).
• Query-document pairs are “judged” by professional/paid editors
– Recruit editors who will be able to understand the task and relevance
– Inappropriate editors = bad data = wasted time/effort
• (Absolute) relevance judgments are often multi-graded (e.g. Bad, Good, Perfect)
– Could be a preference judgment (doci is more relevant than docj for the query)
– Bad guidelines = bad data = wasted time/effort
Relevance Judgments in TREC
IR H/M Course
Retrieval Strategy (s)
• There are many ways to structure a relationship between input features and relevance scores
– linear models – decision trees
– support vector machines – …
• Each learning to rank strategy carries a unique set of parameters and methods to learn them
• C.f. Lecture 9 (Learning to Rank) for more details on learning to rank methods
Validating the System
• How well the system is doing and/or how well a newly introduced feature is doing?
Baseline Ranking Algorithm Proposed Ranking Algorithm
Which%is% better?
Radlinski, 2015
See Evaluation Lecture
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com