CIS 455/555: Internet and Web Systems
Google November 18, 2020
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
1
Plan for today
n Classic IR models n HITS
n Hubs and authorities
n PageRank
n Iterative computation
n Random-surfer model
n Refinements: Sinks and Hogs
n Google
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
2
NEXT
Google’s PageRank (Brin/Page 98)
n A technique for estimating page quality n Based on web link graph, just like HITS
n Like HITS, relies on a fixpoint computation
n Important differences to HITS:
n No hubs/authorities distinction; just a single value per page n Query-independent
n Results are combined with IR score
n Think of it as: TotalScore = IR score * PageRank
n In practice, search engines use many other factors (for example, Google says it uses more than 200)
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
3
Shouldn’t E’s vote be worth more than F’s?
A
C F
D
PageRank: Intuition
How many levels should we consider?
G HEB
I J
n Imagine a contest for The Web’s Best Page
n Initially, each page has one vote
n Each page votes for all the pages it has a link to
n To ensure fairness, pages voting for more than one page must split their vote equally between them
n Voting proceeds in rounds; in each round, each page has the number of votes it received in the previous round
4
n In practice, it’s a little more complicated – but not much! © 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
PageRank
n Each page i is given a rank xi
n Goal: Assign the xi such that the rank of each page is governed by the ranks of the pages
linking to it:
å 1 xi= Nxj
jÎBi j
Rank of page i
How do we compute the rank values?
Every page
j that links to i
Rank of page j
Number of links out from page j
© 2020 A. Haeberlen, Z. Ives, V. Liu
University of Pennsylvania
5
© 2020 A. Haeberlen, Z. Ives, V. Liu
University of Pennsylvania
Iterative PageRank (simplified)
Initialize all ranks to be equal, e.g.:
x(0) = i
1
n
Iterate until convergence
x(k+1) = å 1 x(k)
i
j i
jÎB Nj
6
Example: Step 0
Initialize all ranks to be equal
x(0) = 1 i n
.33
.33
© 2020 A. Haeberlen, Z. Ives, V. Liu
University of Pennsylvania
.33
7
Example: Step 1
Propagate weights across out-edges
x(k+1) = å 1 x(k) iNj
jÎBi j
.17
.33 .33
© 2020 A. Haeberlen, Z. Ives, V. Liu
University of Pennsylvania
.17
8
Example: Step 2
Compute weights based on in-edges
.5
x (1) = å 1 x (0) iNj
jÎBi j
© 2020 A. Haeberlen, Z. Ives, V. Liu
University of Pennsylvania
.33
.17
9
Example: Convergence
x(k+1) = å 1 x(k) iNj
jÎBi j
0.4
© 2020 A. Haeberlen, Z. Ives, V. Liu
University of Pennsylvania
0.4
0.2
10
Naïve PageRank Algorithm Restated
n Let
n N(p) = number outgoing links from page p n B(p) = number of back-links to page p
PageRank( p) = å 1 bÎBp N(b)
PageRank(b)
n Each page b distributes its importance to all of the
pages it points to (so we scale by 1/N(b))
n Page p’s importance is increased by the importance of its back set
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
11
Linear Algebra formulation
n Create an m x m matrix M to capture links:
n M(i,j) =1/nj ifpageiispointedtobypagej and page j has nj outgoing links.
= 0 otherwise.
n Initialize all PageRanks to 1, multiply by M repeatedly until all values converge:
éPageRank(p ‘)ù éPageRank(p )ù ê1úê1ú
êPageRank(p2′)ú = MêPageRank(p2)ú ê … ú ê … ú ëPageRank(pm’)û ëPageRank(pm)û
n Computes principal eigenvector via power iteration © 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
12
A Brief Example
Google
g’
y’ = 0 0 0.5 * y
0 0.5 0.5 g 10.50 a
11 0.75 ,… 0.67 1. 25 1. 33
Amazon
Running for multiple iterations:
g11
y = 1 , 0.5 , a 1 1.5
Yahoo
a’
Total rank sums to number of pages
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
13
Oops #1 – PageRank sinks
g’ 000.5 g y’= 0.500.5* y a’ 0.500 a
‘dead end’ – PageRank is lost after each round
Running for multiple iterations:
© 2020 A. Haeberlen, Z. Ives, V. Liu
University of Pennsylvania
Google
Amazon
Yahoo
g 1 0.5 0.25 y = 1 , 1 , 0.5 a 1 0.5 0.25
0 ,…, 0 0
14
Oops #2 – PageRank hogs
Running for multiple iterations:
g’ 000.5 g y’ = 0.5 1 0.5 * y a’ 0.500 a
PageRank cannot flow out and accumulates
in a PageRank ‘sink’
© 2020 A. Haeberlen, Z. Ives, V. Liu
University of Pennsylvania
Google
Amazon
Yahoo
g 1 0.5 0.25 y = 1 , 2 , 2.5 a 1 0.5 0.25
0 ,…, 3 0
15
Random Surfer Model
n PageRank has an intuitive basis in random walks on graphs
n Imagine a random surfer, who starts on a random page and, in each step,
n with probability d, klicks on a random link on the page n with probability 1-d, jumps to a random page (bored?)
n The PageRank of a page can be interpreted as the fraction of steps the surfer spends on the corresponding page
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
16
Improved PageRank
n Remove out-degree 0 nodes (or consider them to refer back to referrer)
n Add decay factor d to deal with sinks PageRank(p)=(1-d)+då 1 PageRank(b)
n Typical value: d=0.85
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
bÎBp N(b)
17
Stopping the Sink
g’ y’ a’
Running for multiple iterations:
g 0.57 0.39
y = 1.85 , 2.21 , 2.36 ,…, 2.48 a 0.57 0.39 0.32 0.26
Google
= 0.85
g 0.15 a 0.15
0 0 0.5
0.510.5* y+0.15
0.5 0 0
Amazon
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
Yahoo
0.32 0.26
18
Reminder: Search-Engine Optimization
n White-hat techniques
n Google webmaster tools; add meta tags to documents, etc.
n Black-hat techniques n Link farms
n Keyword stuffing, hidden text, meta-tag stuffing, …
n Comment spam
n Initialsolution:
n Somepeoplestartedtoabusethistoimprovetheirownrankings
n Doorway pages / cloaking
n Specialpagesjustforsearchengines
n BMWGermanyandRicohGermanybannedinFebruary2006
n Link buying
n You need countermeasures for these
n Otherwise they can degrade the quality of your results © 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
19
Recap: PageRank
n Estimates absolute ‘quality’ or ‘importance’ of a given page based on inbound links
n Query-independent
n Can be computed via fixpoint iteration
n Can be interpreted as the fraction of time a ‘random surfer’ would spend on the page
n Several refinements, e.g., to deal with sinks
n An important factor, but not the only one
n Overall ranking is based on many factors (Google: >200)
n Need to perform rank merging, e.g., with TF/IDF scores n e.g.,TF/IDFcanensurehighprecision,andPageRankhighquality
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
20
What could be the other 200 factors?
Positive
Negative
On-page
Keyword in title? URL? Keyword in domain name? Quality of HTML code Page freshness
Rate of change …
Links to ‘bad neighborhood’ Keyword stuffing Over-optimization
Hidden content (text has same color as background) Automatic redirect/refresh …
Off-page
High PageRank
Anchor text of inbound links Links from authority sites Links from well-known sites Domain expiration date
…
Fast increase in number of inbound links (link buying?) Link farming
Different pages user/spider Content duplication
…
n Note: This is entirely speculative!
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania Source: Web Information Systems, Prof. Beat Signer, VU Brussels 21
Beyond PageRank
n PageRank assumes a “random surfer” who starts at any node and estimates likelihood that the surfer will end up at a particular page
n A more general notion: label propagation
n Take a set of start nodes each with a different label
n Estimate, for every node, the distribution of arrivals from each label
n In essence, captures the relatedness or influence of nodes n Used in YouTube video matching, schema matching, …
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
22
Plan for today
n HITS
n PageRank
n Random-surfer model n Refinements
n Google
n Cloud computing
n Utility computing model n Brief overview of EC2
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
23
NEXT
Google Architecture [Brin/Page 98]
Focus was on scalability to the size of the Web
First to really exploit Link Analysis
Started as an academic project @ Stanford; became a startup
Our discussion will be on early Google – today they keep things secret!
© 2020 A. Haeberlen, Z. Ives, V. Liu
University of Pennsylvania
24
Repository and Index
n “BigFile”Repository
n Basically,awarehouseofeveryHTMLpage (this is the ‘cached page’ entry), compressed in zlib (faster than bzip)
n Usefulfordoingadditionalprocessing,any necessary rebuilds
n OneRepositoryIndexforURLtoDocID
n SortedbychecksumofURL
n ComputechecksumofURL,thenperform binary search by checksum
n OneindexforDocIDtodocument
n Indexpointstorepositoryentries(ortoURL
entry if not crawled)
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
25
Lexicon
n The list of searchable words n (Presumably, today it’s used to
suggest alternative words as well) n The “root” of the inverted index
n As of 1998, 14 million “words”
n Kept in memory (was 256MB)
n Two parts:
n Hashtableofpointerstowordsandthe
“barrels” (partitions) they fall into n Listofwords(null-separated)
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
26
Indices – Inverted and “Forward”
Lexicon: 293 MB
WordID
ndocs
WordID
ndocs
WordID
ndocs
n Invertedindexdividedinto “barrels” (partitions by range)
n Indexedbythelexicon;for each DocID, consists of a Hit List of entries in the document
n Twobarrels:short(anchorand title); full (all text)
n Forwardindexusesthesame barrels
n IndexedbyDocID,thenalist of WordIDs in this barrel and this document, then Hit Lists corresponding to the WordIDs
t t
t
original tables from http://www.cs.huji.ac.il/~sdbi/2000/google/index.htm
Inverted Barrels: 41 GB
DocID: 27
nhits: 8
hit hit hit hit
DocID: 27
nhits: 8
hit hit hit
DocID: 27
nhits: 8
hit hit hit hit
DocID: 27
nhits: 8
hit hit
forward barrels: total 43 GB
DocID
WordID: 24
nhits: 8
hit hit hi
WordID: 24
nhits: 8
hit hit
NULL
hit hit hi
DocID
WordID: 24
nhits: 8
hit
WordID: 24
nhits: 8
hit hit
WordID: 24
nhits: 8
hit hit hi
NULL
hit
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
27
Plain Fancy
Anchor
vs. special-cased to:
Hit Lists (Not Mafia-Related)
n Used in inverted and forward indices
n Goal was to minimize the size – the bulk of
data is in hit entries
n For 1998 version, made it down to 2 bytes per hit (though that’s likely climbed since then):
cap 1
font: 3
position: 12
cap 1
font: 7
type: 4
position: 8
cap 1
font: 7
type: 4
hash: 4
pos: 4
© 2020 A. Haeberlen, Z. Ives, V. Liu
University of Pennsylvania
28
Google’s Search Algorithm
1. Parse the query
2. Convert words into wordIDs
3. Seek to start of doclist in the short barrel for every word
4. Scan through the doclists until there is a document that matches all of the search terms
5. Compute the rank of that document
n IR score: Uses dot product of count weights and type weights n Final rank: IR score combined with PageRank
6. If we’re at the end of the short barrels, start at the doclists of the full barrel, unless we have enough
7. If not at the end of any doclist, goto step 4
8. Sort the documents by rank; return the top K
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
29
Google over the years
Hummingbird HTTPS Penguin 4.0 Core update
Takes “meaning” into Boosts pages Doesn’t target entire account more, rather than with HTTPS site for “spam” but rather
Around March 12; no details published
just individual words Pigeon
individual pages
Panda
Reduces rank of “low-quality” sites
Pirate
Reduces rank
of sites with many copyright violations
Improved local search results (based on distance and location)
Fred
Targets more of the link quality aspecs
BERT update NN-based NLP technique
2011 2012 2013 2014 2015 2016 2017 2018 2019
© 2020 A. Haeberlen, Z. Ives, V. Liu
University of Pennsylvania
30
Payday
Cleans up results for “spammy queries” such as “payday loan”
Penguin
Aimed at “spamming” Sites, e.g., through link buying
Mobile friendly
Boosts mobile-friendly pages
Medic update Thought to affect primarily health, fitness, and medical websites
Plan for today
n PageRank
n Google
n Cloud computing
n Utility computing model n Brief overview of EC2
n Transactions n ACID properties
n Serializability
n Concurrency control: 2PL
n Distributed transactions: 2PC
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
31
NEXT
Utility computing: Power plant analogy
Steam engine at Stott Park Bobbin Mill
Waterwheel at the Neuhausen ob Eck Open-Air Museum
n It used to be that everyone had their own power source
n Challenges are similar to the cluster: Needs large up-front investment, expertise to operate, difficult to scale up/down…
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
32
Scaling the power plant
n Then people started to build large, centralized power plants with very large capacity…
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
33
Metered usage model
Power source Network Metering Customer device
n Power plants are connected to customers by a network
n Usage is metered, and everyone (basically) pays only for what they actually use
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
34
Why is this a good thing?
Electricity
n Economies of scale
n Cheaper to run one big power n
plant than many small ones n Statistical multiplexing
n High utilization! n n No up-front commitment
n No investment in generator; n pay-as-you-go model
n Scalability
n Thousands of kilowatts
available on demand; add more within seconds
n
Computing
Cheaper to run one big data center than many small ones
High utilization!
No investment in data center; pay-as-you-go model
Thousands of computers available on demand; add more within seconds
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
35