Link Analysis Algorithms Page Rank
With slide contributions from
J. Leskovec, Anand Rajaraman, Jeffrey D. Ullman, Wensheng Wu
The Problem
Web as a Graph
• Web as a directed graph: • Nodes: Webpages
• Edges: Hyperlinks
CS224W: Classes are in the Gates building
Computer Science Department at Stanford
I teach a class on Networks.
Stanford University
3
Web as a Graph
• Web as a directed graph: • Nodes: Webpages
• Edges: Hyperlinks
CS224W: Classes are in the Gates building
Computer Science Department at Stanford
I teach a class on Networks.
Stanford University
4
Web as a Directed Graph
5
Broad Question
•How to organize the Web?
• First try: Human curated Web directories
• Yahoo, DMOZ, LookSmart
6
Broad Question (Cont’d)
•How to organize the Web? • Second try: Web Search
• Information Retrieval investigates: Find relevant docs in a small
and trusted set
• Newspaper articles, patents, etc.
But: Web is huge, full of untrusted documents, random things, web spam, etc.
7
Early Web Search
• Keywords extracted from web pages • E.g., title, content
• Used to build inverted index
• Queries are matched with web pages
• Via lookup in the inverted index
• Pages ranked by occurrences of query keywords
8
Inverted Index
• Problem: susceptible to term spam
https://developer.apple.com/library/mac/documentation/UserExperience/Conceptual/ SearchKitConcepts/searchKit_basics/searchKit_basics.html
9
Term Spam
• Disguise a page as something it is not about
• E.g., adding thousands of keyword “movies”
• Actual content may be some advertisement
• Fool search engine to return it for query “movies”
• May even fade spam words into background
• Spam pages may be based on top-ranked pages
10
Web Search: Two Challenges
Two challenges of web search:
• (1) Web contains many sources of information Who to “trust”?
• Trick: Trustworthy pages may point to each other!
• (2) What is the “best” answer to query “newspaper”?
• No single right answer
• Trick: Pages that actually know about newspapers might all be pointing to many newspapers
11
Ranking Nodes on the Graph
• All web pages are not equally “important” blog.bob.com vs. www.usc.edu
• There is large diversity in the web-graph node connectivity
• Let’s rank the pages by the link structure!
12
Link Analysis Algorithms
• We will cover the following Link Analysis approaches for computing importance
of nodes in a graph:
• Page Rank
• Topic-Specific (Personalized) Page Rank • Web Spam Detection Algorithms
13
PageRank:
The “Flow” Formulation
PageRank: Combating Term Spam
• Key idea: rank pages by linkage too • How many pages point to a page
• How important these pages are
=> PageRank
• USC.edu can be important
• because many pages point to it
• Your home page can be important • If it is pointed to by USCJ
USC
You
15
Random Surfer Model
• Random surfer of web
• starts from any page
• follows its outgoing links randomly
• Page is important if it attracts a large # of surfers
Fake
…
Fake
Los Angeles
Spam Fake
Me You
USC
16
PageRank
• Probability that a random surfer lands on the page
Fake
…
Fake
Los Angeles
Spam Fake
Me You
USC
17
Intuition
• If a page is important, then
• many other pages may directly/indirectly link to it • random surfer can easily find it
• Spam pages are less connected
• So less chance to attract random surfer
• Random surfer model more robust than manual approach • A collective voting scheme
18
Example: PageRank Scores
AB
C 34.3
3.3
38.4
D 3.9
E 8.1
1.6
F 3.9
1.6
1.6
1.6
1.6
19
Assumption: A Strongly Connected Web Graph
• Nodes = pages
• Edges = hyperlinks between pages
• every node is reachable from every other node
20
Model: Random Surfer on the Graph
• Can start at any node, say A
• Can next go to B, C, or D, each with 1/3 prob. • IfatB,cangotoAandD,eachwith1/2prob. • So on…
21
Random Surfer Property: Memoryless
• Where to go from node X is not affected by how the surfer got to X
22
Extreme Case: Dead End
• Dead end: a page with no edges out
• Absorb PageRanks
• PageRankà0 for any page that can reach the dead end (including the dead end itself)
Dead end
23
Extreme Case: Spider Trap
• Group of pages with no edges going out of group • Absorb all PageRanks (rank of Cà1, othersà0)
• Surfer can never leave, once trapped
• Can have > 1 nodes
Spider trap
24
PageRank: Formulation Details
PageRank: Links as Votes
• Idea: Links as votes
• Page is more important if it has more links
• In-cominglinks?Out-goinglinks?
• Think of in-links as votes:
• www.stanford.edu has 23,400 in-links • www.one-link.com has 1 in-link
• Are all in-links equal?
• Links from important pages count more • Recursive question!
26
Simple Recursive Formulation
• Each link’s vote is proportional to the importance of its source page • If page j with importance rj has n out-links, each link gets rj / n votes • Page j’s own importance is the sum of the votes on its in-links
ir/3 k
i rk/4
rj = ri /3+rk /4 j rj/3 rj/3 rj/3
27
PageRank: The “Flow” Model
• A “vote” from an important page is worth more • A page is important if it is pointed to by other
important pages
• Define a “rank” rj for page j
i r =år
y/2
y
a/2
y/2
amm a/2
“Flow” equations:
ry =ry/2+ra /2 ra =ry/2+rm rm = ra /2
jd i®j i
d =out-degreeofnodei i
28
Solving the flow equations
• 3 equations, 3 unknowns, no constants • No unique solution
• All solutions equivalent modulo scale factor
• Additional constraint forces uniqueness • ry + rm + ra = 1
• ry =2/5,ra =2/5,rm =1/5
“Flow” equations:
ry =ry/2+ra /2 ra =ry/2+rm rm =ra /2
• Gaussian elimination method (in the later slides) works for small examples, but we need a better method for large graphs
PageRank: Matrix formulation
• Stochastic Transition (or adjacency) Matrix M
• Suppose page j has n outlinks • If outlink j -> i, then Mij=1/n
• Else Mij=0
• M is a column stochastic matrix • Columns sum to 1
Transition Matrix
• M[i,j] = prob. of going from node j to node i
• If j has k outgoing edges, prob. for each edge = 1/k
ABCD A 0 1/2 1 0
M=
D
1/3 1/2 0 0
B 1/3 0 0 1/2 C 1/3 0 0 1/2
31
PageRank: Matrix formulation (Cont’d)
• Stochastic Transition (or adjacency) Matrix M
• Suppose page j has n outlinks • If outlink j -> i, then Mij=1/n
• Else Mij=0
• M is a column stochastic matrix • Columns sum to 1
• Rank vector r is a vector with one entry per web page • ri is the importance score of page I
• The flow equations can be written as r = Mr
Example
• Flow equation in matrix form: Mr = r
• Suppose page j links to 3 pages, including i
j
i ri
.
M .r=r
=
rj
1/3
33
Stationary Distribution
• Limiting prob. distribution of random surfer
• PageRanks are based on limiting distribution
• the probability destruction will converge eventually
• Requirement for its existence
• Graph is strongly connected: a node can reach any other node in the graph
=> Cannot have dead ends, spider traps
34
Eigenvectors and Eigenvalues
• An eigenvector of a square matrix A is a non-zero vector v that, when the matrix multiplies v, yields the same as when some scalar multiplies v, the scalar multiplier often being denoted by λ
• That is:
Av = λv
• The number λ is called the eigenvalue of A corresponding to v
Eigenvalues and Eigenvectors Example
• The transformation matrix A = preserves the direction of vectors parallel to v = (1,−1)T (in purple) and w = (1,1)T (in blue). The vectors in red are not parallel to either eigenvector, so, their directions are changed by the transformation.
http://setosa.io/ev/eigenvectors-and-eigenvalues/
https://en.wikipedia.org/wiki/Eigenvalues_and_e igenvectors
36
Eigenvector Formulation
• The flow equations can be written 𝒓=𝑴$𝒓
limiting distribution
• So the rank vector r is an eigenvector of the stochastic web matrix M
• r is M’s first or principal eigenvector, with corresponding eigenvalue 1
• Largest eigenvalue of M is 1 since M is column stochastic (with non-negative entries)
• We know r is unit length and each column of M sums to one
NOTE: x is an eigenvector with the corresponding eigenvalue λ if:
𝑨𝒙 = 𝝀𝒙
• We can now efficiently solve for r!
• 1. Power Iteration: https://en.wikipedia.org/wiki/Power_iteration • 2. Use the principal eigenvector
37
Example
Yahoo
yam
y a m
Amazon
Mʼsoft r =r /2+r/2
r = Mr 1/21/2 0
yya ra =ry/2+rm
ry
ra =1/201
rm = ra/2
r
m
0 1/2 0
1/2 1/2 0 1/20 1 0 1/2 0
ry
ra rm
Power Iteration method
• Simple iterative scheme (aka relaxation) • Suppose there are N web pages
• Initialize: r0 = [1/N,….,1/N]T
• Iterate: rk+1 = Mrk
• Stop when |rk+1 – rk|1 < 𝜖
• |x|1 = ∑1≤i≤N |xi|is the L1 norm
• Can use any other vector norm e.g., Euclidean
Power Iteration Example
Yahoo
1/2 1/2 0 1/20 1 0 1/2 0
Amazon
Mʼsoft 1/3 1/3
yam
y a m
rk+1 = Mrk 5/12 3/8
ry
ra = rm
2/5 1/3 1/2 1/3 11/24 ... 2/5 1/3 1/6 1/4 1/6 1/5
Recall: The flow model
41
Recall: Random Walk Interpretation
¡ Imagine a random web surfer:
• At any time 𝒕, surfer is on some page 𝒊
• At time 𝒕 + 𝟏, the surfer follows an out-link from 𝒊 uniformly at random
• Endsuponsomepage𝒋linkedfrom𝒊 • Process repeats indefinitely
i1 i2 i3 j
¡ Let: th ¡ 𝒑(𝒕) ... vector whose 𝒊
i® j dout (i)
coordinate is the prob. that the surfer is at page 𝒊 at time 𝒕
• So, 𝒑(𝒕) is a probability distribution over pages • -> rank vector
j i r = å r
42
Recall: Transition Matrix
• M[i,j] = prob. of going from node j to node i
• If j has k outgoing edges, prob. for each edge = 1/k
ABCD A 0 1/2 1 0
M=
D
1/3 1/2 0 0
B 1/3 0 0 1/2 C 1/3 0 0 1/2
43
Prob. of Locations of Surfer
• Represented as a column vector, v
• Initially, surfer can be at any page with equal probability
A 1/4
𝑣-=B 1/4 C 1/4 D 1/4
44
Limiting Distribution
• 𝑣! = 𝑀𝑣”
• 𝑣# = 𝑀𝑣!(=𝑀#𝑣”)
•…
• 𝑣$ = 𝑀𝑣$%!(= 𝑀$𝑣”)
•…
• 𝑣 = 𝑀𝑣
(from some step k on, v does not change any more)
45
Compute Next Distribution
A B C D 1/4 A 0 1/2 1 0
B 1/3 0 0 1/2 1/4 C 1/3 0 0 1/2 1/4 D 1/3 1/2 0 0 1/4
• E . g . , 𝑣 “! = 0 ∗ !& + !# ∗ !& + 1 ∗ !& + 0 ∗ !& = ‘( • i.e., prob. at A is 3/8 (or 9/24) after step 1
A 1/4 A 9/24 𝑣-=B1/4 𝑣+=B5/24 CD 1 / 4 C 5 / 2 4 1/4 D 5/24
• 𝑣! = 𝑀𝑣” •𝑣”=∑% 𝑀𝑣&
! #$”!##
46
Solving Mv = v
47
Solving Mv = v by 1. Power Iteration
• Note: ∑* 𝑣+=1, after every step k $)! $
• Usually stop when little change btw iterations • In practice, 50-75 iterations for Web
1/4 9/24 15/48 11/32 3/9
1/4 5/24 11/48 7/32 ,…, 2/9 1/4 5/24 11/48 7/32 2/9
1/4 5/24 11/48 7/32 2/9
48
Solving Mv = v by 2. Finding Eigenvector
• 𝑀𝑣 = 𝜆𝑣
• 𝜆: eigenvalue, 𝑣: eigenvector
• 𝑀 is a (left) stochastic matrix • Each column adds up to 1
• Largest eigenvalue of a stochastic matrix is 1
• Corresponding eigenvector is principal eigenvector • Proof in the next few slides
49
Characteristic Polynomial
Calculating the Eigenvalues
•𝑀𝑣=𝜆𝑣
•𝑀−𝜆𝐼𝑣=0
•𝑣isnotanullvector,so • det(𝑀−𝜆𝐼)=0
−𝜆 1/2 1 0 1/3 −𝜆 0 1/2 1/3 0 −𝜆 1/2 1/3 1/2 0 −𝜆
Compute determinant using cofactors in this column
0 1/2 1 0 𝑀=1/3 0 01/2 1/3 0 0 1/2
1/3 1/2 0 0
I is the identity matrix
=0
50
Characteristic Polynomial
13−𝜆12 −𝜆120
13 0 12 −𝜆 13 −𝜆 12
1 1 −𝜆 1 1 −𝜆 32 32
8888 −𝜆8−𝜆8
=𝜆9:-1/29:−𝜆(-1/2 :−𝜆 🙂 8−𝜆88 888−𝜆
99:9:9 = 𝜆(-89 𝜆-8;)- 𝜆[− 8: (− 8: 𝜆-8;) – 𝜆(𝜆2-8;)]
= 𝜆4-9< 𝜆2- =<= 𝜆(𝜆-1)(𝜆2+ 𝜆+8<)=𝜆(𝜆-1)(𝜆+ 8:)2
51
Characteristic Polynomial
• 𝑀𝑣 = 𝜆𝑣
• 𝑀−𝜆𝐼𝑣=0
• det(𝑀 − 𝜆𝐼) = 0
0 1/2 1 0 1/3 0 0 1/2 1/3 0 0 1/2 1/3 1/2 0 0
−𝜆 1/2 1 0 1/3 −𝜆 0 1/2 1/3 0 −𝜆 1/2 1/3 1/2 0 −𝜆
𝑀=
=0 𝜆(𝜆-1)(𝜆++?)2=0=>𝜆+ =0,𝜆? =1,𝜆@ =𝜆A =−+?
52
Solving Mv = v by Finding Eigenvector
• 𝑀𝑣 = 𝜆𝑣
• 𝑀−𝜆𝐼𝑣=0 •det𝑀−𝜆𝐼 =0
𝑀=
0 1/2 1 0 1/3 0 0 1/2 1/3 0 0 1/2 1/3 1/2 0 0
𝜆(𝜆-1)(𝜆+ +?)2=0 => 𝜆+ = 0,𝜆? = 1,𝜆@ = 𝜆A = −+?
• Problem becomes solving 𝑀 − 𝐼 𝑣 = 0 • Use Gaussian elimination
53
Solving Mv = v by Gaussian Elimination
•(M–I)v=0
−1 12 1 0 13 −1 0 12 13 0 −1 12
1 1 0 −1 32
−1 1/2 1 0 𝑣8 0 1/3 −1 0 1/2 𝑣: = 0 1/3 0 −1 1/2 𝑣9 0 1/3 1/2 0 −1 𝑣< 0
1 −12 −1 0 1 −12 −1 0
1−101 0−511 32632
1 0 −1 1 0 1 −2 1 32632
1 1 0 −1 0 2 1 −1 32 33
54
Solving Mv = v by Gaussian Elimination
•(M–I)v=0
−1 1/2 1 0 𝑣8 0 1/3 −1 0 1/2 𝑣:=0 1/3 0 −1 1/2 𝑣9 0 1/3 1/2 0 −1 𝑣< 0
1 −12 −1 0
0−511 01−6−3 01−6−3
1 −12 −1 0 1 −12 −1 0
6 3 2
01−21 01−21 00−33
15 5 15 5
632
632 55
0 2 1 −1 0 2 1 −1 0 0 3 −3 333355
55
Solving Mv = v by Gaussian Elimination
−1 1/2 1 0 𝑣! 0 1/3 −1 0 1/2 𝑣" = 0 1/3 0 −1 1/2 𝑣# 0 1/3 1/2 0 −1 𝑣$ 0
1 −12 −1 0 1 −12 0 −1
• (M – I)v = 0
1 −12 −1 0
01−6−3 15 5
00−33 55
0 0 35 −35
1 0 0 −32
0 1 0 −1 0 0 1 −1 0000
0 1 −6 −3 0 1 0 −1 155 001−1
0 0 1 −1 0000
0 0 0 0
56
Solving Mv = v by Gaussian Elimination
• (M – I)v = 0
1 0 0 −3
−1 1/2 1 0 𝑣! 0 1/3 −1 0 1/2 𝑣" = 0 1/3 0 −1 1/2 𝑣# 0 1/3 1/2 0 −1 𝑣$ 0
𝑣8-3/2𝑣<=0 2 𝑣-𝑣=0
010−1 :<
0 0 1 −1 0000
𝑣9 -𝑣< =0 𝑣8+𝑣:+𝑣9 +𝑣<=1
𝑣8 =1/3
𝑣: =𝑣9 =𝑣< =2/9
Probability distribution
57
Comparison of Solutions
• 𝑀𝑣 = 𝑣
• Solution 1: power iteration
• Solution 2: Gaussian elimination
• Power iteration
• Complexity: O(kn2), where k = # of iterations
• Gaussian elimination • Complexity: O(n3)
• n2+(n-1)2+...+22
58
Spider traps and deadends
59
The Stationary Distribution
i1 i2 i3 𝒑 𝒕 + 𝟏 = 𝑴 ⋅ 𝒑(𝒕) j
• Where is the surfer at time t+1?
• Follows a link uniformly at random
¡Supposetherandomwalkreachesastate 𝒑 𝒕+𝟏 = 𝑴⋅ 𝒑(𝒕) = 𝒑(𝒕)
then 𝒑(𝒕) is stationary distribution of a random walk ¡ Our original rank vector 𝒓 satisfies 𝒓 = 𝑴 ⋅ 𝒓
• So, 𝒓 is a stationary distribution for the random walk
60
Existence and Uniqueness
• A central result from the theory of random walks (a.k.a. Markov processes):
For graphs that satisfy certain conditions, the stationary distribution is unique and eventually will be reached no matter what the initial probability distribution at time t = 0
61
PageRank: Problems
Two problems:
• (1) Some pages are
dead ends (have no out-links)
• Random walk has “nowhere” to go to
• Such pages cause importance to “leak out”
• (2) Spider traps:
(all out-links are within the group)
• Random walked gets “stuck” in a trap
• And eventually spider traps absorb all importance
Dead end
62
Spider trap
Problem: Spider Traps
yam a
1⁄2
1⁄2
0
1⁄2
0
0
0
1⁄2
1
All outlinks are within a group
y y mm
• Power Iteration: • Set
a
• Example: ry
=
1/3 2/6 3/12 5/24 1/3 1/6 2/12 3/24 ... 1/3 3/6 7/12 16/24
Iteration 0, 1, 2, ...
0 0 1
m is a spider trap
• And iterate rk+1 = Mrk
ry =ry/2+ra /2 ra =ry/2
rm = ra /2 + rm
ra rm
All the PageRank score gets “trapped” in node m
63
Solution: Teleports!
• The Google solution for spider traps: At each time step, the random surfer has two options
• With prob. 𝛽 follow a link at random
• With prob. 1-𝛽 jump to some random page
• Common values for 𝛽 are in the range 0.8 to 0.9
• Surfer will teleport out of spider trap within a few time steps
yy
amam 64
Random Teleports (b = 0.8) 7/15
M
[1/N]NxN
y
+ 0.2
0.8
13/15
m
rk+1 = Ark
1/3 0.33 0.24 0.26
7/15 7/15 1/15 7/15 1/15 1/15 1/15 7/15 13/15
a
ry
yam
y a m
ra = 1/3 0.20 0.20 0.18 ... 5/33
rm
1/3 0.46 0.52 0.56 21/33
1/2 1/2 0 1/2 0 0 0 1/2 1
A
7/33
1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3
65
1/15
7/15 7/15
1/15
7/15
1/15
1/15
Matrix formulation
• Suppose there are N pages
• Consider a page j, with set of outlinks O(j)
• We have Mij = 1/|O(j)| when j->i and Mij = 0 otherwise
• The random teleport is equivalent to
• adding a teleport link from j to every other page with probability (1-𝛽)/N
• reducingtheprobabilityoffollowingeachoutlinkfrom1/|O(j)|to𝛽/|O(j)| • Equivalent: tax each page a fraction (1-𝛽) of its score and redistribute evenly
Taxation
• Deal with spider traps and dead ends
• By allowing surfer to jump to a random page • This jumping is also called “teleporting”
• 𝑣- = 𝛽𝑀𝑣 + 1 − 𝛽 𝒆/𝑛
• 𝛽, damping factor, e.g., set to 80%
• 𝒆, an n-dimensional vector with all 1’s • n, # of nodes in the graph
• This in effect makes graph “strongly-connected”
Tax: 20% Distribute it evenly
67
Taxation on Spider Trap
• Suppose 𝛽 = .8
0 !”
0 0
𝑣- =
1/4 1/4 1/4 1/4
1/4 1/4 1/10
!# 0 0 !” 1/4 1/4 1/6 1/20 13/60
1/20 3/20 ! 0 1 ! .8 1/4 +.2 1/4 = 11/30 + 1/20 = 25/60
𝑣! =
C=0.8*1/4=1/5; A, D = 1/6; total=11/30
(A and D:.8*1/4*(1/3+1/2)=1/6; C:.8*1/4=1/5)
# ” 1/4 1/4 1/6 1/20
13/60
68
!!00 #”
A
1/4
9/60
…
15/148
• Suppose 𝛽 = .8
C
1/4
25/60
…
95/148
D
1/4
13/60
…
19/148
.8(1/3 A)
1/15
3/75
…
4/148
𝑣% =.8𝑀𝑣+.2
1/4 1/4 1/4 1/4
.8(1/2 D)
1/10
13/150
…
19/370
.8C
1/5
1/3
…
76/148
.2 * 1/4
1/20
1/20
…
1/20
Next C
25/60 (.4)
153/300 (.5)
…
95/148 (.64)
decreasing decreasing increasing fixed increasing
70
Problem: Dead Ends
Pages with no outlinks
• Power Iteration:
are “dead ends” for the
y
a
yam
1⁄2
1⁄2
0
1⁄2
0
0
0
1⁄2
0
• Set 𝑟# = 1
rfer:
m
y
a m
ra =ry/2 rm =ra /2
random su
• 𝑟# = ∑!→# ( )
• Example: ry
=
1/3 2/6 3/12 5/24 0 1/3 1/6 2/12 3/24 … 0 1/3 1/6 1/12 2/24 0
!
Nowhere to go on next
r = r /2 + r /2 yya
• And iterate step
k+1 k r =Ar
!
ra rm
Here the PageRank “leaks” out since the matrix is not column stochastic (not all columns
total to 1)
72
Solution 1 for Dead Ends: Always Teleport!
• Teleports: Follow random teleport links with probability 1.0 from dead- ends
• Adjust matrix accordingly
y a
y a
m
m
yam yam yy aa
mm
73
1⁄2
1⁄2
0
1⁄2
0
0
0
1⁄2
0
1⁄2
1⁄2
1/3
1⁄2
0
1/3
0
1⁄2
1/3
Solution 2 for Dead Ends: Taxation
• Suppose 𝛽 = .8
0 12
0 0
0 1 1/4
2 .8𝑣 +.2 1/4 0 1 1/4
2 1/4
00
0 1/2 0 0 1/4 1 0 𝑣%=.81/3 0 0 1/2𝑣+.21/4=3
1/3 0 0 1/2 1/3 1/2 0 0
1/4 1 0 1/4 3
11 32
74
Taxation on Dead End
• Suppose 𝛽 = .8
0 !”
! 0
𝑣!=#
!# 0
! ! #”
0 T o t a l : 4 8 / 6 0
! 1/4 1/4 1/10 1/20 3/20 ” .81/4 +.21/4= 1/6 +1/20=13/60 !” 1/4 1/4 1/6 1/20 13/60
0 1/4 1/4 1/6 1/20 13/60
Lost: 8/10 * 1/4 + 2/10 *1 – 2/10 *1 = 1/5 = 12/60
0 0 0 0
𝑣- =
1/4 1/4 1/4 1/4
sink
tax teleport (gain)
75
Taxation on Dead End
• Suppose 𝛽 = .8
1/4 3/20 𝑣-= 1/4,𝑣8= 13/60 1/4 13/60 1/4 13/60
76
1
2
…
After 100 Iterations
1/4 3/20 41/300 .1 1/4 , 13/60 , 53/300 , …, .13
8/10 * 1/4
8/10 * 13/60
…
1/4
1/4
Remaining: 1
13/60 13/60 4/5
53/300 .13 53/300 .13
2/3 .49
• Net loss = column 2 + column 3 – column 4
At the end of iteration #
Loss due to dead end
Loss due to taxation (20% tax)
2/10 * 1
2/10 * 4/5
…
Gain due to teleporting
2/10 * 1
2/10 * 1
…
1/5 (.2)
2/15 (.13)
…
1/5 (.2)
1/5+2/15 (.33)
…
4/5
2/3
…
Remaining ranks
Net loss will be zero when loss and gain balance out
77
Net loss
Total net loss
Without taxation
At the end of iteration #
Lost Prob.
Total loss
Remaining
1
1/4 (.25)
1/4 (.25)
3/4
2
5/24 (.21)
1/4 + 5/24
= 11/24 (.45)
1-11/24 =13/24
3
7/48 (.15)
11/24+7/48 =29/48 (.6)
1-29/48 =19/48
…
…
…
…
With taxation
At the end of iteration #
Loss due to dead end
Loss due to taxation (20% tax)
Gain due to teleporting
Net loss
Total net loss
Remaini ng ranks
1
8/10 * 1/4
2/10 * 1
2/10 * 1
1/5 (.2)
1/5 (.2)
4/5
2
8/10 * 13/60
2/10 * 4/5
2/10 * 1
2/15 (.13)
1/5+2/15 (.3)
2/3
…
…
…
…
…
…
…
Net loss will be zero when loss and gain balance out
78
Why Do Teleports Solve the Problem?
Why are dead-ends and spider traps a problem
and why do teleports solve the problem?
• Spider-traps
• PageRank scores are not what we want
• Solution: Never get stuck in a spider trap by teleporting out of it in a finite number of steps
• Dead-ends
• The matrix is not column stochastic so our initial assumptions are not met
• Solution: Make matrix column stochastic by always teleporting when there is nowhere else to go
79
Alternative for Dealing with dead-ends
•Prune and propagate
• Preprocess the graph to eliminate dead-ends
• This may create new dead-ends
• Might require multiple passes
• Compute page rank on reduced graph
• Approximate values for dead-ends by propagating values from reduced graph
Some Problems with Page Rank
• Measures generic popularity of a page • Biased against topic-specific authorities
• Solution: Topic-Specific PageRank
• Uses a single measure of importance • Other models of importance
• Solution: Hubs-and-Authorities
• Susceptible to Link spam
• Artificial link topographies created in order to boost page rank • Solution: TrustRank
81
How is Page Rank Information Used in a Search Engine?
• Each search engine has a secret formula that decides the order in which to show pages to a user in response to a search query
• Google: thought to use over 250 different properties of pages
• Search engine decides on a linear order of search results
• To be considered for ranking, page has to have at least one of the query search terms
• Unless all search terms are present, little chance of being in top ten results
• Among qualified pages, score computed for each
• PageRank is an important component of the score
• Other components: presence/absence of search terms in headers, links to the page
82
PageRang Iteration using MapReduce
• If n is small enough that each Map task can store the full vector v and vʹ in main memory
• With additional steps to multiply each component of Mv by constant β and to add (1 − β)/n
• Given the size of the Web, v is much too large to fit in main memory
• Break M into vertical stripes (chp. 2.3.1) and break v into corresponding horizontal stripes to execute the MapReduce
83
Matrix-Vector Multiplication by MapReduce
• nxnmatrix M,vectorvoflengthn
•MxV=X
• the vector x of length n, whose ith element xi is given by
84
Use of Combiners to Consolidate the Result Vector
• Partition the vector into k stripes and matrix into k^2 blocks • E.g., k=4
85
Use of Combiners to Consolidate the Result Vector (Cont’d)
• Use k^2 Map tasks (e.g., 16)
• Each task gets one square of the matrix M (Mij) and one stripe of the
vector v (vj)
• each stripe of the vector is sent to k (e.g., 4) different Map tasks
• vj is sent to the task handling Mij for each of the k possible values of i
• Advantage: keep both the jth stripe of v and the ith stripe of vʹ in main memory as we process Mij
Note that all terms generated from Mij and vj contribute to vʹi and no other stripe of vʹ
86
PageRank Iteration Using MapReduce
• Assume rank vector v does not fit in main memory
• Block stripe update method: break M into vertical stripes and
break r into corresponding horizontal stripes
• Allows us to execute MapReduce efficiently
• Process no more of v at any one Map task than can fit in main memory
• Trying to use a combiner gets more complicated: Section 5.2.3 • Partition matrix M into k2 blocks, vector r in k stripes
• Use k2 Map tasks
• Each Map task gets: one square of matrix: Mij, one stripe of vector, rj
• Each stripe rj sent to k Map tasks, transmitted over network k times • But each Mij transmitted only once
Topic-Specific PageRank
Topic-Specific PageRank
• Instead of generic popularity, can we measure popularity within a topic?
• Goal: Evaluate Web pages not just according to their popularity, but by how close they are to a particular topic, e.g. “sports” or “history”
• Allows search queries to be answered based on interests of the user
• Example: Query “Trojan” wants different pages depending on whether you are interested in sports, history, and computer security
89
Topic-Specific PageRank
• Random walker has a small probability of teleporting at any step
• Where a Teleport can go:
• Standard PageRank: Any page with equal probability
• Topic Specific PageRank: A topic-specific set of “relevant” pages (teleport set)
• Idea: Bias the random walk
• When walker teleports, she picks a page from a set S • S contains only pages that are relevant to the topic
• E.g., Open Directory (DMOZ) pages for a given topic/query • For each teleport set S, we get a different vector rS
90
Discovering the Topic PageRank Vectors rs for Topic Set S
• If the search engine can deduce the user’s interests, then it can do a better job of returning relevant pages
• A private PageRank vector that gives importance of each page to that user (not feasible!)
• Topic-Sensitive PageRank: creates one vector for some small number of topics
• Bias PageRank to favor pages of that topic
• Create different PageRank vectors for different topics
• The 16 top-level categories of the Open Directory (DMOZ): arts, business, sports,…
• If we can determine that the user is interested in one of these topics (e.g., by content of pages they recently viewed), use the PageRank vector for that topic when deciding on ranking of pages for query results
91
Biased Random Walk
•Bias toward a set S of pages
• S called teleport set
• eS: a vector with 1 for pages in S, 0 otherwise
92
Example
• Teleport set S = {B, D}, 𝛽 = .8 ABCD
A 0 1/2 1 0 B 1/3 0 0 1/2 C 1/3 0 0 1/2 D
M=
1/3 1/2 0 0
2/10 * 1/2
93
Example
1/4 9/24 15/48 11/32 3/9 1/4 5/24 11/48 7/32 ,…, 2/9
1/4 5/24 11/48 7/32 2/9 1/4 5/24 11/48 7/32 2/9
Unbiased: A has the largest
Biased: B and D has largest PageRanks
94
Integrating Topic-Sensitive PageRank Into a Search Engine
1.
2.
3.
4.
•
•
Decide on topics for which to create specialized PageRank vectors
Pick a teleport set for each of these topics
Use that set to compute topic-sensitive PageRank for that topic
Determine the topic or set of topics most relevant to a particular search query
More on this in next slide
Use the PageRank vector for those topic(s) to order the responses to the search query
95
How to Select the Topic Set that is Most Relevant to the Search Query?
• Tricky
• Several methods proposed:
1.
2.
User can pick from a topic menu
Infer the topic(s) from the context of the query
• Words that appear in Web pages recently searched by user or recent user queries
• E.g., query is launched from a web page talking about a known topic • History of queries: e.g., “basketball” followed by “Kobe”
• Classification of documents by topic: studied for decades
Infer the topic(s) from information about the user
• E.g., the user’s bookmarks, their interest on Facebook, etc.
3.
96
TrustRank: Combatting Web Spam
What is Web Spam?
• Spamming:
• Deliberate action to boost a webpage’s position
in search results
• Not commensurate with page’s real value
•This is a very broad definition
• SEO industry might disagree!
• SEO = search engine optimization
•Approximately 10-15% of web pages are spam
98
Web Search
• Early search engines:
• Crawl the Web
• Index pages by words they contained
• Respond to search queries (lists of words) with the pages containing those words
• Early page ranking:
• Attempt to order pages matching a search query by
“importance”
• First search engines considered:
• (1) Number of times query words appeared
• (2) Prominence of word position, e.g. title, header
99
First Spammers: Term Spam
• How do you make your page appear to be about movies? • (1) Add the word movie 1,000 times to your page
• Set text color to the background color, so only search engines would see it
• (2) Or, run the query “movie” on your target search engine • See what page came first in the listings
• Copy it into your page, make it “invisible”
• These and similar techniques are called term spam
100
Google’s Solution to Term Spam
• Believe what people say about you, rather than what you say about yourself
• Use words in the anchor text (words that appear underlined to represent the link) and its surrounding text
• PageRank as a tool to measure the “importance” of Web pages
• Doesn’t depend on content of pages
101
Why It Works?
• Our hypothetical shirt-seller loses
• Saying he is about movies doesn’t help, because others don’t say he
is about movies
• His page isn’t very important, so it won’t be ranked high for shirts or movies
• Example:
• Shirt-seller creates 1,000 pages, each links to his with “movie” in the
anchor text
• These pages have no links in, so they get little PageRank
• So the shirt-seller can’t beat truly important movie pages, like IMDB
102
Google vs. Spammers: Round 2!
• Once Google became the dominant search engine, spammers began to work out ways to fool Google
• Spam farms were developed to concentrate PageRank on a single page
• Link spam:
• Creating link structures that boost PageRank of a particular page
103
Spam Farm
at which the spammer
attempts to place as much PageRank as possible
the pages that the spammer cannot affect. Most of the Web is in this part
104
those pages that, while they are not controlled by the spammer, can be affected by the spammer.
supporting pages
the pages that the spammer owns and controls
Spam Farm Analysis
• x: contributed ranks to target page t from accessible pages • y: PageRank of target page
• m: # of supporting pages
• 𝜷: damping factor (so 1- 𝛽 is tax percentage)
• n: # of pages in the entire Web
• PageRank of each supporting page:
𝛽 𝑦/𝑚 + (1 – 𝛽)/n
105
Compute PageRank of Target Page
• Contributed by supporting pages • 𝛽𝑚(𝛽 𝑦/𝑚 + (1 – 𝛽)/n)
• Contributed by accessible pages •x
• Teleporting
• (1-𝛽)/n,negligible; y=x+𝛽𝑚(𝛽& + (1−𝛽)/n)=x+𝛽”𝑦+( !)( ‘ GIJG’J *
y=8HI. +(8KI)L =8HI. +𝑐L,
where c = I 8KI
If 𝛽 = .85, then 1/(1-𝛽2) = 3.6, c = .46
Amplify external contribution by 360%
Plus additional PageRank: 46% of fraction of farm pages in the entire Web
106
HITS: Hubs and Authorities
TrustRank
• A variant of topic-sensitive PageRank • Topic = a set of trustworthy pages
• Identify trustworthy pages
1. Manually examine pages with high PageRanks
2. Pick pages within controlled domains: edu, gov, mil
3. Avoid borderline pages, e.g., blog sites, forums
108
HITS Algorithm
• Hyperlink-induced topic search
• Each page has two scores
• Authority: how important the page is about topic? • Hub: does it provide many pointers to the topic?
• Examples
• Hub: department pages listing all courses • Authority: individual course pages
109
Authority and Hub
• A good authority page is one
• pointed by many good hub pages
U
V W
authority hub AB
X
Y Z
• A good hub page is one
• pointing to many good authority pages
110
Computing Authority and Hubbiness
• Authority of a page =
• sum of hubbiness of pages pointing to it
• a(A) = h(U) + h(V) + h(W)
• Hubbiness of a page =
• sum of authority of pages it points to • h(B) = a(X) + a(Y) + a(Z)
U
V W
authority hub AB
X
Y Z
111
Link Matrix
• L[i,j] = 1 if there is a link from node i to node j • LT is the un-normalized M in PageRank
112
Computing Authority and Hubbiness
•𝒉=𝜆𝐿𝒂, 𝒂=𝜇𝐿M𝒉
• 𝜆, 𝜇: scaling factors to normalize 𝒉 and 𝒂
•𝒉=𝜆𝜇𝐿𝐿M𝒉, 𝒂=𝜆𝜇𝐿M𝐿𝒂
• 𝐿𝐿Mand 𝐿M𝐿 likely dense
• Expensive to use direct recursion on 𝒉 and 𝒂 • e.g., iteratively computing 𝒉 = 𝜆𝜇𝐿𝐿+ 𝒉
• Use mutual recursion instead
113
Compute via Mutual Recursion
1. Start with 𝒉 = a vector of all 1’s
2. Compute 𝒂 = 𝐿/𝒉, then scale it so that largest component is 1
3. Compute 𝒉 = 𝐿𝒂 and similarly scaled
4. Repeat steps 2 and 3 until 𝒉 and 𝒂 do not change much
114
Bipartite View of Iterations
Compute authority from hub
AA BB CC DD EE
Compute hub from authority
Hub Authority
115
Example
1
𝒉- = 1 1 1
C’s hub=E’s authority
010001 1 1/2 -M-10010121 𝑎 = 𝐿 𝒉 = 1 0 0 1 0 1 = 2 = 1
110001 2 1/2 001001 1
Can you predict the changes to hub scores next?
116
Example
1 1/2
𝒉-=1- 1 𝑎=1
11
1 1/2 C’s hub=E’s authority
011101/2 3 1 8 – 1 0 0 1 0 1 3/2 1/2 h=𝐿𝒂=0 0 0 0 1 1 =1/2=1/6 0 1 1 0 0 1/2 2 2/3
00000 0 0
Will E’s authority decrease? Will E have > 0 hub score?
117
Example
1 1/2
𝒉-=1- 1 𝑎=1
11 1 1/2
C’s hub=E’s authority
1/2 3/10
8 1/2 8 5/3 1 : 6/5 12/29
h=1/6 𝑎=5/3= 1 h= 1/10 = 1/29 2/3 3/2 9/10 2 20/29
0 1/6 1/10 0 0
1
29/10 1
Decrease due to normalization
118
Example
1 1/2
𝒉-=1- 1 𝑎=1
11
1
1
8 1/2 8
1/2
3/10
1 : 12/29 : 49/29 1
h = 1/6 2/3
0
𝑎 =
1 h = 9/10
1/10
12/49 1/29 𝑎 = 49/29 = 1
1 12/29
C’s hub=E’s authority
20/29 41/29 0
41/49 1/29 1/49
Decrease due to normalization
119
At the Limit
• A: greatest hub
• B, C: greatest authority
–
𝒉 =
1111 1 8 1/2 : 12/29 L .36
1 𝒉 = 1/6 𝒉 = 1/29 ,…,𝒉 = 0 1 2/3 20/29 .72
1 1/2 0 3/10 0 12/49 0 .21
-181:1L1 𝒂=1𝒂=1𝒂=1,…,𝒂=1
1 9/10 41/49 1/2 1/10 1/49
.79 0
120
PageRank and HITS
• PageRank and HITS are two solutions to the same problem
• Rank the relative value of pages
• Use these values in search engine algorithms to determine the order of search results
121