Algorithms for Large Graphs
COMPSCI 753 Kaiqi Zhao
Slides adapted from “Mining of Massive Datasets” http://mmds.org
Problems with PageRank
§ Measures generic importance of a page
§ In different topics, the importance of a web page could be different!
§ The results are not personalized
§ All the users will get the same results
§ Web spam
§ Spammer can create some topological structures to boost PageRank
35
Biased PageRank
§ In standard PageRank, every node is in the teleport set.
§ In biased PageRank, teleport takes place at nodes within a
set 𝑆.
§ The matrix formulation of biased PageRank is
𝒓=𝛽𝑴𝒓+1−𝛽 1𝒆𝑺 |𝑺|
𝒆𝑺 is a vector with the i-th value 1 if node 𝑖 is in 𝑺, otherwise 0 36
Applications of Biased PageRank
§ Topic-specific ranking
§ Personalized PageRank § TrustRank
37
Topic-specific ranking
§ Intuition:
§ Under different contexts, the search engine should be able to return
different ranks to a particular topic, e.g., “technology” or “life”.
§ E.g., a query of “apple” may should be related to “technology” or “life”.
§ Question: How to get ranking score within a topic? § Teleport set 𝑆! = A set of pages relevant to topic 𝑡
§ For each teleport set 𝑆!, we get a different PageRank 𝒓𝒕
𝒓𝒕=𝛽𝑴𝒓𝒕+ 1−𝛽 1 𝒆𝑺 |𝑺𝒕| 𝒕
38
Topic-specific ranking
§ Topic-specific PageRank:
§ For each topic 𝑡, we get PageRank 𝒓𝒕 based on teleport set 𝑆!
§ For each query 𝑞, classify it and get probability 𝑝(𝑡|𝑞) § Could use topic modeling or clustering
§ Final importance score obtained as linear combination
𝒓 = J 𝑝 𝑡 𝑞 𝒓𝒕 !
39
Topic-specific ranking
Example:
Node Iter=0
Iter=1 0.4 0.1 0.3 0.2
Iter=2 … 0.28 … 0.16 … 0.32 … 0.24 …
Stop 0.294 0.118 0.327 0.261
1 2 3 4
0.25 0.25 0.25 0.25
node 1
0.4
0.36
0.17
0.07
0.9
node 2
node 3
node 4
0.39
0.27
0.19
0.14
0.7
0.32
0.29
0.26
0.11
0.8 β
node 1
node 2
0.38
0.3
0.17
0.13
1,2,3
node 3
node 4
0.39
0.36
0.29
0.32
0.26
0.29
0.23
0.26
0.13
0.1
1,2,3,4
1,2
0.11
1
0.2
S
𝑺 = {1},
𝛽 = 0.8
Rank v.s. 𝛽
Rank v.s. S
40
Rank
Rank
Personalized PageRank
§ Idea: Provide a personalized ranking for each user based his/her preference on web pages 𝑺𝒖 (e.g., browse history)
𝒓𝒖 = 𝛽𝑴𝒓𝒖 + 1 − 𝛽 𝒑𝒖
§ 𝒑𝒖 = 2 𝒆𝑺 is called preference vector
|𝑺𝒖| 𝒖
§ Problems
§ Computation for each personalized ranking 𝒓𝒖 is expensive
§ Google has 2 billion users – it is not possible to calculate a ranking based on each user!
41
Linearity of PageRank
[Jeh and Widom ’03]
§ Linearity Theorem: For any preference vectors 𝒑𝟏 and 𝒑𝟐, if 𝒓𝟏 and 𝒓𝟐 are the two corresponding personalized PageRank vector, then for any constants 𝑤3, 𝑤E ≥ 0 and 𝑤3 + 𝑤E = 1:
𝑤3𝒓𝟏 +𝑤E𝒓𝟐 =𝛽𝑴 𝑤3𝒓𝟏 +𝑤E𝒓𝟐 +(1−𝛽)(𝑤3𝒑𝟏 +𝑤E𝒑𝟐)
42
Calculating Personalized PageRank
§ Basis vector: A one-hot vector where only one of the row is 1 and all others are 0.
§ Each basis vector 𝒆𝒊 denotes a teleport set of exactly one web page 𝑖
§ Any user preference vector can be computed as: 𝒑𝒖= 3 𝒆𝑺𝒖= 3 ∑1∈𝑺𝒖𝒆𝒊
|𝑺𝒖| |𝑺𝒖|
§ According to the Linearity Theorem:
𝒓I = 3 ∑1∈J” 𝒓𝒊 |𝑺𝒖|
The Linearity Theorem allows us to model each user preference vector as an ensemble of basis vectors!
43
Example
§ Basis vectors:
100 𝒆𝟏=0 𝒆𝟐=1 𝒆𝟑=0 001
§ Biased PageRank:
0.1 0.3 0.2 𝒓𝟏= 0.7 𝒓𝟐= 0.3 𝒓𝟑= 0.1 0.2 0.4 0.7
User 1 – Teleport set 𝑺𝒖𝟏 = {1,2} 𝒑𝒖𝟏=12𝒆𝑺𝒖𝟏 =12𝒆𝟏+12𝒆𝟐
1 1 0.2 𝒓𝒖𝟏=2𝒓𝟏+2𝒓𝟐= 0.5 0.3
User 2 – Teleport set 𝑺𝒖𝟐 = {1,2,3} 𝒑𝒖𝟐=13𝒆𝑺𝒖𝟐 =13𝒆𝟏+13𝒆𝟐+13𝒆𝟑
1 1 1 0.2 𝒓𝒖𝟐=3𝒓𝟏+3𝒓𝟐+3𝒓𝟑= 0.37
0.43
44
Overview
§ Web Search § PageRank
§ Web spam
§ Social network analysis § Community detection
§ Influence maximization
45
Link spam
§ The goal of a spammer is to maximize the PageRank of a target page
§ Three kinds of web pages for a spammer
§ Inaccessible pages
§ Accessible pages (e.g., spammers can put links in the comments)
§ Owned pages
§ Spammers have full control
§ Include the target page and support pages to boost the rank for the target.
§ Ideas to boost PageRank:
§ Post many links from the accessible pages to the target page § Construct link farm to get PageRank multiplier effect
46
Link farms
target
1 2
Inaccessible pages
Accessible pages
P
Owned pages
47
Analysis on link farms
§ Assume that without the link farm, the target
32U :
§ 𝑦: PageRank of target page with link farm
§ 𝑧: PageRank of each support page 𝑧 = N𝒚 + 2QN
§ The rank of the target with link farm:
𝑦≈R+𝑐P where𝑐=N 2QN$ 4 21N
1 2
P
Owned pages
page has a rank in the form of 𝑦 = 𝑥 + § 𝑥: contribution from accessible pages
target
P4
Accessible pages
Two properties:
1. The multiplier # > 1, boost the contribution of 𝑥
2. The PageRank grows linear to the number of support pages.
#23(
48
TrustRank
§ A topic-specific PageRank with the “topic” be the pages that are believed to be trustworthy.
§ Intuition: a spam page is easy to make a link to a trustworthy page, it is unlikely that trustworthy pages would link to a spam page.
§ Pages are divided:
§ Trustworthy pages
§ Fine sites with commentary functionalities § Spam pages
Compute TrustRank and consider pages below a threshold as spam? Problem?
49
Spam Mass
§ A complementary view:
What fraction of a page’s PageRank comes from spam pages?
§ Spam mass estimation
§ 𝑟S = PageRank of page 𝑝 § 𝑟S1 = TrustRank of page 𝑝
§ The spam mass is:
𝑟S − 𝑟S1
𝑟S
Pages with high spam mass are spam.
50