Algorithms for Large Graphs
COMPSCI 753 Kaiqi Zhao
Slides adapted from “Mining of Massive Datasets” http://mmds.org
Graph data
§ Social networks
Monthly Active Users (MAU)
Twitter
300 millions
Facebook
2.6 billions
Tiktok
800 millions
1
Graph data
§ Transportation networks
2
Graph data
§ Knowledge graphs
Yago
10M nodes 120M facts
DBPedia
4.58M nodes 3B facts
ConceptNet
300K nodes 1.6M facts
3
Graph data
§ Citation network (Börner et al., 2012)
4
Graph data
§ Web as a graph (4.2 billion webpages) § Nodes: Webpages, Edges: Hyperlinks
5
Overview
§ Web Search § PageRank
§ Web spam
§ Social network analysis § Community detection
§ Influence maximization
6
Web search
§ A broad question: How to efficiently organize the data on the web?
§ Web directories, e.g., Yahoo!
§ Web search, e.g., Google search
§ Find a ranked list of relevant documents given keywords
§ Difficulties:
§ Efficiency issue: scanning over all the documents each time is impossible! § Accuracy issue: we have spam everywhere over the web!
7
Web search
§ Search engines:
§ Crawl the Web and collect information from webpages § Index webpages according to words – inverted index
§ Answer queries (lists of words) with the webpages containing those words
Inverted index
8
Web search
§ Search engines:
§ Crawl the Web and collect information from webpages § Index webpages according to words – inverted index
§ Answer queries (lists of words) with the webpages containing those words
§ Search Result Ranking:
§ Return a list of webpages to a query ordered by importance/relevance
9
Importance measure by TF-IDF
§ We may consider each web page as a plain text document and use Term Frequency – Inverse Document Frequency (TF-IDF)
§ It models the importance of each word in a document: 𝑇𝐹−𝐼𝐷𝐹 𝑤,𝑑 =𝑡𝑓 𝑤,𝑑 ×𝑖𝑑𝑓(𝑤)
𝑡𝑓𝑤,𝑑 =𝑓 , !,#
Term frequency in document
𝑖𝑑𝑓(𝑤)=log 𝑁
$ |{𝑑 ∈ 𝐷: 𝑤 ∈ 𝑑}|
#docs contain the term
10
Problems with TF-IDF
§ The search engine can be easily fooled
§ Term spam:
§ Put invisible words in the webpages
§ A webpage about pizza could pretend as a page about AI, electronic, etc.:
§ Put 1000 keywords like “graph” and “algorithm” in the webpage in an invisible div layer. Or simply make the font color the same as background color.
§ Human cannot see such keywords, but the search engine does
§ Idea – see what others talk about the webpage rather than what it says about itself.
11
Web as a directed graph
§ In-links of a webpage 𝑝: hyperlinks towards 𝑝
§ Out-links of a webpage 𝑝: links within 𝑝
12
PageRank
[Larry Page, 99’]
§ Idea – the more in-links comes from important webpages, the more likely the webpage itself is important
§ Page is more important if it has more in-links. § www.stanford.edu has 23,400 in-links
§ www.joe-schmoe.com has 1 in-link
§ Links from important pages count more.
§ Many trustworthy sites like BBC news link to www.stanford.edu
13
PageRank
Flow definition
§ Each link’s vote is proportional to the importance of its source page
§ If page j with importance rj has 𝑑! out-links, each link gets 𝒓𝒋 votes
𝒅𝒋 &
§ Page j’s importance is the sum of the votes from its in-links 𝑟! = ∑$→! ‘” ”
14
PageRank
Matrix formulation
§ Let 𝐺 be a graph with 𝑁 nodes and 𝑁×𝑁 adjacency matrix 𝑴
§ Foreachnode𝑖with𝑑! out-links,ifwehavelink𝑖→𝑗,then𝑴”! =$#,otherwise𝑴”! =0 !
§ PageRank computes the rank vector 𝒓 as: 𝒓=𝑴⋅𝒓
§ Example of adjacency matrix :
yam
y
a m
15
1⁄2
1⁄2
0
1⁄2
0
0
0
1⁄2
1
PageRank
Matrix formulation
§ Let 𝐺 be a graph with 𝑁 nodes and 𝑁×𝑁 adjacency matrix 𝑴
§ Foreachnode𝑖with𝑑! out-links,ifwehavelink𝑖→𝑗,then𝑴”! =$#,otherwise𝑴”! =0 !
§ PageRank computes the rank vector 𝒓 as: 𝒓=𝑴⋅𝒓
i
Relation to recursive j definition 𝑟% = ∑&→% (!
.
rj
#! !=! “! $
r = i
!.r=r 16
PageRank
Matrix formulation
§ Let 𝐺 be a graph with 𝑁 nodes and 𝑁×𝑁 adjacency matrix 𝑴
§ Foreachnode𝑖with𝑑! out-links,ifwehavelink𝑖→𝑗,then𝑴”! =$#,otherwise𝑴”! =0 !
§ PageRank computes the rank vector 𝒓 as: 𝒓=𝑴⋅𝒓
§ Suppose 𝑴 is a column stochastic matrix (if no dead ends):
§ Each entry is non-negative § Each column sums up to 1
yam y
a m
1⁄2
1⁄2
0
1⁄2
0
0
0
1⁄2
1
17
PageRank
Matrix formulation
§ Let 𝐺 be a graph with 𝑁 nodes and 𝑁×𝑁 adjacency matrix 𝑴
§ Foreachnode𝑖with𝑑! out-links,ifwehavelink𝑖→𝑗,then𝑴”! =$#,otherwise𝑴”! =0 !
§ PageRank computes the rank vector 𝒓 as: 𝒓=𝑴⋅𝒓
§ Properties of column stochastic matrix 𝑴: § Always has an eigen value equals 1
§ For all other eigen values 𝜆′, 𝜆% ≤ 1
NOTE: x is an eigenvector with the corresponding eigenvalue λ if:
𝑨𝒙 = 𝜆𝒙
For 𝜆 = 1, 𝑨𝑨𝒙 = 𝑨𝒙 = 𝒙 http://people.math.harvard.edu/~knill/teaching/math19b_2011/handouts/lecture33.pdf
More information:
18
Solving PageRank – Iterative Method
𝒓=𝑴⋅𝒓
§ Consider the rank vector as the eigen vector of 𝑴 with eigen
value equals 1, i.e., the stable equilibrium distribution of 𝑴
§ Power iteration:
§ Suppose there are N webpages
§ Initialize: 𝒓(/) = [1/𝑁, … . , 1/𝑁]𝑇
§Iterate:𝒓!12 = 𝑴 E𝒓!
§ Stop when the difference of two vectors in consecutive iterations is
smaller than a threshold: ∑23&34 |𝒓&!12 − 𝒓&! | < 𝜖
19
Random Walk Interpretation
¡ A random web surfer (random walker)
1. Thesurferstartsatarandompage𝑖.
2. Nexttime,thesurferwillvisitanyofits out-links uniformly at random
3. Supposeweselectalinktopage𝑗
4. Repeat2and3frompage𝑗
¡ Page probability
¡ Let 𝑝(𝑥!) denotes the probability distribution over pages at time 𝑡. ¡ The Page Rank vector 𝒓 is the stationary distribution 𝑝(𝑥!) of the
random walk.
20
Markov Chain
¡ Let 𝑥0 be the page of the random walker at time 𝑡.
𝑝 𝑥0 =𝑗 =)𝑝 𝑥0 =𝑗𝑥023 =𝑖 𝑝(𝑥023 =𝑖)
1
¡ This probabilistic model is called a Markov Chain
21
Convergence of Random Walk
Theorem for Markov Chain: § If the matrix 𝑴 is
§ Stochastic – entries are non-negative and each column sums to 1
§ Aperiodic – when starting from a node, we don't know when we will return to the same node after transitions
§ Irreducible – all nodes can reach to any other nodes § Then
§ The stationary distribution exists and is unique, and it does not reply on the initial values!
22
Structure of Web
23
Problems of PageRank
§ Problem 1: Dead ends
§ Pages that do not have out-links
§ Information propagates to dead ends will leak out
Dead end
ab
𝒓𝒂 1 0 0 0
𝒓𝒃 0 1 0 0
Iteration
0123
§ Problem 2: Spider traps
§ Random walk gets “stuck” in a trap
§ And eventually spider traps absorb all importance
𝒓𝒂 1 0 0 0 a b c 𝒓𝒃 0 1 0 1
𝒓𝒄 0 0 1 0 Iteration 0 1 2 3
r(t+1) =år(t) i
jd i® j i
24
Spider trap
Dead ends
§ Solution to dead ends:
§ Idea – Run PageRank in the subgraph without dead ends and then
propagate the ranks to the dead ends.
ya
y a
0
0
0
1⁄2
1⁄2
0
1⁄2
0
1⁄2
G
m Not column stochastic!
y a
𝒓𝒚 1/2 3/4 ... 2/3 1/2 G’
𝒓𝒂 1/2 1/4 ... 1/3 1/3 𝒓𝒎 - - - - 1/6
0 1 stop
1⁄2
1
1⁄2
0
Iteration
mm
ya
y a
25
Spider trap
§ Solution to spider trap (assume no dead end exists):
§ Idea – Get out of the trap by randomly select a node to teleport.
§ For each page, the random surfer has two choices:
§ Follow the outlinks to propagate information (with probability 𝛽) § Randomly jump to a node (with probability 1 − 𝛽)
Flow form
𝑟% = J 𝛽 𝑟& + J(1 − 𝛽) 1 𝑟& y &→%𝑑&&𝑁
a
m
Matrix form
𝑨
𝒓 = 𝛽 𝑴 + 1 − 𝛽 42
4×4
𝒓
26
Teleport example for spider trap
yam yam yy
0.8× a +0.2× a mm
𝑴
1/2
1/2
0
1/2
0
0
0
1/2
1
1/3
1/3
1/3
1/3
1/3
1/3
1/3
1/3
1/3
7/15
y
a
13/15
m
𝑨
yam
y
a m
7/15
7/15
1/15
7/15
1/15
1/15
1/15
7/15
13/15
𝛽 = 0.8
27
1/15
7/15 7/15
1/15
7/15
1/15
1/15
Solving both problems with teleport
§ Solving dead ends with teleport:
§ Idea – Always teleport at dead end nodes.
§ Adjust the graph and link the dead ends to all other nodes with equal prob. Let the new transition matrix be 𝑴′
#
& &×&
yam y
a m
𝑴′ may have dense columns
§ Calculate 𝑨=𝛽𝑴′+ 1−𝛽 y
1/3
1/3
1/3
1/2
1/2
a
m
0
1/2
0
1/2
28
Efficiency of Power Iteration
§ Matrix multiplication for many times!
§𝒓=𝑨⋅𝒓
§ Time cost: 𝑂(𝑁9𝐼) § Space cost: 𝑂(𝑁$)
!
yam (yam yam
1/2
1/2
0
1/2
0
0
0
1/2
0
1/3
1/3
1/3
1/3
1/3
1/3
1/3
1/3
1/3
7/15
7/15
1/15
7/15
1/15
1/15
1/15
7/15
13/15
§ How if we have 𝑁 = 1 billion webpages? § We will have 𝑁$ = 102: entries in the matrix
y=yy
a 0.8× a +0.2× a mmm
29
Sparse matrix formulation
§ We can rearrange the PageRank equation to the follow:
1−𝛽 𝑁
§ 𝑀 is sparse!
§ In each iteration (suppose no dead-ends):
1/2
1/2
0
1/2
0
0
0
1/2
0
§ Calculate 𝑟"()* = 𝛽 ∑!→" ,!"#$ $!
§ Update the rank with the constant 𝑟()*+ = 𝑟()*+ + 1 − 𝛽
𝑁
yam y
𝑟 = 𝛽𝑀𝑟 +
Constant!
:
!
a m
§ Question: how about if we have dead-ends?
30
Sparse matrix formulation
§ We can rearrange the PageRank equation to the follow:
§ Suppose use 4 bytes to each entry: We still need 40 GB to store 𝑀!
1−𝛽 𝑁
§ 𝑀 is sparse!
§ 10~20 links per node: 10𝑁 = 102/ entries
yam y
a m
𝑟 = 𝛽𝑀𝑟 +
:
!
1/2
1/2
0
1/2
0
0
0
1/2
0
31
Store the graph on disk
§ We could store the matrix 𝑀 on disk in the following form
𝑟)*+
0 1 2 3 4 5 6 7
source degree destinations
𝑟,-.
0 1 2 3 4 5 6 7
0
3
1, 4, 7
1
4
3, 6, 52, 711
2
2
0, 13
3
2
71, 22
What if we cannot even fit 𝑟)*+ in memory?
32
Block update
§ Basic idea: break 𝑟>?@ into blocks!
𝑟)*+
0 1 2 3
4 5 6 7
source degree destinations
𝑟,-.
0 1 2 3 4 5 6 7
0
3
1, 4, 7
1
4
3, 6, 52, 711
2
2
0, 13
3
2
71, 22
Block 1
Block 2
Ø Each time only load and update one block, then write back to disk
Ø Need to read 𝑀 and 𝑟,-. for 𝑘 times!
33
Block-Stripe update
§ Basic idea: divide 𝑀 into stripes! Each stripe only contains destinations in the blocks of 𝑟>?@
𝑟)*+
0 1 2 3
4 5 6 7
source degree destinations
𝑟,-.
0 1 2 3 4 5 6 7
0
3
1
1
4
3
2
2
0
Block 1
0
3
4, 7
1
4
6
Block 2
34