程序代写代做代考 C graph algorithm chain Algorithms for Large Graphs

Algorithms for Large Graphs
COMPSCI 753 Kaiqi Zhao
Slides adapted from “Mining of Massive Datasets” http://mmds.org

Graph data
§ Social networks
Monthly Active Users (MAU)
Twitter
300 millions
Facebook
2.6 billions
Tiktok
800 millions
1

Graph data
§ Transportation networks
2

Graph data
§ Knowledge graphs
Yago
10M nodes 120M facts
DBPedia
4.58M nodes 3B facts
ConceptNet
300K nodes 1.6M facts
3

Graph data
§ Citation network (Börner et al., 2012)
4

Graph data
§ Web as a graph (4.2 billion webpages) § Nodes: Webpages, Edges: Hyperlinks
5

Overview
§ Web Search § PageRank
§ Web spam
§ Social network analysis § Community detection
§ Influence maximization
6

Web search
§ A broad question: How to efficiently organize the data on the web?
§ Web directories, e.g., Yahoo!
§ Web search, e.g., Google search
§ Find a ranked list of relevant documents given keywords
§ Difficulties:
§ Efficiency issue: scanning over all the documents each time is impossible! § Accuracy issue: we have spam everywhere over the web!
7

Web search
§ Search engines:
§ Crawl the Web and collect information from webpages § Index webpages according to words – inverted index
§ Answer queries (lists of words) with the webpages containing those words
Inverted index
8

Web search
§ Search engines:
§ Crawl the Web and collect information from webpages § Index webpages according to words – inverted index
§ Answer queries (lists of words) with the webpages containing those words
§ Search Result Ranking:
§ Return a list of webpages to a query ordered by importance/relevance
9

Importance measure by TF-IDF
§ We may consider each web page as a plain text document and use Term Frequency – Inverse Document Frequency (TF-IDF)
§ It models the importance of each word in a document: 𝑇𝐹−𝐼𝐷𝐹 𝑤,𝑑 =𝑡𝑓 𝑤,𝑑 ×𝑖𝑑𝑓(𝑤)
𝑡𝑓𝑤,𝑑 =𝑓 , !,#
Term frequency in document
𝑖𝑑𝑓(𝑤)=log 𝑁
$ |{𝑑 ∈ 𝐷: 𝑤 ∈ 𝑑}|
#docs contain the term
10

Problems with TF-IDF
§ The search engine can be easily fooled
§ Term spam:
§ Put invisible words in the webpages
§ A webpage about pizza could pretend as a page about AI, electronic, etc.:
§ Put 1000 keywords like “graph” and “algorithm” in the webpage in an invisible div layer. Or simply make the font color the same as background color.
§ Human cannot see such keywords, but the search engine does
§ Idea – see what others talk about the webpage rather than what it says about itself.
11

Web as a directed graph
§ In-links of a webpage 𝑝: hyperlinks towards 𝑝
§ Out-links of a webpage 𝑝: links within 𝑝
12

PageRank
[Larry Page, 99’]
§ Idea – the more in-links comes from important webpages, the more likely the webpage itself is important
§ Page is more important if it has more in-links. § www.stanford.edu has 23,400 in-links
§ www.joe-schmoe.com has 1 in-link
§ Links from important pages count more.
§ Many trustworthy sites like BBC news link to www.stanford.edu
13

PageRank
Flow definition
§ Each link’s vote is proportional to the importance of its source page
§ If page j with importance rj has 𝑑! out-links, each link gets 𝒓𝒋 votes
𝒅𝒋 &
§ Page j’s importance is the sum of the votes from its in-links 𝑟! = ∑$→! ‘” ”
14

PageRank
Matrix formulation
§ Let 𝐺 be a graph with 𝑁 nodes and 𝑁×𝑁 adjacency matrix 𝑴
§ Foreachnode𝑖with𝑑! out-links,ifwehavelink𝑖→𝑗,then𝑴”! =$#,otherwise𝑴”! =0 !
§ PageRank computes the rank vector 𝒓 as: 𝒓=𝑴⋅𝒓
§ Example of adjacency matrix :
yam
y
a m
15
1⁄2
1⁄2
0
1⁄2
0
0
0
1⁄2
1

PageRank
Matrix formulation
§ Let 𝐺 be a graph with 𝑁 nodes and 𝑁×𝑁 adjacency matrix 𝑴
§ Foreachnode𝑖with𝑑! out-links,ifwehavelink𝑖→𝑗,then𝑴”! =$#,otherwise𝑴”! =0 !
§ PageRank computes the rank vector 𝒓 as: 𝒓=𝑴⋅𝒓
i
Relation to recursive j definition 𝑟% = ∑&→% (!
.
rj
#! !=! “! $
r = i
!.r=r 16

PageRank
Matrix formulation
§ Let 𝐺 be a graph with 𝑁 nodes and 𝑁×𝑁 adjacency matrix 𝑴
§ Foreachnode𝑖with𝑑! out-links,ifwehavelink𝑖→𝑗,then𝑴”! =$#,otherwise𝑴”! =0 !
§ PageRank computes the rank vector 𝒓 as: 𝒓=𝑴⋅𝒓
§ Suppose 𝑴 is a column stochastic matrix (if no dead ends):
§ Each entry is non-negative § Each column sums up to 1
yam y
a m
1⁄2
1⁄2
0
1⁄2
0
0
0
1⁄2
1
17

PageRank
Matrix formulation
§ Let 𝐺 be a graph with 𝑁 nodes and 𝑁×𝑁 adjacency matrix 𝑴
§ Foreachnode𝑖with𝑑! out-links,ifwehavelink𝑖→𝑗,then𝑴”! =$#,otherwise𝑴”! =0 !
§ PageRank computes the rank vector 𝒓 as: 𝒓=𝑴⋅𝒓
§ Properties of column stochastic matrix 𝑴: § Always has an eigen value equals 1
§ For all other eigen values 𝜆′, 𝜆% ≤ 1
NOTE: x is an eigenvector with the corresponding eigenvalue λ if:
𝑨𝒙 = 𝜆𝒙
For 𝜆 = 1, 𝑨𝑨𝒙 = 𝑨𝒙 = 𝒙 http://people.math.harvard.edu/~knill/teaching/math19b_2011/handouts/lecture33.pdf
More information:
18

Solving PageRank – Iterative Method
𝒓=𝑴⋅𝒓
§ Consider the rank vector as the eigen vector of 𝑴 with eigen
value equals 1, i.e., the stable equilibrium distribution of 𝑴
§ Power iteration:
§ Suppose there are N webpages
§ Initialize: 𝒓(/) = [1/𝑁, … . , 1/𝑁]𝑇
§Iterate:𝒓!12 = 𝑴 E𝒓!
§ Stop when the difference of two vectors in consecutive iterations is
smaller than a threshold: ∑23&34 |𝒓&!12 − 𝒓&! | < 𝜖 19 Random Walk Interpretation ¡ A random web surfer (random walker) 1. Thesurferstartsatarandompage𝑖. 2. Nexttime,thesurferwillvisitanyofits out-links uniformly at random 3. Supposeweselectalinktopage𝑗 4. Repeat2and3frompage𝑗 ¡ Page probability ¡ Let 𝑝(𝑥!) denotes the probability distribution over pages at time 𝑡. ¡ The Page Rank vector 𝒓 is the stationary distribution 𝑝(𝑥!) of the random walk. 20 Markov Chain ¡ Let 𝑥0 be the page of the random walker at time 𝑡. 𝑝 𝑥0 =𝑗 =)𝑝 𝑥0 =𝑗𝑥023 =𝑖 𝑝(𝑥023 =𝑖) 1 ¡ This probabilistic model is called a Markov Chain 21 Convergence of Random Walk Theorem for Markov Chain: § If the matrix 𝑴 is § Stochastic – entries are non-negative and each column sums to 1 § Aperiodic – when starting from a node, we don't know when we will return to the same node after transitions § Irreducible – all nodes can reach to any other nodes § Then § The stationary distribution exists and is unique, and it does not reply on the initial values! 22 Structure of Web 23 Problems of PageRank § Problem 1: Dead ends § Pages that do not have out-links § Information propagates to dead ends will leak out Dead end ab 𝒓𝒂 1 0 0 0 𝒓𝒃 0 1 0 0 Iteration 0123 § Problem 2: Spider traps § Random walk gets “stuck” in a trap § And eventually spider traps absorb all importance 𝒓𝒂 1 0 0 0 a b c 𝒓𝒃 0 1 0 1 𝒓𝒄 0 0 1 0 Iteration 0 1 2 3 r(t+1) =år(t) i jd i® j i 24 Spider trap Dead ends § Solution to dead ends: § Idea – Run PageRank in the subgraph without dead ends and then propagate the ranks to the dead ends. ya y a 0 0 0 1⁄2 1⁄2 0 1⁄2 0 1⁄2 G m Not column stochastic! y a 𝒓𝒚 1/2 3/4 ... 2/3 1/2 G’ 𝒓𝒂 1/2 1/4 ... 1/3 1/3 𝒓𝒎 - - - - 1/6 0 1 stop 1⁄2 1 1⁄2 0 Iteration mm ya y a 25 Spider trap § Solution to spider trap (assume no dead end exists): § Idea – Get out of the trap by randomly select a node to teleport. § For each page, the random surfer has two choices: § Follow the outlinks to propagate information (with probability 𝛽) § Randomly jump to a node (with probability 1 − 𝛽) Flow form 𝑟% = J 𝛽 𝑟& + J(1 − 𝛽) 1 𝑟& y &→%𝑑&&𝑁 a m Matrix form 𝑨 𝒓 = 𝛽 𝑴 + 1 − 𝛽 42 4×4 𝒓 26 Teleport example for spider trap yam yam yy 0.8× a +0.2× a mm 𝑴 1/2 1/2 0 1/2 0 0 0 1/2 1 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 7/15 y a 13/15 m 𝑨 yam y a m 7/15 7/15 1/15 7/15 1/15 1/15 1/15 7/15 13/15 𝛽 = 0.8 27 1/15 7/15 7/15 1/15 7/15 1/15 1/15 Solving both problems with teleport § Solving dead ends with teleport: § Idea – Always teleport at dead end nodes. § Adjust the graph and link the dead ends to all other nodes with equal prob. Let the new transition matrix be 𝑴′ # & &×& yam y a m 𝑴′ may have dense columns § Calculate 𝑨=𝛽𝑴′+ 1−𝛽 y 1/3 1/3 1/3 1/2 1/2 a m 0 1/2 0 1/2 28 Efficiency of Power Iteration § Matrix multiplication for many times! §𝒓=𝑨⋅𝒓 § Time cost: 𝑂(𝑁9𝐼) § Space cost: 𝑂(𝑁$) ! yam (yam yam 1/2 1/2 0 1/2 0 0 0 1/2 0 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 7/15 7/15 1/15 7/15 1/15 1/15 1/15 7/15 13/15 § How if we have 𝑁 = 1 billion webpages? § We will have 𝑁$ = 102: entries in the matrix y=yy a 0.8× a +0.2× a mmm 29 Sparse matrix formulation § We can rearrange the PageRank equation to the follow: 1−𝛽 𝑁 § 𝑀 is sparse! § In each iteration (suppose no dead-ends): 1/2 1/2 0 1/2 0 0 0 1/2 0 § Calculate 𝑟"()* = 𝛽 ∑!→" ,!"#$ $! § Update the rank with the constant 𝑟()*+ = 𝑟()*+ + 1 − 𝛽 𝑁 yam y 𝑟 = 𝛽𝑀𝑟 + Constant! : ! a m § Question: how about if we have dead-ends? 30 Sparse matrix formulation § We can rearrange the PageRank equation to the follow: § Suppose use 4 bytes to each entry: We still need 40 GB to store 𝑀! 1−𝛽 𝑁 § 𝑀 is sparse! § 10~20 links per node: 10𝑁 = 102/ entries yam y a m 𝑟 = 𝛽𝑀𝑟 + : ! 1/2 1/2 0 1/2 0 0 0 1/2 0 31 Store the graph on disk § We could store the matrix 𝑀 on disk in the following form 𝑟)*+ 0 1 2 3 4 5 6 7 source degree destinations 𝑟,-. 0 1 2 3 4 5 6 7 0 3 1, 4, 7 1 4 3, 6, 52, 711 2 2 0, 13 3 2 71, 22 What if we cannot even fit 𝑟)*+ in memory? 32 Block update § Basic idea: break 𝑟>?@ into blocks!
𝑟)*+
0 1 2 3
4 5 6 7
source degree destinations
𝑟,-.
0 1 2 3 4 5 6 7
0
3
1, 4, 7
1
4
3, 6, 52, 711
2
2
0, 13
3
2
71, 22
Block 1
Block 2
Ø Each time only load and update one block, then write back to disk
Ø Need to read 𝑀 and 𝑟,-. for 𝑘 times!
33

Block-Stripe update
§ Basic idea: divide 𝑀 into stripes! Each stripe only contains destinations in the blocks of 𝑟>?@
𝑟)*+
0 1 2 3
4 5 6 7
source degree destinations
𝑟,-.
0 1 2 3 4 5 6 7
0
3
1
1
4
3
2
2
0
Block 1
0
3
4, 7
1
4
6
Block 2
34