程序代写代做代考 algorithm SocialNetworks2015-part1

SocialNetworks2015-part1

With slide contributions from
J. Leskovec, Anand Rajaraman, Jeffrey D. Ullman

Mining Social-Network Graphs

Analysis of Large Graphs:
Community Detection

Note to other teachers and users of these slides: We would be delighted if you found this our
material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify
them to fit your own needs. If you make use of a significant portion of these slides in your own
lecture, please include this message, or a link to our web site: http://www.mmds.org

http://www.mmds.org/

Networks & Communities

qWe often think of networks being organized into
modules, cluster, communities:

2

Goal: Find Densely Linked Clusters

3

Twitter & Facebook

qDiscovering social circles, circles of trust:

[McAuley, Leskovec: Discovering social circles in ego networks, 2012]
4

COMMUNITY DETECTION
(GRAPH BASICS)
How to find communities?

5

COMMUNITY DETECTION
(ALGORITHMS AND METHODS)
How to find communities?

6We will work with undirected (unweighted) networks

Today:

7

Recall: Methods of Clustering
qHierarchical:

Ø Agglomerative (bottom up):
• Initially, each point is a cluster
• Repeatedly combine the two

“nearest” clusters into one
• Used a distance metric

Ø Divisive (top down):
• Start with one cluster and recursively split it

qPoint assignment:
Ø Maintain a set of clusters
Ø Points belong to “nearest” cluster
Ø Used a distance metric

Betweenness Concept

qEdge betweenness: Number of
shortest paths passing over the edge

qIntuition:

8
Edge strengths (call volume)

in a real network
Edge betweenness
in a real network

b=16
b=7.5

Betweenness Concept (Cont’d)

qFind edges in a social network graph that are least
likely to be inside a community

qBetweenness of edge (a, b):
Ønumber of pairs of nodes x and y -> x, y ∈ ”
Øedge (a,b) lies on the shortest path between x and y

qIf there are several shortest paths between x and
y, edge (a,b) is credited with the fraction of those
shortest paths that include edge (a,b)

qA high score is bad: suggests that edge (a,b) runs
between two different communities
Øa and b are in different communities 9

The Russian Bridge

10

Betweenness Example

11

• Expect that edge (B,D) has highest betweenness
• (B,D) is on every shortest path from {A,B,C} to {D,E,F,G}
• Betweenness of (B,D) = 3×4 = 12
• (D,F) is on every shortest path from {A,B,C,D} to {F}
• Betweenness of (D,F) = 4×1 = 4
• Natural communities: {A,B,C} and {D,E,F,G}

WE NEED TO RESOLVE 2 QUESTIONS

1. How to compute betweenness?
2. How to select the number of clusters?

12

The Girvan-Newman Algorithm
q Want to discover communities using divisive hierarchical

clustering
Ø Start with one cluster (the social network) and recursively split it

q Will do this based on the notion of edge betweenness:
Number of shortest paths passing through the edge

q Girvan-Newman Algorithm:
Ø Visits each node X once
Ø Computes the number of shortest paths from X to each of the other

nodes that go through each of the edges

q Repeat:
Ø Calculate betweenness of edges

1. Thresholding to remove high betweeness edges, or
2. Remove edges with highest betweenness: between communities

q Connected components are communities
q Gives a hierarchical decomposition of the network

13

[Girvan-Newman ‘02]

Girvan-Newman Algorithm (1)
q Visit each node X once and compute the number of

shortest paths from X to each of the other nodes that go
through each of the edges

q 1) Perform a breadth-first search (BFS) of the graph,
starting at node X
Ø The level of each node in BFS is length of the shortest path from X

to that node
Ø So edges that go between nodes on the same level can never be

part of a shortest path from X
Ø Edges between levels are called DAG edges (DAG = Directed

Acyclic Graph)
Ø Each DAG edge is part of at least one shortest path from root X

14

15

Girvan-Newman Algorithm (2)
q2) Label each node by the number of shortest

paths that reach it from the root node
ØExample: BFS starting from node E, labels assigned

Girvan-Newman Algorithm (3)

q3) Calculate for each edge e, the sum over all nodes
Y (of the fraction) of the shortest paths from the root
X to Y that go through edge e
ØCompute this sum for nodes and edges, starting from the

bottom of the graph
ØEach node other than the root node is given a credit of 1
ØEach leaf node in the DAG gets a credit of 1
ØEach node that is not a leaf gets credit = 1 + sum of credits

of the DAG edges from that node to level below
ØA DAG edge e entering node Z (from the level above) is

given a share of the credit of Z proportional to the fraction
of shortest paths from the root to Z that go through e

16

Girvan-Newman Algorithm (4)

17

qAssign node and edge values starting from
bottom

Girvan-Newman Algorithm (5)

Assigning credits:
q A and C are leaves: get credit = 1
q Each of these nodes has only one parent, so their credit=1

is given to edges (B,A) and (B,C)
q At level 2, G is a leaf: gets credit = 1
q B gets credit 1 + credit of DAG edges entering from below

= 1 + 1 +1 = 3
q B has only one parent, so edge (D,B) gets entire credit of

node B = 3
q Node G has 2 parents (D and F): how do we divide credit

of G between the edges?
18

Girvan-Newman Algorithm (6)

q In this case, both D and F have just one shortest
path from E to each of those nodes
Ø So, give half credit of node G to each of those edges
Ø Credit = 1/(1 + 1) = 0.5

q In general, how we distribute credit of a node to its edges
depends on number of shortest paths
Ø Say there were 5 shortest paths to D and only 3 to F

Ø Then credit of edge (D,G) = 5/8 and credit of edge (F,G) = 3/8

q Node D gets credit = 1 + credits of edges below it =
1 + 3 + 0.5 = 4.5

q Node F gets credit = 1 + 0.5 = 1.5
q D has only one parent, so Edge (E,D) gets credit = 4.5 from D
q Likewise for F: Edge (E,F) gets credit = 1.5 from F 19

Girvan-Newman Algorithm (7): Completion
of Credit Calculation starting at node E

20

Girvan-Newman Algorithm (8):
Overall Betweenness Calculation

qTo complete betweenness calculation, must:
ØRepeat this for every node as root
ØSum the contributions on each edge
ØDivide by 2 to get true betweenness

• since every shortest path will be counted twice, once for
each of its endpoints

21

Using Betweenness to Find
Communities: Clustering

q Betweenness scores for edges of a graph behave
something like a distance metric
Ø Not a true distance metric

q Could cluster by taking edges in increasing order of
betweenness and adding to graph one at a time
Ø At each step, connected components of graph form clusters

q Girvan-Newman: Start with the graph and all its edges and
remove edges with highest betweenness
Ø Continue until graph has broken into suitable number of connected

components
Ø Divisive hierarchical clustering (top down)

• Start with one cluster (the social network) and recursively split it
22

Using Betweenness to Find
Communities (2)

q(B,D) has highest betweenness (12)
qRemoving edge would give natural communities

we identified earlier: {A,B,C} and {D,E,F,G}

23

Using Betweenness to Find
Communities (3): Thresholding

qCould continue to remove edges with highest
betweenness

24

Run Girvan-Newman Iteratively for
Community Detection

qRecall: Divisive hierarchical clustering based on the
notion of edge betweenness:

Number of shortest paths passing through the edge
qGirvan-Newman Algorithm:

» Undirected unweighted networks
ØRepeat until no edges are left:

• Calculate betweenness of edges
• This time: remove edges with highest betweenness

ØConnected components are communities
ØGives a hierarchical decomposition of the network

25

[Girvan-Newman ‘02]

Girvan-Newman: Example

26

Need to re-compute
betweenness at

every step

4933

121

Girvan-Newman: Example

27

Step 1: Step 2:

Step 3: Hierarchical network decomposition:

Recall: Twitter & Facebook

qDiscovering social circles, circles of trust:

[McAuley, Leskovec: Discovering social circles in ego networks, 2012]
28

Girvan-Newman: Results

29Communities in physics
collaborations

Girvan-Newman: Results

qZachary’s Karate club:
Hierarchical decomposition

30

WE NEED TO RESOLVE 2 QUESTIONS

1. How to compute betweenness?
2. How to select the number of clusters?

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
http://www.mmds.org

31

Network Communities
q Communities: sets of

tightly connected nodes
q Define: Modularity !

Ø A measure of how well
a network is partitioned
into communities

Ø Given a partitioning of the
network into groups ÎΔ:
Q = ∑s∈ S [ (# edges within group s) –

(expected # edges within group s) ]

32

Need a null model!
The null model is a graph which matches one specific graph in some of its structural
features, but which is otherwise taken to be an instance of a random graph. The null model
is used as a term of comparison, to verify whether the graph in question displays some
feature, such as community structure, or not.

胡振东

胡振东

Null Model: Configuration Model
qGiven real ! on ” nodes and # edges,

construct rewired network !’
Ø Same degree distribution but

random connections
ØConsider !’ as a multigraph
ØThe expected number of edges between nodes
% and & of degrees ‘% and ‘& equals to: ‘% ⋅

‘&
)# =

‘%’&
)#

• The expected number of edges in (multigraph) G’:

– = ,)∑%∈/∑&∈/
‘%’&
)# =

,
) ⋅

,
)#∑%∈/ ‘% ∑&∈/ ‘& =

– = ,0#)# ⋅ )# = #

33

j

i

1
2∈3

42 = 26Note:

胡振东

胡振东

胡振东

Modularity

qModularity of partitioning S of graph G:
ØQ = ∑s∈ S [ (# edges within group s) –

(expected # edges within group s) ]

Ø” #, % = ‘()∑+∈%∑,∈+∑-∈+ .,- −
0,0-
()

qModularity values take range [−1,1]
Ø It is positive if the number of edges within

groups exceeds the expected number
Ø0.3-0.79. So, value <9= needs to be summed up >9 times.
But each edge (9, 🙂 has two endpoints so we need <9= +<:= for each edge (i, j) λ2 as optimization problem (cont’d) qWhat else do we know about x? Ø! is unit vector: ∑# !#$ = & Ø! is orthogonal to 1st eigenvector (&,… , &) thus: ∑# !# ⋅ & = ∑# !# = , qRemember: 54 å å -= Î 2 2 ),( 2 )( min ii jiEji x xx l All labelings of nodes - so that ∑./ = 0 We want to assign values !# to nodes i such that few edges cross 0. (we want xi and xj to subtract each other) i j ./ 0 x .1 Balance to minimize Ncut as an optimization problem 55 Recall: Example q Use standard math package to find all eigenvalues and eigenvectors Ø (Have not scaled eigenvectors to length 1, but could) q Second eigenvector has three positive and three negative components q Suggest obvious partitioning of {1,2,3} and {4,5,6} 56 So far… q How to define a “good” partition of a graph? Ø Minimize a given graph cut criterion q How to efficiently identify such a partition? Ø Approximate using information provided by the eigenvalues and eigenvectors of a graph q Spectral Clustering ØNaïve approache: • Split at 0 57 Spectral Clustering Algorithms qThree basic stages: Ø1) Pre-processing • Construct a matrix representation of the graph Ø2) Decomposition • Compute eigenvalues and eigenvectors of the matrix • Map each point to a lower-dimensional representation based on one or more eigenvectors Ø3) Grouping • Assign points to two or more clusters, based on the new representation 58 Spectral Partitioning Algorithm q1) Pre-processing: ØBuild Laplacian matrix L of the graph q2) Decomposition: Ø Find eigenvalues l and eigenvectors x of the matrix L ØMap vertices to corresponding components of l2 59 0.0-0.4-0.40.4-0.60.4 0.50.4-0.2-0.5-0.30.4 -0.50.40.60.1-0.30.4 0.5-0.40.60.10.30.4 0.00.4-0.40.40.60.4 -0.5-0.4-0.2-0.50.30.4 5.0 4.0 3.0 3.0 1.0 0.0 l= X = How do we now find the clusters? -0.66 -0.35 -0.34 0.33 0.62 0.31 1 2 3 4 5 6 1 3 -1 -1 0 -1 0 2 -1 2 -1 0 0 0 3 -1 -1 3 -1 0 0 4 0 0 -1 3 -1 -1 5 -1 0 0 -1 3 -1 6 0 0 0 -1 -1 2 Spectral Partitioning q3) Grouping: Ø Sort components of reduced 1-dimensional vector Ø Identify clusters by splitting the sorted vector in two qHow to choose a splitting point? Ø Naïve approaches: • Split at 0 or median value Ø More expensive approaches: • Attempt to minimize normalized cut in 1-dimension (sweep over ordering of nodes induced by the eigenvector) 60 -0.66 -0.35 -0.34 0.33 0.62 0.31 Split at 0: Cluster A: Positive points Cluster B: Negative points 0.33 0.62 0.31 -0.66 -0.35 -0.34 A B Example: Spectral Partitioning 61 Rank in x2 Va lu e of x 2 Example: Spectral Partitioning 62 Rank in x2 Va lu e of x 2 Components of x2 Example: Spectral partitioning 63 Components of x1 Components of x3 k-Way Spectral Clustering qHow do we partition a graph into k clusters? qTwo basic approaches: Ø Recursive bi-partitioning [Hagen et al., ’92] • Recursively apply bi-partitioning algorithm in a hierarchical divisive manner • Disadvantages: Inefficient, unstable Ø Cluster multiple eigenvectors [Shi-Malik, ’00] • Build a reduced space from multiple eigenvectors • Commonly used in recent papers • Multiple eigenvectors prevent instability due to information loss • A preferable approach… 64 DIRECT DISCOVERY OF COMMUNITIES: TRAWLING 65 With slide contributions from P. Desikan; http://www-users.cs.umn.edu/~desikan/ Web community qGroups of individuals who share common interests, together with the web pages most popular among them qWeb page collections with a shared topic 66 looking for Types of Communities qExplicitly- defined ØCommunities that manifest themselves as newsgroups or as resource collections on directories such as Yahoo! qImplicitly- defined ØCommunities that result from nature of content- creation of the web 67 Terms and Definitions (1) q Directed Bipartite Graph: A graph whose nodes set can be partitioned into two sets F and C, and every directed edge in the graph is from a node u in F to a node v in C 68 … … … F C Terms and Definitions (2) qCompleted Bipartite Graph: A bipartite graph that contains all possible edges between a vertex of F and a vertex of C qCore: A complete bipartite sub-graph with at least i nodes from F and at least j nodes from C Ø In the web world, the i pages the contains the links are referred to as ‘fans’ and the j pages that are referenced as ‘centers’ 69 Trawling qSearching for small communities in the Web graph qWhat is the signature of a community / discussion in a Web graph? [Kumar et al. ‘99] Dense 2-layer graphIntuition: Many people all talking about the same things … … … Use this to define “topics”: What the same people on the left talk about on the right Remember HITS! 71 Searching for Small Communities qA more well-defined problem: Enumerate complete bipartite subgraphs Ks,t ØWhere Ks,t : s nodes on the “left” where each links to the same t other nodes on the “right” K3,4 |X| = s = 3 |Y| = t = 4 X Y Fully connected 72 Frequent itemsets = complete bipartite graphs! qHow? ØView each node i as a set Si of nodes i points to ØKs,t = a set Y of size t (all items) that occurs in s (a basket) sets Si Ø Looking for Ks,tà set of frequency threshold to s and look at layer t – all frequent sets of size t From Itemsets to Bipartite Ks,t [Kumar et al. ‘99] i b c d a Si={a,b,c,d} j i k b c d a X Y s … minimum support (|X|=s) t … itemset size (|Y|=t) 74 From Itemsets to Bipartite Ks,t [Kumar et al. ‘99] i b c d a Si={a,b,c,d} x y z b c a X Y Find frequent itemsets: s … minimum support t … itemset size x b c a We found Ks,t! Ks,t = a set Y of size t that occurs in s sets Si View each node i as a set Si of nodes i points to Say we find a frequent itemset Y={a,b,c} of supp s So, there are s nodes that link to all of {a,b,c}: z a b c y b c a 75 Example qSupport threshold s=2 Ø {b,d}: support 3 Ø {e,f}: support 2 qAnd we just found 2 bipartite subgraphs: c a b d f Itemsets: a = {b,c,d} b = {d} c = {b,d,e,f} d = {e,f} e = {b,d} f = {} e c a b d e c d f e 76