Mining Social-Network Graphs
Analysis of Large Graphs: Community Detection
With slide contributions from
J. Leskovec, Anand Rajaraman, Jeffrey D. Ullman
Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.org
Networks & Communities
qWe often think of networks being organized into modules, cluster, communities:
2
Goal: Find Densely Linked Clusters
3
Twitter & Facebook qDiscovering social circles, circles of trust:
[McAuley, Leskovec: Discovering social circles in ego networks, 2012]
4
COMMUNITY DETECTION (ALGORITHMS AND METHODS)
How to find communities?
We will work with undirected (unweighted) networks
5
Betweenness Concept
qEdge betweenness: Number of shortest paths passing over the edge
qIntuition: The Russian Bridge
b=16
b=7.5
6
Betweenness Concept (Cont’d)
qFind edges in a social network graph that are least likely to be inside a community
qBetweenness of edge (a, b):
Ønumber of pairs of nodes x and y -> x, y ∈ 𝐶
Øedge (a,b) lies on the shortest path between x and y
qIf there are several shortest paths between x and y, edge (a,b) is credited with the fraction of those shortest paths that include edge (a,b)
qA high score is bad: suggests that edge (a,b) runs between two different communities
Øa and b are in different communities
7
Betweenness Example
• Expect that edge (B,D) has highest betweenness
• (B,D) is on every shortest path from {A,B,C} to {D,E,F,G}
• Betweenness of (B,D) = 3×4 = 12
• (D,F) is on every shortest path from {A,B,C,D} to {F}
• Betweenness of (D,F) = 4×1 = 4
• Natural communities: {A,B,C} and {D,E,F,G}
8
WE NEED TO RESOLVE 2 QUESTIONS
1. How to compute betweenness?
2. How to select the number of clusters?
9
The Girvan-Newman Algorithm
q Want to discover communities using divisive hierarchical clustering
Ø Start with one cluster (the social network) and recursively split it q Will do this based on the notion of edge betweenness:
Number of shortest paths passing through the edge
q Girvan-Newman Algorithm:
Ø Visits each node X once
Ø Computes the number of shortest paths from X to each of the other nodes that go through each of the edges
q Repeat:
Ø Calculate betweenness of edges
1. Thresholding to remove high betweeness edges, or
2. Remove edges with highest betweenness: between communities
q Connected components are communities
q Gives a hierarchical decomposition of the network
10
[Girvan-Newman ‘02]
Girvan-Newman Algorithm (1)
qVisit each node X once and compute the number of shortest paths from X to each of the other nodes that go through each of the edges
q1) Perform a breadth-first search (BFS) of the graph, starting at node X
Ø The level of each node in BFS is length of the shortest path from X to that node
Ø So edges that go between nodes on the same level can never be part of a shortest path from X
Ø Edges between levels are called DAG edges (DAG = Directed Acyclic Graph)
Ø Each DAG edge is part of at least one shortest path from root X 11
Girvan-Newman Algorithm (2)
q2) Label each node by the number of shortest paths that reach it from the root node
ØExample: BFS starting from node E, labels assigned
12
Girvan-Newman Algorithm (3)
qAssign node and edge values starting from bottom
13
Girvan-Newman Algorithm (4)
Assigning credits:
qA and C are leaves: get credit = 1
qEach of these nodes has only one parent, so their credit=1
is given to edges (B,A) and (B,C)
qAt level 2, G is a leaf: gets credit = 1
qB gets credit 1 + credit of DAG edges entering from below = 1 + 1 +1 = 3
qB has only one parent, so edge (D,B) gets entire credit of node B = 3
qNode G has 2 parents (D and F): how do we divide credit of G between the edges?
14
Girvan-Newman Algorithm (5)
q3) Calculate for each edge e, the sum over all nodes Y (of the fraction) of the shortest paths from the root X to Y that go through edge e
ØCompute this sum for nodes and edges, starting from the bottom of the graph
ØEach node other than the root node is given a credit of 1
ØEach leaf node in the DAG gets a credit of 1
ØEach node that is not a leaf gets credit = 1 + sum of credits of the DAG edges from that node to level below
ØA DAG edge e entering node Z (from the level above) is given a share of the credit of Z proportional to the fraction of shortest paths from the root to Z that go through e15
Girvan-Newman Algorithm (6)
qIn this case, both D and F have just one shortest path from E to each of those nodes
Ø So, give half credit of node G to each of those edges ØCredit = 1/(1 + 1) = 0.5
qIn general, how we distribute credit of a node to its edges depends on number of shortest paths
Ø Say there were 5 shortest paths to D and only 3 to F
Ø Then credit of edge (D,G) = 5/8 and credit of edge (F,G) = 3/8
qNode D gets credit = 1 + credits of edges below it = 1 + 3 + 0.5 = 4.5
qNode F gets credit = 1 + 0.5 = 1.5
qD has only one parent, so Edge (E,D) gets credit = 4.5 from D
qLikewise for F: Edge (E,F) gets credit = 1.5 from F
16
Girvan-Newman Algorithm (7): Completion of Credit Calculation starting at node E
17
Girvan-Newman Algorithm (8): Overall Betweenness Calculation
qTo complete betweenness calculation, must: ØRepeat this for every node as root
ØSum the contributions on each edge
ØDivide by 2 to get true betweenness
• since every shortest path will be counted twice, once for each of its endpoints
18
Using Betweenness to Find Communities: Clustering
qBetweenness scores for edges of a graph behave something like a distance metric
Ø Not a true distance metric
qCould cluster by taking edges in increasing order of betweenness and adding to graph one at a time
Ø At each step, connected components of graph form clusters
qGirvan-Newman: Start with the graph and all its edges and remove edges with highest betweenness
Ø Continue until graph has broken into suitable number of connected components
Ø Divisive hierarchical clustering (top down)
• Start with one cluster (the social network) and recursively split it 19
Using Betweenness to Find Communities (2)
q(B,D) has highest betweenness (12)
qRemoving edge would give natural communities we identified earlier: {A,B,C} and {D,E,F,G}
20
Using Betweenness to Find Communities (3): Thresholding
qCould continue to remove edges with highest betweenness
21
[Girvan-Newman ‘02]
Run Girvan-Newman Iteratively for Community Detection
qRecall: Divisive hierarchical clustering based on the notion of edge betweenness:
Number of shortest paths passing through the edge
qGirvan-Newman Algorithm:
» Undirected unweighted networks
ØRepeat until no edges are left:
• Calculate betweenness of edges
• This time: remove edges with highest betweenness
ØConnected components are communities
ØGives a hierarchical decomposition of the network
22
Girvan-Newman: Example
1
12
33
49
Need to re-compute betweenness at every step
23
Girvan-Newman: Example
Step 1: Step 2:
Step 3:
Hierarchical network decomposition:
24
Girvan-Newman: Results
25
Communities in physics collaborations
Girvan-Newman: Results
qZachary’s Karate club: Hierarchical decomposition
26
WE NEED TO RESOLVE 2 QUESTIONS
1. How to compute betweenness?
2. How to select the number of clusters?
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 27 http://www.mmds.org
Network Communities
qCommunities: sets of tightly connected nodes
qDefine: Modularity 𝑸
Ø A measure of how well a network is partitioned into communities
Ø Given a partitioning of the network into groups ÎÎ𝑺:
Q = ∑s∈S [(#edgeswithingroups)– (expected # edges within group s) ]
Need a null model!
The null model is a graph which matches one specific graph in some of its structural features, but which is otherwise taken to be an instance of a random graph. The null model is used as a term of comparison, to verify whether the graph in question displays some feature, such as community structure, or not.
28
Null Model: Configuration Model
qGiven real 𝑮 on 𝒏 nodes and 𝒎 edges, construct rewired network 𝑮’
ØSame degree distribution but random connections
i
ØConsider 𝑮’ as a multigraph
ØThe expected number of edges between nodes
j
𝒊and𝒋ofdegrees𝒌𝒊 and𝒌𝒋 equalsto:𝒌𝒊 ⋅ 𝒌𝒋 =𝒌𝒊𝒌𝒋 𝟐𝒎 𝟐𝒎
• The expected number of edges in (multigraph) G’: –=𝟏∑ ∑ 𝒌𝒊𝒌𝒋=𝟏⋅𝟏∑ 𝒌∑ 𝒌=
𝟐 𝒊∈𝑵 𝒋∈𝑵𝟐𝒎
𝟐 𝟐𝒎 𝒊∈𝑵 𝒊
𝒋∈𝑵 𝒋
– = 𝟏 𝟐𝒎⋅𝟐𝒎=𝒎 𝟒𝒎
Note:
! 𝑘! = 2𝑚
!∈#
29
Modularity
qModularity of partitioning S of graph G: ØQ=∑s∈S [(#edgeswithingroups)–
(expected # edges within group s) ] Ø𝑸 𝑮,𝑺 = 𝟏 ∑𝒔∈𝑺∑𝒊∈𝒔∑𝒋∈𝒔 𝑨𝒊𝒋 −𝒌𝒊𝒌𝒋
𝟐𝒎 𝟐𝒎
Normalizing cost.: -1