Social network analysis: community detection
Donglei Du
(ddu@unb.edu)
Faculty of Business Administration, University of New Brunswick, NB Canada Fredericton
E3B 9Y2
October 20, 2016
Donglei Du (UNB) AlgoTrading October 20, 2016 1 / 35
Table of contents
1 An introduction
2 Community-detection algorithms
The Girvan-Newman betweeness method for graph partition
[Newman and Girvan, 2004]
The spectral modularity maximization community detection algorithm
Donglei Du (UNB) AlgoTrading October 20, 2016 2 / 35
Section 1
An introduction
Donglei Du (UNB) AlgoTrading October 20, 2016 3 / 35
Introduction
Community structure: a cohesive group of nodes that are connected
“more densely” to each other than to the nodes in other communities.
Community structures are quite common in real networks.
Being able to identify these sub-structures within a network can
provide insight into how network function and topology affect each
other.
Finding communities within an arbitrary network can be a
computationally difficult task (it is related to clustering in data
mining or machine learning with a key difference though).
Despite these difficulties, however, several methods for community
finding have been developed and employed with varying levels of
success.
The evaluation of algorithms, to detect which are better at detecting
community structure, is still an open question.
Donglei Du (UNB) AlgoTrading October 20, 2016 4 / 35
Quantify modularity of a community
[Newman and Girvan, 2004] I
Modularity is the fraction of the edges that fall within the given
groups minus the expected such fraction if edges were distributed at
random.
Formally, given a particular division of a network into ` communities
{c1, . . . , c`}, the modularity of this division is
Q =
1
2m ∑ij
[
Aij −
kik j
2m
]
δ(ci, cj) =
1
2m ∑ij
Bijδ(ci, cj) , (1)
where
m is the number of edges;
A is the adjacency matrix;
ki is the degree of node i;
ci is the type (community);
Donglei Du (UNB) AlgoTrading October 20, 2016 5 / 35
Quantify modularity of a community
[Newman and Girvan, 2004] II
δ is the Kronecker delta such that δ(x, y) = 1 if x = y and 0 otherwise;
the modularity matrix B has elements
Bij = Aij −
kik j
2m
.
Note that
∑
j
Bij = 0, ∀i
Q is strictly less than 1, takes positive values if there are more edges
between vertices of the same type than we would expect by chance,
and negative ones if there are less.
Donglei Du (UNB) AlgoTrading October 20, 2016 6 / 35
Why (1)? I
The modularity (1) is the difference between two terms:
The fraction of number of edges of the same type for the given division
is equal to
1
m ∑i,j
δ(ci, cj) =
1
2m ∑ij
Aijδ(ci, cj)
The total number of edges of the same type for a random network is
equal to
1
2m ∑ij
kik j
2m
δ(ci, cj)
Consider a particular edge attached to vertex i with degree ki.
There are by definition 2m ends of edges in the entire network, and the
chances that the other end of our particular edge is one of the kj ends
attached to vertex j is thus kj/2m if connections are made purely at
random.
Donglei Du (UNB) AlgoTrading October 20, 2016 7 / 35
Why (1)? II
Counting all ki edges attached to i, the total expected number of edges
between vertices i and j is then kikj/2m, leading to the desired
expected number of edges between all pairs of vertices of the same
type.
Donglei Du (UNB) AlgoTrading October 20, 2016 8 / 35
Modularity: an example
##
## Attaching package: ’igraph’
## The following objects are masked from ’package:stats’:
##
## decompose, spectrum
## The following object is masked from ’package:base’:
##
## union
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Donglei Du (UNB) AlgoTrading October 20, 2016 9 / 35
Modularity: an example
## [1] 0.5757576
## [1] 0.5757576
1
2
3 4
5
6
7
89
10
11
12
13
14
15
Donglei Du (UNB) AlgoTrading October 20, 2016 10 / 35
Section 2
Community-detection algorithms
Donglei Du (UNB) AlgoTrading October 20, 2016 11 / 35
Community-detection algorithms I
Edge-betweeness: The Girvan-Newman method
Divisive method: starts with the full graph and breaks it up to find
communities.
Too slow for many large networks (unless they are very sparse), and it
tends to give relatively poor results for dense networks.
Some other methods (See igraph manual)
fastgreedy.community: modularity optimization algorithm for finding
community structure
label.propagation.community: near linear time algorithm
leading.eigenvector.community: calculating the eigenvector of the
modularity matrix for the largest positive eigenvalue and then
separating vertices into two community based on the sign of the
corresponding element in the eigenvector.
multilevel.community: modularity measure and a hierarchial approach
optimal.community: Graphs with up to fifty vertices should be fine,
graphs with a couple of hundred vertices might be possible.
spinglass.community: spin-glass model and simulated annealing.
Donglei Du (UNB) AlgoTrading October 20, 2016 12 / 35
Community-detection algorithms II
walktrap.community: random walks
Donglei Du (UNB) AlgoTrading October 20, 2016 13 / 35
Subsection 1
The Girvan-Newman betweeness method for graph partition
[Newman and Girvan, 2004]
Donglei Du (UNB) AlgoTrading October 20, 2016 14 / 35
Edge Betweeness
1
2
3
4
5
6
The edge betweenness for edge e:
∑
s,t 6=v
σst(e)
σst
,
where
σst is total number of shortest paths from node
s to node t and σst(e) is the number of those
paths that pass through e.
The summation is n(n− 1) pairs for directed
graphs and n(n− 1)/2 for undirected graphs.
The edge betweenness for the graph on the left:
Edge Betwenness
12 2
15 3
23 3.5
25 2.5
34 3.5
45 5.5
46 5
Donglei Du (UNB) AlgoTrading October 20, 2016 15 / 35
How to find the edge betweeness in the example?
For example: for edge 23, the n(n− 1)/2 = 6(6− 1)/2 = 15 terms in the
summation in the order of 12, 13, 14, 15, 16, 23, 24, 25, 26, 34, 35, 36, 45,
46, 56 are
0
1
+
1
1
+
0
1
+
0
1
+
0
1
+
1
1
+
1
2
+
0
1
+
1
2
+
0
1
+
1
2
+
0
1
+
0
1
+
0
1
+
0
1
.
Here the denominators are the number of shortest paths between pair of edges
in the above order and the numerators are the number of shortest paths
passing through edge 23 between pair of edges in the above order.
1 2 3 4 5 6
1 0/1 1/1 0/1 0/1 0/1
2 1/1 1/2 0/1 1/2
3 0/1 1/2 0/1
4 0/1 0/1
5 0/1
6
Donglei Du (UNB) AlgoTrading October 20, 2016 16 / 35
Edge betweenness based community structure detection
The idea is that it is likely that edges connecting separate modules
have high edge betweenness as all the shortest paths from one module
to another must traverse through them.
So if we gradually remove the edge with the highest edge betweenness
score we will get a hierarchical map, a rooted tree, called a
dendrogram of the graph.
The leafs of the tree are the individual vertices
The root of the tree represents the whole graph
Donglei Du (UNB) AlgoTrading October 20, 2016 17 / 35
The Girvan-Newman method for graph partition
[Newman and Girvan, 2004]
Step 1. Find the edge of highest betweenness – or multiple edges of
highest betweenness, if there is a tie – and remove these
edges from the graph.
This may cause the graph to separate into multiple
components. If so, this is the first level of regions in the
partitioning of the graph.
Step 2. Recalculate all betweennesses, and again remove the edge or
edges of highest betweenness.
This may break some of the existing components into
smaller components; if so, these are regions nested
within the larger regions.
Step 3 Proceed in this way as long as edges remain in graph, in
each step recalculating all betweennesses and removing the
edge or edges of highest betweenness.
Donglei Du (UNB) AlgoTrading October 20, 2016 18 / 35
The Girvan-Newman method for graph partition
[Newman and Girvan, 2004]
The method gives us only a succession of splits of the network into
smaller and smaller communities, but it gives no indication of which
splits are best.
One way to find the best split is via the the modularity concept (1).
Donglei Du (UNB) AlgoTrading October 20, 2016 19 / 35
An example via package igraph: Illustration of the
Girvan-Newman method
1
2
3
4
5
6
## [1] 2.0 3.0 3.5 2.5 3.5 5.5 5.0
Donglei Du (UNB) AlgoTrading October 20, 2016 20 / 35
Delete edge (5,4)
## [1] 4 1 9 4 8 5
1
2
3
4
5
6
Donglei Du (UNB) AlgoTrading October 20, 2016 21 / 35
Delete edge (3,2)
## [1] 1 1 1 2 2
1
2
3
4
5
6
Donglei Du (UNB) AlgoTrading October 20, 2016 22 / 35
Delete edge (4,3) and (6,4)
## [1] 1 1 1
1
2
3
4
5
6
Donglei Du (UNB) AlgoTrading October 20, 2016 23 / 35
Delete edge (2,1), (5,1) and (5,2)
## numeric(0)
1
2
3
4
5
6
Donglei Du (UNB) AlgoTrading October 20, 2016 24 / 35
An example via package igraph: the
edge.betweenness.community() function
1
2
3
4
5
6
Donglei Du (UNB) AlgoTrading October 20, 2016 25 / 35
Find community via igraph function
edge.betweenness.community()
## IGRAPH clustering edge betweenness, groups: 2, mod: 0.2
## + groups:
## $`1`
## [1] 1 2 5
##
## $`2`
## [1] 3 4 6
##
## Community sizes
## 1 2
## 3 3
## [1] 0.2040816
1
2
3
4
5
6
Donglei Du (UNB) AlgoTrading October 20, 2016 26 / 35
Plot dendrogram for hierarchical methods via igraph
function dendPlot()
6 4 3 5 2 1
1
2
3
4
5
Donglei Du (UNB) AlgoTrading October 20, 2016 27 / 35
Subsection 2
The spectral modularity maximization community detection
algorithm
Donglei Du (UNB) AlgoTrading October 20, 2016 28 / 35
The spectral modularity maximization algorithm
[Newman, 2006a, Newman, 2006b]: division of a network
into just two parts
Recall (1)
Q =
1
2m ∑ij
[
Aij −
kik j
2m
]
δ(ci, cj) =
1
2m ∑ij
Bijδ(ci, cj)
Introduce a decision variable
si =
{
1, if node i belongs to group 1
−1, if node i belongs to group 2
Note that
δ(ci, cj) =
1
2
(1 + sisj)
We now have, noting that ∑j Bij = 0, ∀i for the modularity matrix:
Q =
1
4m ∑ij
Bijsisj =
1
4m
sTBs
Donglei Du (UNB) AlgoTrading October 20, 2016 29 / 35
The spectral modularity maximization problem
Now our problems becomes
max
[
1
4m
sTBs : s ∈ {−1, 1}n
]
.
This problem in general is NP-hard.
[Newman, 2006a, Newman, 2006b] propose a heuristic, called the
leading.eigenvector algorithm.
Donglei Du (UNB) AlgoTrading October 20, 2016 30 / 35
How are about unknown number of communities?
Simulated annealing: slow and only works for a few hundreds of nodes
Genetic algorithm: slow and only works for a few hundreds of nodes
Greedy algorithm: fast and can work for millions of nodes
. . .
Donglei Du (UNB) AlgoTrading October 20, 2016 31 / 35
Approximation algorithms for modularity maximization
[Agarwal and Kempe, 2008, Dinh and Thai, 2013]
More work to be done on approximation algorithms…
Donglei Du (UNB) AlgoTrading October 20, 2016 32 / 35
An example using package igraph function:
leading.eigenvector.community()
1
2
3
4
5
6
7
89
10
11
12
13
14
15
Donglei Du (UNB) AlgoTrading October 20, 2016 33 / 35
References I
Agarwal, G. and Kempe, D. (2008).
Modularity-maximizing graph communities via mathematical
programming.
The European Physical Journal B, 66(3):409–418.
Dinh, T. N. and Thai, M. T. (2013).
Community detection in scale-free networks: Approximation
algorithms for maximizing modularity.
Selected Areas in Communications, IEEE Journal on, 31(6):997–1006.
Newman, M. E. (2006a).
Finding community structure in networks using the eigenvectors of
matrices.
Physical review E, 74(3):036104.
Donglei Du (UNB) AlgoTrading October 20, 2016 34 / 35
References II
Newman, M. E. (2006b).
Modularity and community structure in networks.
Proceedings of the National Academy of Sciences, 103(23):8577–8582.
Newman, M. E. and Girvan, M. (2004).
Finding and evaluating community structure in networks.
Physical review E, 69(2):026113.
Donglei Du (UNB) AlgoTrading October 20, 2016 35 / 35
An introduction
Community-detection algorithms
The Girvan-Newman betweeness method for graph partition newman2004finding
The spectral modularity maximization community detection algorithm