SocialNetworks2015-part1
With slide contributions from
J. Leskovec, Anand Rajaraman, Jeffrey D. Ullman
Mining Social-Network Graphs
Analysis of Large Graphs:
Community Detection
Note to other teachers and users of these slides: We would be delighted if you found this our
material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify
them to fit your own needs. If you make use of a significant portion of these slides in your own
lecture, please include this message, or a link to our web site: http://www.mmds.org
http://www.mmds.org/
Networks & Communities
qWe often think of networks being organized into
modules, cluster, communities:
2
Goal: Find Densely Linked Clusters
3
Twitter & Facebook
qDiscovering social circles, circles of trust:
[McAuley, Leskovec: Discovering social circles in ego networks, 2012]
4
COMMUNITY DETECTION
(GRAPH BASICS)
How to find communities?
5
COMMUNITY DETECTION
(ALGORITHMS AND METHODS)
How to find communities?
6We will work with undirected (unweighted) networks
Today:
7
Recall: Methods of Clustering
qHierarchical:
Ø Agglomerative (bottom up):
• Initially, each point is a cluster
• Repeatedly combine the two
“nearest” clusters into one
• Used a distance metric
Ø Divisive (top down):
• Start with one cluster and recursively split it
qPoint assignment:
Ø Maintain a set of clusters
Ø Points belong to “nearest” cluster
Ø Used a distance metric
Betweenness Concept
qEdge betweenness: Number of
shortest paths passing over the edge
qIntuition:
8
Edge strengths (call volume)
in a real network
Edge betweenness
in a real network
b=16
b=7.5
Betweenness Concept (Cont’d)
qFind edges in a social network graph that are least
likely to be inside a community
qBetweenness of edge (a, b):
Ønumber of pairs of nodes x and y -> x, y ∈ ”
Øedge (a,b) lies on the shortest path between x and y
qIf there are several shortest paths between x and
y, edge (a,b) is credited with the fraction of those
shortest paths that include edge (a,b)
qA high score is bad: suggests that edge (a,b) runs
between two different communities
Øa and b are in different communities 9
The Russian Bridge
10
Betweenness Example
11
• Expect that edge (B,D) has highest betweenness
• (B,D) is on every shortest path from {A,B,C} to {D,E,F,G}
• Betweenness of (B,D) = 3×4 = 12
• (D,F) is on every shortest path from {A,B,C,D} to {F}
• Betweenness of (D,F) = 4×1 = 4
• Natural communities: {A,B,C} and {D,E,F,G}
WE NEED TO RESOLVE 2 QUESTIONS
1. How to compute betweenness?
2. How to select the number of clusters?
12
The Girvan-Newman Algorithm
q Want to discover communities using divisive hierarchical
clustering
Ø Start with one cluster (the social network) and recursively split it
q Will do this based on the notion of edge betweenness:
Number of shortest paths passing through the edge
q Girvan-Newman Algorithm:
Ø Visits each node X once
Ø Computes the number of shortest paths from X to each of the other
nodes that go through each of the edges
q Repeat:
Ø Calculate betweenness of edges
1. Thresholding to remove high betweeness edges, or
2. Remove edges with highest betweenness: between communities
q Connected components are communities
q Gives a hierarchical decomposition of the network
13
[Girvan-Newman ‘02]
Girvan-Newman Algorithm (1)
q Visit each node X once and compute the number of
shortest paths from X to each of the other nodes that go
through each of the edges
q 1) Perform a breadth-first search (BFS) of the graph,
starting at node X
Ø The level of each node in BFS is length of the shortest path from X
to that node
Ø So edges that go between nodes on the same level can never be
part of a shortest path from X
Ø Edges between levels are called DAG edges (DAG = Directed
Acyclic Graph)
Ø Each DAG edge is part of at least one shortest path from root X
14
15
Girvan-Newman Algorithm (2)
q2) Label each node by the number of shortest
paths that reach it from the root node
ØExample: BFS starting from node E, labels assigned
Girvan-Newman Algorithm (3)
q3) Calculate for each edge e, the sum over all nodes
Y (of the fraction) of the shortest paths from the root
X to Y that go through edge e
ØCompute this sum for nodes and edges, starting from the
bottom of the graph
ØEach node other than the root node is given a credit of 1
ØEach leaf node in the DAG gets a credit of 1
ØEach node that is not a leaf gets credit = 1 + sum of credits
of the DAG edges from that node to level below
ØA DAG edge e entering node Z (from the level above) is
given a share of the credit of Z proportional to the fraction
of shortest paths from the root to Z that go through e
16
Girvan-Newman Algorithm (4)
17
qAssign node and edge values starting from
bottom
Girvan-Newman Algorithm (5)
Assigning credits:
q A and C are leaves: get credit = 1
q Each of these nodes has only one parent, so their credit=1
is given to edges (B,A) and (B,C)
q At level 2, G is a leaf: gets credit = 1
q B gets credit 1 + credit of DAG edges entering from below
= 1 + 1 +1 = 3
q B has only one parent, so edge (D,B) gets entire credit of
node B = 3
q Node G has 2 parents (D and F): how do we divide credit
of G between the edges?
18
Girvan-Newman Algorithm (6)
q In this case, both D and F have just one shortest
path from E to each of those nodes
Ø So, give half credit of node G to each of those edges
Ø Credit = 1/(1 + 1) = 0.5
q In general, how we distribute credit of a node to its edges
depends on number of shortest paths
Ø Say there were 5 shortest paths to D and only 3 to F
Ø Then credit of edge (D,G) = 5/8 and credit of edge (F,G) = 3/8
q Node D gets credit = 1 + credits of edges below it =
1 + 3 + 0.5 = 4.5
q Node F gets credit = 1 + 0.5 = 1.5
q D has only one parent, so Edge (E,D) gets credit = 4.5 from D
q Likewise for F: Edge (E,F) gets credit = 1.5 from F 19
Girvan-Newman Algorithm (7): Completion
of Credit Calculation starting at node E
20
Girvan-Newman Algorithm (8):
Overall Betweenness Calculation
qTo complete betweenness calculation, must:
ØRepeat this for every node as root
ØSum the contributions on each edge
ØDivide by 2 to get true betweenness
• since every shortest path will be counted twice, once for
each of its endpoints
21
Using Betweenness to Find
Communities: Clustering
q Betweenness scores for edges of a graph behave
something like a distance metric
Ø Not a true distance metric
q Could cluster by taking edges in increasing order of
betweenness and adding to graph one at a time
Ø At each step, connected components of graph form clusters
q Girvan-Newman: Start with the graph and all its edges and
remove edges with highest betweenness
Ø Continue until graph has broken into suitable number of connected
components
Ø Divisive hierarchical clustering (top down)
• Start with one cluster (the social network) and recursively split it
22
Using Betweenness to Find
Communities (2)
q(B,D) has highest betweenness (12)
qRemoving edge would give natural communities
we identified earlier: {A,B,C} and {D,E,F,G}
23
Using Betweenness to Find
Communities (3): Thresholding
qCould continue to remove edges with highest
betweenness
24
Run Girvan-Newman Iteratively for
Community Detection
qRecall: Divisive hierarchical clustering based on the
notion of edge betweenness:
Number of shortest paths passing through the edge
qGirvan-Newman Algorithm:
» Undirected unweighted networks
ØRepeat until no edges are left:
• Calculate betweenness of edges
• This time: remove edges with highest betweenness
ØConnected components are communities
ØGives a hierarchical decomposition of the network
25
[Girvan-Newman ‘02]
Girvan-Newman: Example
26
Need to re-compute
betweenness at
every step
4933
121
Girvan-Newman: Example
27
Step 1: Step 2:
Step 3: Hierarchical network decomposition:
Recall: Twitter & Facebook
qDiscovering social circles, circles of trust:
[McAuley, Leskovec: Discovering social circles in ego networks, 2012]
28
Girvan-Newman: Results
29Communities in physics
collaborations
Girvan-Newman: Results
qZachary’s Karate club:
Hierarchical decomposition
30
WE NEED TO RESOLVE 2 QUESTIONS
1. How to compute betweenness?
2. How to select the number of clusters?
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
http://www.mmds.org
31
Network Communities
q Communities: sets of
tightly connected nodes
q Define: Modularity !
Ø A measure of how well
a network is partitioned
into communities
Ø Given a partitioning of the
network into groups ÎΔ:
Q = ∑s∈ S [ (# edges within group s) –
(expected # edges within group s) ]
32
Need a null model!
The null model is a graph which matches one specific graph in some of its structural
features, but which is otherwise taken to be an instance of a random graph. The null model
is used as a term of comparison, to verify whether the graph in question displays some
feature, such as community structure, or not.
胡振东
胡振东
Null Model: Configuration Model
qGiven real ! on ” nodes and # edges,
construct rewired network !’
Ø Same degree distribution but
random connections
ØConsider !’ as a multigraph
ØThe expected number of edges between nodes
% and & of degrees ‘% and ‘& equals to: ‘% ⋅
‘&
)# =
‘%’&
)#
• The expected number of edges in (multigraph) G’:
– = ,)∑%∈/∑&∈/
‘%’&
)# =
,
) ⋅
,
)#∑%∈/ ‘% ∑&∈/ ‘& =
– = ,0#)# ⋅ )# = #
33
j
i
1
2∈3
42 = 26Note:
胡振东
胡振东
胡振东
Modularity
qModularity of partitioning S of graph G:
ØQ = ∑s∈ S [ (# edges within group s) –
(expected # edges within group s) ]
Ø” #, % = ‘()∑+∈%∑,∈+∑-∈+ .,- −
0,0-
()
qModularity values take range [−1,1]
Ø It is positive if the number of edges within
groups exceeds the expected number
Ø0.3-0.79. So, value <9= needs to be summed up >9 times.
But each edge (9, 🙂 has two endpoints so we need <9= +<:=
for each edge (i, j)
λ2 as optimization problem (cont’d)
qWhat else do we know about x?
Ø! is unit vector: ∑# !#$ = &
Ø! is orthogonal to 1st eigenvector (&,… , &) thus:
∑# !# ⋅ & = ∑# !# = ,
qRemember:
54
å
å -= Î 2
2
),(
2
)(
min
ii
jiEji
x
xx
l
All labelings
of nodes - so
that ∑./ = 0
We want to assign values !# to nodes i such
that few edges cross 0.
(we want xi and xj to subtract each other)
i j
./ 0
x
.1
Balance to minimize
Ncut as an optimization problem
55
Recall: Example
q Use standard math package to find all eigenvalues and
eigenvectors
Ø (Have not scaled eigenvectors to length 1, but could)
q Second eigenvector has three positive and three negative
components
q Suggest obvious partitioning of {1,2,3} and {4,5,6} 56
So far…
q How to define a “good” partition of a graph?
Ø Minimize a given graph cut criterion
q How to efficiently identify such a partition?
Ø Approximate using information provided by the
eigenvalues and eigenvectors of a graph
q Spectral Clustering
ØNaïve approache:
• Split at 0
57
Spectral Clustering Algorithms
qThree basic stages:
Ø1) Pre-processing
• Construct a matrix representation of the graph
Ø2) Decomposition
• Compute eigenvalues and eigenvectors of the matrix
• Map each point to a lower-dimensional representation
based on one or more eigenvectors
Ø3) Grouping
• Assign points to two or more clusters, based on the new
representation
58
Spectral Partitioning Algorithm
q1) Pre-processing:
ØBuild Laplacian
matrix L of the
graph
q2) Decomposition:
Ø Find eigenvalues l
and eigenvectors x
of the matrix L
ØMap vertices to
corresponding
components of l2 59
0.0-0.4-0.40.4-0.60.4
0.50.4-0.2-0.5-0.30.4
-0.50.40.60.1-0.30.4
0.5-0.40.60.10.30.4
0.00.4-0.40.40.60.4
-0.5-0.4-0.2-0.50.30.4
5.0
4.0
3.0
3.0
1.0
0.0
l= X
=
How do we now
find the clusters?
-0.66
-0.35
-0.34
0.33
0.62
0.31
1 2 3 4 5 6
1 3 -1 -1 0 -1 0
2 -1 2 -1 0 0 0
3 -1 -1 3 -1 0 0
4 0 0 -1 3 -1 -1
5 -1 0 0 -1 3 -1
6 0 0 0 -1 -1 2
Spectral Partitioning
q3) Grouping:
Ø Sort components of reduced 1-dimensional vector
Ø Identify clusters by splitting the sorted vector in two
qHow to choose a splitting point?
Ø Naïve approaches:
• Split at 0 or median value
Ø More expensive approaches:
• Attempt to minimize normalized cut in 1-dimension
(sweep over ordering of nodes induced by the eigenvector)
60
-0.66
-0.35
-0.34
0.33
0.62
0.31
Split at 0:
Cluster A: Positive points
Cluster B: Negative points
0.33
0.62
0.31
-0.66
-0.35
-0.34
A B
Example: Spectral Partitioning
61
Rank in x2
Va
lu
e
of
x
2
Example: Spectral Partitioning
62
Rank in x2
Va
lu
e
of
x
2
Components of x2
Example: Spectral partitioning
63
Components of x1
Components of x3
k-Way Spectral Clustering
qHow do we partition a graph into k clusters?
qTwo basic approaches:
Ø Recursive bi-partitioning [Hagen et al., ’92]
• Recursively apply bi-partitioning algorithm in a hierarchical
divisive manner
• Disadvantages: Inefficient, unstable
Ø Cluster multiple eigenvectors [Shi-Malik, ’00]
• Build a reduced space from multiple eigenvectors
• Commonly used in recent papers
• Multiple eigenvectors prevent instability due to information
loss
• A preferable approach…
64
DIRECT DISCOVERY OF
COMMUNITIES: TRAWLING
65
With slide contributions from P. Desikan; http://www-users.cs.umn.edu/~desikan/
Web community
qGroups of individuals who share common
interests, together with the web pages most
popular among them
qWeb page collections with a shared topic
66
looking for
Types of Communities
qExplicitly- defined
ØCommunities that manifest themselves as newsgroups
or as resource collections on directories such as Yahoo!
qImplicitly- defined
ØCommunities that result from nature of content-
creation of the web
67
Terms and Definitions (1)
q Directed Bipartite Graph: A graph whose nodes
set can be partitioned into two sets F and C, and
every directed edge in the graph is from a node u
in F to a node v in C
68
… …
…
F C
Terms and Definitions (2)
qCompleted Bipartite Graph: A bipartite graph that
contains all possible edges between a vertex of F
and a vertex of C
qCore: A complete bipartite sub-graph with at least
i nodes from F and at least j nodes from C
Ø In the web world, the i pages the contains the links are
referred to as ‘fans’ and the j pages that are
referenced as ‘centers’
69
Trawling
qSearching for small communities in
the Web graph
qWhat is the signature of a community /
discussion in a Web graph?
[Kumar et al. ‘99]
Dense 2-layer
graphIntuition: Many people all talking about the same things
… …
…
Use this to define “topics”:
What the same people on
the left talk about on the right
Remember HITS!
71
Searching for Small Communities
qA more well-defined problem:
Enumerate complete bipartite subgraphs Ks,t
ØWhere Ks,t : s nodes on the “left” where each links to
the same t other nodes on the “right”
K3,4
|X| = s = 3
|Y| = t = 4
X Y
Fully connected 72
Frequent itemsets = complete bipartite graphs!
qHow?
ØView each node i as a
set Si of nodes i points to
ØKs,t = a set Y of size t (all items)
that occurs in s (a basket) sets Si
Ø Looking for Ks,tà set of
frequency threshold to s
and look at layer t – all
frequent sets of size t
From Itemsets to Bipartite Ks,t
[Kumar et al. ‘99]
i
b
c
d
a
Si={a,b,c,d}
j
i
k
b
c
d
a
X Y
s … minimum support (|X|=s)
t … itemset size (|Y|=t) 74
From Itemsets to Bipartite Ks,t
[Kumar et al. ‘99]
i
b
c
d
a
Si={a,b,c,d}
x
y
z
b
c
a
X Y
Find frequent itemsets:
s … minimum support
t … itemset size
x
b
c
a
We found Ks,t!
Ks,t = a set Y of size t
that occurs in s sets Si
View each node i as a
set Si of nodes i points to
Say we find a frequent
itemset Y={a,b,c} of supp s
So, there are s nodes that
link to all of {a,b,c}:
z
a
b
c
y
b
c
a
75
Example
qSupport threshold s=2
Ø {b,d}: support 3
Ø {e,f}: support 2
qAnd we just found 2 bipartite
subgraphs:
c
a b
d
f
Itemsets:
a = {b,c,d}
b = {d}
c = {b,d,e,f}
d = {e,f}
e = {b,d}
f = {}
e
c
a b
d
e
c
d
f
e
76