Recommender Systems
Social Network Analysis
Community Detection
Robin Burke
DePaul University
Chicago, IL
1
Finding groups in networks
Discover communities of practice
Measure isolation of groups
Understand opinion dynamics / adoption
2
Zachary Karate Club
source:Easley/Kleinberg
3
Community finding
Social and other networks have a natural community structure
We want to discover this structure rather than impose a certain size of community or fix the number of communities
4
What makes a community
similarity of members
shared connectivity
mutuality of ties
everybody in the group knows everybody else
frequency of ties among members
everybody in the group has links to at least k others in the group
closeness or reachability of a subgroup
individual are separated by at most n hops
relative frequency of ties among subgroup members compared to nonmembers
5
Communities summarize
Grouping nodes can give us a “high level” picture of a network
6
Open research topic
Even though we know (in some cases) good measures for cluster quality
we can’t calculate those clusterings
for reasonably large networks
Lots of research on-going on this question
Talk about BIGCLAM later
7
Modularity
Seen this measure before
for assortativity
Are there more edges than you would expect if the connections were random?
Measure this probability over all clusters
Random clustering has modularity = 0
because edges in and out of the clusters are equally likely
Good clustering has high modularity
edges within clusters are many
edges between clusters relatively few
8
Modularity
Use cluster ids in our previous equation
Remember also B
modularity matrix
9
Modularity maximization
We can attempt to find maximum modularity
look at all possible subsets of nodes
exponential task
built into R
cluster_optimal()
small graphs only!
10
Greedy approaches
Merge nodes into a community as long as the modularity continues to increase
Then move to a different community
Continue until finished
No guarantee of optimality
In R
cluster_louvain()
Also implemented in Gephi
Used this in lab
A faster version
cluster_fast_greedy()
11
Edge weights
igraph methods will use edge weights if present
and labeled “weight”
Same modularity calculation applies
now the “value” of an edge is weighted
a weak edge crossing group boundaries
not as significant as a strong edge
12
Simulated annealing
An optimization approach
Basically
add and remove nodes from communities
occasionally allow moves that make the modularity worse
but tolerance for “bad” moves goes down over time
Eventually the communities converge
In R
cluster_spinglass()
this has the nice property that you can find a community for a single vertex without finding all of them
13
Spectral methods
Remember equation with modularity matrix
Create a vector s where si=1 if in group 1 and si=-1 if in group 2
We can rewrite Q as
14
Solving
We want to find assignments for s that maximize Q
Looks like “all subsets” again
but…
If s is the eigenvector of B with the greatest eigenvalue
then we’ll maximize Q
15
Discretization
Sounds good, but
entries in the eigenvector won’t be +1 and -1
Creating clusters
if si > 0, put node i in class 1
if si < 0, put node i in class 2
General strategy
turn an exponential discrete problem
into a continuous one that can be optimized
move the solution back into the discrete domain
we will see this idea again
16
Multiple communities
Keep subdividing until modularity cannot be improved
In R
cluster_leading_eigen()
17
Modularity maximization
Nice theoretical foundation
corresponds with intuition
Problem
resolution limit
the larger the network, the larger the communities it finds
not possible to find small communities in a large network
this is a fundamental problem with the modularity measure
“resolution” parameter in Gephi
doesn’t really fix the problem
Non-overlapping communities only
18
Betweenness clustering
A hierarchical technique
Find the edge of highest betweenness
this is the edge on the most shortest paths
Remove it
When you have disconnected the network
those are your communities
Keep going until you are down to single vertices
19
Betweenness clustering:
successively remove edges of highest betweenness (the bridges, or local bridges), breaking up the network into separate components
20
betweenness clustering algorithm & the karate club data set
source: Girvan and Newman, PNAS June 11, 2002 99(12):7821-7826
21
Betweenness clustering
In igraph
cluster_edge_betweenness()
Completely different criteria from modularity
more global
may be good or bad
depends on the purpose of the network
Note counter-intuitive interpretation of weights
Usually will want to set weights=NULL to avoid
22
What if communities overlap?
Users are often in multiple communities
cannot use the same techniques reliably
More open research questions
how to measure the quality of non-exclusive clustering
how to compute such clusters
Particularly an issue in large social networks
modularity maximization doesn’t work
23
Random walk
Idea
short random walks will (on average) stay within the community
Properties
very efficient compared to some others
often matches real data quite well
Overlapping communities
possible to extend this idea to allow overlap
In igraph
cluster_infomap()
cluster_walktrap()
cluster_labelprop()
24
Return value
Clustering methods return a community object
Various methods
length = # of communities
sizes = vector of community sizes
(number of nodes)
membership = vector of community labels
modularity = the modularity of the clustering (even if it wasn’t computed that way)
25
BIGCLAM
For more see mmds.org
A generative model of overlapping communities in networks
Assume that the network is a projection of a bipartite network
Our job is to recover the latent (unknown) shared interests that generate the graph
Generative model
B(V, C, M, {pc}) for graphs:
Nodes V, Communities C, Memberships M
Each community c has a single probability pc
Model
Network
Communities, C
Nodes, V
Community Affiliations
pA
pB
Memberships, M
Maximum Likelihood Estimation
Our goal: Find such that:
How do we find that maximizes the likelihood?
AGM
arg max
P(
|
)
28
Approximation
Cannot be solved directly
Approximation
Assume each user has an affiliation FuA for each community A
Assume independence
Assume probability of shared community
29
With assumptions
Turns into an optimization problem that can be solved with gradient descent
Very efficient even on large networks
Comparing clusterings
visually
# of communities
distribution of size
if there are lots of small communities, that’s not good
modularity
But this is not always a great measure
Adjusted Rand index
computes the probability that two clusterings agree on a given pair of items
normalized by the expected number of agreements by chance
vi.dist is similar
Both in mcclust package
31
Conclusion
Community structure is a way of ‘x-raying’ the network
seeing components with strong interactions
Idea: discover the “natural” community boundaries
hard to define what this means
many techniques
an area of open research
Analytic utility
identifying areas of the network with particular properties
subnetworks can be isolated for further analysis
32
Many techniques
What to use?
Modularity is intuitively appealing
but controversial
many studies show it does not work with real networks
esp. large ones
Random walks
Pretty good
but analyst has to decide how to set walk size
Overlapping communities
BIGCLAM and other techniques not implemented in igraph
33
Homework 5
Hashtag-hashtag projection from twitter
Tags connected if used by the same user
Filter
Community detection and analysis
Extract communities surrounding a given node
Different for each community detection algorithm
Next week
Mathematical models of networks
Data due
35
Example
36
Q =
1
2m
Ai, j −
kik j
2m
⎡
⎣
⎢
⎤
⎦
⎥
i, j
∑ ⋅δ(ci,cj )
Q=
1
2m
A
i,j
-
k
i
k
j
2m
é
ë
ê
ù
û
ú
i,j
å
×d(c
i
,c
j
)
Bi, j = Ai, j −
kik j
2m
⎡
⎣
⎢
⎤
⎦
⎥
B
i,j
=A
i,j
-
k
i
k
j
2m
é
ë
ê
ù
û
ú
Q =
1
2m
Bi, jδ(ci,cj )
i, j
∑
Q=
1
2m
B
i,j
d(c
i
,c
j
)
i,j
å
Q =
1
2m
sTBi, js
i, j
∑
Q=
1
2m
s
T
B
i,j
s
i,j
å
Q
Q
/docProps/thumbnail.jpeg