程序代写代做代考 graph algorithm data mining information theory Complex Dynamical Networks: Lecture 6a: Community Structures

Complex Dynamical Networks: Lecture 6a: Community Structures
EE 6605
Instructor: G Ron Chen
Most pictures on this ppt were taken from un-copyrighted websites on the web with thanks

Community Structure in Complex Networks

Community
Each densely- linked group is a community
Each symmetrical group is a community
Definition of a community can be subjective

Community Detection
 Common Definition: A community is a set of nodes among which the connections are relatively dense and strong, or interactions are relatively frequent
 A network has a community structure if the network can naturally be divided into clusters of nodes with dense internal connections and sparse external connections
 Community Detection (Clustering, Grouping) To find cohesive subgroups from a given graph
 Applications
Understanding interactions between people (or systems) Visualizing and navigating on large-scale networks Forming bases for other tasks such as data mining

Wiki

Example: USA school integration and friendship segregation
Race: left (white) to right (black) J. Moody, Amer. J. of Sociology, 2001 Grade: up (junior) to down (senior)

Community Detection
Many real networks have a natural community structure, and we want to discover this structure
For a large-scale network, how can we discover community structure in an automated way?

Criteria: Which partition is better ? • Quality Measures:
How to evaluate an algorithm’s performance while the community structure is unknown?
• Benchmarks:
Which algorithm is the best to characterize a given network with a known community structure?

Benchmark Examples
A real network:
Zachary’s Karate Network
An artificial network:
Planted L-partition model
W. W. Zachary, J. Anthropological Research 33: 452-473, 1977
A. Condon and R. M. Karp, Random Structures & Algorithms 18(2): 116-140 , 2001

Zachary’s Karate Network
Story Background / Dataset

Zachary’s Karate Network

Other
Benchmark Examples
word network
Lusseau’s network of bottlenose dolphins

Example: Planted 2-Partition Model
L=2

Planted L-Partition Model Partition the graph into L parts (for example, L = 3)
Each node has a probability (or, degree) pin of being connected to nodes inside its group and a probability
pout of being connected to nodes outside its group If pin  pout then the graph has a
community structure
Otherwise, it is a homogeneous (e.g., random) graph
L =3

Methods and Algorithms
1) Minimum-cut method
2) Hierarchical clustering
3) GN algorithm
4) Clique-based methods
5) Modularity maximization
6) Information-based algorithms ……

1) Minimum-Cut Method
One of the oldest algorithms for dividing networks into parts
The network is divided into
a pre-determined number of groups, in approximately the same size (“balanced”), chosen such that the number or cost (total weights) of edges between groups is minimized
Karger’s algorithm
Example:
Balanced, best
Balanced, but costly

Ratio Cut and Normalized Cut
 If cost (total weights) of edges between groups are not chosen appropriately, the corresponding minimum-cost cut may yield an unbalanced partition (e.g., with one set being a singleton)
 Other objective functions:
Ci : community i
Ci : complementary community |Ci|: number of nodes in Ci vol(Ci): sum of degrees in Ci
cut(Ci,Ci ) = degrees of the cut

Example
For partition by red:
For partition by green:
 We should cut as comparing to
|Ci|: number of nodes in Ci
vol(Ci): sum of degrees in Ci cut(Ci , Ci ) = degrees of the cut

2) Node-Similarity-Based Clustering
Put all nodes with the same (or close) similarity into the same community
 For large-scale networks:
 Consider the connections as features
 Use Cosine similarity or Jaccard similarity to compute node similarity
 Apply the classical K-means Clustering Algorithm
 K-means Clustering Algorithm
 Each cluster is associated with a centroid (center point)
 Each node is assigned to the cluster with the closest centroid

Node Similarity
1
2
3
4
5
6
7
8
9
10
11
12
13
5
1
1
8
1
1
1
9
1
1
1
Adjacency matrix
Structurally equivalent
Cosine Similarity:
Others omitted
sim(5,8)01101101 1 236
Jaccard Similarity:
J(5,8)= = 14

Node-Similarity-Based Clustering
Cosine, Jaccard
K-means

Illustration of K-means Clustering
Iteration 1 Iteration 2 333
2.5 2.5 2.5 222 1.5 1.5 1.5 111 0.5 0.5 0.5 000
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Iteration 3
xxx
Iteration 4 Iteration 5 333
2.5 2.5 2.5 222 1.5 1.5 1.5 111 0.5 0.5 0.5 000
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Iteration 6
yy
yy
yy
xxx
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

2) Hierarchical Clustering Method
A similarity measure is used to quantify node pairs
Commonly used measures include: cosine similarity, Jaccard similarity, and Hamming distance between rows of the adjacency matrix, etc.
Then, according to any of such measures, similar nodes are grouped into communities
Example: Hamming distance between: toned and roses
Is: 3
Example: Hamming distance between: 100100 and 011011
Is: 6

2) Hierarchical Clustering Method
In information theory, the Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different
Hamming distance:
100 → 011 has distance 3 (red path) 010 → 111 has distance 2 (blue path)

Hierarchical Clustering Method
 Calculate a “weight” (e.g., similarity value) for every pair of nodes, which represents how closely connected this pair of nodes is
 Starting with all disconnected nodes, add edges between pairs, one by one, in increasing order of their weights
 Result: Some nested components, where one can take a “slice” (threshold) at any level of the tree
Threshold
Smallest weights
 Secondsmallest weights
partition

3) GN Algorithm
Girvan–Newman (GN) algorithm identifies “heavy” edges in a network that lie between communities and then removes them, leaving only the communities
This is a betweenness-based clustering method: Identification is performed by employing the edge- betweenness, yielding results of reasonable quality
Drawbacks: High computational complexity
GN Algorithm
M. Girvan and M. E. J. Newman, PNAS, 99(12): 7821-7826, 2002

GN Algorithm
 Calculate all the edge-betweenness in the network
 Remove the edge with the highest betweenness
 Re-calculate all the edge-betweennesses for the resulting (smaller) network
 Repeat the above, until no edge is left 33
3 55
4
Remove (5) v2-v4, v3-v5 Remove (4) v4-v6, v5-v6 Remove (3) v1-v2, v2-v3, v3-v1
4

33 3
55
4
Remove (5) v2-v4, v3-v5 Remove (4) v4-v6, v5-v6 Remove (3) v1-v2, v2-v3, v3-v1
4
Net
Remove v2-v4, v3-v5
v1,v2,v3 v4, v5, v6 2 communities
Remove v1-v2, v2-v3, v3-v1
Remove v4-v6, v5-v6
v1 v2 v3 v4 v5 v6
6 communities

Example:
Step 1: remove 7-8 Step 2: remove 3-7, 6-7; 8-9, 8-12
Result

GN Algorithm
Applied to the Karate Club Dataset
Partition

Zachary’s Karate Network

4) Clique-Based Methods  A clique is a fully-connected subgraph
There are several clique-based community detection algorithms (computationally NP-hard)
 Since a node can be a member of more than one clique, these methods usually yield overlapping community structures
For example:
Nodes (5, 6, 7, 8) form a clique
Nodes (1, 2, 3), (1, 3, 4), (4, 5, 6), also form a clique, respectively

Clique-Based Methods
 One approach is to find all the maximal cliques, which are cliques that are not a subgraph of any other clique
 To find maximum cliques: Bron-Kerbosch Algorithm Recursively apply the following pruning procedure:
Sample a (large) subgraph from the given network, and find a clique of size k in it (say, by a greedy algorithm)
To find a larger clique, all nodes with degree  k 1 are removed from the whole network
Repeat the above, until the network is small enough

Example
 First, suppose we sampled a sub-network with nodes {1, 2, 3, 4, 5} and found a clique {1, 2, 3} of size 3
[This is used as a reference to search for a possible larger clique]
 Next, to find a clique of size > 3, remove all nodes with degrees 1 and 2
 Remove nodes 2, 9
 Remove nodes 1, 3, 4
 Result is a larger clique: {5, 6, 7, 8}
 If the network is huge, continue: size > 4, size > 5, …

Clique Percolation Method
 A node can be a member of more than one clique. These methods usually yield overlapping community structures
 CPM is a method to find overlapping communities Input: A network, and a parameter k
Procedure:
Findallcliquesofsize k inthegivennetwork  Construct a clique graph:
Two cliques are adjacent if they share k – 1 nodes
 Each connected cluster in the clique graph is a community

Example
Cliques of size 3: {1, 2, 3}, {1, 3, 4}, {4, 5, 6}, {5, 6, 7}, {5, 6, 8}, {5, 7, 8}, {6, 7, 8}
k=3
k -1 = 2 shared nodes
Communities:
{1, 2, 3, 4} {4, 5, 6, 7, 8} {9}
overlapping: node 4

k-Clique Communities
Union of all k-cliques that can be reached from each other through a series of adjacent k-cliques
Example:
k=3
2 common nodes
They belong to the same community

1 common node

A. Lancichinetti, S. Fortunato, and F. Radicchi, Phys. Rev. E 78, 046110 (2008)

Detecting Community Structure: Challenges
 Hierarchical  Overlapping

Detecting Community Structure: Challenges
Evolution, Emergence

Link Prediction
An emerging community ?

Link Prediction
Predict:
Which computer is likely to connect to which computer

Link Prediction
Predict:
Which person is likely to connect to which friend

Link Prediction
For a large-scale network, this could be challenging

Link Prediction
Simple Cases:
Degree sequences:
Criteria:
Similarity: Similar degrees, properties, importance, … Commonality: Common friends, features, … Closeness: Closeness, distances, …

Based on:
Link Prediction
Degree similarity Closeness/Distance
Criteria:
Similarity: Similar degrees, properties, importance, … Commonality: Common friends, features, … Closeness: Closeness, distances, …

Examples
Predict link(s) from node “B”
Answer: B–D and B–F (They have degree 2)
Predict link(s) from node “F” Answer:F–E and A–F
(There is no other degree-1 node)
(They have a shortest distance)

Link Prediction
Criterion:
Based on node-degree average

Link Prediction

Link Prediction
Based on Commonality 
Predict 1st new link: 1 – 3 Predict 2nd new links:
2 – 4 or 2 – 5 or 4 – 5 and so on

. … … =0

Other Link Prediction Methods
 Node-similarity-based methods
 Topology-similarity-based methods
CN, AA, RA, LP, Katz, LRW, SRW, RWR, …
 Maximum-likelihood analytic methods Layer-structure modeling,
random partitioning, …
 Machine Learning ……

BREAK
10 minutes