SocialNetworks2015-part2
Mining of Massive Datasets
Jure Leskovec, Anand Rajaraman, Jeff Ullman
Stanford University
http://www.mmds.org
Note to other teachers and users of these slides: We would be delighted if you found this our
material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify
them to fit your own needs. If you make use of a significant portion of these slides in your own
lecture, please include this message, or a link to our web site: http://www.mmds.org
http://www.mmds.org/
2
Nodes: Football Teams
Edges: Games played
Can we identify
node groups?
(communities,
modules, clusters)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
3
NCAA conferences
Nodes: Football Teams
Edges: Games played
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
4
Can we identify
functional modules?
Nodes: Proteins
Edges: Physical interactions
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
5
Functional modules
Nodes: Proteins
Edges: Physical interactions
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
6
Can we identify
social communities?
Nodes: Facebook Users
Edges: Friendships
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
7
High school Summer
internship
Stanford (Squash)
Stanford (Basketball)
Social communities
Nodes: Facebook Users
Edges: Friendships
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
¡ Non-overlapping vs. overlapping communities
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8
9
Network Adjacency matrix
Nodes
N
od
es
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
¡ What is the structure of community overlaps:
Edge density in the overlaps is higher!
10
Communities as “tiles”
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
11
This is what we want!Communities
in a network
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
¡ 1) Given a model, we generate the network:
¡ 2) Given a network, find the “best” model
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 12
C
A
B
D E
H
F
G
C
A
B
D E
H
F
G
Generative
model for
networks
Generative
model for
networks
model: 一些nodes 边等等
¡ Goal: Define a model that can generate
networks
§ The model will have a set of “parameters” that we
will later want to estimate (and detect communities)
¡ Q: Given a set of nodes, how do communities
“generate” edges of the network?
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13
C
A
B
D E
H
F
G
Generative
model for
networks
¡ Assume a generative process (the model )
§ for creating instances of some artifact, for
example, “friends graphs”
¡ The model has parameters
§ for determining the probability of generating any
particular instance of the artifact
§ the “likelihood” of the parameter values
¡ The value of the parameters that gives the
largest value of the likelihood is the correct
model for the observed artifact
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 15
¡ Model: Each edge is present with probability p with
the presence or absence of each edge chosen
independently
¡ Parameter: p
¡ Example: a graph with 15 nodes and 23 edges
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 16
q How many pairs of nodes do we
have? 105
q If each edge is connected by a
probability p, the probability of
having a graph below is
v p23(1-p)82
¡ Take the first derivative of the probability function
and set it to 0 for finding the maximum / minimum
¡ Maximum or minimum of p23(1-p)82 happens when
§ Either p=0 or 1 or 23(1-p)-82p = 0
¡ The likelihood of generating the graph in the
previous slide is maximized when 23(1-p)-82p = 0
¡ p = 23/105
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 17
(23/105)^23 * (1-23/105)^82 = 1.0656135e-24
¡ Generative model B(V, C, M, {pc}) for graphs:
§ Nodes V, Communities C, Memberships M
§ Each community c has a single probability pc
§ Later we fit the model to networks to detect
communities
19
Model
Network
Communities, C
Nodes, V
Model
pA pB
Memberships, M
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
¡ The Community-Affiliation Graph:
1. There is a given number of communities and a given
number of individuals (nodes)
2. Each community can have any set of individuals as
members (binary memberships)
§ The membership parameters
3. Each community C has a probability Pc associated with
it
§ The probability that two members of community C are
connected because they are both members of C
4. If a pair of nodes is in two or more communities, then
there is an edge between them if any of the
communities of which both are members justifies that
edge according to rule 3 (see the next slide for an
example)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 20
¡ Nodes: u, v; Edge: (u, v)
§ u and v are in communities C, D but not other communities
§ Given the probability that two nodes in C (D) are connected
is PC (PD)
§ The probability of no edge between u, c is (1 – PC)(1 – PD)
§ The probability of (u, v) exists is 1 – (1 – PC)(1 – PD)
¡ General case: If u and v are members of a
community set M, the probability of an edge
between u and v is:
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 21
¡
22
Model
Õ
ÇÎ
–=
vu MMc
cpvuP )1(1),(
Network
Communities, C
Nodes, V
Community Affiliations
pA pB
Memberships, M
Think of this as an “OR” function: If at least 1 community says “YES” we create an edge
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
23
Model
Network
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
¡ AGM can express a
variety of community
structures:
Non-overlapping,
Overlapping, Nested
24J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
¡ Detecting communities with AGM:
26
C
A
B
D E
H
F
G
1) Affiliation graph M
2) Number of communities C
3) Parameters pc
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
¡
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 27
¡
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 28
¡ Using AGM for modeling the graph below
¡ Assumptions (preset some of the
parameters):
§ There are two communities C and D with PC and PD
§ We have determined (or are using as a temporary
hypothesis) the community memberships:
§ C = {w, x, y} and D = {w, y, z}
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 29
¡ Given C = {w, x, y} and D = {w, y, z}
¡ For w an x, Mwx = {C}
§ Pwx = 1 – (1 – Pc) = Pc
¡ And
§ Pxy = PC and Pyz = Pwz = PD
§ Pwy= 1 – (1 – PC)(1 – PD)
§ Pxz = a tiny value
¡ The likelihood of this graph given the membership
assumption
¡ PwxPwy Pxy Pyz (1 – Pwz)(1 – Pxz)
§
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 30
¡ Find the values of PC and PD that maximize:
§ The last factor can be dropped
¡ PC should be as large as possible
¡ Given Pc = 1, the expression becomes:
§ PD (1 – PD) -> PD=0.5 when the expression has its
maximum
¡ The maximum likelihood for the graph in Fig.
10.20 occurs when members of C are certain to
have an edge between them and
¡ There is a 50% chance that joint membership in
D will cause an edge between the members
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 31
¡
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 32
0 0.10 0.10 0.04
0.10 0 0.02 0.06
0.10 0.02 0 0.06
0.04 0.06 0.06 0
0 1 0 0
1 0 1 1
0 1 0 1
0 1 1 0
Flip
biased
coins
)),(1(),()|(
),(),(
vuPvuPGP
EvuEvu
-PP=Q
ÏÎ
¡ Given graph G(V,E) and Θ, we calculate
likelihood that Θ generated G: P(G|Θ)
0 0.9 0.9 0
0.9 0 0.9 0
0.9 0.9 0 0.9
0 0 0.9 0
Θ=B(V, C, M, {pc})
0 1 1 0
1 0 1 0
1 1 0 1
0 0 1 0
G
P(G|Θ)
)),(1(),()|(
),(),(
vuPvuPGP
EvuEvu
-PP=Q
ÏÎ
G
33J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
A B
¡
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 34
QP( | ) AGMarg maxQ
¡
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 35
¡
36
u v
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
¡
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 37
u v
¡
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 38
j
Communities
N
od
es
¡
39
Node community
membership strengths
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
0 1.2 0 0.2
0.5 0 0 0.8
0 1.8 1 0
¡
40J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
¡
41J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
¡
42J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
¡ BigCLAM takes 5 minutes for 300k node nets
§ Other methods take 10 days
¡ Can process networks with 100M edges!
43J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
45J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
¡ Extension:
Make community membership edges directed!
§ Outgoing membership: Nodes “sends” edges
§ Incoming membership: Node “receives” edges
46J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 47
¡
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 48
49J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
¡ Overlapping Community Detection at Scale: A Nonnegative Matrix
Factorization Approach by J. Yang, J. Leskovec. ACM International
Conference on Web Search and Data Mining (WSDM), 2013.
¡ Detecting Cohesive and 2-mode Communities in Directed and
Undirected Networks by J. Yang, J. McAuley, J. Leskovec. ACM
International Conference on Web Search and Data Mining (WSDM),
2014.
¡ Community Detection in Networks with Node Attributes by J. Yang,
J. McAuley, J. Leskovec. IEEE International Conference On Data
Mining (ICDM), 2013.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 50
http://cs.stanford.edu/people/jure/pubs/bigclam-wsdm13.pdf
http://cs.stanford.edu/people/jure/pubs/coda-wsdm14.pdf
http://cs.stanford.edu/people/jure/pubs/cesna-icdm13.pdf