程序代写代做代考 data mining SocialNetworks2015-part2

SocialNetworks2015-part2

Mining of Massive Datasets
Jure Leskovec, Anand Rajaraman, Jeff Ullman
Stanford University

http://www.mmds.org

Note to other teachers and users of these slides: We would be delighted if you found this our
material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify
them to fit your own needs. If you make use of a significant portion of these slides in your own
lecture, please include this message, or a link to our web site: http://www.mmds.org

http://www.mmds.org/

2

Nodes: Football Teams
Edges: Games played

Can we identify
node groups?

(communities,
modules, clusters)

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

3

NCAA conferences

Nodes: Football Teams
Edges: Games played

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

4

Can we identify
functional modules?

Nodes: Proteins
Edges: Physical interactions

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

5

Functional modules

Nodes: Proteins
Edges: Physical interactions

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

6

Can we identify
social communities?

Nodes: Facebook Users
Edges: Friendships

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

7

High school Summer
internship

Stanford (Squash)
Stanford (Basketball)

Social communities

Nodes: Facebook Users
Edges: Friendships

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

¡ Non-overlapping vs. overlapping communities

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8

9

Network Adjacency matrix

Nodes

N
od

es

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

¡ What is the structure of community overlaps:
Edge density in the overlaps is higher!

10

Communities as “tiles”
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

11

This is what we want!Communities
in a network

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

¡ 1) Given a model, we generate the network:

¡ 2) Given a network, find the “best” model

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 12

C

A

B

D E

H

F

G

C

A

B

D E

H

F

G

Generative
model for
networks

Generative
model for
networks

model: 一些nodes 边等等

¡ Goal: Define a model that can generate
networks
§ The model will have a set of “parameters” that we

will later want to estimate (and detect communities)

¡ Q: Given a set of nodes, how do communities
“generate” edges of the network?

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13

C

A

B

D E

H

F

G

Generative
model for
networks

¡ Assume a generative process (the model )
§ for creating instances of some artifact, for

example, “friends graphs”
¡ The model has parameters
§ for determining the probability of generating any

particular instance of the artifact
§ the “likelihood” of the parameter values

¡ The value of the parameters that gives the
largest value of the likelihood is the correct
model for the observed artifact

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 15

¡ Model: Each edge is present with probability p with
the presence or absence of each edge chosen
independently

¡ Parameter: p
¡ Example: a graph with 15 nodes and 23 edges

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 16

q How many pairs of nodes do we
have? 105

q If each edge is connected by a
probability p, the probability of
having a graph below is

v p23(1-p)82

¡ Take the first derivative of the probability function
and set it to 0 for finding the maximum / minimum

¡ Maximum or minimum of p23(1-p)82 happens when
§ Either p=0 or 1 or 23(1-p)-82p = 0
¡ The likelihood of generating the graph in the

previous slide is maximized when 23(1-p)-82p = 0
¡ p = 23/105

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 17

(23/105)^23 * (1-23/105)^82 = 1.0656135e-24

¡ Generative model B(V, C, M, {pc}) for graphs:
§ Nodes V, Communities C, Memberships M
§ Each community c has a single probability pc
§ Later we fit the model to networks to detect

communities

19

Model

Network

Communities, C

Nodes, V

Model

pA pB

Memberships, M

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

¡ The Community-Affiliation Graph:
1. There is a given number of communities and a given

number of individuals (nodes)
2. Each community can have any set of individuals as

members (binary memberships)
§ The membership parameters

3. Each community C has a probability Pc associated with
it
§ The probability that two members of community C are

connected because they are both members of C
4. If a pair of nodes is in two or more communities, then

there is an edge between them if any of the
communities of which both are members justifies that
edge according to rule 3 (see the next slide for an
example)

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 20

¡ Nodes: u, v; Edge: (u, v)
§ u and v are in communities C, D but not other communities
§ Given the probability that two nodes in C (D) are connected

is PC (PD)
§ The probability of no edge between u, c is (1 – PC)(1 – PD)
§ The probability of (u, v) exists is 1 – (1 – PC)(1 – PD)

¡ General case: If u and v are members of a
community set M, the probability of an edge
between u and v is:

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 21

¡

22

Model

Õ
ÇÎ

–=
vu MMc

cpvuP )1(1),(

Network

Communities, C

Nodes, V
Community Affiliations

pA pB

Memberships, M

Think of this as an “OR” function: If at least 1 community says “YES” we create an edge
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

23

Model

Network
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

¡ AGM can express a
variety of community
structures:
Non-overlapping,
Overlapping, Nested

24J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

¡ Detecting communities with AGM:

26

C

A

B

D E

H

F

G

1) Affiliation graph M
2) Number of communities C
3) Parameters pc

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

¡

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 27

¡

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 28

¡ Using AGM for modeling the graph below
¡ Assumptions (preset some of the

parameters):
§ There are two communities C and D with PC and PD
§ We have determined (or are using as a temporary

hypothesis) the community memberships:
§ C = {w, x, y} and D = {w, y, z}

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 29

¡ Given C = {w, x, y} and D = {w, y, z}
¡ For w an x, Mwx = {C}

§ Pwx = 1 – (1 – Pc) = Pc
¡ And

§ Pxy = PC and Pyz = Pwz = PD
§ Pwy= 1 – (1 – PC)(1 – PD)
§ Pxz = a tiny value

¡ The likelihood of this graph given the membership
assumption

¡ PwxPwy Pxy Pyz (1 – Pwz)(1 – Pxz)
§

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 30

¡ Find the values of PC and PD that maximize:

§ The last factor can be dropped
¡ PC should be as large as possible
¡ Given Pc = 1, the expression becomes:

§ PD (1 – PD) -> PD=0.5 when the expression has its
maximum

¡ The maximum likelihood for the graph in Fig.
10.20 occurs when members of C are certain to
have an edge between them and

¡ There is a 50% chance that joint membership in
D will cause an edge between the members

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 31

¡

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 32

0 0.10 0.10 0.04
0.10 0 0.02 0.06
0.10 0.02 0 0.06
0.04 0.06 0.06 0

0 1 0 0
1 0 1 1
0 1 0 1
0 1 1 0

Flip
biased
coins

)),(1(),()|(
),(),(

vuPvuPGP
EvuEvu

-PP=Q
ÏÎ

¡ Given graph G(V,E) and Θ, we calculate
likelihood that Θ generated G: P(G|Θ)

0 0.9 0.9 0
0.9 0 0.9 0
0.9 0.9 0 0.9
0 0 0.9 0

Θ=B(V, C, M, {pc})

0 1 1 0
1 0 1 0
1 1 0 1
0 0 1 0

G
P(G|Θ)

)),(1(),()|(
),(),(

vuPvuPGP
EvuEvu

-PP=Q
ÏÎ

G

33J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

A B

¡

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 34

QP( | ) AGMarg maxQ

¡

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 35

¡

36

u v

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

¡

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 37

u v

¡

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 38

j

Communities

N
od
es

¡

39

Node community
membership strengths

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

0 1.2 0 0.2

0.5 0 0 0.8

0 1.8 1 0

¡

40J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

¡

41J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

¡

42J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

¡ BigCLAM takes 5 minutes for 300k node nets
§ Other methods take 10 days

¡ Can process networks with 100M edges!
43J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

45J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

¡ Extension:
Make community membership edges directed!
§ Outgoing membership: Nodes “sends” edges
§ Incoming membership: Node “receives” edges

46J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 47

¡

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 48

49J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

¡ Overlapping Community Detection at Scale: A Nonnegative Matrix
Factorization Approach by J. Yang, J. Leskovec. ACM International
Conference on Web Search and Data Mining (WSDM), 2013.

¡ Detecting Cohesive and 2-mode Communities in Directed and
Undirected Networks by J. Yang, J. McAuley, J. Leskovec. ACM
International Conference on Web Search and Data Mining (WSDM),
2014.

¡ Community Detection in Networks with Node Attributes by J. Yang,
J. McAuley, J. Leskovec. IEEE International Conference On Data
Mining (ICDM), 2013.

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 50

http://cs.stanford.edu/people/jure/pubs/bigclam-wsdm13.pdf
http://cs.stanford.edu/people/jure/pubs/coda-wsdm14.pdf
http://cs.stanford.edu/people/jure/pubs/cesna-icdm13.pdf