Mining Social Networks
COMPSCI 753 Kaiqi Zhao
Overview
§ Social network analysis
§ Social network introduction § Community detection
§ Influence maximization
56
Influence
§ People are connected and perform actions
57
Application: viral marketing
§ Idea: exploit social influence for marketing
§ Basic assumption: word-of-mouth effect
§ More efficient marketing strategy – target the influencers
xxx laptop is good!
xxx laptop is good!
xxx laptop is good!
xxx laptop is good!
xxx laptop is good!
58
Problem definition
§ Given a social network 𝐺 = (𝑉, 𝐸), find a seed-set 𝑆 of 𝑘 influential people in 𝑉 such that by targeting them we maximize the inference spread.
§ Influence spread: Given a set 𝑆 ⊆ 𝐺, the influence spread 𝜎(𝑆) of 𝑆 is the expected number of influenced users after the propagation terminates if the users in 𝑆 were selected as seed users
§ Propagation models:
§ Independent Cascade Model § Linear Threshold Model
59
Independent Cascade (IC)
§ Each edge 𝑢, 𝑣 ∈ 𝐸 has influence probability 𝑝!”.
§ Seeds 𝑆 are activated at time 𝑡 = 0, e.g., 𝑆# = 𝑆
§ At each time step 𝑡 > 0, each active node 𝑢 independently activates its inactive neighbor 𝑣:
§ Succeed with probability 𝑝!” and fails with probability 1 − 𝑝!”.
§ Once a node is active, it cannot become inactive
60
Example – Independent Cascade
Y
0.6
0.2
Inactive Node Active Node
Newly active node
Successful attempt
Unsuccessful attempt
0.3 0.4
w
0.2 X0.1 U
0.5 0.3 0.5
v
0.2
Animation from https://personal.utdallas.edu/~dzdu/
61
Linear Threshold (LT) Model
§ Each edge 𝑢, 𝑣 ∈ 𝐸 has weight 𝑏yt.
§ Let 𝑆z be the set of active nodes at time 𝑡. Initially, only seed nodes in 𝑆
are activated at time 𝑡 = 0, i.e., 𝑆{ = 𝑆
§ Randomly pick an activation threshold 𝜃y~𝑈[0,1] for each node 𝑢
§ At each time step 𝑡 > 0, a node 𝑢 gets activated if:
§ The sum of weights from all its activated in-neighbor, i.e.,
Σ t│t→y∈b &t∈$%& 𝑏ty ≥ 𝜃y.
62
Example – Linear Threshold
0.6 0.3
X
0.4
0.5
Y
Inactive Node
Active Node
Threshold
Active neighbors
0.2
0.2
0.1 0.3
U
0.2 w 0.5 v
Animation from https://personal.utdallas.edu/~dzdu/
63
Mutually-exclusive Cascade (MC)
§ In IC, a node is activated independently by its in-neighbors
u1 u2
§ In mutually-exclusive cascade model (MC), only one in-neighbors of 𝑣 activates 𝑣
u1 u2
v
v
64
LT = MC
Linear Threshold
Prob[u is not activated in step 𝑘, but is activated at step 𝑘 + 1]
= ∑,∈-!\-!”# 𝑏,” 1 − ∑,∈/!”# 𝑏,”
R 𝑏0& 0∈4#$%
R𝑏0& 1 0∈4#\6#$%
0
Mutually-exclusive Cascade
Prob[u is not activated in step 𝑘, but is activated at step 𝑘 + 1] =
𝑃𝑟𝑜𝑏 𝑢 𝑖𝑠 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑒𝑑 𝑏𝑦 𝑤 𝑖𝑛 𝑆%\𝑆%01 𝑃𝑟𝑜𝑏[𝑢 𝑖𝑠 𝑛𝑜𝑡 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑒𝑑 𝑏𝑦 𝑤 𝑖𝑛 𝑆%01]
= ∑,∈-!\-!”# 𝑏,” 1 − ∑,∈/!”# 𝑏,”
w1
bw ,u
u wh
1
bwh ,u
65
Computing Influence Spread
§ Deterministic graph
§ Accessibility from the active nodes
66
Computing Influence Spread
§ How to compute the influence of IC and MC (or LT)? § Stochastic model is annoying!àdeterministic models?
§ Choose active all edges at the beginning – activated edges are live edges § Let 𝑋 be the set denoting the label of edges (live or blocked)
0.6
0.3
0.1
0.2
0.2
0.4
0.5 0.3
0.2 0.5
67
𝑋”! =1ifedge𝑢→𝑣islive Prob𝑋=Ö𝑝8&!1−𝑝 !98&!
&/ &/ &,/ ∈7
Computing Influence Spread
§ How to compute the influence of IC and MC (or LT)?
1. Calculate each possible deterministic graph with edge label 𝑋 and its
probability prob[𝑋]
2. Compute the number of nodes that can be accessed from the seed set 𝑆
in the deterministic graph 𝜎2(𝑆)
3. The influence spread of the seed set is the expectation over all the
possible edge labels 𝑋, 𝜎 𝑆 = ∑2 prob[X] ⋅ 𝜎2(𝑆)
68
Example – IC and MC
Graph
𝐴̅ (2 ↛ 1)
1𝑝 1−𝑝>B ?B
23 11
𝑝>B
23
𝐴 (2 → 1)
1 23
𝐵 (3 → 1)
p21
p31
1
𝑝>B
23
𝑝?B
𝐵à ( 3 ↛ 1 )
IC 𝑝𝐴𝐵 =𝑝(!𝑝$!
𝑝 𝐴 𝐵à = 𝑝 ( ! 1 − 𝑝 $ !
𝑝 𝐴̅𝐵 = 1−𝑝(! 𝑝$!
𝑝 𝐴̅𝐵à = (1 − 𝑝(!)(1 − 𝑝$!)
1−𝑝?B 1−𝑝>B 1−𝑝?B
𝜎 2,3 =𝐸𝜎8 2,3 =R𝑝𝑋⋅𝜎8 2,3 :% :% :%
8
=3⋅𝑝 𝐴𝐵 +3⋅𝑝 𝐴𝐵à +3⋅𝑝 𝐴̅𝐵 +2⋅𝑝 𝐴̅𝐵à
= 2 + 𝑝(! + 𝑝$! − 𝑝(!𝑝$!
23
69
Example – IC and MC
Graph
𝐴̅ (2 ↛ 1)
1𝑝 1−𝑝>B ?B
23 11
𝑝>B
23
𝐴 (2 → 1)
1 23
𝐵 (3 → 1)
p21
p31
1
𝑝>B
23
𝑝?B
𝐵à ( 3 ↛ 1 )
MC
𝑝𝐴𝐵 =0
𝑝 𝐴 𝐵à = 𝑝 𝐴 𝑝 𝐵à | 𝐴 = 𝑝 ( ! ⋅ 1 𝑝𝐴̅𝐵 =𝑝𝐵𝑝𝐴̅|𝐵 =𝑝$!⋅1 𝑝𝐴̅𝐵à =1−𝑝(!−𝑝$!
1−𝑝?B 1−𝑝>B 1−𝑝?B
𝜎 2,3 =𝐸𝜎8 2,3 =R𝑝𝑋⋅𝜎8 2,3 2% 2% 2%
8
=3⋅𝑝 𝐴𝐵 +3⋅𝑝 𝐴𝐵à +3⋅𝑝 𝐴̅𝐵 +2⋅𝑝 𝐴̅𝐵à
= 2 + 𝑝(! + 𝑝$!
23
70
Greedy Algorithm for Influence Maximization
[Kempe et al. KDD 2003]
§ NP-Hardness: Influence maximization, i.e., given k, picking a seed set 𝑆 of size 𝑘 that maximizes the influence spread 𝜎w(𝑆), is NP-hard under propagation model 𝑚 (LT or IC)
§ Greedy algorithm – each time add a node 𝑢 with largest marginal gain of influence spread: 𝜎w 𝑆 ∪ 𝑢 − 𝜎w(𝑆), under the propagation model 𝑚
71
Greedy Algorithm for Influence Maximization
[Kempe et al. KDD 2003]
§ Theorem: If the function 𝜎w(⋅) is non-negative, submodular and monotone, the greedy algorithm has approximation ratio 1 − 1/𝑒
§ Non-negative: 𝜎3 ⋅ ≥ 0
§Monotone:𝜎3 𝑆∪{𝑣} ≥𝜎3 𝑆
§Submodular:∀𝑆⊂𝑇,∀𝑥∉𝑆,𝑇: 𝑓(𝑆∪{𝑥})−𝑓(𝑆) ≥ 𝑓(𝑇∪{𝑥})−𝑓(𝑇)
72
Submodular functions
§ Definition: A set function 𝑓(𝑆): 2 → R is submodular if ∀𝑆 ⊂ 𝑇, ∀𝑥 ∉ 𝑆,𝑇: 𝑓(𝑆∪{𝑥})−𝑓(𝑆) ≥ 𝑓(𝑇∪{𝑥})−𝑓(𝑇).
§ Also known as “diminishing returns” property. Adding a node earlier results in a marginal gain that is no less than adding it later
Image from MIT CSAIL: http://people.csail.mit.edu/stefje/mlss/cadiz_part1.pdf
73
Submodular examples
§ Are the following submodular functions? §Constantfunctions– 𝑓 𝑆 =𝑓 ∅ =𝐶 §Additivefunctions–𝑓 𝑋∪𝑌 =𝑓 𝑋 +𝑓(𝑌) § Min functions – 𝑓 𝑋 = min𝑋
§ Max functions – 𝑓 𝑋 = max𝑋
74
Submodularity of influence spread
§ Theorem: Let 𝑚 be a propagation model of LT or IC, 𝜎w(⋅) is monotone and submodular.
§ Stochastic model is annoying!àdeterministic models?
§ Choose active all edges at the beginning – activated edges are live edges
0.6 0.3
0.2
0.2
0.4
0.1
0.3 0.5
0.5
0.2
75
Submodularity of influence spread
§ Theorem: Let 𝑚 be a propagation model of LT or IC, 𝜎w(⋅) is monotone and submodular.
§ Let 𝑋 be the set denoting the label of edges (live or blocked), 𝜎 2 (𝐴) denotes the #nodes reachable from node set 𝐴 through live edges 45
Given two node sets 𝑆 ⊂ 𝑇, we want to show 𝜎2 (⋅) is submodular for a node 𝑣 ∉ 𝑆, 𝑇:
v 45
𝜎8(𝑆) :%
S T
𝜎2(𝑆∪{𝑣})−𝜎2(𝑆) ≥𝜎2(𝑇∪{𝑣})−𝜎2(𝑇) 45 45 45 45
𝜎8(𝑇) :%
76
Submodularity of influence spread
§ Theorem: Let 𝑚 be a propagation model of LT or IC, 𝜎w(⋅) is monotone and submodular.
§ Let 𝑋 be the set denoting the label of edges (live or blocked), 𝜎 2 (𝐴) denotes the #nodes reachable from node set 𝐴 through live edges 45
§ 𝜎2(⋅)issubmodularand𝜎 𝐴 =∑ prob[X]⋅ 𝜎2(𝐴) 45 45245
§ Non-negative linear combinations of submodular functions is also submodular! (Tutorial)
§ Thus, 𝜎45(⋅) is submodular
77
Computing influence spread
§ The greedy algorithm is efficient only if we have an efficient way to calculate the influence spread.
§ But computing the influence spread in IC and LT models are #P-hard.
§ Monte Carlo(MC) simulations for estimation.
§ Starts from the seed set S
§ Simulates the activation process w.r.t. the corresponding diffusion model § Outputs the number of activated users
78
Summary
§ Influence Maximization
§ Finding a set of influential nodes in the network
§ Two propagation models § Linear Threshold
§ Independent Cascade
§ Greedy algorithm – submodular function
§ Computing influence spread § Monte Carlo simulation
79
Summary for the Graph Algorithms
§ Web search – Fighting with spams
§ PageRank – dead-ends and spider traps
§ Biased PageRank – topic-specific, personalized pagerank § Link spam and link farm
§ Community detection – Finding people with strong relationships. § Edge betweenness
§ Graph cut
§ Modularity
§ Influence Maximization – Finding influential users in the network § Two propagation models: LT and IC
§ Greedy algorithm – submodular function
§ Computing influence spread – Monte Carlo simulation
80