Towards Big Graph Processing:
Applications, Challenges, and
Advances
PPT模板下载:www.1ppt.com/moban/ 行业PPT模板:www.1ppt.com/hangye/
节日PPT模板:www.1ppt.com/jieri/ PPT素材下载:www.1ppt.com/sucai/
PPT背景图片:www.1ppt.com/beijing/ PPT图表下载:www.1ppt.com/tubiao/
优秀PPT下载:www.1ppt.com/xiazai/ PPT教程: www.1ppt.com/powerpoint/
Word教程: www.1ppt.com/word/ Excel教程:www.1ppt.com/excel/
资料下载:www.1ppt.com/ziliao/ PPT课件下载:www.1ppt.com/kejian/
范文下载:www.1ppt.com/fanwen/ 试卷下载:www.1ppt.com/shiti/
教案下载:www.1ppt.com/jiaoan/ PPT论坛:www.1ppt.cn
Xuemin Lin
UNSW
What is a graph ?
• Ver$ces: a collec$on of en$$es
• Edges: connec$ons between ver$ces
v8v4
v1
v2 v3
v5
v7
v6
v9
Graphs and Classic Problems
Leonhard Euler in 1735:
• find a roundtrip through the city to cross each
bridge once and only once
• earliest (published) work on networks/graphs
1. Seven Bridges of Königsberg
Abstraction:
– replaced each land mass with an abstract “vertex” or node,
and each bridge with an abstract connection, an “edge”
Showed that this problem has no solution.
Euler Tour:
– necessary and sufficient conditions for the walk of the
desired form: connected and all vertices with even degrees.
1. Seven Bridges of Königsberg(cont.)
• Defini&on:
– A Graph is a collec&on of nodes joined by edges.
Graphs
• G=(V,E)
– V is a set of vertices, u,
v in V are vertices
– E is a set of edges, (u,v)
in E is an edge.(u,v)
• Undirected edges
– u–v: (u,v)=(v,u)
– e.g., u and v like each other; u and v are co-authors
• Directed edges(arcs)
– u�v: (u, v) != (v, u)
– (u, v) is an in-link (in-edge) of v, and an out-link (out-edge) of u.
– e.g., u likes v; u is the father of v.
• Other attributes
– weight (e.g. frequency of communication) – w(u, v)
– rank (e.g., best friend, second best friend, …)
– type (e.g., friend, co-worker, activates, inhibits, …)
Edges
• Degree
– d(v): the number of edges connected to a vertex
– In degree: number of incoming edges of a vertex
– Out degree: number of outgoing edges of a vertex
• If edges are weighted, the degree, in degree, out degree are the
total weights of edges, incoming edges, outgoing edges of a vertex.
• Other attributes:
– Weight: (e.g., importance, authority of a person)
– Type: (e.g., advisor, student, old people)
– Label: (e.g., name)
Ver?ces
• Problem:
– Given a list of ci$es and the distances between each
pair of ci$es, find the shortest route that visits each
city exactly once and returns to the origin city.
– In the theory of computa$onal complexity, the TSP
belongs to the class of NP-hard problems.
2. Travelling salesman
Thus, it is unlikely that the worst-case
running time for any algorithm for the
TSP is in PTIME.
• HeurisGc SoluGons:
– ConstrucGve heurisGcs: Nearest-neighbors (greedy),
spanning tree, bitonic tour, etc.
– for extremely large problems (millions of ciGes), within a
reasonable Gme which are with a high probability just 2–
3% away from the opGmal soluGon
2. Travelling salesman (cont.)
Widely used in several applications:
planning, logistics
the manufacture of microchips.
DNA sequencing, etc.
• Definition: can be
embedded in the plane
– i.e., it can be drawn on
the plane in such a way
that its edges intersect
only at their endpoints.
• Determination:
– does not contain a
subgraph that is a
subdivision of K5 (the
complete graph on five
vertices) or
K3,3((complete bipartite
graph on six vertices)
3. Planar graph
• Euler’s Formula:
if a finite, connected, planar graph is drawn in the plane
without any edge intersec&ons
v − e + f = 2. (v is the number of ver&ces, e is the number of
edges and f is the number of faces)
4. Euler’s Formula
• a.k.a. Geodesic paths:
– finding a path between two vertices (or nodes) in a graph
such that the sum of the weights of its constituent edges is
minimized.
• Solution:
– Single Source: Dijkstra
– All Pair: Floyd-Warshall.
5. Shortest Path
• Applications:
– Road Network Navigation
– n-Degree Separation in Social Networks
5. Shortest Path(cont.)
Six Degree of Separation
http://entitycube.research.microsoft.com
/6dumapsvg.aspx
http://entitycube.research.microsoft.com/6dumapsvg.aspx
• How to fast get shortest path between two nodes in a large
graph?
– Pre-computed small structure to maintain distances.
– Distance Oracle:
• Hop cover for each nodes’ reachable node list
– Intersect of two lists.
• Landmark based global summary structure.
– Tree traversal.
5. Shortest Path(cont.)
• Problem:
– finding a feasible flow through a single-source,
single-sink flow network that is maximum.
• Applications:
– Airline/pipeline/network scheduling
– Circulation-demand/logistics planning
– Social network Sybil/Spammer detection
6. Max-Flow
• Common Solu&ons:
– Linear Programming.
– Ford-Fulkerson: O(mn)
– Recent result: close to O (m)
6. Max-Flow (cont.)
• Problem:
– A vertex-cover of an undirected graph G=(V,E) is a
subset V’ of V, for edge (u,v) in G, u or v is in V’.
– Minimum Vertex Cover:
• NP-hard opGmizaGon problem.
• SoluGon:
– ApproximaGon algorithm:
• Repeatedly taking both endpoints of an edge into
the vertex cover and remove.
• Maximum Matching
7. Vertex Cover
Leonard Euler Seven
Bridges of Konisberg
1736
Gustav
Kirchoff
Number of spanning
trees
1845 Francis Guthrie
Four Color Theory
1852 ……
1878 Dénes Kőnig
Graph Theory Textbook
1936 Ramsey and Turán
Ramsey Number
1941
De Bruijn
De Bruijin Sequence
1959
Erdos, Renyi and Gilbert
…Random Graph Model
1959 Heinrich Heesch
Compute-aided-proof of
4-color theorem.
……
1969
Graph DB and Big
Graph
2003… Current
Node mapping
(Embedding)
Degree/Neighbors
(structural)
Distance/Path
(reachability)
propagation
Community
Detection
Link Analysis
(relevance)
Iterative
Pagerank
Shortest Path
K-hop
Degree correlation
Label Propagation
Community Detection
PPR
Structural Discovery
Cluster Coefficient
Triangle Courting
Centrality
Connected Component
Dense Subgraph (e.g. K-core, K-
Truss)
Others
Link prediction
Transport Model(IC,LT model
)
Maximal Influence
Node2vec
Rou$ngs
Features
Relations
Groups
Applica’ons Services
Adver&sing
Social Network
• Fake ID
• Sensi,ve User
• Fraudster
• Team-Fraud
•Rela,on
Tracking
•…
•Personal Value
• Friend
Recommend
•Community and
Organisation
Analysis
•…
•Real-time
Recommendation
• Forum selling
•Community
Detection
•…
•Routing
• Spatial-
Temporal
Analysis
•…
Insurance
Transportion
Algorithms
Applications
vApplications
vGraph Matching
vCohesive Subgraphs
vGraph Similarity and Clustering
vDistributed Graph Computation
vOther Graph Problems
vIndustry Collaborations
21Outline
Some Examples
q Fraud Detection
qProduct Recommendation
qRetail Services
q Investment Analysis
q Product Lifecycle Management
q Anti Money Laundering
q Cyber Security
…….
Fraud Detection: First-Party Bank Fraud
• Typical Scenario
– A group of two or more people organize into a fraud ring;
– The ring shares a subset of real contact, combining them to create a
number of synthetic identities;
– Ring members open accounts using these synthetic identities;
– New accounts are added to the original ones (e.g. credit cards);
– The accounts are used normally, with regular purchases and timely
payments;
– Banks increase the revolving credit lines over time, due to the observed
responsible credit behaviour;
– One day the ring “busts out”, coordinating their activity, maxing out all of
their credit lines, and disappearing.
Fraud Detection: First-Party Bank Fraud
Fraud Detection: Insurance Fraud
• Background:
– Fraudsters claim the insurance money via faking or exaggerating
injury, damage etc.
– The impact of fraud on the insurance industry is estimated to be $80
billion annually in the US.
– In the UK, insurers estimate that bogus whiplash claims add $144 per
year to each driver’s policy
Fraud Detection: Insurance Fraud
• Typical Scenario
– rings of fraudsters work together to stage fake car accidents and claim
soft tissue injuries.
– “Paper collisions”, with fake drivers, fake passengers, fake pedestrians
and even fake witnesses.
– Roles
• Providers
– Doctors: diagnose false injuries
– Lawyers: file fraudulent claims
– Body shops: misrepresent damage to cars
• Participants
– Drivers, Passengers, Pedestrians and Witnesses
Fraud Detection:
Insurance Fraud
No Possible to Join
Fraud Detection:
Insurance Fraud
The entities involved in
one of the rings.
Fraud Detection: Some Insights
• The data is grow so big and so complex that it is impossible to
go with relation database.
• Graph database can naturally model the complex relations of
data.
• The fraud detection can often reduce to the mining of certain
patterns:
– Rings (in the previous two examples)
– Bi-cliques
– More others.
Product Recommendation
• Goal
─ Online shopping, financial product
recommendation
• Solution
– Put customers, preferences, purchases,
products and locations etc. into a graph model
– Uses connections to make product
recommendations
• Challenge
– Serve many millions of customers and
products: scalability & efficiency
Retail Services:
Ecommerce Delivery Service Rou8ng
• Goal
– Enable delivery within 60 minutes to compete
with Amazon Prime
• Solution
– With graph, we can identify the fastest delivery
route requires support for complex routing
queries at scale with fast and consistent
performance
• Challenge
– Calculate best route option in real-time
– Offer more predictable delivery time
Retail Services:
Supply Chain Visibility
• Goal
– Visualize the supply chain and easily to explore for the clients
• Solution
– With graph, we can easily explore the supply chain with graph traversal operations
• Challenge
– the volume and structure of information to be processed are huge and complex
Investment Analysis
• A complete knowledge of an enterprise facilitate investment
analysis, which requires:
–Own profile: financial status, innova4on strength of long credit
–Ownership and investment rela4onships
• The stronger the owner and relevant, the higher the inves$ng
poten$al
• Linking all companies via their investment rela$ons requires graph
database
Investment Analysis
• Reveal an enterprise’s real controllers: the person who owns
the biggest equity share, directly or indirectly
Investment Analysis
• Enterprise path discovery
– Whether there are paths to reach new customers
– the potential paths from the competitors to the new customers
…
…
Investing company
Compe2tors
Invested by Blue
New Customers
Invested
Ooops, competitors approached first
Investment Analysis
• Multidimensional relationships discovery:
– competitors
– pattern transfer
– acquisition
– investment
• Why this can help?
– Difficulties of investing a target company? Try negotiate with the
competitors’ targets.
Product Lifecycle Management
• Product lifecycle management is the process of managing the entire lifecycle of a
product from design stage to development to go-to-market to retirement to disposal
Product Lifecycle Management
• Existing PLM systems rely on RDBMS
– Siloed data, inability to model real-life complexities
and to adapt to changes
– Inability to scale data model without costs and risks as
business evolves
– Search for information is time-consuming
– Missing insights regarding dependencies leads to poor
decisions and increase risks
Product Lifecycle Management
• Represent cross-department data
about products and processes as a
graph and store it as one
– Aggregate product hierarchies and
connections into a single source of truth
– Easily edit, or expand your model as
your need evolve
– Saving time by searching and exploring
your data
– Visually track dependencies between
entities and get contextual insights
Anti Money Laundering
Samsung Group
Anti Money Laundering:
Graph Model
• More than a pretty picture
• Graph structure for deep
quantitative analysis
Anti Money Laundering:
Graph Model
• Classical measures of centrality, like degree and betweeness,
and eccentricity:
– to correlate such measures with a propensity of fraud
• Complex circular money flows:
– the native graph structure can also be used to input, track and
calculate transactions across these chains.
– profit distributions flow from one company to another via many
different paths
Anti Money Laundering:
Some Insights
• Global-scale organisation fosters complex transaction flows
– can be circular, making it impossible to investigate ML using
traditional method.
– Graph not only can visual, but can model the data for deep analysis
• Graph properties such as eccentricity and betweenness are useful factors to
correlate the potential ML.
• Path, cycle and pattern analysis can facilitate the AML process
Cybersecurity
• Goal
─ A unified web framework to manage cyber attack relationships in the
network
─ Understand how cyber attackers can leverage initial footholds to extend
their reach through the network
• Challenge
─ correlate data from numerous sources into a common model
─ flexible and easy to extend, and map naturally to network attack
relationships
─ Support various kinds of network attack relationships queries
Cybersecurity:
MITRE Case Study
• Solution
• Graph model
Many thanks to my team
A/Prof. Wenjie Zhang
100+ ERA A* (CCF A) ranked papers
Associate Head of CSE, UNSW
Graph Data, Spatial-Temporal Data,
Algorithms
A/Prof. Ying Zhang
80+ ERA A* (CCF A) ranked papers
University of Technology Sydney (UTS)
Social Network, Streaming Data,
Algorithm
Dr. Lu Qin
60+ ERA A* (CCF A) ranked papers
Senior Lecturer of UTS
Big Graph Computing, Algorithm
Dr. Xin Cao
20+ ERA A* (CCF A) ranked papers
Senior Lecturer of UNSW
Social Network, Data Mining, Algorithm
q Dr. Yixiang Fang: Postdoctoral Fellow of UNSW,Social
Network, Big Graph Computing.
❑ Dr. Xing Feng: Postdoctoral Fellow r of UTS, Big Graph
Computing.
❑ Dr. Longbin Lai: Postdoctoral Fellow of UNSW, Big Graph
Computing.
q Dr. Long Yuan: Postdoctoral Fellow of UNSW, Big Graph
Computing.
q Many past students and current students
GRAPH PATTERN MATCHING
Introduction
Ø Subgraph Matching
Given a query q and a large data graph G, the problem is
to extract all subgraph isomorphic embeddings of q in G.
Introduction
Ø Subgraph Matching
Given a query q and a large data graph G, the problem is
to extract all subgraph isomorphic embeddings of q in G.
Introduction
Ø Subgraph Matching
Given a query q and a large data graph G, the problem is
to extract all subgraph isomorphic embeddings of q in G.
Introduction
Ø Subgraph Matching
Given a query q and a large data graph G, the problem is
to extract all subgraph isomorphic embeddings of q in G.
Graph Pattern Matching: summary
• Related Applications
– Fraud Detection, Anti Money Laundering, Cyber Security etc.
• Algorithms and Recent Publications
– Exact Match
• Tree Sequence (QuickSI), VLDB’08
• Core-Forest-Leaf Decomposition (CFL Algorithm), Sigmod’16
• Optimal Distributed Join-based algorithms, VLDB’15, VLDB’17
– Approximate Match
• Spanning-tree-based matching algorithm (TreeSpan), SIGMOD’12
– Supergraph Match
• Tree-index-based supergraph search algorithm, DGTree, ICDE’16
– Tree Match
• Lawler-procedure-based top-k tree searching algorithm, VLDB’15
Detecting Designated Communities
Applications:
Group of Terrorists, Anomaly detection etc.
Challenges:
• Subgraph Isomorphism is NP complete.
• Graph data and results can be huge (MB-scale
graph can produce PB-scale results).
Query Graph
Data Graph
Alibaba Big Graph Computing- RT Cycle Detection
Applications:
• Fraud detection
• Money laundering detection
Challenges:
• Large-scale dynamic graph (100-million-scale vertices, billion-scale edges)
• High demand on real-time processing: 10-50ms response time, while previous system
takes 50s
Solutions:
• Real-time cycle detection on each edge
• Light-weighted index structure (easy to maintain and update) and efficient query
algorithm
Real-data simulation:
• Detect 6-edge cycles within 10ms
Single CPU
QuickSI (VLDB2008)
q
g1 g2 g3
Synchronized Depth-First Traversal
1.Determine the access order in q
2.Detect corresponding subgraphs in g1, g2 which can be
mapped to the currently traversed vertices.
1
1
2
2
3 4
5
67
3 4
5
67
Forwarding
Backtracking
TurboISO (SIGMOD2013)
• Overview of TurboISO
ØNeighborhood Equivalence Class (NEC)
§ Merging vertices with same neighbor to reduce query size
§ Combine / Permute strategy
ØCandidate Region Exploration
§ Constructed on-the-fly based on q
§ A path-based data structure containing all embeddings of q in G
Neighborhood Equivalence Class(NEC)
• Definition
• Let be an equivalence relation over all query vertices in q such that,
if for every embedding m that contains and , there
exists an embedding m’ such that
• Example
u3 and u4 are NEC nodes .
CFL-Match (SIGMOD2016)
Matching order of QuickSI and TurboISO : (u1,u2,u3,u4,u5,u6).
Reduce Candidates
Cartesian products:
Ø 100 mappings (v0,v2, v1000+i, v2100 +i) (3 ≤ i ≤ 102) of (u1,u2,u3,u4)
Ø 1000 mappings (v0, vj) (3 ≤ j ≤ 1002) of (u1,u5)
105 – 100 partial mappings are
redundant.
(u1,u2,u5,u3,u4,u6)
Compression
● Originally proposed by Qiao et al. [7]
● Intuition
○ Subgraph enumeration can generate enormous (intermediate) results
○ Some vertices can be compressed as they are not needed in future
computation
■ Heuristics by [7]: the vertices that do not belong to the minimum vertex
cover (MVC)
v0
v1 v2
v3
Vertices to compress
Vertices in MVC
Distributed Subgraph
Enumeration
(VLDB2015 & VLDB2017)
Single CPU
QuickSI (VLDB2008)
q
g1 g2 g3
Synchronized Depth-First Traversal
1.Determine the access order in q
2.Detect corresponding subgraphs in g1, g2 which can be
mapped to the currently traversed vertices.
1
1
2
2
3 4
5
67
3 4
5
67
Forwarding
Backtracking
TurboISO (SIGMOD2013)
• Overview of TurboISO
ØNeighborhood Equivalence Class (NEC)
§ Merging vertices with same neighbor to reduce query size
§ Combine / Permute strategy
ØCandidate Region Exploration
§ Constructed on-the-fly based on q
§ A path-based data structure containing all embeddings of q in G
Neighborhood Equivalence Class(NEC)
• Definition
• Let be an equivalence relation over all query vertices in q such that,
if for every embedding m that contains and , there
exists an embedding m’ such that
• Example
u3 and u4 are NEC nodes .
CFL-Match (SIGMOD2016)
Matching order of QuickSI and TurboISO : (u1,u2,u3,u4,u5,u6).
Reduce Candidates
Cartesian products:
Ø 100 mappings (v0,v2, v1000+i, v2100 +i) (3 ≤ i ≤ 102) of (u1,u2,u3,u4)
Ø 1000 mappings (v0, vj) (3 ≤ j ≤ 1002) of (u1,u5)
105 – 100 partial mappings are
redundant.
(u1,u2,u5,u3,u4,u6)
Compression
● Originally proposed by Qiao et al. [7]
● Intuition
○ Subgraph enumeration can generate enormous (intermediate)
results
○ Some vertices can be compressed as they are not needed in future
computation
■ Heuristics by [7]: the vertices that do not belong to the minimum vertex
cover (MVC)
v0
v1 v2
v3
Vertices to compress
Vertices in MVC
Distributed Techniques
A Thriving Literature
Categorizing by Strategies
Related Works
PSgL [Shao et al., Sigmod 2014]
§ BFS Search strategy, Vertex-centric computation
§ Analogous to decomposing into stars and then processing
Star-based join.
§ Several issues including memory issues
BinaryJoin Strategy
● Divide the pattern graph into a set of join units { p1, p2, …, pk }
● Process k-1 join following specific join order
● We prove that CliqueJoin is worst-case optimal by showing that
it can be expressed as GenericJoin proposed by Ngo et al. [8]
StarJoin Algorithm
StarJoin
Round 1
Round 2
Round 3
Round 4
Trade-off for stars as join units
Not unusual: a node contains many
neighbors in a large social networks or
web graphs
A node with 1,000,000 neighbors => 𝟏𝟎𝟏𝟖
matches of a 3-star
q # of matches to a star is massive if many edges in a star.
q If star with many edges, a join followed has more pruning power; i.e.,
less # intermediate results of a join.
Star-based Join
𝑣!
𝑣” 𝑣#
𝑣$
𝑣!
𝑣” 𝑣#
𝑣$
𝑣” 𝑣#
𝑣$
𝑣#
𝑣$
𝑃
𝑝! 𝑝” 𝑝#
Star decomposition Left-deep Join
Star-based Join
Tradeoff
TwinTwigJoin Algorithm
StarJoin TwinTwigJoin
Round 1
Round 2
Round 3
Round 4
1v
2v 3v
4v
1v
2v 3v
4v
0p
2v 3v
4v
1p
3v
4v
2p
P
Star Decomposition
TwinTwig Decomposition
1v
2v
4v
2v 3v
4v
3v
4v1v
TwinTwig: A star of at most
two edges
TwinTwig Join VLDB15’
Given any star-based join, star, there always exists a
TwinTwig join, tt, s.t. the size of intermediate results of
tt is no larger than that of star.
TwinTwig Join VLDB15’
CliqueJoin Algorithm
StarJoin TwinTwigJoin CliqueJoin
Round 1
Round 2
Round 3
Round 4
The drawbacks of TwinTwig Join
Simple graph storage only supports using star (TwinTwig) as
join units, which can result in too many intermediate results
(e.g. a node of degree 1,000,000, O(10^18) stars, O(10^12) TwinTwigs)
Left-deep join can result in sub-optimal solution
General Algorithmic Framework
Graph Storage
§ Stored as (u; Gu) for each data node u
§ Gu is called the local graph of u such that:
(1) it is connected;
(2) ;
(3)
General Algorithmic Framework
The structure into which we decompose the pattern graph are called the join unit
§ e.g. Star in Star-based join, TwinTwig in TwinTwig Join
§ A structure p can be the join unit iff.
where stands for the matches of p in
General Algorithmic Framework
Decomposing
Left-deep tree Bushy tree
SEED VLDB2017
SCP (Star-Clique-Preserved) graph storage: Use star and clique as the join units
§ We can avoid using star if clique is an alternative
§ Shorter execution. The 6-clique can now be processed in one single round, instead of 7
rounds in TwinTwig Join
Bushy join plan: Optimality Guarantee
§ The search space of an optimal bushy join covers that of a left-deep join
SCP Graph Storage
SCP Graph, each local graph is denoted as
We include the edges among the neighbors into the local graph
Gu, so that all cliques with u can be fully computed within
Optimal Bushy Join
Notations
: the join plan to solve P
: the cost of the join plan
: estimated # matches of P in G (We apply
powe-law random graph model for the estimation,
while estimation can vary with applications)
We aim at finding a join plan for P , s.t is minimized
Optimal Bushy Join
A dynamic programming transform function:
References
1. Z. Sun, H. Wang, H. Wang, B. Shao, and J. Li. Efficient subgraph matching on billion node graphs. PVLDB, 5(9),
2012.
2. Y. Shao, B. Cui, L. Chen, L. Ma, J. Yao, and N. Xu. Parallel subgraph listing in a large-scale graph. In SIGMOD’14,
pages 625-636.
3. L. Lai, L. Qin, X. Lin, and L. Chang. Scalable subgraph enumeration in mapreduce. PVLDB, 8(10), 2015.
4. L. Lai, L. Qin, X. Lin, Y. Zhang, L. Chang, and S. Yang. Scalable distributed subgraph enumeration. PVLDB, 10(3),
2016.
5. F. N. Afrati, D. Fotakis, and J. D. Ullman. Enumerating subgraph instances using map-reduce. In Proc. of ICDE,
2013.
6. K. Ammar, F. McSherry, S. Salihoglu, and M. Joglekar. Distributed evaluation of subgraph queries using worst-case
optimal low-memory dataflows. PVLDB, 11(6), 2018.
7. M. Qiao, H. Zhang, and H. Cheng. Subgraph matching: On compression and computation. PVLDB, 11(2), 2017.
8. H. Q. Ngo, E. Porat, C. Re, and A. Rudra. Worst-case optimal join algorithms. J. ACM, 65(3), 2018.
9. H. Kim, J. Lee, S. S. Bhowmick, W.-S. Han, J. Lee, S. Ko, and M. H. Jarrah. Dualsim: Parallel subgraph enumeration
in a massive graph on a single machine. SIGMOD ’16, pages 1231{1245, 2016.
Optimal Enumeration: Efficient Top-k Tree
Matching
(VLDB2015)
90
Introduction
• Data Graph: a node-labeled directed graph G
– Query: a node-labeled rooted tree T
– Find k mappings from T to G with the minimum lengths.
S
C C
E S v5v4v3
v2v1C
S Eu3u2
u1
Query Tree T: Data Graph G:
C
S Ev4v3
v1 C
S Ev4v5
v2Top-1 Match
Length=2
Top-2 Match
Length=2
91
C
S Ev4v5
v1Top-3 Match
Length=3
Applications of Top-k Tree Matching
n Top-k tree matching strengthen the relevance of twig-pattern matching
results
– Twig-pattern matching is a core operation in XQuery over XML/Graph
data
– Exponential number of matches
• It is also used as a key building block to retrieve top-k graph
pattern matches
92
State-of-the-art Approach[Gou’08]
• Time Complexity of the existing techniques
– O(nT(dT + log k)) for enumerating each top-i match
• nT: number of nodes in T
• dT: the maximum node degree in T
• Our contribution in this paper: we propose a new
enumeration paradigm to enumerate each match in O(nT + log
k) time
93
A New Enumeration Strategy
• The solution space: all valid matches in V1 × V2 × … × Vn
– {u1,u2,…,un}: the set of nodes T
– Vi: the set of nodes in G having the same label as ui.
• Adapt Lawler’s Procedure to compute top-k matches
– Suppose (v1,v2,…,vn) is the top-1 match, then the entire solution space is divided into n subspaces
• (V1 − {v1}) × V2 × … × Vn
• {v1} × (V2 − {v2}) × … × Vn
• …
• {v1} × {v2} × … × (Vn − {vn})
– Top-2 match is the match with lowest score among best matches in these obtained subspaces.
How to efficiently compute the best match in a subspace?
94
Two Types of Subspaces
• Categorize the n subspaces into two types
– Let {v1} × {v2} × … x (Vi – Ui) × … × Vn be divided by (v1,v2,…,vn)
– Type-1: {v1} × {v2} × … × (Vi – Ui – vi) × … × Vn
» Only one such type of subspace
– Type-2: {v1} × {v2} × … × {vi} × … × (Vj – vj) × … × Vn
» (n – i) such type of subspaces
• For efficient computation, order the nodes
in T in a top-down fashion (e.g., u1, u2, u3, u4).
u3u2
u1
u4
95
An Optimal Approach
• Compute the score of the best match in a subspace
– One Type-1 subspace: takes time O(log k)
– (
State-of-the-art
126
● BFC-BS & BFC-IBS
(Adam) (Mark) (Shego) (Taylor)
(Doll) (Carpet)(Balm) (Wine)(Hat)
u0 u1 u2 u3
v0 v1 v2 v3 v4
c(v0,v1)= 1
BFC-IBS improves BFC-BS in two
aspects: (1) pre choosing the layer of
start-vertices to achieve a lower time
complexity; (2) using a hash map to
speed up the implementation.
(3) avoid counting a butterfly twice
(process the wedge (u, v, w) only if w.id
> u.id)
State-of-the-art
127
● BFC-BS & BFC-IBS
(Adam) (Mark) (Shego) (Taylor)
(Doll) (Carpet)(Balm) (Wine)(Hat)
u0 u1 u2 u3
v0 v1 v2 v3 v4
c(v0,v1)= 2
BFC-IBS improves BFC-BS in two
aspects: (1) pre choosing the layer of
start-vertices to achieve a lower time
complexity; (2) using a hash map to
speed up the implementation.
(3) avoid counting a butterfly twice
(process the wedge (u, v, w) only if w.id
> u.id)
State-of-the-art
128
● BFC-BS & BFC-IBS
(Adam) (Mark) (Shego) (Taylor)
(Doll) (Carpet)(Balm) (Wine)(Hat)
u0 u1 u2 u3
v0 v1 v2 v3 v4
c(v0,v1)= 3
BFC-IBS improves BFC-BS in two
aspects: (1) pre choosing the layer of
start-vertices to achieve a lower time
complexity; (2) using a hash map to
speed up the implementation.
(3) avoid counting a butterfly twice
(process the wedge (u, v, w) only if w.id
> u.id)
State-of-the-art
129
● BFC-BS & BFC-IBS
(Adam) (Mark) (Shego) (Taylor)
(Doll) (Carpet)(Balm) (Wine)(Hat)
u0 u1 u2 u3
v0 v1 v2 v3 v4
c(v0,v1)= 3 c(v0,v2)= 1
BFC-IBS improves BFC-BS in two
aspects: (1) pre choosing the layer of
start-vertices to achieve a lower time
complexity; (2) using a hash map to
speed up the implementation.
(3) avoid counting a butterfly twice
(process the wedge (u, v, w) only if w.id
> u.id)
State-of-the-art
130
● BFC-BS & BFC-IBS
(Adam) (Mark) (Shego) (Taylor)
(Doll) (Carpet)(Balm) (Wine)(Hat)
u0 u1 u2 u3
v0 v1 v2 v3 v4
c(v0,v1)= 3
= 3 + 0 + 0 = 3
⋈
v0
c(v0,v2)= 1 c(v0,v3)= 1
BFC-IBS improves BFC-BS in two
aspects: (1) pre choosing the layer of
start-vertices to achieve a lower time
complexity; (2) using a hash map to
speed up the implementation.
(3) avoid counting a butterfly twice
(process the wedge (u, v, w) only if w.id
> u.id)
State-of-the-art
131
● BFC-BS & BFC-IBS
(Adam) (Mark) (Shego) (Taylor)
(Doll) (Carpet)(Balm) (Wine)(Hat)
u0 u1 u2 u3
v0 v1 v2 v3 v4
BFC-IBS improves BFC-BS in two
aspects: (1) pre choosing the layer of
start-vertices to achieve a lower time
complexity; (2) using a hash map to
speed up the implementation.
(3) avoid counting a butterfly twice
(process the wedge (u, v, w) only if w.id
> u.id) = 3
⋈
v0 = 0
⋈
v1
State-of-the-art
132
● BFC-BS & BFC-IBS
(Adam) (Mark) (Shego) (Taylor)
(Doll) (Carpet)(Balm) (Wine)(Hat)
u0 u1 u2 u3
v0 v1 v2 v3 v4
BFC-IBS improves BFC-BS in two
aspects: (1) pre choosing the layer of
start-vertices to achieve a lower time
complexity; (2) using a hash map to
speed up the implementation.
(3) avoid counting a butterfly twice
(process the wedge (u, v, w) only if w.id
> u.id) = 3
⋈
v0= 4
⋈
𝐺 = 0
⋈
v1
c(v2,v3)= 2
= 1
⋈
v2
c(v2,v4)= 1
= 0
⋈
v3 = 0
⋈
v3
Time Complexity
O(min{∑u∈U(G) degG(u)
2,∑v∈L(G) degG(v)
2})
State-of-the-art
133
v1001
u0
v0 v1 v999 v1000
u1 u2 u1000 u1001
L(G)
R(G)
Issues in handling hub vertices:
State-of-the-art
134
v1001
u0
v0 v1 v999 v1000
u1 u2 u1000 u1001
L(G)
R(G)
Issues in handling hub vertices:
The existing algorithms need to go
through u0 (or v1000) as the middle-
vertex if choosing L(G) (or U(G)) to start.
Regardless of whether the upper or the
lower layer is selected to start, we must
traverse totally C2, 1000 (= 499,500) plus
1,000 different wedges by the existing
algorithms.
State-of-the-art
135
v1001
u0
v0 v1 v999 v1000
u1 u2 u1000 u1001
L(G)
R(G)
Issues in handling hub vertices:
The existing algorithms need to go
through u0 (or v1000) as the middle-
vertex if choosing L(G) (or U(G)) to start.
Regardless of whether the upper or the
lower layer is selected to start, we must
traverse totally C2, 1000 (= 499,500) plus
1,000 different wedges by the existing
algorithms.
State-of-the-art
136
v1001
u0
v0 v1 v999 v1000
u1 u2 u1000 u1001
L(G)
R(G)
Issues in handling hub vertices:
The existing algorithms need to go
through u0 (or v1000) as the middle-
vertex if choosing L(G) (or U(G)) to start.
Regardless of whether the upper or the
lower layer is selected to start, we must
traverse totally C2, 1000 (= 499,500) plus
1,000 different wedges by the existing
algorithms.
State-of-the-art
137
v1001
u0
v0 v1 v999 v1000
u1 u2 u1000 u1001
L(G)
R(G)
Issues in handling hub vertices:
The existing algorithms need to go
through u0 (or v1000) as the middle-
vertex if choosing L(G) (or U(G)) to start.
Regardless of whether the upper or the
lower layer is selected to start, we must
traverse totally C2, 1000 (= 499,500) plus
1,000 different wedges by the existing
algorithms.
Challenges
138
● It is a challenge to effectively handle high-degree vertices.
● It is also a challenge to utilize CPU cache to speed up butterfly counting.
Solutions
●Solution 1: The vertex-priority-based algorithm
● BFC-VP, to conquer challenge 1
139
● Solution 2: BFC-VP + cache aware techniques
BFC-VP++, to conquer challenge 1 & 2
The vertex-priority-based algorithm
140
Motivation: a hub vertex may not always
necessary to become a middle-vertex in a wedge
for the construction of a butterfly.
[u0, v0, u1, v1] in the Figure can be constructed
in two ways: 1) by the wedges (u0, v0, u1) and
(u0, v1, u1), or 2) by the wedges (v0, u0, v1) and
(v0, u1, v1).
The vertex-priority-based algorithm
• Traversal necessary wedges.
• Only process the wedges (u, v, w) with
p(u) > p(v) & p(u) > p(w).
• Count butterflies.
141
Time Complexity
O(∑(u,v)∈E(G) min{degG(u), degG(v)})
• Assign a priority to each vertex.
142
min{∑u∈U(G)degG(u)
2,∑v∈L(G)degG(v)
2}
∑(u,v)∈E(G)min{degG(u), degG(v)} ≤
∑u∈U(G)degG(u)
2 = ∑(u,v)∈E(G),u∈U(G)degG(u)
≥ ∑(u,v)∈E(G)min{degG(u), degG(v)}
∑v∈L(G)degG(v)
2 = ∑(u,v)∈E(G),v∈L(G)degG(v)
≥ ∑(u,v)∈E(G)min{degG(u), degG(v)}
Theorem:
Proof:
Cache aware techniques
143
● Cache aware wedge processing
● Cache aware graph projection
Improve CPU cache performance,
keep time complexity and space
complexity unchanged.
Cache aware wedge processing
144
The degree distribution of the end-
vertex-accesses on Tracker by BFC-VP
79% of total accesses are accesses
of low-degree vertices (i.e., degree < 500)
Since the locality of accesses is a key aspect of improving the CPU cache
performance, we hope the algorithm can access more high-degree vertices as end-
vertices.
Cache aware wedge processing
145
The degree distribution of the end-
vertex-accesses on Tracker using new
wedge processing strategy
New wedge processing strategy: processing the
wedges where the priorities of end-vertices are
higher than the priorities of middle-vertices and
start-vertices.
We proved that the total access of end-vertices
remaining unchanged - the time complexity
unchanged.
The percentage of accesses of high-degree vertices (i.e.,
degree > 2000) increases from 9% to 81%.
Cache aware graph projection
146
Generally, vertices are sorted by their ids when storing in the buffer.
Although end-vertices are mostly high-priority vertices, the distance
between two end-vertices (e.g., v0 and v3) can be very long.
Extensions
147
● Counting the butterflies for each edge
Extensions
148
● Parallel butterfly counting – shared memory parallelization
global data space
local
data
space
for
thread
1
local
data
space
for
thread
2
local
data
space
for
thread
3
…
Easily extern our algorithms into a parallel version using O(n*t+m) space.
Extensions
149
● External memory butterfly counting
2. For each wedge (u, v, w), store the vertex-pair (u, w) on disk.
3. Sequentially scan these vertex-pairs and for the vertex-pair (u, w), we count
the occurrence of it and compute using the following lemma.
1. Process wedges based on our strategy.
⋈
𝐺
OLAK: An Efficient Algorithm to Prevent Unravelling in
Social Networks
VLDB 2017
The collapse of Friendster, Inc.
• Founded in 2002.
• Popular at early 21st century with over 115 million users.
• Suspended in 2015 for lack of user engagement.
D. Garcia, P. Mavrodiev, and F. Schweitzer. Social resilience in online communities: the autopsy of friendster.
In COSN, pages 39–50, 2013.
K. Seki and M. Nakamura. The collapse of the friendster network started from the center of the core. In
ASONAM, pages 477–484, 2016.
Network Unraveling
The engagement of a user is influenced by the number of her engaged friends.
K. Bhawalkar, J. Kleinberg, K. Lewi, T. Roughgarden, and A. Sharma. Preventing unraveling in social networks:
the anchored k-core problem. SIAM Journal on Discrete Mathematics, 29(3):1452–1475, 2015.
Network Unraveling
An equilibrium: a group has the minimum degree of k
K. Bhawalkar, J. Kleinberg, K. Lewi, T. Roughgarden, and A. Sharma. Preventing unraveling in social networks:
the anchored k-core problem. SIAM Journal on Discrete Mathematics, 29(3):1452–1475, 2015.
Network Unraveling
A Social group tends to be a k-core in the network
K. Bhawalkar, J. Kleinberg, K. Lewi, T. Roughgarden, and A. Sharma. Preventing unraveling in social networks:
the anchored k-core problem. SIAM Journal on Discrete Mathematics, 29(3):1452–1475, 2015.
Network Unraveling
D. Garcia, P. Mavrodiev, and F. Schweitzer. Social resilience in online
communities: the autopsy of friendster. In COSN, pages 39–50, 2013.
The core number steadily increased.
• Founded in 2002.
• Popular at early 21st century with over 115 million users.
• Suspended in 2015 for lack of user engagement.
Network Unraveling
• Founded in 2002.
• Popular at early 21st century with over 115 million users.
• Suspended in 2015 for lack of user engagement.
K. Seki and M. Nakamura. The collapse of the friendster network started
from the center of the core. In ASONAM, pages 477–484, 2016.
The collapse started from the center of the core.
Network Unraveling
• Founded in 2002.
• Popular at early 21st century with over 115 million users.
• Suspended in 2015 for lack of user engagement.
J. Ugander, L. Backstrom, C. Marlow, and J. Kleinberg. Structural diversity
in social contagion. PNAS, 109(16):5962–5966, 2012.
Social influence is tightly controlled by the number of friends in
current subgraph, like k-core.
Network Unraveling
• Founded in 2002.
• Popular at early 21st century with over 115 million users.
• Suspended in 2015 for lack of user engagement.
F. D. Malliaros and M. Vazirgiannis. To stay or not to stay: modeling
engagement dynamics in social graphs. In CIKM, pages 469–478, 2013.
The degeneration property of k-core can be used to quantify
engagement dynamics.
Prevent Network Unraveling
𝑢! is called an anchor.
Prevent Network Unraveling
Follower: a node v is a follower of an anchor u, if v is not in k-
core but belongs to anchored k-core by anchoring u.
Anchored k-Core Problem: Given two integers k and b, find b
anchors to maximize the number of followers (i.e., maximize
the number of nodes in anchored k-core ).
Performance Evaluation
• Datasets:
• Environments:
• Intel Xeon 2.3GHz CPU and Redhat Linux System.
• All algorithms are implemented in C++.
Case Studies
Yelp is a crowd-sourced local business
review and social networking site.
DBLP is a computer science
bibliography website.
Effectiveness: Number of Followers
Efficiency
Diversified Top-k Clique Search
ICDE 2015
Clique and Maximal Clique
• Given a graph G, a clique is a set of nodes such that for
any pair of them have an edge
• A clique is called maximal clique if there exist no other
bigger cliques that contain it
v6
v5
v8
v9
v10v7
v0
v1
v4
v3v2
Traditional models for Maximal Clique
• Maximal Clique Enumeration:
• Top-K Maximal Clique:
• Enumerate all the maximal cliques in the graph
• Return top-k maximal cliques with largest size
• Exponential number of maximal cliques
• Large redundant and useless information
• Still contain large redundancy and hold little information as a
whole.
Diversified Top-K Clique
v6 v8
v9
v10
v7
v5
v1
v4
v3v2
v0
Traditional top-k clique high overlap
Diversified Top-K Clique
v6 v8
v9
v10
v7
v5
v1
v4
v3v2
v0
Diversified top-k clique: cover more vertices
Diversified Top-K Clique Search
• Diversified Top-K Clique:
• NP-hard Problem
• Given: a graph G and an integer k,
• Maximize |𝑪𝟏 ∪ 𝑪𝟐…𝑪𝒌|
• s.t 𝑪𝒊 (𝟏 ≤ 𝒊 ≤ 𝒌) is a maximal clique
• Advantage:
• Consider both size and diversity, provide a better query
result for users.
Baseline Solutions
• EnumAll
Phase 1: Enumerate all the maximal cliques in the graph(D.
Eppstein et al., SEA’11)
Phase 2: Greedily select k cliques from all the maximal cliques
that cover most nodes in the graph
• Problem
• Clique enumeration is a costly operation
• Keeping all maximal cliques in memory is infeasible
• The number of cliques is exponential to the number of nodes
Baseline Solutions
• EnumSub(sample-based enumeration)
Phase 1: Sample a subset of the maximal cliques in the graph(J.
Wang et al., KDD’13)
Phase 2: Greedily select k cliques from the sampling that cover
most nodes in the graph
• Problem
• It still outputs exponential number of maximal cliques without
a bound
Challenge
• generating all maximal cliques, and
• keeping all generated maximal cliques in memory
• Retain the result quality, while avoid:
• Main Idea
Our Approach: extending online k coverage problem
Input Graph Top-K Candidate Set
1. online model storing k maximal cliques in memory
2. update top-k candidate set when enumerate cliques
• replace small existing cliques with big new cliques
• Main Idea
Our Approach: extending online k coverage problem
Input Graph Top-K Candidate Set
1. online model storing k maximal cliques in memory
2. update top-k candidate set when enumerate cliques
• replace small existing cliques with big new cliques
• Main Idea
Our Approach: extending online k coverage problem
Input Graph Top-K Candidate Set
1. online model storing k maximal cliques in memory
2. update top-k candidate set when enumerate cliques
• replace small existing cliques with big new cliques
• Main Idea
Our Approach: extending online k coverage problem
Input Graph Top-K Candidate Set
1. online model storing k maximal cliques in memory
2. update top-k candidate set when enumerate cliques
• replace small existing cliques with big new cliques
• Main Idea
Our Approach: extending online k coverage problem
Input Graph Top-K Candidate Set
1. online model storing k maximal cliques in memory
2. update top-k candidate set when enumerate cliques
• replace small existing cliques with big new cliques
• Main Idea
Our Approach: extending online k coverage problem
Input Graph Top-K Candidate Set
1. online model storing k maximal cliques in memory
2. update top-k candidate set when enumerate cliques
• replace small existing cliques with big new cliques
Replacement Strategy
• which one to be replaced
• private set
• Cmin: clique with smallest private set
A
B
A C
C1
C2
C3
• what is the replacement condition
• |𝐶| > |𝐵| + 𝛼 1 |𝐴| + |𝐵| /𝑘
• 𝛼 is a parameter (0 < 𝛼 ≤ 1)
Advantages of Our Approach
• Guaranteed result quality
• Achieve a guaranteed approximation ratio of 0.25, and much better in
practice.
• Low memory consumption
• Instead of all maximal cliques, just K most promising candidates are
kept in memory
• Efficiency and Scalability ?
Cost Analysis
𝑂( |𝒜| ∗ 𝑘 ∗ 𝐶𝑚𝑎𝑥 + 𝑇𝑒𝑛𝑢𝑚(𝐺) )
• A naïve implementation of our approach:
• for each generated maximal clique C
• update top-k candidate set by C with replacement condition
• The time complexity is :
number of maximal cliques
size of maximum clique
time for maintaining top-k
candidate set
time for enumerating
maximal cliques
Cost Analysis
𝑂( |𝒜| ∗ 𝑘 ∗ 𝐶𝑚𝑎𝑥 + 𝑇𝑒𝑛𝑢𝑚(𝐺) )
• A naïve implementation of our approach:
• for each generated maximal clique C
• update top-k candidate set by C with replacement condition
• The time complexity is :
number of maximal cliques
size of maximum clique
time for maintaining top-k
candidate set
time for enumerating
maximal cliques
PNP-Index
Pruning Rules
PNP-Index
v3
v4
v5
v12
v6
v8
v11
v10
v9
v7
v1
v2
C2
C1
C3
v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12 |priv(C)|
c1 !1 !1 1 1 1 2
c2 1 1 1 !1 1 1
∗
c3 1 1 !1 !1 !1 3
|rcov(v)| 1 1 2 2 3 1 2 1 1 1 |cov|:10
PNP-Index
• Step 1: Check the replacement condition
v3
v4
v5
v12
v6
v8
v11
v10
v9
v7
v1
v2
C2
C1
C3
C4
v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12 |priv(C)|
c1 !1 !1 1 1 1 2
c2 1 1 1 !1 1 1
∗
c3 1 1 !1 !1 !1 3
|rcov(v)| 1 1 2 2 3 1 2 1 1 1 |cov|:10
c4 1 1 1 1 1
PNP-Index
• Step 1: Check the replacement condition
v3
v4
v5
v12
v6
v8
v11
v10
v9
v7
v1
v2
C2
C1
C3
C4
v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12 |priv(C)|
c1 !1 !1 1 1 1 2
c2 1 1 1 !1 1 1
∗
c3 1 1 !1 !1 !1 3
|rcov(v)| 1 1 2 2 3 1 2 1 1 1 |cov|:10
c4 1 1 1 1 1
𝟑 > 𝟏 + 𝟎. 𝟓 ×𝟏𝟎 ÷ 𝟑 = 𝟐. 𝟔𝟕
𝜶
PNP-Index
• Step 2: Delete C2
v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12 |priv(C)|
c1 !1 !1 !1 !1 1 4
c3 1 !1 !1 !1 !1 4
|rcov(v)| 1 1 1 1 2 1 1 1 1 |cov|:9
v3
v4
v5
v12
v6
v8
v11
v10
v9
v7
v1
v2
C2
C1
C3
PNP-Index
• Step 3: Insert C4
v3
v4
v5
v12
v6
v8
v11
v10
v9
v7
v1
v2
C1
C3
C4
v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12 |priv(C)|
c1 !1 !1 !1 !1 1 4
c4 !1 1 1 !1 !1 3
c3 1 1 !1 1 !1 2
∗
|rcov(v)| 1 1 1 1 2 1 2 1 2 1 1 1 |cov|:12
PNP-Index
• An naïve implementation for candidate set maintenance needs
𝑂(|𝒜| ∗ 𝑘 ∗ |𝐶”#$|)
• With the help of PNP-Index, our algorithm can only take 𝑂(∑% ∈𝒜 |𝐶|)
time
Pruning Rules
• Global Pruning: based on k-core and graph coloring
• Local Pruning: check the current coverage of the neighbourhood
• Initial Candidate Computation: increase the power of the above
pruning processes
Performance Evaluation – Efficiency
Vary k
• Vary k (top-𝑘 value) and report total processing time
Performance Evaluation – Effectiveness
• Vary k(top-𝑘 value) and report covered nodes
Vary k
Index-based Optimal Algorithms for Computing
Steiner Components with Maximum Connectivity
SIGMOD 2015
k-edge connected components
• Computing k-edge connected components plays a vital
role in graph-based analysis
■ A graph is k-edge connected if it is still connected after
removing any set of (k-1) edges from it
11
9 10
2-edge connected components
2
1
6
8
7
3
4
5
Steiner Maximum-Connected Component (SMCC) and Steiner-Connectivity
• SMCC of q: maximal subgraph of G containing q⊆V(G)
with maximum connectivity.
■ Steiner-connectivity of q: the connectivity of the SMCC
of q. Denoted sc(q).
Steiner Maximum-Connected Component (SMCC) and Steiner-Connectivity
3 5
4
2
1
6
8
7
11
9 10
the SMCC of {3,4,5}
Challenges
■ Challenge-I: connectivities of subgraphs are not monotonic.
3 5
4
2
1
6
8
7
g1
g2
g3
Challenges
• Challenge-II: enumerating subgraphs requires
exponential time.
■ Enumerate all subgraphs containing q, and choose the maximal one with
maximum connectivity.
Naive Solution
• SMCC of q is the k-edge connected component of G
containing q with the maximum k
• Compute k-edge connected components by
enumerating k from |V| to 1.
■ Still expensive, because it needs to traverse the entire graph
G multiple times
Overview of Our Approach
• Recall sc(q): steiner-connectivity of q
■ sc(u,v): steiner-connectivity of {u,v}
• Key observations:
■ 1: for a fixed u∈q, sc(q) = minv∈q sc(u,v).
■ 2: for a fixed u∈q, SMCC of q is the set of vertices v satisfying
sc(u,v)≥sc(q).
Framework
• Phase-I: offline index construction
■ Step-1: compute sc(u,v) for all edges (u,v) in G
■ Step-2: compute index based on these sc(u,v)
• Phase-II: online query processing
■ Step-1: compute sc(q)
■ Step-2: start from any vertex u∈q to extend result set to include all vertices v
with sc(u,v)≥sc(q)
Index Construction
• Compute sc(u,v) for every edge (u,v) in G, instead of
for all vertex pairs
Time complexity: O(α(G)×kECC(G)), where α(G) is the arboricity of a
graph G, kECC(G) is the cost of computing kECC of G [SIGMOD 13].
■ Algorithms of computing k-edge connected components for a
fixed k
■ Deterministic algorithm: O(h×l×|E|)
■ By Chang. et al in SIGMOD’13
■ Randomized algorithm: O(t×|E|)
■ By Takuya Akiba, Yoichi Iwata, Yuichi Yoshida in CIKM’13
Index Construction
2
1
4
3 5
44
4
4
4
4
4 4
9
87
6
3 3
3
3
33
1110
3 1312
3
3
3
3 3
2
3
3
3 2
Compute sc(1,7)
• sc(u,v) equals the maximum among the minimum weights of
(edges in) all paths between u and v.
Index Construction
• Index structure: the
maximum spanning tree
(MST) T of the weighted
graph.
■ Time complexity: O(|E|).
■ sc(u,v) equals the minimum weight in the
path between u and v in the MST T.
2
1
43 5
4 4 44
9
8
7
6
3
3
3
3
11
10
13
12
3
2 3
3
Compute sc(1,7)
Query Processing: Steiner-Connectivity Query
Algorithm: find the
spanning subtree Tq of T
connecting vertices in q,
and return the minimum
weight in Tq as sc(q).
Time complexity: O(Tq)
2
1
43 5
4 4 44
9
8
7
6
3
3
3
3
11
10
13
12
3
2 3
3
Compute sc(12, 9, 4)
Query Processing: SMCC Query
Algorithm: after obtaining sc(q), extend the subgraph
from q by visiting edges with weights at least sc(q).
Time complexity: O(Vq), where Vq is the result set.
Performance Studies
• Statistics of Data
SMCC Query
SMCC-BLE: baseline algorithm
SMCC-OPT: optimal algorithm
Cohesive Subgraph Search: summsry
• Related Applications
– Product Recommendation, Investment Analysis etc.
• Algorithms and Recent Publications
– Clique
• In-memory approximate algorithm, ICDE 2015
• Graph partition based I/O efficient search algorithm, VLDBJ 2016
– K-ECC
• Graph decomposition based algorithm ,SIGMOD 2013
• I/O efficient algorithm to compute the k-edge connected component for all k values, VLDBJ
2015
– Steiner Maximum-Connected Component
• Index-based in-memory algorithm, SIGMOD 2015
• Index-based semi-external algorithm, VLDBJ 2017
Cohesive Subgraph Search
• Algorithms and Recent Publications
– K-Core
• Pruning based search algorithm, ICDE 2018
• upper and lower bound based pruning rules on uncertain graph, ICDE 2018
• I/O efficient core decomposition, ICDE 2016 (Best Paper Award).
– Critical Users
• Iteratively find a best user to anchor based on the heuristics, such that the
corresponding subgraph (e.g., k-core) is the largest, VLDB 2017, ICDE 2018.
• Iteratively find a best user and delete, such that the corresponding k-core is the
smallest, AAAI 2017.
– Multi-dimensional Subgraph Search
• Solid pruning rules based algorithm on search tree with fast search orders, VLDB
2017.
Algorithm Analysis
Bounded Memory.
– Follow the semi-external model and require only O(n) memory.
Read I/O Only.
– Only require read I/Os by scanning the node and edge tables sequentially on disk.
Simple In-memory Structure and Data Access.
– Use two vectors to save the core number and count value for each node.
K-Edge Connected Component Search
K-Edge Connected Component:
• K-edge Connected:a graph is connected after
the removal of any (k-1) edges
• Maximal subgraph which is k-edge connected
Problem Definition:
• Given a graph G and an integer k, computes
the k-edge connected component of G
v1
v2 v4
v3 v5
v6
v7
v9
v8
v10 v11
v12 v133-ECC
Applications:
• Social behaviour mining
• Community detection
• Graph visualization
Challenges:
• Input graph is very large
Publications:
• Graph decomposition based algorithm ,SIGMOD 2013
• I/O efficient algorithm to compute the k-edge
connected component for all k values, VLDB 2015
Contributions:
• Our in-memory algorithm outperforms the state-of-
the-art by several orders of magnitude
• I/O efficient algorithm can handle billion-edges scale
graphs on a single consumer grade computer
Steiner Maximum-Connected Component Search
Problem Definition:
• Given a set of nodes in G, compute the k-edge
connected component with the maximum k
value that contains the given nodes (Steiner
Maximum-Connected Component)
v1
v2 v4
v3 v5
v6
v7
v9
v8
v10 v11
v12 v13
Applications:
• Potential customer prediction
• Product promotion
• Research team assembling
Challenges:
• Input graph can be very large
Publications:
• Index-based in-memory algorithm,SIGMOD 2015
• Index-based semi-external algorithm,VLDBJ 2017
Contributions:
• Our in-memory algorithm outperforms the state-of-
the-art by several orders of magnitude
• I/O efficient algorithm can handle billion-edges scale
graphs on a single consumer grade computer
Efficient Computing of Radius-Bounded k-Cores
Applications:
• Personalized event recommendation
• Social marketing
Challenges:
• The location of the radius-bounded circle of a RB-k-core is
unknown.
• It is time-consuming to verify all the candidate subgraphs
individually.
Publications:
• Pruning based search algorithm, ICDE 2018
Contributions:
• Our techniques can be applied to solve the SAC search problem
published in VLDB 2017 and achieve a speed-up twice.
Problem definition:
• Given a graph G, the k-core of G is a maximal
subgraph where each node has at least k
neighbors in the subgraph.
• Given a geo-social network (as shown above), a
query vertex q (Leo), a radius r (2km) and k (3),
find k-cores containing the query vertex q
covered by circles with radius r.
K-Core Search on Uncertain Graphs
Applications:
• Cohesive subgraph detection
• Vital vertices detection
• Influential communities detection on social network
Hardness:
• NP-hardness problem
• Prohibitive to enumerate all possible result sets
Algorithm and Published Paper:
• Search algorithm based upper and lower bound pruning rules, ICDE
2018
Contributions:
• Advanced algorithm about one order faster than baseline on average
• Propose a new k-core model on uncertain graph, where result sets
outperform previous models on average influences test of social
networks
Problem Definition:
• Given an uncertain graph(edge with
probability), find a vertex set where the
probability of every node to be in a k-core is no
less than 𝜃(given by users)
• Used in undirected graph and easily to extend
to directed graph
Example:
• P(d is not a 2-core) = P(d not in 2-core(d,f,g)
and not in 2-core(c,d,e)) = 0.766
• (c,d,e) is the result set of k=2,𝜃=0.23
c
d e
Find Critical Users to Prevent Network Unraveling
Applications:
• Efficiently find the critical users to
reinforce the network.
Algorithms and Publications:
• Iteratively find a best user to anchor
based on the heuristics, such that the
corresponding subgraph (e.g., k-core) is
the largest, VLDB 2017, ICDE 2018.
Contributions:
• The proposed algorithms outperform the
state-of-the-art algorithms by 3 orders of
magnitude in runtime.
Example:
• A vertex represents a user.
• An edge represents there is a
relationship between two users.
• The number of vertices in 3-core
represents the stability of
network.
• Encourage the engagement of
𝑢!can greatly enlarge the 3-
core, as the anchored 3-core by
𝑢!shows.
• In Yelp, the
engagement of
Caley can enhance
the engagement of
other 31 users.
Problem:
• k-core: a maximal subgraph where each vertex is adjacent to at
least k vertices in the subgraph.
• Given a network G, find b critical users whose engagement can
significantly prevent network unraveling, i.e., find b vertices
whose persistent existence leads to the k-core with the largest
number of vertices.
• NP-hard, NP-hard to approximate.
Find Critical Users to Reinforce Communities
Example:
• A vertex represents a user.
• An edge represents there is a
relationship between two users.
• The number of vertices in 3-core
represents the stability of
network.
• The leave of 𝑢!!can greatly
collapse the 3-core, as the
collapsed 3-core by 𝑢!! shows.
• In DBLP, the co-author
network, the leave of
Ying Li will break the
engagement of other 74
users (who also leave the
k-core).
Problem:
• k-core: a maximal subgraph where each vertex is adjacent
to at least k vertices in the subgraph.
• Given a network G, find b critical users whose engagement
can significantly reinforce the communities, i.e., find b
vertices whose leave leads to the k-core with the smallest
number of vertices.
• NP-hard.
Applications:
• Efficiently find the critical users to
reinforce the communities.
Algorithms and Publications:
• Iteratively find a best user and delete,
such that the corresponding k-core is the
smallest, AAAI 2017.
Contributions:
• Our algorithm outperforms the state-of-
the-art algorithm by 4 times in runtime.
GRAPH SIMILARITY AND CLUSTERING
Graph Similarity and Clustering: Summary
• Related Applications
– Product Recommendation, Investment Analysis etc.
• Algorithms and Recent Publications
– Graph Edit Distance
• Path-match-based filtering algorithm, ICDE’12
• Subgraph-match-based filtering algorithm, VLDB’14
– SimRank
• Adjustable clustering strategy for SimRank (OIP-SR), ICDE’13
• HyperLink-based SimRank compute (MEMO-SR), VLDB’13
• Incrementally SimRank compute (Inc-SR), ICDE’14
– Graph Clustering
• Core clustering and non-core clustering (pSCAN), ICDE’16
• Dynamic maintenance of clustering structures (dSCAN), TKDE’17
• Similarity-sequence-index-based algorithm (GS-SCAN) VLDB’18
Graph Similarity – Edit distance
Applications:
• Chemistry: Find similar organic compound.
• Biology: Find similar protein-DNA interactions.
• Finger print recognition: Identify criminal suspect.
Challenges:
• Graph edit distance problem is NP hard.
Algorithms and publications
• Path-match-based filtering algorithm, ICDE’12
• Subgraph-match-based filtering algorithm, VLDB’14
Contributions:
• Subgraph-match-based algorithm is 40 times faster than previous
work.
Problem Definition:
• Edit:for Vertex: Add/Delete/Alter Label;
for Edge: Add/Delete/Alter Label;
• Given a graph pair (r, s), their edit distance is
defined as the minimum edit operations to turn
r into s.
• Given two graph sets (R, S), searching for all
pairs of (r, s) such that their edit distance is
within a given threshold.
Levodopa
Droxidopa
Graph Similarity – SimRank
Applications:
Collaborative filtering, Page rank, Graph clustering, Link prediction
Challenges:
• Hard to find proper ranking metrics in practice.
• The complexity is O(Kmn), where K is the iteration, m is the number of
edges, and n is the number of vertices.
• Hard to incrementally update while graph data changes
Algorithms and publications
• Adjustable clustering strategy for SimRank (OIP-SR), ICDE’13
• HyperLink-based SimRank compute (MEMO-SR), VLDB’13
• Incrementally SimRank compute (Inc-SR), ICDE’14
Contributions:
• OIP-SR improves the complexity, with 10 times of performance gain.
• MEMO-SR enriches the semantics of SimRank.
• Inc-SR can apply to dynamic graph.
Problem definition:
• In terms of structure, two vertices are similar
when their neighbours also similar.
• Given two vertices a, b, and their in
neighbours, I(a), I(b), we can compute their
SimRank as:
Graph Clustering
Applications:
Community detection, Graph partition, Graph visualisation, Hidden
structure detection, Gene array.
Challenges:
• Costly to compute pair-wise similarities.
• Parameters are hard to adjusted.
• Graph data is large, and evolving.
Algorithms and publications:
• Core clustering and non-core clustering (pSCAN), ICDE’16
• Dynamic maintenance of clustering structures (dSCAN), TKDE’17
• Similarity-sequence-index-based algorithm (GS-SCAN) VLDB’18
Contributions:
• pSCAN is over 10 times faster than the state-of-the-art (SCAN++)
• GS-SCAN proposes a novel data structure, resulting in quite clustering
regardless of the parameters, and is 10-1000 times faster than pScan.
Problem Definition
• Given a graph, divide the vertices to
different groups based on their similarities.
DISTRIBUTED GRAPH COMPUTING
Distributed Graph Computing – SGC Model
Applications:
• The SGC algorithms can easily scale out, that is the performance of
the algorithm increases nearly linearly with the number of machines
Challenges:
• Hard to design a SGC algorithm, especially targeting O(logn) iterations
Algorithms and publications:
• Scalable Graph Computing Model (SGC), SIGMOD’14
Contributions:
• Design and implement two SGC operators, N and E
• Implement many graph algorithms using N and E in MapReduce
• Empirically prove that it reaches better scalability than the state-of-
the-art
Scalable Graph Computing (SGC):
• m: #edges,n: #vertices,t: #machines
• A distributed graph algorithm belongs to SGC
if it satisfies:
Metrics per machine Total
Disk 𝑂(m/t) 𝑂(m)
Mem 𝑂(1) 𝑂(𝑡)
Comm. 𝑂(m/t) 𝑂(𝑚)
CPU 𝑂(m/t) 𝑂(𝑚)
Iteration lo𝑔(𝑛)
Distributed Graph Computing – Connected Component
Applications:
• CC is the initial steps of many graph computations.
• Aid the analysis of big graphs, help to reveal the anchor vertices.
Challenges:
• Big graph (billion-scale vertices)
Algorithms and publications:
• CC based on graph decomposition and DCC based on label
propagation,ICDE’16
Contributions:
• Novel algorithmic framework that reduces a lot of communication
• 50 times faster than the state-of-the-art
Problem definition:
• Connected component (CC): the maximal
subgraph that is connected
• Double connected component (DCC): the
maximal subgraph that is a connected
component after removing any vertex.
• Given a large graph, compute CC and DCC
Big Graph
• Algorithms and Recent Publications
– Distributed Graph Computing Model
• Scalable Graph Computation Model, SIGMOD’14
– Distributed Connected Component
• CC based on graph decomposition and DCC based on label propagation,
ICDE’16
– Distributed Scalable Subgraph Enumeration
• Optimal Distributed Join-based algorithms, VLDB’15, VLDB’17
Other Graph problems
Other Graph Problems – Coloring, Eccentrality,
Shortest Path, etc.
• Related Applications
– Product Recommendation, Investment Analysis, Anti Monty Laundering,
Retail Service etc.
• Algorithms and Recent Publications
– Graph Coloring
• Dynamic graph colouring based on local update, VLDB 2018
– Eccentricity computation
• Local search algorithm based on lower and upper bounds, ICDE 2018
– Shortest Path
• Two-hop-labelling shortest path query on road network, SIGMOD 2018
• Efficient Top-k shortest path join, EDBT 2015
Graph Colouring
Graph colouring:
• Colour the graph with minimum colours such that
any two neighbouring vertices bear different
colours.
Problem definition:
• Process graph colouring when the graph is
constantly changing (insertion and deletion)
Applications:
• Channel assign in wireless network
• Airline control and management
• Community detection in social network
• An initial step for the other graph, such as maximum
clique and minimum cut
Challenges:
• NP hard
• Large graph data
• Frequent update
Algorithms and publications:
• Dynamic graph colouring based on local update,VLDB
2018
Contributions:
• The algorithm uses ½ less colours than previous work,
and offers ms response to graph update.
Eccentricity computation
Problem definition:
• The eccentricity of a vertex in the graph is the longest
shorted distance this vertex can reach the other vertices.
• It requires to solve APSP problem for computing the
eccentricity of all vertices in the graph, while the error
rate is high in small-world graph model.
Applications:
• Vertex ordering (via their importance)
• Compute some other graph features (diameter, radius
etc.)
Algorithms and publications:
• Local search algorithm based on lower and upper
bounds, ICDE 2018
Contributions:
• Three orders of magnitude faster than previous work.
Example:
• V8‘s eccentricity is 4, as longest shortest
distance from V8 is 4, towards V11.
• The smaller the eccentricity is, the more
centric (darker) the vertex is.
Big Graph Characteristics
Big
Data
Volume
• Petabytes
• Records
• Transactions
Velocity
• Batch
• Real time
• Streaming
Variety
• Structured
• Unstructured
• Semi-
structured
Challenges and Opportunities
New Graph Analytics Models
New Processing Algorithms & Indexing Techniques
New Computing Platform/Architecture
Major Research Issues
Graph Processing System
§ Primitive operators
§ Query language
§ Distributed techniques
§ Storage
§ etc
Research Collaborations: Huawei Cohesive Subgraph
Exploration
❑ Cohesive subgraph:
❖ Subgraphs whose internal vertices are
densely connected, while there are few
edges in between.
❖ Core member detection, friend
recommendation, community detection.
❑ Main Challenges:
❖ Big Graph Data, and computing
complexities
❖ High-dimensional graph data:
Heterogeneous edges
❖ Dynamic Graph: Graph data is evolving
One of the applications on Huawei public cloud
Alibaba Big Graph Computing – FLASH Language
❑ FLASH Query Language:
❖ Only three operators: Filer, LocAl, PuSH
❖ Great expression power: Capable of expression over 100 graph computing cases.
❖ Convenient Programming: Programming most cases within 10 lines.
❑ Progress:
❖ Prototyping system are deployed for production use (in distributed context)
❖ System optimizations and tunings
Alibaba Big Graph Computing – Biclique Fraud detection
“Clients”
Challenges:
• Computationally intensive: NP Complete
• Big Data:100-million-scale users and products, billion-scale purchasing edge
• Dynamic Data: Frequent buying and selling on Taobao
Solutions: In the client-product bi-graph, fake ordering is modelled as a bi-clique,
which infers that there exists a potential fake ordering when a group of people
simultaneously purchase a set of products.
Alibaba’s “Double 11” shopping festival use case:
• Efficiency: Reduce the computation on billion-scale graph from days to hours
• Effectiveness: Recall over 60% issued deals, improved traditional approach by 40%
Products
Alibaba Big Graph Computing- RT Cycle Detection
Applications:
• Fraud detection
• Money laundering detection
Challenges:
• Large-scale dynamic graph (100-million-scale vertices, billion-scale
edges)
• High demand on real-time processing: 10-50ms response time, while
previous system takes 50s
Solutions:
• Real-time cycle detection on each edge
• Light-weighted index structure (easy to maintain and update) and
efficient query algorithm
Real-data simulation:
• Detect 6-edge cycles within 10ms
Graph System
• Graph system should support
– Traversal Queries
• Explore the neighbouring information of given nodes in real time.
– Iterative Computation:
• BFS, Page rank, Shortest path, Connected component, Label propagation etc
– Graph Analytics
• Graph pattern matching, Community detection, Similarity, Graph Clustering
etc.
Graph System
• Existing systems support each type of query very efficiently, but
none covers the whole picture
– Combine several systems
• Complexities of maintenance are rising
• Overheads of data transformation
– Simulate graph processing in general engine
• Spark’s GraphX, Flink’s Gelly, etc.
• The performance is not competitive to the graph-specific systems
Graph Processing System Development
Distributed Graph Storage
High-performance Data Engine
Basic Graph Operators
Traversal Queries Iterative Computation Graph Analytics
Query Languages and Visualisation
Our Publications
1. F. Bi, L. Chang, X. LIN, W. Zhang, An Optimal and Progressive Approach to Online Search of Importance-based Top-K Communities, to appear
in VLDB2018.
2. B. Lv, L. Qin, X. LIN, L. Chang, J. Yu, Supergraph Search in Graph Databases vis Hierarchical Feature-Tree, to appear in TKDE (accept in April,
2018. Subbmited before my EIC term.)
3. D. Wen, L. Qin, Y. Zhang, X. LIN, J. Yu, I/O Efficient Core Graph Decomposition: Application to Degeneracy Ordering, to appear in TKDE (Best
Paper Award in ICDE 2016).
4. D. Ouyang, L. Qin, L. Chang, X. LIN, Y. Zhang, Q. Zhu, When Hierarchy Meets 2-Hop-Labeling: Efficient Shortest Distance Queries on Road
Networks, to appear in SIGMOD 2018.
5. J. Yang, W. Zhang, S. Yang, Y. Zhang, X. LIN, L. Yuan, Efficient Set Containment Join, to appear in VLDB Journal (accepted in March, 2018).
6. K. Wang, X. Cao, X. LIN, W. Zhang, L. Qin, Efficient Computing of Radius-Bounded k-Cores, to appear in ICDE 2018.
7. Y. Peng, Y. Zhang, W. Zhang, X. LIN, L. Qin, Efficient Probabilistic K-Core Computation on Uncertain Graphs, to appear in ICDE 2018.
8. F. Zhang, Y. Zhang, L. Qin, W. Zhang, X. LIN, Efficient Reinforcing Social Networks over User Engagement and Tie Strength, to appear in
ICDE2018.
9. J. Qin, Y. Wang, C. Xiao, W. Wang, X. LIN, Y. Ishikawa, GPH: Similarity Search in Hamming Space, to appear in ICDE2018.
10. W. Li, M. Qiao, L. Qin, Y. Zhang, L. Chang, X. LIN, Exacting Eccentricity for Small-World Networks, to appear in ICDE2018.Y. Chen, X. Zhao,
X. LIN, Y. Wang, D. Guo, Efficient Frequent Sungraph Mining on Uncertain Graphs, to appear in TKDE (accepted in Jan, 2018, submitted before
my EIC term).
11. W. Yu, X. LIN, W. Zhang, and J. McCann. Dynamical SimRank Assessment on Time-Varying Networks. to appear in VLDB Journal. (33 pages,
Accepted in OCT 2017).
12. X. Zhao, C. Xiao, X. LIN, Wenjie Zhang, Yan Wang, Efficient Structure Similarity Searches: A Partition-Based Approach, to appear in VLDB
Journal.
13. W. Liu, Z. Liu, I. W. Tsang, W. Zhang, X. LIN, Doubly Approximate Nearest Neighbor Approximation, to appear in AAAI 2018
14. L. Yuan, L. Qin, X. LIN, L. Chang, W. Zhang, Effective and Efficient Dynamic Graph Coloring, to appear in VLDB 2018.
15. D. Wen, L. Qin, L. Chang, Y. Zhang, X. LIN, Efficient Structural Graph Clustering: An Index-Based Approach, to appear in VLDB2018.
References:
Our Publications
16. F. Zhang, Y. Zhang, L. Qin, W. Zhang, X. LIN, When Engagement Meets Similarity: Efficient (k, r)-Core Computation on Social Networks, to appear
in VLDB2017.
17. Fan Zhang, Wenjie Zhang, Ying Zhang, Lu Qin, XUEMIN LIN, “OLAK: An Efficient Algorithm to Prevent Unraveling in Social Networks”, to appear in
42nd International Conference on Very Large Data Bases (VLDB), 2017
18. F. Zhang, Y. Zhang, L. Qin, W. Zhang, X. LIN, Finding Critical Users for Social Network Engagement: The Collapsed k-Core Problem, to appear in
AAAI2017.
19. T. Gao, X. Cao, G. Cong, J. Lu, X. LIN, Distributed Algorithms on Exact Personalized PageRank, to appear in SIGMOD2017.
20. L. Yuan, L. Qin, X. LIN, L. Chang, W. Zhang, I/O Efficient ECC Graph Decomposition via Graph Reduction, to appear in VLDB Journal.
21. L. Chang, C. Zhang, X. LIN, L. Qin Scalable Top-K Structural Diversity Search, Short Paper, Proceedings of the 32st International Conference on
Data Engineering (ICDE2017), 2017
22. L. Lai, Q. Lu, X. LIN, Y. Zhang, L. Chang, Scalable Distributed Subgraph Enumeration, VLDB 2017.
23. F. Bi, L. Chang, X. LIN, L. Qin, W. Zhang, Efficient Subgraph Matching by Postponing Cartesian Products, SIGMOD 2016.
24. H. Wei, J.X. Yu, C. Lu, X. LIN, Speedup Graph Processing by Graph Ordering, to appear in SIGMOD 2016.
25. . Yuan, L. Qin, X. LIN, L. Chang, W. Zhang, I/O Efficient ECC Graph Decomposition via Graph Reduction, VLDB2016.
26. B. Lyu, L. Qin, X. LIN, L. Chang, J.X. Yu, Scalable Supergraph Search in Large Graph Databases, ICDE 2016.
27. X. Feng, L. Chang, X. LIN, L. Qin, W. Zhang, Computing Connected Components with Linear Communication Cost in Pregel-like Systems, ICDE
2016.
28. L. Chang, W. Li, X. LIN, L. Qin, W. Zhang, pScan: Fast and Exact Structural Graph Clustering, ICDE 2016.
29. D.-W. Choi, J. Pei, X. LIN, Finding the Minimum Spatial Keyword Cover, ICDE 2016.
30. D. Wen, L. Qin, Y. Zhang, X. LIN, J.X. Yu, I/O Efficient Core Graph Decomposition at Web Scale, ICDE 2016 (BEST PAPER AWARD).
31. W. Zhang, X. LIN, Y. Zhang, K. Zhu, G. Zhu, Efficient Probabilistic Supergraph Search, to appear in IEEE Transactions on Knowledge and
Engineering (TKDE, accepted in Nov 2015)
32. L. Yuan, L. Qin, X. LIN, L. Chang, W. Zhang, Diversified Top-K Clique Search, to appear in VLDB J, (accepted in Oct 2015)
References:
Our Publications
33. L. Chang, X.LIN, Q. Lu, J .X. Yu, W. Zhang, Index-based Optimal Algorithms for Computing Steiner Components with Maximum Connectivity,
SIGMOD 2015.
34. L. Chang, X. LIN, Q. Lu, J.X. Yu, J. Pei, Efficiently Computing Top-k Shortest Path Join, to apeear in EDBT2015.
35. L. Lai, L. Qin, X. LIN, L. Chang, Scalable Subgraph Enumeration in MapReduce, to appear in VLDB 2015.
36. L. Chang, X. LIN, W. Zhang, J.X. Yu, Y. Zhang, L. Qin, Optimal Enumeration: Efficient Top-k Tree Matching, to appear in VLDB 2015.
37. X. Wang, Y. Zhang, W. Zhang, X. LIN, W. Wang, Selectivity Estimation On Streami SpatioTextual Data Using Local Correlations, to appear in
VLDB2015.
38. J. Wang, S. Song, X. LIN, X. Zhu, J. Pei, Clean Structured Even Logs: A Graph Repair Approach, in ICDE2015.
39. L. Yuan, Lu. Qin, X. LIN, L. Chang, W. Zhang, Diversified Top-K Clique Search, to appear in ICDE2015.
40. X. Wang, Y. Zhang, W. Zhang, X. LING, W. Wang, AP-Tree: Efficiently Support Continuous Spatial-Keyword Queries Over Stream, to appear in
ICDE2015
41. Z. Zhang, J.X. Yu, L. Qin, L. Chang, X. LIN, I/O Efficient: Computing SCCs in Massive Graphs, to appear in VLDB Journal (accepted in Sept, 2014).
42. W. Yu, X. LIN, W. Zhang, J. A. McCann, Fast All-Pairs SimRank Assessment on Large Graphs and Bipartite Domains, to appear in TKDE.
References:
Our Publications
43. X. Wang, Y. Zhang, W. Zhang, X. LIN, Efficiently Identify Local Frequent Keyword Co-occurrence Patterns in Geo-tagged Twitter Stream, SIGIR
2014 (short paper): 1215-1218.
44. L. Qu, J.X. Yu, L. Chang, H. Cheng, C. Zhang, X. LIN, Scalable Big Graph Processing in MapReduce, SIGMOD 2014: 827-838
45. W. Yu, X. LIN, W. Zhang, Fast Incremental SimRank on Link-Evolving Graphs, ICDE 2014: 304-315.
46. C. Zhang, Y. Zhang, W. Zhang, X. LIN, M.A. Cheema, X. Wang, Diversified Spatial Keyword Search on Road Networks, EDBT 2014: 367-378.
47. X. Zhao, C. Xiao, X. LIN, Q. Liu, W. Zhang, A Partition-Based Approach to Structure Similarity Search, PVLDB 7(3): 169-180 (2013)
48. W. Yu, X. LIN, W. Zhang, L. Chang, J. Pei, More is Simpler: Effectively and Efficiently Assessing Node-Pair Similarity based on Hype-links. PVLDB
7(1): 13-24 (2013)
49. W. Yu, X. LIN, IRWR: Incremental Random Walk with Restart, SIGIR 2013: 1017-1020.
50. L. Chang, J. Yu, L. Qin, X. LIN, C. Liu, W. Liang, Efficiently Computing k-Edge Connected Components via Graph Decomposition, SIGMOD
Conference 2013: 205-216
51. Z. Zhang, J. Yu, L, Qin, L. Chang, X. LIN, I/O Efficient: Computing SCCs in Massive Graphs, SIGMOD Conference 2013: 181-192
52. X. Zhao, X. Chuan, X. LIN, W. Wang, Y. Ishikawa, Efficient Processing of Graph Similarity Queries with Edit Distance Constraints, to appear in VLDB
Journal, VLDB J. 22(6): 727-752 (2013)
53. Weiren Yu, XUEMIN LIN, Wenjie Zhang, Towards Efficient Computation on SimRank Computation on Large Graphs, ICDE 2013: 601-612. ( One of
the Best Papers )
References:
Our Publications
54. Chengyuan Zhang, Ying Zhang, Wenjie Zhang, XUEMIN LIN, Inverted Linear Quadtree: Efficient Top K Spatial Keyword Search, ICDE 2013: 901-912
55. Weiiren Yu, XUEMIN LIN, Wenjie Zhang, Ying Zhang, Jiajin Le, SimFusion+: Extending SimiFusion Towards Efficient Estimation on Large Dynamic
Networks, SIGIR 2012: 365-374
56. Yuanyuan Zhu, Lu Qin, Jeffrey Yu, Yiping Ke, XUEMIN LIN, High Efficiency and Quality: Large Graphs Matching, VLDB J. 22(3): 345-368 (2013)
57. Gaoping Zhu, XUEMIN LIN, Ke Zhu, Wenjie Zhang, Jeffrey Xu Yun, TreeSpan: Efficiently Computing Similarity All-Matching, SIGMOD Conference
2012: 529-540
58. Xiang Zhao, Chuan Xiao, XUEMIN LIN, Wei Wang, Efficient Graph Similarity Joins with Edit Distance Constraints, ICDE 2012: 834-845
59. Y. Luo, W. Wang, X. LIN, X. Zhou, J. Wang, K. Li, SPARK2: Top-k Keyword Query in Relational Databases, IEEE Transactions on Knowledge and Data
Enigneering, 23(12), 1763-1780, 2011 (Spotlight Paper)
60. H. Shang, X. LIN, Y. Zhang, J.X. Yu, W. Wang, Connected Substructure Similarity Search , pages 903-914, SIGMOD 2010.
61. H. Shang, K. Zhu, X. LIN, Y. Zhang, R. Ichise, Similarity Search on Supergraph Containment , pages 637-648, ICDE 2010.
62. H. Shang, Y. Zhang, X. LIN, J. Yu, Taming Verification Hardness: an efficient algorithm for testing subgraph isomorphism, pages 364-375,
VLDB2008.
63. Y. Luo, W. Wang, X. LIN, SPARK: A Keyword Search Engine on Relational Databases (demo) ICDE 2008: pages 1552-1555.
64. J. Chen, J.X. Yu, X. LIN, H. Wang, P.S. Yu, Fast Computing Reachability for Large Graphs with Hogh Compresion Rate , in the proceedings of
EDBT08, pages 193-204, France.
65. B. Ding, J.X. Yu, S. Wang, L. Qing, X. Zhang, X. LIN, Finding Top-k Min-Cost Connected Trees in Databases, ICDE’07 (Best Student Paper Award),
pages 836-845, 2007.
References:
Thank you!
Questions?