Towards Big Graph Processing: Applications, Challenges, and Advances
Xuemin Lin UNSW
P节PT日模PP板T模下板载:www.1ppt.com/mjieorbi/an/ 行PP业T素PP材T模下板载:wwwwww.1.1ppptt.c.coom//hsauncgayi/e/
PPT背景图片:www.1ppt.com/beijing/ 优秀PPT下载:www.1ppt.com/xiazai/ W资o料rd下教载程:wwww.1.1pppt.tc.ocomm/z/iwliaoord/ / 范教文案下载:www.1ppt.com/fjaianowaenn/ /
PPT图表下载:www.1ppt.com/tubiao/ PPT教程: www.1ppt.com/powerpoint/ EPxPcTe课l教件程下:载w:www.1wpp.1t.pcpotm.c/oemxc/ekle/ jian/
PPT论试坛卷:下w载w:ww.1wppwt.c1nppt.com/shiti/
• Ver$ces: a collec$on of en$$es
• Edges: connec$ons between ver$ces
What is a graph ?
v1
v3
v5
v6
v2
v7
v4
v8
v9
Graphs and Classic Problems
1. Seven Bridges of Königsberg Leonhard Euler in 1735:
• find a roundtrip through the city to cross each bridge once and only once
• earliest (published) work on networks/graphs
1. Seven Bridges of Königsberg(cont.) Abstraction:
– replaced each land mass with an abstract “vertex” or node, and each bridge with an abstract connection, an “edge”
Showed that this problem has no solution.
Euler Tour:
– necessary and sufficient conditions for the walk of the desired form: connected and all vertices with even degrees.
Graphs
– A Graph is a collec&on of nodes joined by edges.
• Defini&on:
(u,v)
• G=(V,E)
– V is a set of vertices, u,
v in V are vertices
– E is a set of edges, (u,v) in E is an edge.
• •
•
Edges
Undirected edges
– u–v: (u,v)=(v,u)
– e.g., u and v like each other; u and v are co-authors
Directed edges(arcs)
– u v: (u, v) != (v, u)
– (u, v) is an in-link (in-edge) of v, and an out-link (out-edge) of u. – e.g., u likes v; u is the father of v.
Other attributes
– weight (e.g. frequency of communication) – w(u, v) – rank (e.g., best friend, second best friend, …)
– type (e.g., friend, co-worker, activates, inhibits, …)
Ver?ces
• Degree
– d(v): the number of edges connected to a vertex – In degree: number of incoming edges of a vertex – Out degree: number of outgoing edges of a vertex
• Ifedgesareweighted,thedegree,indegree,outdegreearethe total weights of edges, incoming edges, outgoing edges of a vertex.
• Other attributes:
– Weight: (e.g., importance, authority of a person)
– Type: (e.g., advisor, student, old people) – Label: (e.g., name)
2. Travelling salesman • Problem:
– Given a list of ci$es and the distances between each pair of ci$es, find the shortest route that visits each city exactly once and returns to the origin city.
– In the theory of computa$onal complexity, the TSP belongs to the class of NP-hard problems.
Thus, it is unlikely that the worst-case running time for any algorithm for the TSP is in PTIME.
2. Travelling salesman (cont.) • HeurisGc SoluGons:
– ConstrucGve heurisGcs: Nearest-neighbors (greedy), spanning tree, bitonic tour, etc.
– for extremely large problems (millions of ciGes), within a reasonable Gme which are with a high probability just 2– 3% away from the opGmal soluGon
Widely used in several applications: planning, logistics
the manufacture of microchips. DNA sequencing, etc.
3. Planar graph
• Definition: can be embedded in the plane
– i.e., it can be drawn on the plane in such a way that its edges intersect only at their endpoints.
• Determination:
– does not contain a subgraph that is a subdivision of K5 (the complete graph on five vertices) or K3,3((complete bipartite graph on six vertices)
4. Euler’s Formula
• Euler’s Formula:
if a finite, connected, planar graph is drawn in the plane
without any edge intersec&ons
v − e + f = 2. (v is the number of ver&ces, e is the number of edges and f is the number of faces)
5. Shortest Path • a.k.a. Geodesic paths:
– finding a path between two vertices (or nodes) in a graph such that the sum of the weights of its constituent edges is minimized.
• Solution:
– Single Source: Dijkstra – All Pair: Floyd-Warshall.
5. Shortest Path(cont.)
• Applications:
– Road Network Navigation
– n-Degree Separation in Social Networks
Six Degree of Separation
http://entitycube.research.microsoft.com /6dumapsvg.aspx
5. Shortest Path(cont.)
• How to fast get shortest path between two nodes in a large graph?
– Pre-computedsmallstructuretomaintaindistances. – DistanceOracle:
• Hop cover for each nodes’ reachable node list – Intersect of two lists.
• Landmark based global summary structure. – Tree traversal.
6. Max-Flow
• Problem:
– finding a feasible flow through a single-source,
single-sink flow network that is maximum. • Applications:
– Airline/pipeline/network scheduling
– Circulation-demand/logistics planning
– Social network Sybil/Spammer detection
6. Max-Flow (cont.)
• Common Solu&ons: – Linear Programming.
– Ford-Fulkerson: O(mn)
– Recent result: close to O (m)
7. Vertex Cover
• Problem:
– A vertex-cover of an undirected graph G=(V,E) is a
subset V’ of V, for edge (u,v) in G, u or v is in V’. – Minimum Vertex Cover:
• NP-hardopGmizaGonproblem. • SoluGon:
– ApproximaGon algorithm:
• Repeatedlytakingbothendpointsofanedgeinto
the vertex cover and remove. • MaximumMatching
1959
1959
1969
……
2003… Current
Graph DB and Big Graph
Heinrich Heesch
Erdos, Renyi and Gilbert Compute-aided-proof of
1941
De Bruijn
De Bruijin Sequenc…e Random Graph Model 4-color theorem.
1936
Ramsey and Turán
1852 1878
Ramsey Number Graph Theory Textbook
1845
Gustav Kirchoff
…… Francis Guthrie
Dénes Kőnig
1736
Leonard Euler Seven Bridges of Konisberg
Four Color Theory
Number of spanning trees
Applications
Iterative
Pagerank
Applica’ons
Insurance
• Fake ID
• Sensi,ve User
• Fraudster
• Team-Fraud
• Rela,on
Tracking Transportion •…
Services
Relations
Algorithms
Shortest Path
Link Analysis (relevance)
Degree/Neighbors (structural)
K-hop
Degree correlation
Label Propagation
Community Detection
PPR
Structural Discovery
Adver&sing
• Real-time Recommendation
• Forum selling • Community
Detection •…
Social Network
• Personal Value
• Friend
Recommend
• Community and
Organisation
Analysis •…
Rou$ngs
Distance/Path (reachability)
Cluster Coefficient
Triangle Courting
• Routing • Spatial-
Temporal
Analysis •…
Community Detection
propagation
Centrality
Connected Component
Dense Subgraph (e.g. K-core, K- Truss)
Groups
Features
Others
Link prediction
Node mapping (Embedding)
Transport Model(IC,LT model )
Maximal Influence
Node2vec
Outline
vApplications
vGraph Matching
vCohesive Subgraphs
vGraph Similarity and Clustering vDistributed Graph Computation vOther Graph Problems vIndustry Collaborations
21
Some Examples
qFraud Detection
qProduct Recommendation qRetail Services
qInvestment Analysis qProduct Lifecycle Management qAnti Money Laundering qCyber Security
…….
Fraud Detection: First-Party Bank Fraud
• Typical Scenario
– A group of two or more people organize into a fraud ring;
– The ring shares a subset of real contact, combining them to create a number of synthetic identities;
– Ring members open accounts using these synthetic identities;
– New accounts are added to the original ones (e.g. credit cards);
– The accounts are used normally, with regular purchases and timely payments;
– Banks increase the revolving credit lines over time, due to the observed responsible credit behaviour;
– One day the ring “busts out”, coordinating their activity, maxing out all of their credit lines, and disappearing.
Fraud Detection: First-Party Bank Fraud
Fraud Detection: Insurance Fraud
• Background:
– Fraudsters claim the insurance money via faking or exaggerating
injury, damage etc.
– The impact of fraud on the insurance industry is estimated to be $80 billion annually in the US.
– In the UK, insurers estimate that bogus whiplash claims add $144 per year to each driver’s policy
Fraud Detection: Insurance Fraud
• Typical Scenario
– rings of fraudsters work together to stage fake car accidents and claim
soft tissue injuries.
– “Paper collisions”, with fake drivers, fake passengers, fake pedestrians
and even fake witnesses. – Roles
• Providers
– Doctors: diagnose false injuries
– Lawyers: file fraudulent claims
– Body shops: misrepresent damage to cars
• Participants
– Drivers, Passengers, Pedestrians and Witnesses
Fraud Detection: Insurance Fraud
No Possible to Join
Fraud Detection: Insurance Fraud
The entities involved in one of the rings.
Fraud Detection: Some Insights
• The data is grow so big and so complex that it is impossible to go with relation database.
• Graph database can naturally model the complex relations of data.
• The fraud detection can often reduce to the mining of certain patterns:
– Rings (in the previous two examples) – Bi-cliques
– More others.
Product Recommendation
• Goal
─ Online shopping, financial product
recommendation • Solution
– Put customers, preferences, purchases, products and locations etc. into a graph model
– Uses connections to make product recommendations
• Challenge
– Serve many millions of customers and products: scalability & efficiency
Retail Services:
Ecommerce Delivery Service Rou8ng
• Goal
– Enable delivery within 60 minutes to compete
with Amazon Prime • Solution
– With graph, we can identify the fastest delivery route requires support for complex routing queries at scale with fast and consistent performance
• Challenge
– Calculate best route option in real-time – Offer more predictable delivery time
Retail Services: Supply Chain Visibility
• Goal
– Visualize the supply chain and easily to explore for the clients
• Solution
– With graph, we can easily explore the supply chain with graph traversal operations
• Challenge
– the volume and structure of information to be processed are huge and complex
Investment Analysis
• A complete knowledge of an enterprise facilitate investment analysis, which requires:
– Own profile: financial status, innova4on strength of long credit – Ownership and investment rela4onships
• The stronger the owner and relevant, the higher the inves$ng poten$al
• Linking all companies via their investment rela$ons requires graph database
Investment Analysis
• Reveal an enterprise’s real controllers: the person who owns the biggest equity share, directly or indirectly
Investment Analysis
• Enterprise path discovery
– Whether there are paths to reach new customers
– the potential paths from the competitors to the new customers
…
…
Ooops, competitors approached first
Invested
Investing company Invested by Blue Compe2tors
New Customers
Investment Analysis
• Multidimensional relationships discovery: – competitors
– pattern transfer – acquisition
– investment
• Why this can help?
– Difficulties of investing a target company? Try negotiate with the competitors’ targets.
Product Lifecycle Management
• Product lifecycle management is the process of managing the entire lifecycle of a product from design stage to development to go-to-market to retirement to disposal
Product Lifecycle Management
• Existing PLM systems rely on RDBMS
– Siloeddata,inabilitytomodelreal-lifecomplexities
and to adapt to changes
– Inabilitytoscaledatamodelwithoutcostsandrisksas business evolves
– Searchforinformationistime-consuming
– Missinginsightsregardingdependenciesleadstopoor decisions and increase risks
Product Lifecycle Management
• Represent cross-department data about products and processes as a graph and store it as one
– Aggregateproducthierarchiesand connections into a single source of truth
– Easilyedit,orexpandyourmodelas your need evolve
– Savingtimebysearchingandexploring your data
– Visuallytrackdependenciesbetween entities and get contextual insights
Anti Money Laundering
Samsung Group
Anti Money Laundering: Graph Model
• More than a pretty picture
• Graph structure for deep quantitative analysis
Anti Money Laundering: Graph Model
• Classical measures of centrality, like degree and betweeness, and eccentricity:
– to correlate such measures with a propensity of fraud • Complex circular money flows:
– the native graph structure can also be used to input, track and calculate transactions across these chains.
– profit distributions flow from one company to another via many different paths
Anti Money Laundering: Some Insights
• Global-scale organisation fosters complex transaction flows – can be circular, making it impossible to investigate ML using
traditional method.
– Graph not only can visual, but can model the data for deep analysis
• Graph properties such as eccentricity and betweenness are useful factors to correlate the potential ML.
• Path, cycle and pattern analysis can facilitate the AML process
Cybersecurity
• Goal
─ A unified web framework to manage cyber attack relationships in the
network
─ Understand how cyber attackers can leverage initial footholds to extend their reach through the network
• Challenge
─ correlate data from numerous sources into a common model
─ flexible and easy to extend, and map naturally to network attack relationships
─ Support various kinds of network attack relationships queries
Cybersecurity: MITRE Case Study
• Solution
• Graph model
Many thanks to my team
A/Prof. Wenjie Zhang
100+ ERA A* (CCF A) ranked papers Associate Head of CSE, UNSW Graph Data, Spatial-Temporal Data, Algorithms
A/Prof. Ying Zhang
80+ ERA A* (CCF A) ranked papers University of Technology Sydney (UTS) Social Network, Streaming Data, Algorithm
Dr. Lu Qin
60+ ERA A* (CCF A) ranked papers Senior Lecturer of UTS
Big Graph Computing, Algorithm
Dr. Xin Cao
20+ ERA A* (CCF A) ranked papers Senior Lecturer of UNSW
Social Network, Data Mining, Algorithm
q Dr. Yixiang Fang: Postdoctoral Fellow of UNSW,Social Network, Big Graph Computing.
❑ Dr. Xing Feng: Postdoctoral Fellow r of UTS, Big Graph Computing.
❑ Dr. Longbin Lai: Postdoctoral Fellow of UNSW, Big Graph Computing.
q Dr. Long Yuan: Postdoctoral Fellow of UNSW, Big Graph Computing.
q Many past students and current students
GRAPH PATTERN MATCHING
Introduction
Ø Subgraph Matching
Given a query q and a large data graph G, the problem is
to extract all subgraph isomorphic embeddings of q in G.
Introduction
Ø Subgraph Matching
Given a query q and a large data graph G, the problem is
to extract all subgraph isomorphic embeddings of q in G.
Introduction
Ø Subgraph Matching
Given a query q and a large data graph G, the problem is
to extract all subgraph isomorphic embeddings of q in G.
Introduction
Ø Subgraph Matching
Given a query q and a large data graph G, the problem is
to extract all subgraph isomorphic embeddings of q in G.
Graph Pattern Matching: summary
• Related Applications
– Fraud Detection, Anti Money Laundering, Cyber Security etc.
• Algorithms and Recent Publications
– Exact Match
• Tree Sequence (QuickSI), VLDB’08
• Core-Forest-Leaf Decomposition (CFL Algorithm), Sigmod’16 • Optimal Distributed Join-based algorithms, VLDB’15, VLDB’17
– Approximate Match
• Spanning-tree-based matching algorithm (TreeSpan), SIGMOD’12
– Supergraph Match
• Tree-index-based supergraph search algorithm, DGTree, ICDE’16
– Tree Match
• Lawler-procedure-based top-k tree searching algorithm, VLDB’15
Detecting Designated Communities
Query Graph
Data Graph
Applications:
Group of Terrorists, Anomaly detection etc.
Challenges:
• Subgraph Isomorphism is NP complete.
• Graph data and results can be huge (MB-scale
graph can produce PB-scale results).
Alibaba Big Graph Computing- RT Cycle Detection
Applications:
• Fraud detection
• Money laundering detection
Challenges:
• Large-scale dynamic graph (100-million-scale vertices, billion-scale edges)
• High demand on real-time processing: 10-50ms response time, while previous system
takes 50s
Solutions:
• Real-time cycle detection on each edge
• Light-weighted index structure (easy to maintain and update) and efficient query
algorithm
Real-data simulation:
• Detect 6-edge cycles within 10ms
Single CPU
QuickSI (VLDB2008)
Synchronized Depth-First Traversal
Forwarding Backtracking
4 25
6
q
2
4
5 6
g1 g2 g3
1.Determine the access order in q
2.Detect corresponding subgraphs in g1, g2 which can be
mapped to the currently traversed vertices.
3
1
7
1
3
7
TurboISO (SIGMOD2013) • Overview of TurboISO
ØNeighborhood Equivalence Class (NEC)
§ Merging vertices with same neighbor to reduce query size
§ Combine / Permute strategy ØCandidate Region Exploration
§ Constructed on-the-fly based on q
§ A path-based data structure containing all embeddings of q in G
• Definition
Neighborhood Equivalence Class(NEC)
• Let be an equivalence relation over all query vertices in q such that,
if for every embedding m that contains and , there exists an embedding m’ such that
• Example
u3 and u4 are NEC nodes .
Reduce Candidates
CFL-Match (SIGMOD2016)
Matching order of QuickSI and TurboISO : (u1,u2,u3,u4,u5,u6). (u1,u2,u5,u3,u4,u6)
105 – 100 partial mappings are redundant.
Cartesian products:
Ø 100 mappings (v0,v2, v1000+i, v2100 +i) (3 ≤ i ≤ 102) of (u1,u2,u3,u4) Ø 1000 mappings (v0, vj) (3 ≤ j ≤ 1002) of (u1,u5)
Compression
● Originally proposed by Qiao et al. [7]
● Intuition
○ ○
Subgraph enumeration can generate enormous (intermediate) results
Some vertices can be compressed as they are not needed in future
computation
■ Heuristics by [7]: the vertices that do not belong to the minimum vertex
cover (MVC)
v3 v2
Vertices in MVC Vertices to compress
v0 v1
Distributed Subgraph
Enumeration
(VLDB2015 & VLDB2017)
Single CPU
QuickSI (VLDB2008)
Synchronized Depth-First Traversal
Forwarding Backtracking
4 25
6
q
2
4
5 6
g1 g2 g3
1.Determine the access order in q
2.Detect corresponding subgraphs in g1, g2 which can be
mapped to the currently traversed vertices.
3
1
7
1
3
7
TurboISO (SIGMOD2013) • Overview of TurboISO
ØNeighborhood Equivalence Class (NEC)
§ Merging vertices with same neighbor to reduce query size
§ Combine / Permute strategy ØCandidate Region Exploration
§ Constructed on-the-fly based on q
§ A path-based data structure containing all embeddings of q in G
• Definition
Neighborhood Equivalence Class(NEC)
• Let be an equivalence relation over all query vertices in q such that,
if for every embedding m that contains and , there exists an embedding m’ such that
• Example
u3 and u4 are NEC nodes .
CFL-Match (SIGMOD2016)
Reduce Candidates
105 – 100 partial mappings are redundant.
Matching order of QuickSI and TurboISO : (u1,u2,u3,u4,u5,u6). (u1,u2,u5,u3,u4,u6)
Cartesian products:
Ø 100 mappings (v0,v2, v1000+i, v2100 +i) (3 ≤ i ≤ 102) of (u1,u2,u3,u4) Ø 1000 mappings (v0, vj) (3 ≤ j ≤ 1002) of (u1,u5)
Compression
● Originally proposed by Qiao et al. [7]
● Intuition
○
○
Subgraph enumeration can generate enormous (intermediate) results
Some vertices can be compressed as they are not needed in future
computation
■ Heuristics by [7]: the vertices that do not belong to the minimum vertex
v0 v1
cover (MVC)
v3 v2
Vertices in MVC Vertices to compress
Distributed Techniques
A Thriving Literature
Categorizing by Strategies
Related Works
PSgL [Shao et al., Sigmod 2014]
§ BFS Search strategy, Vertex-centric computation
§ Analogous to decomposing into stars and then processing Star-based join.
§ Several issues including memory issues
BinaryJoin Strategy
● ●
Divide the pattern graph into a set of join units { p1, p2, …, pk } Process k-1 join following specific join order
We prove that CliqueJoin is worst-case optimal by showing that it can be expressed as GenericJoin proposed by Ngo et al. [8]
●
StarJoin Algorithm
Round 4
Round 3
Round 2
Round 1
StarJoin
Trade-off for stars as join units
q # of matches to a star is massive if many edges in a star.
A node with 1,000,000 neighbors => 𝟏𝟎𝟏𝟖
matches of a 3-star
Not unusual: a node contains many neighbors in a large social networks or web graphs
q If star with many edges, a join followed has more pruning power; i.e., less # intermediate results of a join.
Star-based Join
Star decomposition
Left-deep Join
𝑣! 𝑣$
𝑣𝑃 #
𝑣! 𝑣$ 𝑣$ 𝑣$
𝑣” 𝑣# 𝑣” 𝑣# 𝑣# 𝑝! 𝑝” 𝑝#
𝑣”
Star-based Join
Tradeoff
TwinTwigJoin Algorithm
Round 4
Round 3
Round 2
Round 1
StarJoin TwinTwigJoin
TwinTwig Join VLDB15’
TwinTwig: A star of at most two edges
vvvv
1444 v v v2 v3 v2 v3 v3
14ppp
Star Decomposition
vvvvvvv
0
12
2314414
P
v2 v2 v3 TwinTwig Decomposition
v3
TwinTwig Join VLDB15’
Given any star-based join, star, there always exists a TwinTwig join, tt, s.t. the size of intermediate results of tt is no larger than that of star.
CliqueJoin Algorithm
Round 4
Round 3
Round 2
Round 1
StarJoin
TwinTwigJoin
CliqueJoin
The drawbacks of TwinTwig Join
Simple graph storage only supports using star (TwinTwig) as join units, which can result in too many intermediate results
(e.g. a node of degree 1,000,000, O(10^18) stars, O(10^12) TwinTwigs)
Left-deep join can result in sub-optimal solution
General Algorithmic Framework
Graph Storage
§ Stored as (u; Gu) for each data node u
§ Gu is called the local graph of u such that:
(1) it is connected; (2) ; (3)
General Algorithmic Framework
The structure into which we decompose the pattern graph are called the join unit § e.g. Star in Star-based join, TwinTwig in TwinTwig Join
§ A structure p can be the join unit iff.
where stands for the matches of p in
General Algorithmic Framework
Decomposing
Left-deep tree
Bushy tree
SEED VLDB2017
SCP (Star-Clique-Preserved) graph storage: Use star and clique as the join units
§ We can avoid using star if clique is an alternative
§ Shorter execution. The 6-clique can now be processed in one single round, instead of 7 rounds in TwinTwig Join
Bushy join plan: Optimality Guarantee
§ The search space of an optimal bushy join covers that of a left-deep join
SCP Graph Storage
SCP Graph, each local graph is denoted as
We include the edges among the neighbors into the local graph Gu, so that all cliques with u can be fully computed within
Optimal Bushy Join
Notations
: the join plan to solve P
: the cost of the join plan
: estimated # matches of P in G (We apply
powe-law random graph model for the estimation, while estimation can vary with applications)
We aim at finding a join plan for P , s.t
is minimized
Optimal Bushy Join
A dynamic programming transform function:
References
1. Z. Sun, H. Wang, H. Wang, B. Shao, and J. Li. Efficient subgraph matching on billion node graphs. PVLDB, 5(9), 2012.
2. Y. Shao, B. Cui, L. Chen, L. Ma, J. Yao, and N. Xu. Parallel subgraph listing in a large-scale graph. In SIGMOD’14, pages 625-636.
3. L. Lai, L. Qin, X. Lin, and L. Chang. Scalable subgraph enumeration in mapreduce. PVLDB, 8(10), 2015.
4. L. Lai, L. Qin, X. Lin, Y. Zhang, L. Chang, and S. Yang. Scalable distributed subgraph enumeration. PVLDB, 10(3),
2016.
5. F. N. Afrati, D. Fotakis, and J. D. Ullman. Enumerating subgraph instances using map-reduce. In Proc. of ICDE,
2013.
6. K. Ammar, F. McSherry, S. Salihoglu, and M. Joglekar. Distributed evaluation of subgraph queries using worst-case
optimal low-memory dataflows. PVLDB, 11(6), 2018.
7. M. Qiao, H. Zhang, and H. Cheng. Subgraph matching: On compression and computation. PVLDB, 11(2), 2017.
8. H. Q. Ngo, E. Porat, C. Re, and A. Rudra. Worst-case optimal join algorithms. J. ACM, 65(3), 2018.
9. H. Kim, J. Lee, S. S. Bhowmick, W.-S. Han, J. Lee, S. Ko, and M. H. Jarrah. Dualsim: Parallel subgraph enumeration
in a massive graph on a single machine. SIGMOD ’16, pages 1231{1245, 2016.
Optimal Enumeration: Efficient Top-k Tree Matching
(VLDB2015)
90
Introduction
• Data Graph: a node-labeled directed graph G – Query: a node-labeled rooted tree T
– Find k mappings from T to G with the minimum lengths.
Query Tree T:
u1 C
u2 S u3 E v1 C
v3 S v4 E
Data Graph G:
v1 C v2 C
v3 S v4 E S
Top-3 Match v1 Length=3
v5
Top-1 Match Length=2
Top-2 Match Length=2
v2 C
v5 S v4 E
C
v5 S v4 E 91
Applications of Top-k Tree Matching
n Top-k tree matching strengthen the relevance of twig-pattern matching results
– Twig-pattern matching is a core operation in XQuery over XML/Graph data
– Exponential number of matches
• It is also used as a key building block to retrieve top-k graph pattern matches
92
State-of-the-art Approach[Gou’08] • Time Complexity of the existing techniques
– O(nT(dT + log k)) for enumerating each top-i match • nT: number of nodes in T
• dT: the maximum node degree in T
• Our contribution in this paper: we propose a new enumeration paradigm to enumerate each match in O(nT + log k) time
93
A New Enumeration Strategy
• The solution space: all valid matches in V1 × V2 × … × Vn – {u1,u2,…,un}:thesetofnodesT
– Vi:thesetofnodesinGhavingthesamelabelasui.
• Adapt Lawler’s Procedure to compute top-k matches
– Suppose(v1,v2,…,vn)isthetop-1match,thentheentiresolutionspaceisdividedintonsubspaces
• (V1 −{v1})×V2 ×…×Vn
• {v1}×(V2 −{v2})×…×Vn •…
• {v1}×{v2}×…×(Vn −{vn})
– Top-2 match is the match with lowest score among best matches in these obtained subspaces.
How to efficiently compute the best match in a subspace?
94
Two Types of Subspaces
• Categorize the n subspaces into two types
– Let{v1}×{v2}×…x(Vi –Ui)×…×Vn bedividedby(v1,v2,…,vn) – Type-1:{v1}×{v2}×…×(Vi –Ui -vi)×…×Vn
» Only one such type of subspace
– Type-2:{v1}×{v2}×…×{vi}×…×(Vj –vj)×…×Vn
» (n – i) such type of subspaces
• For efficient computation, order the nodes
in T in a top-down fashion (e.g., u1, u2, u3, u4).
u1
u2
u3 u4
95
An Optimal Approach
• Compute the score of the best match in a subspace
– One Type-1 subspace: takes time O(log k)
– (
v0 v1 v2 v3 v4
(Balm) (Doll) (Hat)
(Wine) (Carpet)
125
● BFC-BS & BFC-IBS
State-of-the-art
u0
(Adam)
u1
(Mark)
u2
(Shego)
u3
(Taylor)
BFC-IBS improves BFC-BS in two aspects: (1) pre choosing the layer of start-vertices to achieve a lower time complexity; (2) using a hash map to speed up the implementation.
(3) avoid counting a butterfly twice (process the wedge (u, v, w) only if w.id > u.id)
v3 v4
(Wine) (Carpet)
v0 v1 v2
(Balm) (Doll) (Hat)
c(v0,v1)= 1
126
● BFC-BS & BFC-IBS
State-of-the-art
u0
(Adam)
u1
(Mark)
u2
(Shego)
u3
(Taylor)
BFC-IBS improves BFC-BS in two aspects: (1) pre choosing the layer of start-vertices to achieve a lower time complexity; (2) using a hash map to speed up the implementation.
(3) avoid counting a butterfly twice (process the wedge (u, v, w) only if w.id > u.id)
v3 v4
(Wine) (Carpet)
v0 v1 v2
(Balm) (Doll) (Hat)
c(v0,v1)= 2
127
● BFC-BS & BFC-IBS
State-of-the-art
u0
(Adam)
u1
(Mark)
u2
(Shego)
u3
(Taylor)
BFC-IBS improves BFC-BS in two aspects: (1) pre choosing the layer of start-vertices to achieve a lower time complexity; (2) using a hash map to speed up the implementation.
(3) avoid counting a butterfly twice (process the wedge (u, v, w) only if w.id > u.id)
v3 v4
(Wine) (Carpet)
v0 v1 v2
(Balm) (Doll) (Hat)
c(v0,v1)= 3
128
● BFC-BS & BFC-IBS
State-of-the-art
u0
(Adam)
u1
(Mark)
u2
(Shego)
u3
(Taylor)
BFC-IBS improves BFC-BS in two aspects: (1) pre choosing the layer of start-vertices to achieve a lower time complexity; (2) using a hash map to speed up the implementation.
(3) avoid counting a butterfly twice (process the wedge (u, v, w) only if w.id > u.id)
v0 v1
v2 v3
v4
(Balm) (Doll) (Hat) (Wine)
c(v0,v1)= 3 c(v0,v2)= 1
(Carpet)
129
● BFC-BS & BFC-IBS
State-of-the-art
u0
(Adam)
u1
(Mark)
u2
(Shego)
u3
(Taylor)
BFC-IBS improves BFC-BS in two
aspects: (1) pre choosing the layer of
start-vertices to achieve a lower time
complexity; (2) using a hash map to
speed up the implementation.
(3) avoid counting a butterfly twice
(process the wedge (u, v, w) only if w.id
> u.id) v0 = 3 + 0 + 0 = 3
v2
v3 v4
(Wine) (Carpet)
c(v0,v3)= 1
v0 v1
(Hat)
c(v0,v2)= 1
(Balm) (Doll)
c(v0,v1)= 3
130
⋈
● BFC-BS & BFC-IBS
State-of-the-art
u0
u1
u2
u3
(Adam) (Mark)
(Shego) (Taylor)
BFC-IBS improves BFC-BS in two aspects: (1) pre choosing the layer of start-vertices to achieve a lower time complexity; (2) using a hash map to speed up the implementation.
(3) avoid counting a butterfly twice (process the wedge (u, v, w) only if w.id > u.id)
v0 v1 v2 v3 v4
(Balm)
v0= 3
(Doll) (Hat)
v1= 0
(Wine) (Carpet)
131
⋈ ⋈
● BFC-BS & BFC-IBS
u0
u1
u2
u3
State-of-the-art
(Adam) (Mark)
(Shego) (Taylor)
BFC-IBS improves BFC-BS in two
aspects: (1) pre choosing the layer of
start-vertices to achieve a lower time
complexity; (2) using a hash map to
v0 v1 v2 v3 v4
speed up the implementation.
(Balm)
(Doll)
(Hat)
(Wine)
c(v2,v3)= 2
(Carpet)
c(v2,v4)= 1
Time Complexity
(3) avoid counting a butterfly twice
O(min{ ∑u∈U(G) degG(u)2, ∑v∈L(G) degG(v)2}) (process the wedge (u, v, w) only if w.id
> u.id) 𝐺= 4
=1
=0 v3 =0
v0
132
=3 v1
=0v2 v3
⋈ ⋈ ⋈
⋈ ⋈
⋈
Issues in handling hub vertices:
State-of-the-art
L(G) u0 u1 R(G) v0 v1
u2
u1000 u1001
v999 v1000 v1001
133
State-of-the-art
Issues in handling hub vertices:
The existing algorithms need to go through u0 (or v1000) as the middle- vertex if choosing L(G) (or U(G)) to start.
Regardless of whether the upper or the lower layer is selected to start, we must traverse totally C2, 1000 (= 499,500) plus 1,000 different wedges by the existing algorithms.
L(G) u0 R(G) v0
u1 u2
u1000 u1001
v1
v999
v1000 v1001
134
State-of-the-art
Issues in handling hub vertices:
The existing algorithms need to go through u0 (or v1000) as the middle- vertex if choosing L(G) (or U(G)) to start.
Regardless of whether the upper or the lower layer is selected to start, we must traverse totally C2, 1000 (= 499,500) plus 1,000 different wedges by the existing algorithms.
L(G) u0 R(G) v0
u1 u2
u1000 u1001
v1
v999
v1000 v1001
135
State-of-the-art
Issues in handling hub vertices:
The existing algorithms need to go through u0 (or v1000) as the middle- vertex if choosing L(G) (or U(G)) to start.
Regardless of whether the upper or the lower layer is selected to start, we must traverse totally C2, 1000 (= 499,500) plus 1,000 different wedges by the existing algorithms.
L(G) u0 R(G) v0
u1 u2
u1000 u1001
v1
v999
v1000 v1001
136
State-of-the-art
Issues in handling hub vertices:
The existing algorithms need to go through u0 (or v1000) as the middle- vertex if choosing L(G) (or U(G)) to start.
Regardless of whether the upper or the lower layer is selected to start, we must traverse totally C2, 1000 (= 499,500) plus 1,000 different wedges by the existing algorithms.
L(G) u0 R(G) v0
u1 u2
u1000 u1001
v1
v999
v1000 v1001
137
Challenges
● It is a challenge to effectively handle high-degree vertices.
● It is also a challenge to utilize CPU cache to speed up butterfly counting.
138
Solutions ●Solution 1: The vertex-priority-based algorithm
● BFC-VP, to conquer challenge 1
● Solution 2: BFC-VP + cache aware techniques BFC-VP++, to conquer challenge 1 & 2
139
The vertex-priority-based algorithm
Motivation: a hub vertex may not always necessary to become a middle-vertex in a wedge for the construction of a butterfly.
[u0, v0, u1, v1] in the Figure can be constructed in two ways: 1) by the wedges (u0, v0, u1) and (u0, v1, u1), or 2) by the wedges (v0, u0, v1) and (v0, u1, v1).
140
The vertex-priority-based algorithm • Assign a priority to each vertex.
• Traversal necessary wedges.
• Only process the wedges (u, v, w) with
p(u) > p(v) & p(u) > p(w). • Count butterflies.
Time Complexity
O(∑(u,v)∈E(G) min{degG(u), degG(v)})
141
Theorem:
∑(u,v)∈E(G) min{degG(u), degG(v)} ≤
min{ ∑u∈U(G) degG(u)2, ∑v∈L(G) degG(v)2}
∑u∈U(G)degG(u)2 =∑(u,v)∈E(G),u∈U(G)degG(u) ≥ ∑(u,v)∈E(G) min{degG(u), degG(v)}
∑v∈L(G)degG(v)2 =∑(u,v)∈E(G),v∈L(G)degG(v) ≥ ∑(u,v)∈E(G) min{degG(u), degG(v)}
Proof:
142
Cache aware techniques
● Cache aware wedge processing
● Cache aware graph projection
Improve CPU cache performance, keep time complexity and space complexity unchanged.
143
Cache aware wedge processing
The degree distribution of the end-
vertex-accesses on Tracker by BFC-VP 79% of total accesses are accesses
of low-degree vertices (i.e., degree < 500)
Since the locality of accesses is a key aspect of improving the CPU cache performance, we hope the algorithm can access more high-degree vertices as end- vertices.
144
Cache aware wedge processing
New wedge processing strategy: processing the wedges where the priorities of end-vertices are higher than the priorities of middle-vertices and start-vertices.
We proved that the total access of end-vertices remaining unchanged - the time complexity unchanged.
The degree distribution of the end-
vertex-accesses on Tracker using new
wedge processing strategy
The percentage of accesses of high-degree vertices (i.e.,
degree > 2000) increases from 9% to 81%.
145
Cache aware graph projection
Generally, vertices are sorted by their ids when storing in the buffer. Although end-vertices are mostly high-priority vertices, the distance between two end-vertices (e.g., v0 and v3) can be very long.
146
Extensions
● Counting the butterflies for each edge
147
Extensions
● Parallel butterfly counting – shared memory parallelization
Easily extern our algorithms into a parallel version using O(n*t+m) space.
global data space
local data space for thread 1
local data space for thread 2
…
local data space for thread 3
148
Extensions
● External memory butterfly counting
1. Process wedges based on our strategy.
2. For each wedge (u, v, w), store the vertex-pair (u, w) on disk.
3. Sequentially scan these vertex-pairs and for the vertex-pair (u, w), we count
the occurrence of it and compute using the following lemma.
𝐺
149
⋈
OLAK: An Efficient Algorithm to Prevent Unravelling in Social Networks
VLDB 2017
The collapse of Friendster, Inc.
• • •
Founded in 2002.
Popular at early 21st century with over 115 million users. Suspended in 2015 for lack of user engagement.
D. Garcia, P. Mavrodiev, and F. Schweitzer. Social resilience in online communities: the autopsy of friendster. In COSN, pages 39–50, 2013.
K. Seki and M. Nakamura. The collapse of the friendster network started from the center of the core. In ASONAM, pages 477–484, 2016.
Network Unraveling
The engagement of a user is influenced by the number of her engaged friends.
K. Bhawalkar, J. Kleinberg, K. Lewi, T. Roughgarden, and A. Sharma. Preventing unraveling in social networks: the anchored k-core problem. SIAM Journal on Discrete Mathematics, 29(3):1452–1475, 2015.
Network Unraveling
An equilibrium: a group has the minimum degree of k
K. Bhawalkar, J. Kleinberg, K. Lewi, T. Roughgarden, and A. Sharma. Preventing unraveling in social networks: the anchored k-core problem. SIAM Journal on Discrete Mathematics, 29(3):1452–1475, 2015.
Network Unraveling
A Social group tends to be a k-core in the network
K. Bhawalkar, J. Kleinberg, K. Lewi, T. Roughgarden, and A. Sharma. Preventing unraveling in social networks: the anchored k-core problem. SIAM Journal on Discrete Mathematics, 29(3):1452–1475, 2015.
• • •
Network Unraveling
Founded in 2002.
Popular at early 21st century with over 115 million users. Suspended in 2015 for lack of user engagement.
D. Garcia, P. Mavrodiev, and F. Schweitzer. Social resilience in online communities: the autopsy of friendster. In COSN, pages 39–50, 2013.
The core number steadily increased.
• • •
Network Unraveling
Founded in 2002.
Popular at early 21st century with over 115 million users. Suspended in 2015 for lack of user engagement.
K. Seki and M. Nakamura. The collapse of the friendster network started from the center of the core. In ASONAM, pages 477–484, 2016.
The collapse started from the center of the core.
• • •
Network Unraveling
Founded in 2002.
Popular at early 21st century with over 115 million users. Suspended in 2015 for lack of user engagement.
J. Ugander, L. Backstrom, C. Marlow, and J. Kleinberg. Structural diversity in social contagion. PNAS, 109(16):5962–5966, 2012.
Social influence is tightly controlled by the number of friends in current subgraph, like k-core.
• • •
Network Unraveling
Founded in 2002.
Popular at early 21st century with over 115 million users. Suspended in 2015 for lack of user engagement.
F. D. Malliaros and M. Vazirgiannis. To stay or not to stay: modeling engagement dynamics in social graphs. In CIKM, pages 469–478, 2013.
The degeneration property of k-core can be used to quantify engagement dynamics.
Prevent Network Unraveling
𝑢! is called an anchor.
Prevent Network Unraveling
Follower: a node v is a follower of an anchor u, if v is not in k- core but belongs to anchored k-core by anchoring u.
Anchored k-Core Problem: Given two integers k and b, find b anchors to maximize the number of followers (i.e., maximize the number of nodes in anchored k-core ).
Performance Evaluation
• Datasets:
• Environments:
• Intel Xeon 2.3GHz CPU and Redhat Linux System.
• All algorithms are implemented in C++.
Case Studies
Yelp is a crowd-sourced local business review and social networking site.
DBLP is a computer science bibliography website.
Effectiveness: Number of Followers
Efficiency
Diversified Top-k Clique Search
ICDE 2015
Clique and Maximal Clique
• Given a graph G, a clique is a set of nodes such that for any pair of them have an edge
• A clique is called maximal clique if there exist no other bigger cliques that contain it
v1 v6v8 v0 v4
v5 v9
v2 v3 v7 v10
Traditional models for Maximal Clique
• MaximalCliqueEnumeration:
• Enumerate all the maximal cliques in the graph
• Exponential number of maximal cliques
• Large redundant and useless information
• Top-KMaximalClique:
• Return top-k maximal cliques with largest size
• Still contain large redundancy and hold little information as a whole.
Diversified Top-K Clique
Traditional top-k clique high overlap
v1
v6
v5
v7
v8
v9
v10
v0
v2
v4
v3
Diversified Top-K Clique
Diversified top-k clique: cover more vertices
v1
v6
v5
v7
v8
v9
v10
v0
v2
v4
v3
Diversified Top-K Clique Search
• Diversified Top-K Clique:
• Given: a graph G and an integer k,
• Maximize |𝑪𝟏 ∪ 𝑪𝟐 … 𝑪𝒌|
• s.t 𝑪𝒊 (𝟏 ≤ 𝒊 ≤ 𝒌) is a maximal clique
• NP-hard Problem • Advantage:
• Consider both size and diversity, provide a better query result for users.
Baseline Solutions
• EnumAll
Phase 1: Enumerate all the maximal cliques in the graph(D.
Eppstein et al., SEA’11)
Phase 2: Greedily select k cliques from all the maximal cliques
that cover most nodes in the graph
• Problem
• Clique enumeration is a costly operation
• Keeping all maximal cliques in memory is infeasible
• The number of cliques is exponential to the number of nodes
Baseline Solutions
• EnumSub(sample-based enumeration)
Phase 1: Sample a subset of the maximal cliques in the graph(J.
Wang et al., KDD’13)
Phase 2: Greedily select k cliques from the sampling that cover
most nodes in the graph
• Problem
• It still outputs exponential number of maximal cliques without a bound
Challenge
• Retain the result quality, while avoid:
• generating all maximal cliques, and
• keeping all generated maximal cliques in memory
Our Approach: extending online k coverage problem
• Main Idea
1. online model storing k maximal cliques in memory 2. update top-k candidate set when enumerate cliques • replace small existing cliques with big new cliques
Input Graph
Top-K Candidate Set
Our Approach: extending online k coverage problem
• Main Idea
1. online model storing k maximal cliques in memory 2. update top-k candidate set when enumerate cliques • replace small existing cliques with big new cliques
Input Graph
Top-K Candidate Set
Our Approach: extending online k coverage problem
• Main Idea
1. online model storing k maximal cliques in memory 2. update top-k candidate set when enumerate cliques • replace small existing cliques with big new cliques
Input Graph
Top-K Candidate Set
Our Approach: extending online k coverage problem
• Main Idea
1. online model storing k maximal cliques in memory 2. update top-k candidate set when enumerate cliques • replace small existing cliques with big new cliques
Input Graph
Top-K Candidate Set
Our Approach: extending online k coverage problem
• Main Idea
1. online model storing k maximal cliques in memory 2. update top-k candidate set when enumerate cliques • replace small existing cliques with big new cliques
Input Graph
Top-K Candidate Set
Our Approach: extending online k coverage problem
• Main Idea
1. online model storing k maximal cliques in memory 2. update top-k candidate set when enumerate cliques • replace small existing cliques with big new cliques
Input Graph
Top-K Candidate Set
Replacement Strategy
C2
B
AA
C1
which one to be replaced
• private set
• Cmin: clique with smallest private set
what is the replacement condition
• |𝐶|>|𝐵|+ 𝛼 1 |𝐴|+|𝐵| /𝑘
• 𝛼isaparameter(0< 𝛼 ≤1)
C3
C
•
•
Advantages of Our Approach
• Guaranteedresultquality
• Achieve a guaranteed approximation ratio of 0.25, and much better in
practice.
• Lowmemoryconsumption
•
• Instead of all maximal cliques, just K most promising candidates are kept in memory
Efficiency and Scalability ?
•
•
Cost Analysis
A naïve implementation of our approach:
• for each generated maximal clique C
• update top-k candidate set by C with replacement condition
The time complexity is :
𝑂( ∗𝑘∗ + 𝑇𝑒𝑛𝑢𝑚(𝐺))
|𝒜|
𝐶𝑚𝑎𝑥
size of maximum clique number of maximal cliques
time for maintaining top-k candidate set
time for enumerating maximal cliques
•
•
Cost Analysis
A naïve implementation of our approach:
• for each generated maximal clique C
• update top-k candidate set by C with replacement condition
The time complexity is :
𝑂( ∗𝑘∗ + 𝑇𝑒𝑛𝑢𝑚(𝐺))
|𝒜|
size of maximum clique number of maximal cliques
time for maintaining top-k candidate set
time for enumerating maximal cliques
Pruning Rules
PNP-Index
𝐶𝑚𝑎𝑥
C2
v3
PNP-Index
v11 v6
v12
v7
v9
v10
C1
v4
v5
v1
v2
v8
C3
v1
v2
v3
v4
v5
v6
v7
v8
v9
v10
v11
v12
|priv(C)|
c1
1!
1!
1
1
1
2
c2
1
1
1
1!
1
1∗
c3
1
1
1!
1!
1!
3
|rcov(v)|
1
1
2
2
3
1
2
1
1
1
|cov|:10
PNP-Index
• Step1:Checkthereplacementcondition
C2
v3
v4
v5
v11 v6
C4
v12
C1
v7
v9
v1
v2
v10
v8
C3
v1
v2
v3
v4
v5
v6
v7
v8
v9
v10
v11
v12
|priv(C)|
c1
1!
1!
1
1
1
2
c2
1
1
1
1!
1
1∗
c3
1
1
1!
1!
1!
3
|rcov(v)|
1
1
2
2
3
1
2
1
1
1
|cov|:10
c4
1
1
1
1
1
PNP-Index
• Step1:Checkthereplacementcondition
v11 v6
C4
C2
v3
v4
v5
𝟑 > 𝟏 + 𝟎. 𝟓 ×𝟏𝟎 ÷ 𝟑 = 𝟐. 𝟔𝟕 v7 𝜶
C1
v12 v9
v1
v2
v10
v8 C3
v1
v2
v3
v4
v5
v6
v7
v8
v9
v10
v11
v12
|priv(C)|
c1
1!
1!
1
1
1
2
c2
1
1
1
1!
1
1∗
c3
1
1
1!
1!
1!
3
|rcov(v)|
1
1
2
2
3
1
2
1
1
1
|cov|:10
c4
1
1
1
1
1
• Step2:DeleteC2 C2
PNP-Index
v11 v6
v12
C1
v3
v4
v5
v7
v9
v1
v2
v10
v8
C3
v1
v2
v3
v4
v5
v6
v7
v8
v9
v10
v11
v12
|priv(C)|
c1
1!
1!
1!
1!
1
4
c3
1
1!
1!
1!
1!
4
|rcov(v)|
1
1
1
1
2
1
1
1
1
|cov|:9
• Step3:InsertC4
v6
v11
C4
v12
PNP-Index
C1
v3
v4
v5
v7
v9
v1
v2
v10
v8
C3
v1
v2
v3
v4
v5
v6
v7
v8
v9
v10
v11
v12
|priv(C)|
c1
1!
1!
1!
1!
1
4
c4
1!
1
1
1!
1!
3
c3
1
1
1!
1
1!
2∗
|rcov(v)|
1
1
1
1
2
1
2
1
2
1
1
1
|cov|:12
PNP-Index
• An naïve implementation for candidate set maintenance needs 𝑂(|𝒜| ∗ 𝑘 ∗ |𝐶”#$|)
• With the help of PNP-Index, our algorithm can only take 𝑂(∑% ∈𝒜 |𝐶|) time
• •
•
Pruning Rules
Global Pruning: based on k-core and graph coloring
Local Pruning: check the current coverage of the neighbourhood
Initial Candidate Computation: increase the power of the above pruning processes
Performance Evaluation – Efficiency
• Vary k (top-𝑘 value) and report total processing time
Vary k
Performance Evaluation – Effectiveness
• Vary k(top-𝑘 value) and report covered nodes
Vary k
Index-based Optimal Algorithms for Computing Steiner Components with Maximum Connectivity
SIGMOD 2015
k-edge connected components
• Computingk-edgeconnectedcomponentsplaysavital role in graph-based analysis
■
A graph is k-edge connected if it is still connected after removing any set of (k-1) edges from it
8
1
6
11
7 9 10
2-edge connected components
2
4
35
Steiner Maximum-Connected Component (SMCC) and Steiner-Connectivity
• SMCCofq:maximalsubgraphofGcontainingq⊆V(G) with maximum connectivity.
■ Steiner-connectivity of q: the connectivity of the SMCC of q. Denoted sc(q).
Steiner Maximum-Connected Component (SMCC) and Steiner-Connectivity
2
3
8
1
6
4
11
7 9 10
5
the SMCC of {3,4,5}
Challenges
■ Challenge-I: connectivities of subgraphs are not monotonic.
8
1
6
4
g1
5
g3
g2
2
7
3
Challenges
• Challenge-II:enumeratingsubgraphsrequires exponential time.
■ Enumerate all subgraphs containing q, and choose the maximal one with maximum connectivity.
Naive Solution
• SMCCofqisthek-edgeconnectedcomponentofG containing q with the maximum k
• Computek-edgeconnectedcomponentsby
enumerating k from |V| to 1.
■ Still expensive, because it needs to traverse the entire graph G multiple times
Overview of Our Approach
• Recallsc(q):steiner-connectivityofq ■ sc(u,v): steiner-connectivity of {u,v}
• Keyobservations:
■ 1: for a fixed u∈q, sc(q) = minv∈q sc(u,v).
■ 2: for a fixed u∈q, SMCC of q is the set of vertices v satisfying sc(u,v)≥sc(q).
Framework
• Phase-I:offlineindexconstruction
■ Step-1: compute sc(u,v) for all edges (u,v) in G
■ Step-2: compute index based on these sc(u,v)
• Phase-II: online query processing
■ Step-1: compute sc(q)
■ Step-2: start from any vertex u∈q to extend result set to include all vertices v with sc(u,v)≥sc(q)
Index Construction
• Computesc(u,v)foreveryedge(u,v)inG,insteadof for all vertex pairs
Time complexity: O(α(G)×kECC(G)), where α(G) is the arboricity of a graph G, kECC(G) is the cost of computing kECC of G [SIGMOD 13].
■ Algorithms of computing k-edge connected components for a fixed k
■ Deterministic algorithm: O(h×l×|E|) ■ By Chang. et al in SIGMOD’13
■ Randomized algorithm: O(t×|E|)
■ By Takuya Akiba, Yoichi Iwata, Yuichi Yoshida in CIKM’13
•
Index Construction
sc(u,v) equals the maximum among the minimum weights of (edges in) all paths between u and v.
Compute sc(1,7)
1 363
4 4 4 3 7 33 8
2433 44439
432
3 4 5 10 3 11 2 333
12 3 13
3
Index Construction
• Indexstructure:the maximum spanning tree (MST) T of the weighted graph.
■ Time complexity: O(|E|).
■ sc(u,v) equals the minimum weight in the path between u and v in the MST T.
1
4444
2354
233 12 9 7
33 10
3
3
6
3
13 11 8
Compute sc(1,7)
Query Processing: Steiner-Connectivity Query
1
4444
2354
2
12 9
33 10
3
Algorithm: find the spanning subtree Tq of T
connecting vertices in q, and return the minimum weight in Tq as sc(q).
33
7
Time complexity: O(Tq)
3
6
3
13 11 8
Compute sc(12, 9, 4)
Query Processing: SMCC Query
Algorithm: after obtaining sc(q), extend the subgraph from q by visiting edges with weights at least sc(q).
Time complexity: O(Vq), where Vq is the result set.
• StatisticsofData
Performance Studies
SMCC Query
SMCC-BLE: baseline algorithm SMCC-OPT: optimal algorithm
Cohesive Subgraph Search: summsry
• Related Applications
– Product Recommendation, Investment Analysis etc.
• Algorithms and Recent Publications
– Clique
• In-memory approximate algorithm, ICDE 2015
• Graph partition based I/O efficient search algorithm, VLDBJ 2016
– K-ECC
• Graphdecompositionbasedalgorithm,SIGMOD2013
• I/O efficient algorithm to compute the k-edge connected component for all k values, VLDBJ 2015
– SteinerMaximum-ConnectedComponent
• Index-based in-memory algorithm, SIGMOD 2015 • Index-based semi-external algorithm, VLDBJ 2017
Cohesive Subgraph Search
• Algorithms and Recent Publications
– K-Core
• Pruning based search algorithm, ICDE 2018
• upper and lower bound based pruning rules on uncertain graph, ICDE 2018 • I/O efficient core decomposition, ICDE 2016 (Best Paper Award).
– Critical Users
• Iteratively find a best user to anchor based on the heuristics, such that the
corresponding subgraph (e.g., k-core) is the largest, VLDB 2017, ICDE 2018.
• Iteratively find a best user and delete, such that the corresponding k-core is the smallest, AAAI 2017.
– Multi-dimensional Subgraph Search
• Solid pruning rules based algorithm on search tree with fast search orders, VLDB 2017.
Algorithm Analysis
Bounded Memory.
– Follow the semi-external model and require only O(n) memory.
Read I/O Only.
– Only require read I/Os by scanning the node and edge tables sequentially on disk.
Simple In-memory Structure and Data Access. – Use two vectors to save the core number and count value for each node.
K-Edge Connected Component Search
v1
v6 v7
v9
v8
v11
v13
Applications:
• Social behaviour mining
• Community detection
• Graph visualization
Challenges:
• Input graph is very large Publications:
• Graph decomposition based algorithm ,SIGMOD 2013
• I/O efficient algorithm to compute the k-edge
connected component for all k values, VLDB 2015
Contributions:
• Our in-memory algorithm outperforms the state-of- the-art by several orders of magnitude
• I/O efficient algorithm can handle billion-edges scale graphs on a single consumer grade computer
v2
v3
v4
v5
v10
3-ECC v12
K-Edge Connected Component:
• K-edge Connected:a graph is connected after the removal of any (k-1) edges
• Maximal subgraph which is k-edge connected
Problem Definition:
• Given a graph G and an integer k, computes the k-edge connected component of G
Steiner Maximum-Connected Component Search
v1
v6
v9
Applications:
• Potential customer prediction
• Product promotion
• Research team assembling
Challenges:
• Input graph can be very large Publications:
• Index-based in-memory algorithm,SIGMOD 2015
• Index-based semi-external algorithm,VLDBJ 2017
Contributions:
• Our in-memory algorithm outperforms the state-of- the-art by several orders of magnitude
• I/O efficient algorithm can handle billion-edges scale graphs on a single consumer grade computer
v7
v8
v11
v13
v2
v3
v4
v5
v10
v12
Problem Definition:
• Given a set of nodes in G, compute the k-edge connected component with the maximum k value that contains the given nodes (Steiner Maximum-Connected Component)
Efficient Computing of Radius-Bounded k-Cores
Applications:
• Personalized event recommendation
• Social marketing
Challenges:
• The location of the radius-bounded circle of a RB-k-core is unknown.
• It is time-consuming to verify all the candidate subgraphs individually.
Problem definition: Publications:
• Our techniques can be applied to solve the SAC search problem published in VLDB 2017 and achieve a speed-up twice.
• Given a graph G, the k-core of G is a maximal
subgraph where each node has at least k
neighbors in the subgraph. Contributions:
• Given a geo-social network (as shown above), a query vertex q (Leo), a radius r (2km) and k (3), find k-cores containing the query vertex q covered by circles with radius r.
• Pruning based search algorithm, ICDE 2018
K-Core Search on Uncertain Graphs
c d
Problem Definition:
• Given an uncertain graph(edge with
Applications:
• Cohesive subgraph detection
• Vital vertices detection
• Influential communities detection on social network
Hardness:
• NP-hardness problem
• Prohibitive to enumerate all possible result sets
Algorithm and Published Paper:
• Search algorithm based upper and lower bound pruning rules, ICDE 2018
Contributions:
• Advanced algorithm about one order faster than baseline on average
• Propose a new k-core model on uncertain graph, where result sets
outperform previous models on average influences test of social networks
probability), find a vertex set where the probability of every node to be in a k-core is no less than 𝜃(given by users)
• Used in undirected graph and easily to extend to directed graph
Example:
• P(d is not a 2-core) = P(d not in 2-core(d,f,g)
and not in 2-core(c,d,e)) = 0.766
• (c,d,e) is the result set of k=2,𝜃=0.23
e
Find Critical Users to Prevent Network Unraveling
Problem:
𝑢! shows.
• The proposed algorithms outperform the state-of-the-art algorithms by 3 orders of magnitude in runtime.
• In Yelp, the engagement of Caley can enhance the engagement of other 31 users.
• k-core: a maximal subgraph where each vertex is adjacent to at least k vertices in the subgraph.
• Given a network G, find b critical users whose engagement can significantly prevent network unraveling, i.e., find b vertices whose persistent existence leads to the k-core with the largest number of vertices.
• NP-hard, NP-hard to approximate.
Example:
• A vertex represents a user.
• An edge represents there is a
relationship between two users.
• The number of vertices in 3-core
represents the stability of
network.
• Encourage the engagement of
Applications:
• Efficiently find the critical users to
reinforce the network.
Algorithms and Publications:
• Iteratively find a best user to anchor
based on the heuristics, such that the corresponding subgraph (e.g., k-core) is the largest, VLDB 2017, ICDE 2018.
𝑢!can greatly enlarge the 3-
core, as the anchored 3-core by Contributions:
Find Critical Users to Reinforce Communities
Problem:
• k-core: a maximal subgraph where each vertex is adjacent to at least k vertices in the subgraph.
• Given a network G, find b critical users whose engagement can significantly reinforce the communities, i.e., find b vertices whose leave leads to the k-core with the smallest number of vertices.
• NP-hard.
Example:
• A vertex represents a user.
• An edge represents there is a
relationship between two users.
• The number of vertices in 3-core
represents the stability of
network.
• The leave of 𝑢!! can greatly
collapse the 3-core, as the collapsed 3-core by 𝑢!! shows.
Applications:
• Efficiently find the critical users to
reinforce the communities.
Algorithms and Publications:
• Iteratively find a best user and delete,
such that the corresponding k-core is the smallest, AAAI 2017.
Contributions:
• Our algorithm outperforms the state-of-
the-art algorithm by 4 times in runtime.
• In DBLP, the co-author network, the leave of Ying Li will break the engagement of other 74 users (who also leave the k-core).
GRAPH SIMILARITY AND CLUSTERING
Graph Similarity and Clustering: Summary
• Related Applications
– Product Recommendation, Investment Analysis etc.
• Algorithms and Recent Publications
– Graph Edit Distance
• Path-match-basedfilteringalgorithm,ICDE’12
• Subgraph-match-basedfilteringalgorithm,VLDB’14
– SimRank
• AdjustableclusteringstrategyforSimRank(OIP-SR),ICDE’13 • HyperLink-basedSimRankcompute(MEMO-SR),VLDB’13
• IncrementallySimRankcompute(Inc-SR),ICDE’14
– Graph Clustering
• Coreclusteringandnon-coreclustering(pSCAN),ICDE’16
• Dynamicmaintenanceofclusteringstructures(dSCAN),TKDE’17 • Similarity-sequence-index-basedalgorithm(GS-SCAN)VLDB’18
Graph Similarity – Edit distance
Levodopa
Droxidopa
Problem Definition:
• Edit:for Vertex: Add/Delete/Alter Label;
Applications:
• Chemistry: Find similar organic compound.
• Biology: Find similar protein-DNA interactions.
• Finger print recognition: Identify criminal suspect.
Challenges:
• Graph edit distance problem is NP hard.
Algorithms and publications
• Path-match-based filtering algorithm, ICDE’12
• Subgraph-match-based filtering algorithm, VLDB’14
for Edge: Add/Delete/Alter Label; Contributions:
• Given a graph pair (r, s), their edit distance is defined as the minimum edit operations to turn r into s.
• Given two graph sets (R, S), searching for all pairs of (r, s) such that their edit distance is within a given threshold.
• Subgraph-match-based algorithm is 40 times faster than previous work.
Graph Similarity – SimRank
Problem definition:
• In terms of structure, two vertices are similar when their neighbours also similar.
• Given two vertices a, b, and their in neighbours, I(a), I(b), we can compute their SimRank as:
Applications:
Collaborative filtering, Page rank, Graph clustering, Link prediction
Challenges:
• Hard to find proper ranking metrics in practice.
• The complexity is O(Kmn), where K is the iteration, m is the number of
edges, and n is the number of vertices.
• Hard to incrementally update while graph data changes
Algorithms and publications
• Adjustable clustering strategy for SimRank (OIP-SR), ICDE’13
• HyperLink-based SimRank compute (MEMO-SR), VLDB’13
• Incrementally SimRank compute (Inc-SR), ICDE’14
Contributions:
• OIP-SR improves the complexity, with 10 times of performance gain.
• MEMO-SR enriches the semantics of SimRank.
• Inc-SR can apply to dynamic graph.
Graph Clustering
Problem Definition
• Given a graph, divide the vertices to different groups based on their similarities.
Applications:
Community detection, Graph partition, Graph visualisation, Hidden structure detection, Gene array.
Challenges:
• Costly to compute pair-wise similarities.
• Parameters are hard to adjusted.
• Graph data is large, and evolving.
Algorithms and publications:
• Core clustering and non-core clustering (pSCAN), ICDE’16
• Dynamic maintenance of clustering structures (dSCAN), TKDE’17
• Similarity-sequence-index-based algorithm (GS-SCAN) VLDB’18
Contributions:
• pSCAN is over 10 times faster than the state-of-the-art (SCAN++)
• GS-SCAN proposes a novel data structure, resulting in quite clustering regardless of the parameters, and is 10-1000 times faster than pScan.
DISTRIBUTED GRAPH COMPUTING
Distributed Graph Computing – SGC Model
Scalable Graph Computing (SGC):
• m: #edges,n: #vertices,t: #machines
• A distributed graph algorithm belongs to SGC
if it satisfies:
Applications:
• The SGC algorithms can easily scale out, that is the performance of the algorithm increases nearly linearly with the number of machines
Challenges:
• Hard to design a SGC algorithm, especially targeting O(logn) iterations Algorithms and publications:
• Scalable Graph Computing Model (SGC), SIGMOD’14
Contributions:
• Design and implement two SGC operators, N and E
• Implement many graph algorithms using N and E in MapReduce
• Empirically prove that it reaches better scalability than the state-of-
the-art
Metrics
per machine
Total
Disk
𝑂(m/t)
𝑂(m)
Mem
𝑂(1)
𝑂(𝑡)
Comm.
𝑂(m/t)
𝑂(𝑚)
CPU
𝑂(m/t)
𝑂(𝑚)
Iteration
lo𝑔(𝑛)
Distributed Graph Computing – Connected Component
Problem definition:
• Connected component (CC): the maximal subgraph that is connected
• Double connected component (DCC): the maximal subgraph that is a connected component after removing any vertex.
• Given a large graph, compute CC and DCC
Applications:
• CC is the initial steps of many graph computations.
• Aid the analysis of big graphs, help to reveal the anchor vertices.
Challenges:
• Big graph (billion-scale vertices)
Algorithms and publications:
• CC based on graph decomposition and DCC based on label propagation,ICDE’16
Contributions:
• Novel algorithmic framework that reduces a lot of communication
• 50 times faster than the state-of-the-art
Big Graph
• Algorithms and Recent Publications – Distributed Graph Computing Model
• Scalable Graph Computation Model, SIGMOD’14 – Distributed Connected Component
• CC based on graph decomposition and DCC based on label propagation, ICDE’16
– Distributed Scalable Subgraph Enumeration
• Optimal Distributed Join-based algorithms, VLDB’15, VLDB’17
Other Graph problems
Other Graph Problems – Coloring, Eccentrality,
Shortest Path, etc.
• Related Applications
– Product Recommendation, Investment Analysis, Anti Monty Laundering,
Retail Service etc.
• Algorithms and Recent Publications
– Graph Coloring
• Dynamic graph colouring based on local update, VLDB 2018
– Eccentricity computation
• Local search algorithm based on lower and upper bounds, ICDE 2018
– Shortest Path
• Two-hop-labelling shortest path query on road network, SIGMOD 2018 • Efficient Top-k shortest path join, EDBT 2015
Graph Colouring
Graph colouring:
• Colour the graph with minimum colours such that any two neighbouring vertices bear different colours.
Problem definition:
• Process graph colouring when the graph is constantly changing (insertion and deletion)
Applications:
• Channel assign in wireless network
• Airline control and management
• Community detection in social network
• An initial step for the other graph, such as maximum
clique and minimum cut
Challenges:
• NP hard
• Large graph data
• Frequent update
Algorithms and publications:
• Dynamic graph colouring based on local update,VLDB 2018
Contributions:
• The algorithm uses 1⁄2 less colours than previous work, and offers ms response to graph update.
Eccentricity computation
Example:
• V8‘s eccentricity is 4, as longest shortest
distance from V8 is 4, towards V11.
• The smaller the eccentricity is, the more
centric (darker) the vertex is.
Problem definition:
• The eccentricity of a vertex in the graph is the longest shorted distance this vertex can reach the other vertices.
• It requires to solve APSP problem for computing the eccentricity of all vertices in the graph, while the error rate is high in small-world graph model.
Applications:
• Vertex ordering (via their importance)
• Compute some other graph features (diameter, radius
etc.)
Algorithms and publications:
• Local search algorithm based on lower and upper bounds, ICDE 2018
Contributions:
• Three orders of magnitude faster than previous work.
Big Graph Characteristics
Volume
• Petabytes
• Records
• Transactions
Big Data
Variety
• Structured
• Unstructured
• Semi- structured
Velocity
• Batch
• Real time • Streaming
Major Research Issues
Challenges and Opportunities
New Computing Platform/Architecture New Graph Analytics Models
New Processing Algorithms & Indexing Techniques
Graph Processing System
§ Primitive operators
§ Query language
§ Distributed techniques § Storage
§ etc
Research Collaborations: Huawei Cohesive Subgraph
Exploration
❑ Cohesive subgraph:
❖ Subgraphs whose internal vertices are
densely connected, while there are few
edges in between.
❖ Core member detection, friend
recommendation, community detection.
❑ Main Challenges:
❖ Big Graph Data, and computing
complexities
❖ High-dimensional graph data:
Heterogeneous edges
❖ Dynamic Graph: Graph data is evolving
One of the applications on Huawei public cloud
Alibaba Big Graph Computing – FLASH Language
❑ FLASH Query Language:
❖ Onlythreeoperators:Filer,LocAl,PuSH
❖ Greatexpressionpower:Capableofexpressionover100graphcomputingcases. ❖ ConvenientProgramming:Programmingmostcaseswithin10lines.
❑ Progress:
❖ Prototypingsystemaredeployedforproductionuse(indistributedcontext) ❖ Systemoptimizationsandtunings
Alibaba Big Graph Computing – Biclique Fraud detection
“Clients”
Products • Big Data:100-million-scale users and products, billion-scale purchasing edge
Challenges:
• Computationally intensive: NP Complete
• Dynamic Data: Frequent buying and selling on Taobao
Solutions: In the client-product bi-graph, fake ordering is modelled as a bi-clique, which infers that there exists a potential fake ordering when a group of people simultaneously purchase a set of products.
Alibaba’s “Double 11” shopping festival use case:
• Efficiency: Reduce the computation on billion-scale graph from days to hours
• Effectiveness: Recall over 60% issued deals, improved traditional approach by 40%
Alibaba Big Graph Computing- RT Cycle Detection
Applications:
• Fraud detection
• Money laundering detection
Challenges:
• Large-scale dynamic graph (100-million-scale vertices, billion-scale
edges)
• High demand on real-time processing: 10-50ms response time, while
previous system takes 50s
Solutions:
• Real-time cycle detection on each edge
• Light-weighted index structure (easy to maintain and update) and
efficient query algorithm
Real-data simulation:
• Detect 6-edge cycles within 10ms
Graph System
• Graph system should support – Traversal Queries
• Explore the neighbouring information of given nodes in real time. – Iterative Computation:
• BFS, Page rank, Shortest path, Connected component, Label propagation etc – Graph Analytics
• Graph pattern matching, Community detection, Similarity, Graph Clustering etc.
Graph System
• Existing systems support each type of query very efficiently, but none covers the whole picture
– Combine several systems
• Complexitiesofmaintenancearerising • Overheadsofdatatransformation
– Simulate graph processing in general engine
• Spark’s GraphX, Flink’s Gelly, etc.
• Theperformanceisnotcompetitivetothegraph-specificsystems
Graph Processing System Development
Query Languages and Visualisation
Traversal Queries
Iterative Computation
Graph Analytics
Basic Graph Operators
High-performance Data Engine
Distributed Graph Storage
References:
Our Publications
1. F. Bi, L. Chang, X. LIN, W. Zhang, An Optimal and Progressive Approach to Online Search of Importance-based Top-K Communities, to appear in VLDB2018.
2. B. Lv, L. Qin, X. LIN, L. Chang, J. Yu, Supergraph Search in Graph Databases vis Hierarchical Feature-Tree, to appear in TKDE (accept in April, 2018. Subbmited before my EIC term.)
3. D. Wen, L. Qin, Y. Zhang, X. LIN, J. Yu, I/O Efficient Core Graph Decomposition: Application to Degeneracy Ordering, to appear in TKDE (Best Paper Award in ICDE 2016).
4. D. Ouyang, L. Qin, L. Chang, X. LIN, Y. Zhang, Q. Zhu, When Hierarchy Meets 2-Hop-Labeling: Efficient Shortest Distance Queries on Road Networks, to appear in SIGMOD 2018.
5. J. Yang, W. Zhang, S. Yang, Y. Zhang, X. LIN, L. Yuan, Efficient Set Containment Join, to appear in VLDB Journal (accepted in March, 2018).
6. K. Wang, X. Cao, X. LIN, W. Zhang, L. Qin, Efficient Computing of Radius-Bounded k-Cores, to appear in ICDE 2018.
7. Y. Peng, Y. Zhang, W. Zhang, X. LIN, L. Qin, Efficient Probabilistic K-Core Computation on Uncertain Graphs, to appear in ICDE 2018.
8. F. Zhang, Y. Zhang, L. Qin, W. Zhang, X. LIN, Efficient Reinforcing Social Networks over User Engagement and Tie Strength, to appear in ICDE2018.
9. J. Qin, Y. Wang, C. Xiao, W. Wang, X. LIN, Y. Ishikawa, GPH: Similarity Search in Hamming Space, to appear in ICDE2018.
10. W. Li, M. Qiao, L. Qin, Y. Zhang, L. Chang, X. LIN, Exacting Eccentricity for Small-World Networks, to appear in ICDE2018.Y. Chen, X. Zhao, X. LIN, Y. Wang, D. Guo, Efficient Frequent Sungraph Mining on Uncertain Graphs, to appear in TKDE (accepted in Jan, 2018, submitted before my EIC term).
11. W. Yu, X. LIN, W. Zhang, and J. McCann. Dynamical SimRank Assessment on Time-Varying Networks. to appear in VLDB Journal. (33 pages, Accepted in OCT 2017).
12. X. Zhao, C. Xiao, X. LIN, Wenjie Zhang, Yan Wang, Efficient Structure Similarity Searches: A Partition-Based Approach, to appear in VLDB Journal.
13. W. Liu, Z. Liu, I. W. Tsang, W. Zhang, X. LIN, Doubly Approximate Nearest Neighbor Approximation, to appear in AAAI 2018
14. L. Yuan, L. Qin, X. LIN, L. Chang, W. Zhang, Effective and Efficient Dynamic Graph Coloring, to appear in VLDB 2018.
15. D. Wen, L. Qin, L. Chang, Y. Zhang, X. LIN, Efficient Structural Graph Clustering: An Index-Based Approach, to appear in VLDB2018.
References:
Our Publications
16. F. Zhang, Y. Zhang, L. Qin, W. Zhang, X. LIN, When Engagement Meets Similarity: Efficient (k, r)-Core Computation on Social Networks, to appear in VLDB2017.
17. Fan Zhang, Wenjie Zhang, Ying Zhang, Lu Qin, XUEMIN LIN, “OLAK: An Efficient Algorithm to Prevent Unraveling in Social Networks”, to appear in 42nd International Conference on Very Large Data Bases (VLDB), 2017
18. F. Zhang, Y. Zhang, L. Qin, W. Zhang, X. LIN, Finding Critical Users for Social Network Engagement: The Collapsed k-Core Problem, to appear in AAAI2017.
19. T. Gao, X. Cao, G. Cong, J. Lu, X. LIN, Distributed Algorithms on Exact Personalized PageRank, to appear in SIGMOD2017.
20. L. Yuan, L. Qin, X. LIN, L. Chang, W. Zhang, I/O Efficient ECC Graph Decomposition via Graph Reduction, to appear in VLDB Journal.
21. L. Chang, C. Zhang, X. LIN, L. Qin Scalable Top-K Structural Diversity Search, Short Paper, Proceedings of the 32st International Conference on Data Engineering (ICDE2017), 2017
22. L. Lai, Q. Lu, X. LIN, Y. Zhang, L. Chang, Scalable Distributed Subgraph Enumeration, VLDB 2017.
23. F. Bi, L. Chang, X. LIN, L. Qin, W. Zhang, Efficient Subgraph Matching by Postponing Cartesian Products, SIGMOD 2016.
24. H. Wei, J.X. Yu, C. Lu, X. LIN, Speedup Graph Processing by Graph Ordering, to appear in SIGMOD 2016.
25. . Yuan, L. Qin, X. LIN, L. Chang, W. Zhang, I/O Efficient ECC Graph Decomposition via Graph Reduction, VLDB2016.
26. B. Lyu, L. Qin, X. LIN, L. Chang, J.X. Yu, Scalable Supergraph Search in Large Graph Databases, ICDE 2016.
27. X. Feng, L. Chang, X. LIN, L. Qin, W. Zhang, Computing Connected Components with Linear Communication Cost in Pregel-like Systems, ICDE 2016.
28. L. Chang, W. Li, X. LIN, L. Qin, W. Zhang, pScan: Fast and Exact Structural Graph Clustering, ICDE 2016.
29. D.-W. Choi, J. Pei, X. LIN, Finding the Minimum Spatial Keyword Cover, ICDE 2016.
30. D. Wen, L. Qin, Y. Zhang, X. LIN, J.X. Yu, I/O Efficient Core Graph Decomposition at Web Scale, ICDE 2016 (BEST PAPER AWARD).
31. W. Zhang, X. LIN, Y. Zhang, K. Zhu, G. Zhu, Efficient Probabilistic Supergraph Search, to appear in IEEE Transactions on Knowledge and Engineering (TKDE, accepted in Nov 2015)
32. L. Yuan, L. Qin, X. LIN, L. Chang, W. Zhang, Diversified Top-K Clique Search, to appear in VLDB J, (accepted in Oct 2015)
References:
Our Publications
33. L. Chang, X.LIN, Q. Lu, J .X. Yu, W. Zhang, Index-based Optimal Algorithms for Computing Steiner Components with Maximum Connectivity, SIGMOD 2015.
34. L. Chang, X. LIN, Q. Lu, J.X. Yu, J. Pei, Efficiently Computing Top-k Shortest Path Join, to apeear in EDBT2015.
35. L. Lai, L. Qin, X. LIN, L. Chang, Scalable Subgraph Enumeration in MapReduce, to appear in VLDB 2015.
36. L. Chang, X. LIN, W. Zhang, J.X. Yu, Y. Zhang, L. Qin, Optimal Enumeration: Efficient Top-k Tree Matching, to appear in VLDB 2015.
37. X. Wang, Y. Zhang, W. Zhang, X. LIN, W. Wang, Selectivity Estimation On Streami SpatioTextual Data Using Local Correlations, to appear in VLDB2015.
38. J. Wang, S. Song, X. LIN, X. Zhu, J. Pei, Clean Structured Even Logs: A Graph Repair Approach, in ICDE2015.
39. L. Yuan, Lu. Qin, X. LIN, L. Chang, W. Zhang, Diversified Top-K Clique Search, to appear in ICDE2015.
40. X. Wang, Y. Zhang, W. Zhang, X. LING, W. Wang, AP-Tree: Efficiently Support Continuous Spatial-Keyword Queries Over Stream, to appear in ICDE2015
41. Z. Zhang, J.X. Yu, L. Qin, L. Chang, X. LIN, I/O Efficient: Computing SCCs in Massive Graphs, to appear in VLDB Journal (accepted in Sept, 2014).
42. W. Yu, X. LIN, W. Zhang, J. A. McCann, Fast All-Pairs SimRank Assessment on Large Graphs and Bipartite Domains, to appear in TKDE.
Our Publications
References:
43. X. Wang, Y. Zhang, W. Zhang, X. LIN, Efficiently Identify Local Frequent Keyword Co-occurrence Patterns in Geo-tagged Twitter Stream, SIGIR 2014 (short paper): 1215-1218.
44. L. Qu, J.X. Yu, L. Chang, H. Cheng, C. Zhang, X. LIN, Scalable Big Graph Processing in MapReduce, SIGMOD 2014: 827-838
45. W. Yu, X. LIN, W. Zhang, Fast Incremental SimRank on Link-Evolving Graphs, ICDE 2014: 304-315.
46. C. Zhang, Y. Zhang, W. Zhang, X. LIN, M.A. Cheema, X. Wang, Diversified Spatial Keyword Search on Road Networks, EDBT 2014: 367-378.
47. X. Zhao, C. Xiao, X. LIN, Q. Liu, W. Zhang, A Partition-Based Approach to Structure Similarity Search, PVLDB 7(3): 169-180 (2013)
48. W. Yu, X. LIN, W. Zhang, L. Chang, J. Pei, More is Simpler: Effectively and Efficiently Assessing Node-Pair Similarity based on Hype-links. PVLDB 7(1): 13-24 (2013)
49. W. Yu, X. LIN, IRWR: Incremental Random Walk with Restart, SIGIR 2013: 1017-1020.
50. L. Chang, J. Yu, L. Qin, X. LIN, C. Liu, W. Liang, Efficiently Computing k-Edge Connected Components via Graph Decomposition, SIGMOD
Conference 2013: 205-216
51. Z. Zhang, J. Yu, L, Qin, L. Chang, X. LIN, I/O Efficient: Computing SCCs in Massive Graphs, SIGMOD Conference 2013: 181-192
52. X. Zhao, X. Chuan, X. LIN, W. Wang, Y. Ishikawa, Efficient Processing of Graph Similarity Queries with Edit Distance Constraints, to appear in VLDB Journal, VLDB J. 22(6): 727-752 (2013)
53. Weiren Yu, XUEMIN LIN, Wenjie Zhang, Towards Efficient Computation on SimRank Computation on Large Graphs, ICDE 2013: 601-612. ( One of the Best Papers )
Our Publications
References:
54. Chengyuan Zhang, Ying Zhang, Wenjie Zhang, XUEMIN LIN, Inverted Linear Quadtree: Efficient Top K Spatial Keyword Search, ICDE 2013: 901-912
55. Weiiren Yu, XUEMIN LIN, Wenjie Zhang, Ying Zhang, Jiajin Le, SimFusion+: Extending SimiFusion Towards Efficient Estimation on Large Dynamic
Networks, SIGIR 2012: 365-374
56. Yuanyuan Zhu, Lu Qin, Jeffrey Yu, Yiping Ke, XUEMIN LIN, High Efficiency and Quality: Large Graphs Matching, VLDB J. 22(3): 345-368 (2013)
57. Gaoping Zhu, XUEMIN LIN, Ke Zhu, Wenjie Zhang, Jeffrey Xu Yun, TreeSpan: Efficiently Computing Similarity All-Matching, SIGMOD Conference 2012: 529-540
58. Xiang Zhao, Chuan Xiao, XUEMIN LIN, Wei Wang, Efficient Graph Similarity Joins with Edit Distance Constraints, ICDE 2012: 834-845
59. Y. Luo, W. Wang, X. LIN, X. Zhou, J. Wang, K. Li, SPARK2: Top-k Keyword Query in Relational Databases, IEEE Transactions on Knowledge and Data
Enigneering, 23(12), 1763-1780, 2011 (Spotlight Paper)
60. H. Shang, X. LIN, Y. Zhang, J.X. Yu, W. Wang, Connected Substructure Similarity Search , pages 903-914, SIGMOD 2010.
61. H. Shang, K. Zhu, X. LIN, Y. Zhang, R. Ichise, Similarity Search on Supergraph Containment , pages 637-648, ICDE 2010.
62. H. Shang, Y. Zhang, X. LIN, J. Yu, Taming Verification Hardness: an efficient algorithm for testing subgraph isomorphism, pages 364-375, VLDB2008.
63. Y. Luo, W. Wang, X. LIN, SPARK: A Keyword Search Engine on Relational Databases (demo) ICDE 2008: pages 1552-1555.
64. J. Chen, J.X. Yu, X. LIN, H. Wang, P.S. Yu, Fast Computing Reachability for Large Graphs with Hogh Compresion Rate , in the proceedings of
EDBT08, pages 193-204, France.
65. B. Ding, J.X. Yu, S. Wang, L. Qing, X. Zhang, X. LIN, Finding Top-k Min-Cost Connected Trees in Databases, ICDE’07 (Best Student Paper Award), pages 836-845, 2007.
Thank you!
Questions?