PowerPoint Presentation
1
Spark GraphX/GraphFrames
Seamlessly work with both graphs and tables using DataFrame API.
We will be using GraphFrames to take advantage of the more expressive DataFrame API
Comparable performance to the fastest specialized graph processing systems.
A growing library of graph algorithms.
Users can write their own.
2
The property graph model
a
{name: Alice, age: 34}
b
{name: Bob, age: 36}
c
{name: Charlie, age: 37}
d
{name: David, age: 29}
e
{name: Esther, age: 32}
f
{name: Fanny, age: 38}
g
{name: Gabby, age: 60}
friend
friend
follow
follow
follow
friend
friend
follow
Creating GraphFrames
A GraphFrame is created from a vertex DataFrame and an edge DataFrame
Vertex DataFrame should contain a special column named “id” which specifies unique IDs for each vertex in the graph.
Edge DataFrame should contain two special columns: “src” (source vertex ID of edge) and “dst” (destination vertex ID of edge).
Both DataFrames can have arbitrary other columns. Those columns can represent vertex and edge attributes.
Motif finding
Basic unit is an edge. For example, “(a)-[e]->(b)” expresses an edge e from vertex a to vertex b.
vertices are denoted by parentheses (a)
edges are denoted by square brackets [e]
A pattern is expressed as a union of edges. Edge patterns can be joined with semicolons.
Vertex/edge names
vertex names identify common vertices among edges (the same edge name cannot appear more than once)
vertex/edge names are used as column names in the result DataFrame
can be omitted
Motif finding
An edge can be negated to indicate that the edge should not be present in the graph.
All names in the negated edge must have appeared in previous edges
Cannot contain named edges
More meaningful queries can be expressed by applying filters on the resulting DataFrame.
Question: where is this filter applied?
A more complex example: find chains of 4 vertices such that at least 2 of the 3 edges are “friend” relationships.
Subgraphs
Graph Algorithms in GraphFrames
Simple algorithms: Use DataFrame API on vertices, edges, and/or triplets
Example: BFS
The BFS algorithm provided in GraphFrames gives more information, such as the actual shortest path
List Ranking
Problem:
Given a singly linked list L with n objects, put them in order.
We will give an algorithm that, for each node, compute the distance to the end of the list.
Then use sorting to bring them in order.
Algorithm
Initialize u.d = 0 if u.next = null, else u.d = 1
Run the following for each node u until all next pointers are null:
if u.next is not null:
u.d += u.next.d
u.next = u.next.next
8
Example
9
Application: Computing Depth of a Tree
If the tree is shallow, use BFS
What if the tree is very deep?
For each edge, make two copies:
one going up (initial d = +1)
one going down (initial d = -1)
Treat each copy as a vertex, build links between them
Run list ranking and return the largest d
The Pregel Model for Graph Computation
Vertex-centric computation
The Pregel model (available in GraphX; not available in GraphFrames yet)
Each vertex has a local value and a binary state (active/inactive)
In each round (superstep), execute a user-defined program:
Each vertex aggregate messages from previous rounds (inactive vertices become active if messages are received)
Each active vertex updates its local value and sends messages to neighbors
Optionally set its state to inactive
Whole computation terminates when no active vertices
Example: BFS
new_val = min(all messages m received)
if new_val < val then
val = new_val
for each neighbor v
send_message(v, val+1)
set status to inactive
Initialization:
local value = 0, status = active at starting vertex
local value = , status = inactive at all other vertices
User-defined program
Example: PageRank
Initialization:
local value = 1, status = active for all vertices
User-defined program
val = sum(all messages m received) * 0.85 + 0.15/N
for each neighbor v
send_message(v, val / outdegree of self)
if number of rounds > threshold:
set status to inactive
Shortest Path: Dijkstra’s Algorithm
This is an inherently sequential algorithm!
14
Dijkstra():
for each do
,,
create a min priority queue on with as key
while
Extract-Min()
for each do
if and then
Decrease-Key()
Dijkstra’s Algorithm: Example
15
Note: All the shortest paths found by Dijkstra’s algorithm form a tree (shortest-path tree).
Bellman-Ford (implemented in GraphX / GraphFrames)
Initialization:
local value = 0, status = active at starting vertex
local value = , status = inactive at all other vertices
User-defined program
new_val = min(all messages m received)
if new_val < val then
val = new_val
for each neighbor v
send_message(v, val + dist(self, v))
set status to inactive
Can be much faster (less rounds) than Dijkstra’s algorithm on shallow graphs
But may do more total work.
It also supports negative-weight edges
Dijskra’s algorithm cannot handle negative-weight edges
0
a
b
d
c
e
6
7
8
9
7
5
-2
-3
-4
2
0
a
6
b
7
d
c
e
6
7
8
9
7
5
-2
-3
-4
2
0
6
b
7
d
4
c
2
e
6
7
8
9
7
5
-2
-3
-4
2
0
2
b
7
d
4
c
2
e
6
7
8
9
7
5
-2
-3
-4
2
0
2
b
7
d
4
c
-2
e
6
7
8
9
7
5
-2
-3
-4
2
/docProps/thumbnail.jpeg