PowerPoint Presentation
Parallel Query Processing
R&G Chapters
22.1-22.4,
A little history
Relational revolution
declarative set-oriented primitives
1970’s
Parallel relational database systems
on commodity hardware
1980’s
Big Data: MapReduce, Spark, etc.
scaling to thousands of machines and beyond
2005-2015
2
Why Parallelism?
Scan 100TB
At 0.5 GB/sec (see lec 4):
~200,000 sec = ~2.31 days
Why Parallelism? Cont.
Scan 100TB
At 0.5 GB/sec (see lec 4):
~200,000 sec = ~2.31 days
Run it 100-way parallel:
2,000 sec = 33 minutes
1 big problem = many small problems
Trick: make them independent
Speed-up
Increase HW
Fix workload
Scale-up
Increase HW
Increase workload
Two Metrics to Shoot For
parallelism
throughput
ideal
data size + parallelism
throughput
ideal
5
Roughly 2 Kinds of Parallelism
Partition
scales up to amount of data
: any sequential program,
e.g. a relational operator
We’ll get more refined soon.
f(xh(x))
Pipeline
scales up to pipeline depth
g(f(x2))
h(g(f(x1)))
f(x3)
Easy for us to say!
Lots of Data:
Batch operations
Pre-existing divide-and-conquer algorithms
Natural pipelining
Declarative languages
Can adapt the parallelism strategy to the task and the hardware
All without changing the program!
Codd’s Physical Data Independence
DBs: The Parallel Boy that Lived
1980s CS challenge: “parallelize” software
E.g. via awesome new C compilers
E.g. via variants of C designed for parallelism
In broad terms, a failure
Dave Patterson: The “Dead Computer Society”
Exception: Parallel SQL Databases
Why? Data Independence!
SQL is independent of how many machines you have!
The same divide-and-conquer that worked for disks works across machines, as we’ll see
Big Data is the generalization of these lessons
Or in some cases a re-learning of them
Parallel Architectures
Shared Memory
Shared Disk
Shared Nothing
(cluster)
Some Early Systems
Research
XPRS (Berkeley, shared-memory)
Gamma (Wisconsin, shared-nothing)
Volcano (Colorado, shared-nothing)
Bubba (MCC, shared-nothing)
Grace (U. Tokyo, shared-nothing)
Industry
Teradata (shared-nothing)
Tandem Non-Stop SQL (shared-nothing)
This slide is FYI; will not be on exam
10
What about the cloud?
Upshot: not so different from what we’ll see here
Architectural choices and competing systems jockeying for position
Parallelism in many forms across many users and use cases
Things still shaking out
What about the cloud? Part 2
Many transactional cloud systems can be treated as shared nothing
This slide is FYI; will not be on exam
What about the cloud? Part 3
Many analytics systems over “shared disk” abstractions
e.g. AWS Simple Storage Service (S3)
These sometimes don’t allow update-in-place today, just append
This slide is FYI; will not be on exam
What about the cloud? Part 4
Shared memory also common
You can rent a shared memory box up to some size
This slide is FYI; will not be on exams
What about the cloud? Part 5
Shared memory also common
You can rent a shared memory box up to some size
Beyond that it’s NUMA:
Non-Uniform Memory Access
This is a pain to do well
A mix of shared-nothing and
shared memory
A fate that awaits all SW?
This slide is FYI; will not be on exams
Shared Nothing
We will focus on Shared Nothing here
It’s the most common
DBMS, web search, big data, machine learning, …
Runs on commodity hardware
Scales up with data
Just keep putting machines on the network!
Does not rely on HW to solve problems
Good for helping us understand what’s going on
Control it in SW
Kinds of Query Parallelism
Inter-query (parallelism across queries)
Each query runs on a separate processor
Single thread (no parallelism) per query
Does require parallel-aware concurrency control
A topic for later in the semester
Note on latin prefixes
inter: “between”, “across”.
“Interplanetary travel takes a long time”
intra: “within”.
“The political party suffered from intraparty rivalries.”
SQL
SQL
SQL
SQL
SQL
DBMS
Intra Query – Inter-operator
Intra-query (within a single query)
Inter-operator (between operators)
Pipeline Parallelism
g(f(x2))
h(g(f(x1)))
f(x3)
Intra Query – Inter-operator Part 2
Intra-query
Inter-operator
Pipeline Parallelism
g(f(x2))
h(g(f(x1)))
f(x3)
“Logical” Plan
⨝
Scan S
Scan R
⨝
Scan U
Scan T
⨝
Intra Query – Inter-Operator Part 3
Intra-query
Inter-operator
⨝
Scan S
Scan R
mat
⨝
Scan U
Scan T
mat
⨝
Bushy (Tree) Parallelism
g(f(x2))
h(g(f(x1)))
f(x3)
Pipeline Parallelism
“Logical” Plan
⨝
Scan S
Scan R
⨝
Scan U
Scan T
⨝
Intra Query – Intra-Operator
Intra-query
Intra-operator (within a single operator)
“Logical” Plan
⨝
Scan S
Scan R
21
Kinds of Query Parallelism, cont.
Intra-query
Intra-operator
⨝
Scan S
Scan R
⨝
Scan S
Scan R
Partition Parallelism
⨝
Scan S
Scan R
“Logical” Plan
⨝
Scan S
Scan R
22
Summary: Kinds of Parallelism
Inter-Query
Intra-Query
Inter-Operator
Intra-Operator (partitioned)
Intra-Operator Parallelism
Data Partitioning
How to partition a table across disks/machines
A bit like coarse-grained indexing!
Shared nothing particularly benefits from “good” partitioning
A…E
F…J
K…N
O…S
T…Z
Good for equijoins,
range queries
group-by
Range
A…E
F…J
K…N
O…S
T…Z
Good for equijoins,
group-by
Hash
A…E
F…J
K…N
O…S
T…Z
Good for spreading load
Round-Robin
Before loading into database
Loaded and partitioned
25
77
Parallel Scans
Scan in parallel, merge (concat) output
sp : skip entire sites that have no tuples satisfying p
range or hash partitioning
Indexes can be built at each partition
Q: How do indexes differ in the different data partitioning schemes?
26
Lookup by key
Data partitioned on function of key?
Great! Route lookup only to relevant node
Otherwise
Have to broadcast lookup (to all nodes)
A…E
F…J
K…N
O…S
T…Z
Hash
A…E
F…J
K…N
O…S
T…Z
Round-Robin
27
What about Insert?
Data partitioned on function of key?
Route insert to relevant node
Otherwise
Route insert to any node
A…E
F…J
K…N
O…S
T…Z
Hash
A…E
F…J
K…N
O…S
T…Z
Round-Robin
28
Insert to Unique Key?
Data partitioned on function of key?
Route to relevant node
And reject if already exists
A…E
F…J
K…N
O…S
T…Z
Hash
A…E
F…J
K…N
O…S
T…Z
Round-Robin
29
Insert to Unique Key cont.
Otherwise
Broadcast lookup
Collect responses
If not exists, insert anywhere
Else reject
A…E
F…J
K…N
O…S
T…Z
Round-Robin
A…E
F…J
K…N
O…S
T…Z
Hash
30
Remember Hashing?
hp
hr
Parallelize me! Hashing
Phase 1: shuffle data across machines (hn)
streaming out to network as it is scanned
which machine for this record?
use (yet another) independent hash function hn
hn
Parallelize me! Hashing Part 2
Receivers proceed with phase 1 in a pipeline
as data streams in
from local disk and network
Nearly same as single-node hashing
Near-perfect speed-up, scale-up!
Streams through phase 1, during which time every component works at its top speed, no waiting.
Have to wait to start phase 2.
hp
hr
hn
Data is shuffled across machines using h_n and then each machine independently carries out hashing using h_p and h_r like normal
33
Hash Join?
Hmmm….
If you have enough machines…
Naïve parallel hash join
Phase 1: shuffle each table across machines (hn)
Parallel scan streaming out to network
Wait for building relation to finish
Then stream probing relation through it
Receivers proceed with naïve hashing in a pipeline as probe data streams in
from local disk and network
Writes are independent, hence parallel
Note: there is a variation that has no waiting: both tables stream
Wilschut and Apers’ “Symmetric” or “Pipeline” hash join
Requires more memory space
hn
hr
Parallel Grace Hash Join Pass 1
Pass 1 is like hashing above
hp
R
R
R
hn
Parallel Grace Hash Join Pass 1 cont
Pass 1 is like hashing above
But do it 2x: once for each relation being joined
R
R
R
hn
S
S
S
hp
Parallel Grace Hash Join Pass 2
Pass 2 is local Grace Hash Join per node
Complete independence across nodes
hr
R
S
R
S
R
S
R ⨝ S
R ⨝ S
R ⨝ S
R
R
R
hn
S
S
S
hp
Parallel Grace Hash Join
Pass 1: parallel streaming
Stream building and probing tables through shuffle/partition
Pass 2 is local Grace Hash Join per node
Complete independence across nodes in Pass 2
Near-perfect speed-up, scale-up!
Every component works at its top speed
Only waiting is for Pass 1 to end.
Note: there is a variant that has no waiting
Urhan’s Xjoin, a variant of symmetric hash
Parallelize me! Sorting Pass 0
Pass 0: shuffle data across machines
streaming out to network as it is scanned
which machine for this record?
Split on value range (e.g. [-∞,10], [11,100], [101, ∞]).
range
Parallelize me! Sorting Pass 1-n
Receivers proceed with pass 0 as the data streams in
Passes 1–n done independently as in single-node sorting
A Wrinkle: How to ensure ranges are the same #pages?!
i.e. avoid data skew?
range
Range partitioning
Goal: equal frequency per machine
Note: ranges often don’t divide x axis evenly
How to choose?
Range partitioning cont.
Would be easy if data small
In general, can sample the input relation prior to shuffling, pick splits based on sample
Note: Random sampling can be tricky to implement in a query pipeline; simpler if you materialize first.
How to sample a database table?
Advanced topic, we will not discuss in this class.
43
Some Sorting Records
Sorting has become a blood sport!
Parallel sorting is the name of the game …
Minute Sort: how many 100-byte records can you sort in a minute?
Current World record: 55 TB. Tencent.
512 nodes, 10,240 cores
512 GB RAM, 4×1.2TB SSDs (2016)
This slide is FYI; will not be on exams
44
Some Sorting Records cont.
CloudSort: min cost to sort 100TB
Current World record: $144. Alibaba+Databricks
394 Alibaba Cloud machines, 8GB RAM, 40GB Ultra Cloud Disks, 4 135GB SSD Cloud Disks (2016)
Also JouleSort and GraySort
See http://sortbenchmark.org
Students have held sorting trophies at various times
Always evolving
This slide is FYI; will not be on exams
45
Parallel Sort-Merge Join
Pass 0 .. n-1 are like parallel sorting above
Note: this picture is a 2-pass sort (n=1); this is pass 0
R
R
R
range
Parallel Sort-Merge Join Pass 0…n-1
Pass 0 .. n-1 are like parallel sorting above
But do it 2x: once for each relation, with same ranges
Note: this picture is a 2-pass sort (n=1); this is pass 0
R
R
R
range
S
S
S
Pass n (with optimization)
Pass 0 .. n-1 are like parallel sorting above
But do it 2x: once for each relation, with same ranges
Pass n: merge join partitions locally on each node
S
R
R ⨝ S
S
R
R ⨝ S
S
R
R ⨝ S
R
R
R
range
S
S
S
Parallel Aggregates
Hierarchical aggregation
For each aggregate function, need a global/local decomposition:
sum(S) = S S (s)
count = S count (s)
avg(S) = (S S (s)) / S count (s)
etc…
S=17
S=5
S=6
S=6
2,3
1,5
4,2
49
88
Parallel GroupBy
Naïve Hash Group By
Local aggregation: in hash table keyed by group key ki keep local aggi
E.g. SELECT SUM(price) group by cart;
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
50
88
Parallel GroupBy, Cont.
Naïve Hash Group By
Local aggregation: in hash table keyed by group key ki keep local aggi
For example, k is major, agg is (avg(gpa), count(*))
Shuffle local aggs by a hash function hp(ki)
Compute global aggs for each key ki
hp
51
88
Parallel Aggregates/GroupBy Challenge!
Exercise:
Figure out parallel 2-pass GraceHash-based scheme to handle # large of groups
Figure out parallel Sort-based scheme
hp
52
88
Joins: Bigger picture
Alternatives:
Symmetric shuffle
What we did so far
Asymmetric shuffle
Broadcast join
53
Join: One-sided shuffle
If R already suitably partitioned,
just partition S, then run local join at every node and union results.
R
R
R
S
S
S
R ⨝ S
R ⨝ S
R ⨝ S
54
“Broadcast” Join
If R is small, send it to all nodes that have a partition of S.
Do a local join at each node (using any algorithm) and union results.
input S
input R
55
What are “pipeline breakers”?
Sort
Hence sort-merge join can’t start merging until sort is complete
Hash build
Hence Grace hash join can’t start probing until hashtable is built
Is there a join scheme that pipelines?
Symmetric (Pipeline) Hash Join
Single-phase, streaming
Each node allocates two hash tables, one for each side
Upon arrival of a tuple of R:
Build into R hashtable by join key
Probe into S hashtable for matches and output any that are found
Upon arrival of a tuple of S:
Symmetric to R!
R
S
Symmetric (Pipeline) Hash Join cont
Why does it work?
Each output tuple is generated exactly once: when the second part arrives
Streaming!
Can always pull another tuple from R or S, build, and probe for outputs
Useful for Stream query engines!
Extensions
Parallel Symmetric Hash Join
Straightforward—part of the original proposal
Just add a streaming partitioning phase up front
As in naïve hash join
Out-of-core Symmetric Hash Join
Quite a bit trickier. See the X-Join paper.
Non-blocking sort-merge join
See the Progressive Merge Join paper
R
R
R
hn
S
S
S
Encapsulating parallelism: Exchange
This section is FYI; will not be on exams
A Software Engineering Query
How can I take my single-threaded iterator system and add parallelism?
Idea: encapsulate parallel dataflow as an iterator itself: Exchange.
An innovation of the Volcano system
Exchange as Metaphor
To get from here:
To here:
“Logical” Plan
⨝
Scan S
Scan R
Interpose Exchange Operators
Rewrite like this
Everything stays the same, but exchange is hiding two main things:
Shuffling tuples across multiple machines, dealing with networking
The fact that sender and receiver need to be in separate threads
Otherwise receiver can stall waiting for local children
⨝
Scan S
Scan R
X
X
Exchange is Two Iterators
Rewrite like this
X-out runs as its own thread
Along with its subtree
Separate from X-in and its parents
X-out pulls from its children and pushes to X-in
X-in stalls until it receives tuples from X-out
Network queues at X-out and X-in can buffer a few tuples
All other iterators behave as they always did in a single-node setting
⨝
Scan S
Scan R
X-out
X-in
X-out
X-in
THREAD
BOUNDARY
Exchange is Two Iterators Part 2
Now run this at each machine and have X-out shuffle its output across X-ins on all machines
⨝
Scan S
Scan R
X-out
X-in
X-out
X-in
THREAD
BOUNDARY
65
Exchange is Two Iterators Part 3
Now run this at each machine and have X-out shuffle its output across X-ins on all machines
⨝
Scan S
Scan R
⨝
Scan S
Scan R
Scan S
Scan R
X-out
X-out
X-out
X-out
X-out
X-out
X-in
X-in
X-in
X-in
X-in
X-in
⨝
⨝
Scan S
Scan R
X-out
X-in
X-out
X-in
THREAD
BOUNDARY
66
Parallel DBMS Summary
Parallelism natural to query processing:
Both pipeline and partition
Shared-Nothing vs. Shared-Mem vs. Shared Disk
Shared-mem easiest SW, costliest HW.
Doesn’t scale indefinitely
Shared-nothing cheap, scales well, harder to implement.
Shared disk a middle ground
For updates, introduces icky stuff related to concurrency control
Intra-op, Inter-op, & Inter-query parallelism all possible.
67
Parallel DBMS Summary, Part 2
Data layout choices important!
Most DB operations can be done partition-parallel
Sort. Hash.
Sort-merge join, hash-join.
Complex plans.
Allow for pipeline-parallelism, but sorts, hashes block the pipeline.
Partition parallelism achieved via bushy trees.
68
Parallel DBMS Summary, Part 3
Transactions require introducing some new protocols
distributed deadlock detection
two-phase commit (2PC)
2PC not great for availability, latency
single failure stalls the whole system
transaction commit waits for the slowest worker
More on this in subsequent lectures
⨝
Scan
S
Scan
R
mat
⨝
Scan UScan T
mat
⨝
⨝
Scan
S
Scan
R
mat
⨝
Scan UScan T
mat
⨝
/docProps/thumbnail.jpeg