CS W186
Spring 2020 Parallel Query Processing
1 Introduction
Up until this point we have assumed that our database is running on a single computer. For modern applications that deal with millions of requests over terabytes of data it would be impossible for one computer to quickly respond to all of those requests. We need to figure out how to run our database on multiple computers. We will call this parallel query processing because a query will be run on multiple machines in parallel.
2 Parallel Architectures
How are these machines connected together? The most straightforward option would probably be to have every CPU share memory and disk. This is called shared memory.
Another option is for each CPU to have its own memory, but all of them share the same disk. This is called shared disk.
.
These architectures are easy to reason about, but sharing resources holds back the system. It is possible to achieve a much higher level of parallelism if all the machines have their own disk and memory because they do not need to wait for the resource to become available. This architecture is called shared nothing and will be the architecture we use throughout the rest of the note.
CS W186, Spring 2020, Course Notes 1 Brian DeLeonardis and Jeremy Dong
CS W186
Spring 2020 Parallel Query Processing
In shared nothing, the machines communicate with each other solely through the network by passing messages to each other.
3 Types of Parallelism
The focus of this note will be on intra-query parallelism. Intra-query parallelism attempts to make one query run as fast as possible by spreading the work over multiple computers. The other major type of parallelism is inter-query parallelism which gives each machine different queries to work on so that the system can achieve a high throughput. This may sound simple, but doing it correctly is actually quite difficult and we will address how to do it when we get to the module on concurrency.
3.1 Types of Intra-query Parallelism
We can further divide Intra-query parallelism into two classes: intra-operator and inter-operator. Intra-operator is making one operator run as quickly as possible. An example of intra-operator parallelism is dividing up the data onto several machines and having them sort the data in parallel. This parallelism makes sorting (one operation) as fast as possible. Inter-operator parallelism is making a query run as fast as possible by running the operators in parallel. For example, imagine our query plan looks like this:
CS W186, Spring 2020, Course Notes 2 Brian DeLeonardis and Jeremy Dong
CS W186
Spring 2020 Parallel Query Processing
One machine can work on sorting R and another machine can sort S at the same time. In inter- operator parallelism, we parallelize the entire query, not individual operators.
3.2 Types of Inter-operator Parallelism
The last distinction we will make is between the different forms of inter-operator parallelism. The first type is pipeline parallelism. In pipeline parallelism records are passed to the parent operator as soon as they are done. The parent operator can work on a record that its child has already processed while the child operator is working on a different record. As an example, consider the query plan:
In pipeline parallelism the project and filter can run at the same time because as soon as filter finishes a record, project can operate on it while filter picks up a new record to operate on. The other type of inter-operator parallelism is bushy tree parallelism in which different branches of the tree are run in parallel. The example in section 2.1 where we sort the two files independently is an example of bushy tree parallelism.
4 Partitioning
Because all of our machines have their own disk we have to decide what data is stored on what machine. In this class each data page will be stored on only one machine. In industry we call this sharding and it is used to achieve better performance. If each data page appeared on multiple machines it would be called replication. This technique is used to achieve better availability (if one machine goes down another can handle requests), but it presents a host of other challenges that we will not study in this class.
To decide what machines get what data we will use a partitioning scheme. A partitioning scheme is a rule that determines what machine a certain record will end up on. The three we will study are range partitioning, hash partitioning, and round-robin.
CS W186, Spring 2020, Course Notes 3 Brian DeLeonardis and Jeremy Dong
CS W186
Spring 2020 Parallel Query Processing
4.1 Range Partitioning
In a range partitioning scheme each machine gets a certain range of values that it will store (i.e. machine 1 will store values 1-5, machine 2 will store values 6-10, and so on).
This scheme is very good for queries that lookup on a specific key (especially range queries compared to the other schemes we’ll talk about) because you only need to request data from the machines that the values reside on. It will also be the scheme that we use for parallel sorting and parallel sort merge join.
4.2 Hash Partitioning
In a hash partitioning scheme, each record is hashed and is sent to a machine matches that hash value. This means that all like values will be assigned to the same machine (i.e. if value 4 goes to machine 1 then all of the 4s must go to that machine), but it makes no guarantees about where close values go.
It will still perform well for key lookup, but not for range queries (because every machine will likely need to be queried). Hash partitioning will also be our scheme of choice for parallel hashing and parallel hash join.
4.3 Round Robin Partitioning
The last scheme we will talk about is called round robin partitioning. In this scheme we go record by record and assign each record to the next machine. For example the first record will be assigned to the first machine, the second record will be assigned to the second machine and so on. When we reach the final machine we will assign the next record to the first machine.
CS W186, Spring 2020, Course Notes 4 Brian DeLeonardis and Jeremy Dong
CS W186
Spring 2020 Parallel Query Processing
This may sound dumb but it has a nice property – every machine is guaranteed to get the same amount of data. This scheme will actually achieve maximum parallelization. The downside, of course, is that every machine will need to be activated for every query.
5 Network Cost
So far in this class we have only measured performance in terms of IOs. When we have multiple machines communicating over a network, however, we also need to consider the network cost. The network cost is how much data we need to send over the network to do an operation. In this class it is usually measured in KB. One important thing to note is that there is no requirement that entire pages must be sent over the network (unlike when going from memory to disk), so it is possible for the network cost to just be 1 record worth of data.
6 Partitioning Practice Questions
Assume that we have 5 machines and a 1000 page students(sid, name, gpa) table. Initially, all of the pages start on one machine. Assume pages are 1KB.
1) How much network cost does it take to round-robin partition the table?
2) How many IOs will it take to execute the following query: SELECT ∗ FROM students where name = ’Josh Hug’;
3) Suppose that instead of round robin partitioning the table, we hash partitioned it on the name column instead, How many IOs would the query from part 2 take?
4) Assume that an IO takes 1ms and the network cost is negligible. How long will the query in part 2 take if the data is round-robin partitioned and if the data is hash partitioned on the name column?
CS W186, Spring 2020, Course Notes 5 Brian DeLeonardis and Jeremy Dong
CS W186
Spring 2020 Parallel Query Processing
7 Partitioning Practice Solutions
1) In round robin partitioning the data is distributed completely evenly. This means that the ma- chine the data starts on will be assigned 1/5 of the pages. This means that 4/5 of the the pages will need to move to different machines, so the answer is: 4/5 * 1000 * 1 = 800 KB.
2) When the data is round robin partitioned we have no idea what machine(s) the records we need will be on. This means we will have to do full scans on all of the machines so we will have to do a total of 1000 IOs.
3) When the data is hash partitioned on the name column, we know exactly what machine to go to for this query (we can calculate the hash value of ’John Hug’ and find out what machine is assigned that hash value). This means we will only have to a full scan over 1 machine for a total of 200 IOs.
4) Under both partitioning schemes it will take 200ms. Each machine will take the same amount of time to do a full scan, and all machines can be run at the same time so for runtime it doesn’t matter how many machines we need to query.
8 Parallel Sorting
Let’s start speeding up some algorithms we already know by parallelizing them. There are two steps for parallel sorting:
1. Range partition the table
2. Perform local sort on each machine
We range partition the table because then once each machine sorts its data, the entire table is in sorted order (the data on each machine can be simply concatenated together if needed).
9 Parallel Hashing
Parallel hashing is very similar to parallel sorting. The two steps are: 1. Hash partition the table
2. Perform local hashing on each machine
Hash partitioning the table guarantees that like values will be assigned to the same machine. It would also be valid to range partition the data (because it has the same guarantee). However, in practice, range partitioning is a little harder (how do you come up with the ranges efficiently?) than hash-partitioning so use hash partitioning when possible.
CS W186, Spring 2020, Course Notes 6 Brian DeLeonardis and Jeremy Dong
CS W186
Spring 2020 Parallel Query Processing
10 Parallel Sort Merge Join
The two steps for parallel sort merge join are:
1. Range partition each table using the same ranges on the join column 2. Perform local sort merge join on each machine
We need to use the same ranges to guarantee that all matches appear on the same machine. If we used different ranges for each table, then it’s possible that a record from table R will appear on a different machine than a record from table S even if they have the same value for the join column which will prevent these records from ever getting joined.
11 Parallel Grace Hash Join
The two steps for parallel hash join are:
1. Hash partition each table using the same hash function on the join column 2. Perform local grace hash join on each machine
Similarly to parallel SMJ, we need to use the same hash function for both tables to guarantee that matching records will be partitioned to the same machines.
CS W186, Spring 2020, Course Notes 7 Brian DeLeonardis and Jeremy Dong
CS W186
Spring 2020 Parallel Query Processing
12 Broadcast Join
Let’s say we want to join a 1,000,000 page table that is currently round robin partitioned with a 100 page table that is currently all stored on one machine. We could do one of the parallel join algorithms discussed above, but to do this we would need to either range partition or hash partition the tables. It will take a ton of network cost to partition the big table.
Instead, we can use a broadcast join. A broadcast join is a join in which we send the entire small relation to every machine and then each machine does a local join. The concatenation of the results for each machine is the correct and final result. With this algorithm, each machine has to operate on more data, but we make up for this by sending much less data over the network. When we’re joining together one really large table and one small table, a broadcast join will normally be the fastest because of its network cost advantages.
13 Symmetric Hash Join
The parallel joins we have discussed so far have been about making the join itself run as fast as possible by distributing the data across multiple machines. The problem with the join algorithms we have discussed so far is that they are pipeline breakers. This means that the join cannot produce output until it has processed every record. Why is this a problem? It prevents us from using pipeline parallelism. The operators above cannot work in parallel to the join because the join is taking a lot of time and then producing all of the output at once.
CS W186, Spring 2020, Course Notes 8 Brian DeLeonardis and Jeremy Dong
CS W186
Spring 2020 Parallel Query Processing
Sort Merge Join is a pipeline breaker because sorting is. You cannot produce the first row of output in sorting until you have seen all the input (how else would you know if your row is really the smallest?). Hash Join is a pipeline breaker because you cannot probe the hash table of each partition until you know that hash table has all the data in it. If you tried to probe one of the hash tables after processing only half the data, you might get only half of the matches that you should! Let’s try to build a hash based join that isn’t a pipeline breaker.
Symmetric Hash Join is a join algorithm that is pipeline-friendly, we can start producing output as soon as we see our first matches. Here are the steps to symmetric hash join:
1. Build two hash tables, one for each table in the join
2. When a record from R arrives, probe the hash table for S for all of the matches. When a record from S arrives, probe the hash table for R for all of the matches.
3. Whenever a record arrives add it to its corresponding hash table after probing the other hash table for matches.
This works because every output tuple gets generated exactly once – when the second record in the match arrives! But we produce output as soon as we see a match so it’s not a pipeline breaker.
14 Hierarchical Aggregation
The final parallel algorithm we’ll discuss is called hierarchical aggregation. Hierarchical aggre- gation is how we parallelize aggregation operations (i.e. SUM, COUNT, AVG). Each aggregation is necessarily implemented differently, so we’ll only discuss two in this section, but the ideas should carry over to the other operations easily.
To parallelize COUNT, each machine individually counts their records. The machines all send their counts to the coordinator machine who will then sum them together to figure out the overall count.
CS W186, Spring 2020, Course Notes 9 Brian DeLeonardis and Jeremy Dong
CS W186
Spring 2020 Parallel Query Processing
AVG is a little harder to calculate because the average of a bunch of averages isn’t necessarily the average of the data set. To parallelize AVG each machine must calculate the sum of all the values and the count. They then send these values to the coordinator machine. The coordinator machine adds up the sums to calculate the overall sum and then adds up the counts to calculate the overall count. It then divides the sum by the count to calculate the final average.
15 Parallel Algorithms Practice Questions
For 1 and 2, assume we have a 100 page table R that we will join with a 400 page table S. All the pages start on machine 1, and there are 4 machines in total with 30 buffer pages each.
1) How many passes are needed to do a parallel unoptimized SMJ on the data? For this ques- tion a pass is defined as a full pass over either table (so if we have to do 1 pass over R and 1 pass over S it is 2 total passes).
2) How many passes are needed to a parallel Grace Hash Join on the data?
3) We want to calculate the max in parallel using hierarchical aggregation. What value should each machine calculate and what should the coordinator do with those values?
16 Parallel Algorithms Practice Solutions
1) 2 passes to partition the data (1 for each table). Each machine will then have 25 pages for R and 100 pages for S. R can be sorted in 1 pass but S requires 2 passes. Then it will take 2 passes to merge the tables together (1 for each table). This gives us a total of 2 (partition) + 1 (sort R) + 2 (sort S) + 2 (merge relations) = 7 passes.
2) Again we will need 2 passes to partition the data across the four machines and each machine will have 25 pages for R and 100 for S. We don’t need any partitioning passes, because R fits in memory on every machine, so we only need to do build and probe which will take 2 passes (1 for each table). This gives us a total of 4 passes.
CS W186, Spring 2020, Course Notes 10 Brian DeLeonardis and Jeremy Dong
CS W186
Spring 2020 Parallel Query Processing
3) Each machine should calculate the max and the coordinator will take the max of those maxes.
CS W186, Spring 2020, Course Notes 11 Brian DeLeonardis and Jeremy Dong