School of Computer Science Dr. Ying Zhou
COMP5349: Cloud Computing Sem. 1/2020
Week 11: Cloud Storage and Database Services
Example Solution
Question 1: Bigtable storage model and read/write behaviour
Assume an Google Bigtable cluster consisting of 1 master and a few tablet servers. There is also A Chubby service storing important meta data of the cluster. Now assume the cluster manages a table with similar structure of the sample table in Bigtable paper. Table 1 shows a sample row in the table.
Each row of the web table represents a web page; the row key is the reversed DNS of the web page; the web table has two column families: content and anchor. The content family stores the actual HTML content of a web page. It consists of one column. A crawler continuously crawl the web would visit the same page multiple times, each time a different snapshots of the page content would be downloaded and saved in the web table. The web table keeps up to three most recent snapshots indexed by the time stamp they are inserted into the table. For instance t1 → snapshot1 means snapshot1 of web page www.cnn.com is inserted at t1. The anchor family is used to store anchor links in other pages referencing this page. It contains variable number of columns, each storing one anchor link data. The anchor link data uses the url of the referencing page as the column name, the cell value is the anchor text appearing in the referencing page. The time stamp of the data is also kept in the table. For instance (cnnsi.com,t6) → CNN means that page cnnsi.com has a link pointing to www.cnn.com, the anchor text is CNN. This information is inserted in the web table at time stamp t6.
Table 1: Sample Data of the web table
a) Big table stores data as multidimensional map. The key has three dimensions: rowkey, column key, and time stamp. For instance, the data about snapshot1 is stored as the following key-value pair:
(com.cnn.www,content :,t1) → snapshot1 The data about the first anchor column is stored as another key-value pair: (com.cnn.www, anchor : cnnsi.com, t6) → CNN Here the complete column name is the combination of column family and the individual column name inside the family: anchor:cnnsi.com
1
Key
content
anchor
com.cnn.www
t1 → snapshot1 t80 → snapshot2
(cnnsi.com,t6) → CNN
(my.look.ca, t10) → CNN.com
How many key-value pairs are there for the sample row in table 1. What are they? Answer: There are four key value pairs. They are
• (com.cnn.www, content :, t1) → snapshot1
• (com.cnn.www, content :, t80) → snapshot2
• (com.cnn.www, anchor : cnnsi.com, t6) → CNN
• (com.cnn.www, anchor : my.look.ca, t10) → CNN.com
b) Do all key-value pairs belonging to the same row always saved contiguously in the same file?
Answer: No, the key-value pairs belonging to the same row may be scattered in mem- ory and in different files. Only after major compaction, they are saved contiguously in the same file.
c) Assume the GFS layer of the Bigtable cluster has a replication factor of 3. This applies to both log files and data files. A minor compaction happens at t8, a merge compaction happens at t20, a major compaction happens at t50. How many times the data about snapshot1 are written to GFS? Describe when and in what format the data is written.
Answer: The data is appended in commit log at t1, the commit log is replicated 3 times. The data is written to SSTable file at t8. This file is replicated 3 times. The data is written again to a new SSTable file during merge compaction at t20 with 3 replicas. At major compaction, at t50, the data about snapshot1 is written in another new SSTable file containing all latest data about the web table. This file has 3 replicas. In total, The data is written 12 times, the old files will be garbage collected after merge and major compaction.
d) Assume a minor compaction happens at t8. A read request is sent to the Bigtable cluster at t15 to read row with key com.cnn.www. Where would the tablet server find data to answer this read query?
Answer: All data inserted before t8 are stored in a SSTable file after minor compaction at t8. They are
• (com.cnn.www, content :, t1) → snapshot1
• (com.cnn.www, anchor : cnnsi.com, t6) → CNN
The lastest data (com.cnn.www,anchor : my.look.ca,t10) → CNN.com is in the mem- ory of the tablet server.
2
Question 2: Amazon Dynamo Consistent Hashing and Data Replication
Suppose we have a Dynamo ring consists of five nodes, each is assigned a position(token) in the ring space [0,99]. Figure 1 shows the token of each node. The cluster has a repli- cation factor 3 and the preference list size is 4. Suppose some data stored in the system are about university faculties, using faculty name as key. Table 2 shows sample keys and their corresponding hash values in the ring space.
Figure 1: Dynamo Ring Table 2: Sample keys and hashes
Key
Hash
Arts
31
Business
93
Education
29
Engineering
13
Law
71
Medicine
47
Science
43
a) Suppose node n3 receives a request to insert a record with key “Law”. Which node will it forward the request to?
Answer: n3 will forward the request to n4, which is the coordinator node for key “Law”. n4 is responsible for the hash range (60,80].
b) Which other nodes will eventually get a copy of the inserted data? Answer: n0 and n1 will also receive a copy of the data. They are the two clockwise successor nodes of n4 (the coordinator) in the ring. In other words, the preference list of key “Law” contains four nodes: n4, n0, n1, n2. The top 3 are designated replicas for this key. The last one is used to form sloppy quorum.
3
Question 3: Erasure Coding in Windows Azure Storage
In this question, we explore the fault tolerance feature of the erasure coding used in WAS. This is based on section 2.1 and 2.2 of the paper Erasure Coding in Windows Azure Storage. We will skip the part on of constructing the coding equations (section 2.2.1) and focus only on the practical rules and cases.
In general, reconstructing one or more lost fragments is like solving an array of linearly in- dependent equations. You can solve 3 unknowns using 3 linearly independent equations; or solve 4 unknowns using 4 linearly independent equations.
A Reed-Solomon erasure code scheme with t parity fragments can always tolerate up to t arbitrary fragment lost, whether it is parity or data fragment. As stated in section 2.1: “A (6, 3) Reed-Solomon code contains 6 data fragments and 3 parity fragments, where each parity is computed from all the 6 data fragments. When any data fragment becomes unavailable, no matter which data and parity fragments are used for reconstruction, 6 fragments are always required”
WAS uses a Local Reconstruction Code (LRC) strategy that contains both global parity fragments computed from all data fragments and local parity fragments computed from a subset of the data fragments. A LRC(6,2,2) scheme means the data is partitioned into 6 data fragments and there are 2 local parities and 2 global parities). See an example in Figure 1 of the paper (reproduced here as Figure 2). It adds 4 parity fragments and can tolerate up to 4 fragment failures. But not any arbitrary combination of failures can be recovered. For instance, scenario with these 4 failures: x1,x2,x3 and px cannot be recovered, as the py parity fragment contains no information about the first group and cannot be used.
Figure 2: Erasure Coding in Windows Azure Storage Figure 1
The paper gave a simple algorithm to decide if a failure pattern is recoverable theoreti- cally: “ For each local group, if the local parity is available, while at least one data fragment is erased, we swap the parity with one erased data fragment. The swap operation marks the data fragment as available and the parity as erased. Once we complete all the lo- cal groups, we examine the data fragments and the global parities. If the total number of erased fragments (data and parity) is no more than the number of global parities, the algo- rithm declares the failure pattern information-theoretically decodable.” Note that swapping
4
a local parity with a lost data fragment is not the process of reconstruction, it is only a way to help quickly decide if a failure pattern is recoverable.
Figure 2 of the paper(see a reproduced version in Figure 3) shows two example of recov- erable failure patterns:
• Lost fragments: x0, x1 and p1. Using the algorithm, we can swap px with x0 and end up to have only x1 and p1 as lost fragments. The total number of lost fragment is 2 which equals the number of global parities. So we are able to recover all fragments.
• List fragments: x0,x1,y0 and y1. Using the algorithm, we can swap swap px with x0 and py with y0. This gives us two lost fragments x1 and y1, the total number equals the number of global parities. So this pattern is also recoverable
Figure 3: Erasure Coding in Windows Azure Storage Figure 2
We can also use this rule to check another failure pattern x1,x2,x3 and px. In this pattern, we cannot swap the available py with any lost data fragment, the total number of lost fragment is 4, greater than the number of global parities. This pattern is not recoverable.
Now assume the LRC scheme (12, 2, 2) as covered in the lecture. Analyze the recover- ability of the following failure patterns:
• Lost fragments: x0, x1, x2, p0 • Lost fragments: x0, x1, x2, y0 • Lost fragments: x0, x1, x2, py • Lost fragments: x0, x1, x2, x3
Answer:
• Lost fragments: x0, x1, x2, p0 not recoverable • Lost fragments: x0, x1, x2, y0 recoverable
• Lost fragments: x0, x1, x2, py recoverable
• Lost fragments: x0, x1, x2, x3 not recoverable
5
Question 4: Amazon Aurora Database Layer and Storage Layer
a) Aurorareplicateseachdataitem6waysacross3AZwith2copiesofeachitemineach AZ. Which of the following statements correctly explain the replication mechanism?
1. An Aurora cluster should have one primary instance and 5 replicas distributed across 3 availability zones.
2. An Aurora cluster should have 6 storage nodes across 3 availability zones.
3. Each data item will be stored in 6 storage nodes across 3 availability zones
4. Each data item will be stored in the primary instance as well as 6 replicas dis- tributed across 3 availability zones.
Answer: option 3 is correct. Aurora separates database engine and storage nodes. Storage nodes are responsible for actual data replication. The database engine (pri- mary and replica) is only responsible for read/write queries. One db cluster can have one primary instance for read/write and up to 15 replica instance for read only access.
b) Aurora adopts the principle that ”The log is the database”. Which of the following statements correctly interpret the above principle?
1. Aurora does not store actual database files, only redo log is stored
2. Aurora only sends redo logs across the network. The actual database file will be materialized by playing the log asychronously .
3. Aurora periodically compacts the database file and log file in to a log structured page file.
4. Aurora logs all read and write operations and can use that to reconstruct the database content.
Answer: option 2 is correct. See figure 3 of the Amazon Aurora paper published in SIGMOD2017.
6