CS作业代写 COMP5349: Cloud Computing Sem. 1/2022

School of Computer Science Dr.
COMP5349: Cloud Computing Sem. 1/2022
Week 10: Cloud Storage and Database Services
Question 1: Bigtable storage model and read/write behaviour

Copyright By PowCoder代写 加微信 powcoder

Assume that there is a Google Bigtable cluster with 1 master and a few tablet servers. There is also a Chubby service that stores important metadata about the cluster. Now assume the cluster manages a table that is structurally similar to the sample table in the Bigtable paper. Table 1 shows a sample row in the table.
Each row of the web table represents a web page; the row key is the reversed DNS address of the web page; the web table has two column families: content and anchor. The content family stores the actual HTML content of a web page. It consists of one column. A crawler continuously crawls the web, generally visiting the same page multiple times. On each visit, a different snapshot of the page content is downloaded and saved in the web table. The web table keeps up to the three most recent snapshots, which are indexed by their respective time stamps, indicating when they were inserted into the table. For instance, t1 → snapshot1 means snapshot1 of web page www.cnn.com was inserted at t1. The anchor family is used to store anchor links in other pages referencing this page. It contains a variable number of columns, each storing one item of anchor link data. The anchor link data uses the URL of the referencing page as the column name; the cell value is the anchor text appearing in the referencing page. The timestamp of the data is also kept in the table. For instance, (cnnsi.com,t6) → CNN means that page cnnsi.com has a link pointing to www.cnn.com; the anchor text is CNN. This information is inserted in the web table at time t6.
Key content anchor
Table 1: Sample Data of the web table
a) Bigtable stores data as multidimensional maps. Each key has three dimensions: row key, column key, and time stamp. For instance, the data about snapshot1 is stored as the following key-value pair:
(com.cnn.www,content :,t1) → snapshot1 The data about the first anchor column is stored as another key-value pair: (com.cnn.www,anchor : cnnsi.com,t6) → CNN
com.cnn.www
t1 → snapshot1 t80 → snapshot2
(cnnsi.com,t6) → CNN
(my.look.ca, t10) → CNN.com

Here, the complete column name is the following combination of column family and the individual column name inside the family: anchor:cnnsi.com
How many key-value pairs are there for the sample row in table 1. What are they? Answer: There are four key value pairs. They are
• (com.cnn.www, content :, t1) → snapshot1
• (com.cnn.www, content :, t80) → snapshot2
• (com.cnn.www, anchor : cnnsi.com, t6) → CNN
• (com.cnn.www, anchor : my.look.ca, t10) → CNN.com
b) Are all key-value pairs belonging to the same row always saved contiguously in the same file?
Answer: No, the key-value pairs belonging to the same row may be scattered in mem- ory and in different files. Only after major compaction, they are saved contiguously in the same file.
c) Assume the GFS layer of the Bigtable cluster has a replication factor of 3. This applies to both log files and data files. A minor compaction happens at t8, a merge compaction happens at t20, and a major compaction happens at t50. How many times is the data about snapshot1 written to GFS? Describe when and in what format the data is written.
Answer: The data is appended in commit log at t1; the commit log is replicated 3 times. The data is written to SSTable file at t8. This file is replicated 3 times. The data is written again to a new SSTable file during merge compaction at t20 with 3 replicas. When major compaction occurs at t50, the data about snapshot1 is written in another new SSTable file containing all latest data about the web table. This file has 3 replicas. In total, the data is written 12 times. The old files will be garbage collected after merge and major compactions.
d) Assume a minor compaction happens at t8. A read request is sent to the Bigtable cluster at t15 to read the row with the key com.cnn.www. Where would the tablet server find data to answer this read query?
Answer: All data inserted before t8 are stored in a SSTable file after minor compaction at t8. They are
• (com.cnn.www, content :, t1) → snapshot1
• (com.cnn.www, anchor : cnnsi.com, t6) → CNN
The latest data (com.cnn.www, anchor : my.look.ca, t10) → CNN.com is in the memory of the tablet server.

Question 2: Erasure Coding in Windows Azure Storage
There is a simple algorithm to decide if a failure pattern in Local Reconstruction Code is recoverable theoretically: “For each local group, if the local parity is available, while at least one data fragment is erased, we swap the parity with one erased data fragment. The swap operation marks the data fragment as available and the parity as erased. Once we complete all the local groups, we examine the data fragments and the global parities. If the total number of erased fragments (data and parity) is no more than the number of global parities, the algorithm declares the failure pattern information-theoretically decodable.”
Assume that the LRC scheme (12, 2, 2) is used as covered in the lecture. Analyse the recoverability of the following failure patterns:
a) Lost fragments: x0, x1, x2, p0 b) Lost fragments: x0, x1, x2, y0 c) Lost fragments: x0, x1, x2, py d) Lost fragments: x0, x1, x2, x3
• Lost fragments: x0, x1, x2, p0 not recoverable • Lost fragments: x0, x1, x2, y0 recoverable
• Lost fragments: x0, x1, x2, py recoverable
• Lost fragments: x0, x1, x2, x3 not recoverable

Question 3: Amazon Aurora Database Layer and Storage Layer
a) Aurorareplicateseachdataitem6waysacross3availabilityzones(AZs)with2copies of each item in each AZ. Which one of the following statements correctly explains the replication mechanism?
1. An Aurora cluster should have one primary instance and 5 replicas distributed across 3 availability zones.
2. An Aurora cluster should have 6 storage nodes across 3 availability zones.
3. Each data item will be stored in 6 storage nodes across 3 availability zones
4. Each data item will be stored in the primary instance as well as 6 replicas dis- tributed across 3 availability zones.
Answer: option 3 is correct. Aurora separates the database engine and storage nodes. Storage nodes are responsible for actual data replication. The database en- gine (primary and replica) is only responsible for read/write queries. One database cluster can have one primary instance for read/write and up to 15 replica instances for read only access.
b) Aurora adopts the principle that ”The log is the database”. Which one of the following statements correctly interprets the above principle?
1. Aurora does not store actual database files; only redo log is stored
2. Aurora only sends redo logs across the network. The actual database file will be materialized by playing the log asynchronously.
3. Aurora periodically compacts the database file and log file into a log structured page file.
4. Aurora logs all read and write operations and can use that to reconstruct the database content.
Answer: option 2 is correct. See figure 3 of the Amazon Aurora paper published in SIGMOD2017.

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com