CSE 3244: Data Management in the Cloud Score: __________ of 150
Exam #1: Autumn 2020 Clearly Print Name: __________________________ Exam rules:
• Handheld calculators are permitted only to compute algebraic formulas.
• Do not use hadwritten or typed notes or devices connected to the Internet.
Copyright By PowCoder代写 加微信 powcoder
• Write your answers below (on this page). Other pages will not be graded. No exceptions.
1.[10 points] If you set the MAX_UTI_PER_MESG parameter to 100, how many messages would you have to receive to collect at least 1 PB of data (ignore the size of the timestamp) ? Circle all that apply.
A) 10 message B) 500 messages C) 5,000 messages D) 500,000 messages E) None of these
2.[10 points] Here, Twitter provides unstructured data, not structured data. Which of the following properties apply to this data and can not describe well structured data (write down the letters):
________________________________________________________
3.[20 points]. Which of the following Map and Reduce are structured correctly to create a final output (Word Pair, [GIFs] ). (write down all letters that apply)
________________________________________________________
4. [20 points] Using algorithm A, if Twitter provides data every second, how many map tasks will run in order to process the last 1 hour of text? How many reduce tasks will run?
_________________________________________________________
5.[20 points] Provide an example of a possible input to the reduce phase of algorithm A?
_________________________________________________________
6.[10 points] Which algorithms use holistic aggregation in its reducer/combiner functions? ______________
7.[10 points] Holistic aggregation utilizes multiple nodes better than algebraic aggregation? ___True ___ False
8. [10 points] Which of the following changes would make the all-pairs algorithm perform like stripes?
_____________________________________________________________
9.[20 points] On the throughput/capacity curve, what do two points with the same x-value mean? Same y-value?
_____________________________________________________________________________________
10. [20 points] If MAX_UTI_PER_MESG were set to 10, 10,000 and 1,000,000. Use throughput/capacity bottleneck analysis to report the maximum throughput under each setting?
__________ = 10 ; __________ = 10,000 ; __________ = 1,000,000
Page: 1 ALL ANSWERS MUST BE WRITTEN ON PAGE 1 — NO EXCEPTIONS
Page: 2 ALL ANSWERS MUST BE WRITTEN ON PAGE 1 — NO EXCEPTIONS
Pretend you are a data engineer for the WNBA. The WNBA is tracking Twitter messages (tweets) tagged with animated GIFS (that is, short videos). Twitter forwards tweets that inculde “#WNBA” and an animated GIF to your cloud data center. The format for a single tweet is below:
{Unique Tweet Identifier (UTI), GIF, Text}
Each tweet is structured to use exactly 2 MB by padding blank space to the end of text field (GIF is not allowed to exceed 1MB). Twitter does not forward tweets individually. Instead, it forwards them in groups by the time they arrive. So the final structure that you receive looks like this:
{Unique Timestamp,
{UTI, GIF, Text} {UTI, GIF, Text} …
Each message from Twitter is the same size. You get to set a MAX_UTI_PER_MESG parameter to control the size. For example, if MAX_UTI_PER_MESG were 1, Twitter would send a 2 MB per timestamp. If MAX_UTI_PER_MESG were 10, Twitter would send 20 MB per timestamp (padding with blank messages).
Question 1: If you set the MAX_UTI_PER_MESG parameter to 100, how many messages would you have to receive to collect at least 1 PB of data (ignore the size of the timestamp) ?
A) 10 message B) 500 messages C) 5,000 messages D) 500,000 messages E) None of these
Question 2: Here, Twitter provides unstructured data, not structured data. Which of the following properties apply to this data and can not describe well structured data (write down the letter):
A) GIF and Text fields are not the same length across all records
B) Records are not normalized
C) GIF and Text fields are not atomic
D) Key attributes and relations between records are undefined E) Misspelled words can incorrectly label records
The WNBA would like to study which GIFs are associated with popular co-occurring word pairs. Recall, co- occurring word pair is two words that appear in the same Twitter message often. Examples could be “ ” or “WNBA Finals.” You will use Map-Reduce to generate this report. Unique Timestamps are the input keys. The content of the Twitter message are the input values.
About the pseudo code. (1) For each iterates over every element in an array, word in a string or entry in a hash table. I consistently use the letter db/DB to represent a hash table, i.e., a data structure that permits value lookup by string. Finally, append can both (1) add an element to the end of list and/or (2) piece wise add a GIF to the end of a list in an ordered pair (string, list). Assume that append does the right thing in each algorithm.
Page: 3 ALL ANSWERS MUST BE WRITTEN ON PAGE 1 — NO EXCEPTIONS
Algorithm A
Map (k, v) {
For each uti in V {
emit (“batch”, uti)
Reduce (k, v[] ) {
For each uti in v {
For each pair in uti.Text {
DB[pair].append(uti.GIF)
For each entry in DB {
emit (entry, DB[entry])
Algorithm B
Map (k, v) {
For each uti in V {
For each pair in uti.Text
emit (pair, uti)
Reduce (k, v[] ) {
emit (k, v) }
Algorithm C
Map (k, v[] ) {
For each uti in v {
For each pair in uti.Text {
emit (pair, uti)
Reduce (k, v[] ) {
For each uti in v {
For each pair in uti.Text {
DB[pair].append(uti.GIF)
For each entry in DB {
emit (entry, DB[entry])
Algorithm D
Map (k, v) {
For each uti in V {
For each word in uti.Text {
For each neigh in uti.Text{
pair = (word, neigh)
DB[word].append(pair,uti)
emit (word, DB)
Reduce (k, v[] ) {
emit (k, v) }
Algorithm E
Map (k, v) {
For each uti in V {
For each pair in uti.Text {
emit (pair, “1”)
Reduce (k, v[] ) {
emit (k, v.length())
Algorithm F
Map (k, v) {
For each uti in V {
For each word in uti.Text {
For each neigh in uti.Text{
pair = (word, neigh)
DB[word].append(pair,uti)
emit (word, DB)
Reduce (k, v[] ) {
for each db in v {
ReduceDB.append(db)
for each entry in ReduceDB {
emit (entry.pair, entry.uti)
Question 3: Which of the following Map and Reduce are structured correctly to create a final output (Word Pair, [GIFs] ). Note, am not concerned with efficiency in this problem. Also, the errors are logical and related Map-Reduce. You do not need to search for bugs in the pseudo code. (write down all letters that apply)
Question 4. Using algorithm A, if Twitter provides data every second, how many map tasks will run in order to process the last 1 hour of text? How many reduce tasks will run?
Question 5. Provide an example of a possible input to the reduce phase of algorithm A?
Question 6. Which algorithms use holistic aggregation in its reducer/combiner functions?
Question 7. Holistic aggregation utilizes multiple nodes better than algebraic aggregation?
Question 8. Which of the following changes would make the all-pairs algorithm perform like stripes? A. Each map task examines multiple sentences and emitted (pair,
Page: 4 ALL ANSWERS MUST BE WRITTEN ON PAGE 1 — NO EXCEPTIONS
B. If (pair, <1>) emitted data was combined after the map stage before reduce stage
C. If we constrained the co-occurence matrix to sentiment words, i.e., Mij = 1 means that word i occurs in the same sentence as sentiment word j in 1 sentence
D. All of the above
E. None of the above
F. I don’t want to guess. I will take 3 of 10 points partial credit.
Contextual information:
• In your company’s datacenter, blades have 500 GB disk and 8 GB RAM
• Blades in a rack are connected to a 20 port CISCO switch
• Each SSD provides 500 MB/s peak throughput, DDR3 DRAM provides 15 GB/s peak throughput, the
switch transmits at 1 GB/s. You have access to only 1 rack.
Question 9. On the throughput/capacity curve, what do two points with the same x-value mean? Same y- value?
Question 10. If MAX_UTI_PER_MESG were set to 10, 10,000 and 1,000,000. Use throughput/capacity bottleneck analysis to report the maximum throughput under each setting?
__________ = 10 ; __________ = 10,000 ; __________ = 1,000,000
Page: 5 ALL ANSWERS MUST BE WRITTEN ON PAGE 1 — NO EXCEPTIONS
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com