Q1. Write a MapReduce program that counts the number of integers in a given group of integers. For example, if the input is {1, 3, 2, 5, 3, 4}, the program should output 6. You can but not required to use a Combiner in this program. Pseudo code is fine, but make sure you indicate the Map and Reduce functions, and Combiner (if any) in the program, and the input, output, and logic of each component. (1.5 pts)
(Answer)
map(key, value): // value: a group of integers for v in value:
emit (1, v)
reduce(key, value): // key: 1, value: a list of integers count = 0
for v in value:
count += 1 emit(1, count)
Q2. Briefly explain how we could approximate (1+a)b as eab (1 pt) (chapter 1.3.5)
(Answer)
(1+𝑎)𝑏 =(1+ 1 )(1/𝑎)(𝑎𝑏)~𝑒𝑎𝑏 1/𝑎
where a is small since lim (1+1)𝑥 =𝑒 𝑥→∞ 𝑥
Q3. What is power law? Give an example (0.5 pts) (chapter 1.3.6)
(Answer)
(0.25) Power law is a linear relationship between the logarithms of 2 variables, and can be expressed as y = c𝑥𝑎 (a, c: constant)
(0.25) the number of in-links (y) to the xth most popular page, the number of sales (y) of the xth most popular book
Q4. Briefly explain the benefit of a distributed file system (1 pt)
(Answer)
1. high availability by storing data redundantly on multiple nodes,
2. minimize data movement across the network by that servers storing data also
serve as computing servers
3. parallel processing
// 1 point if a student mentions one of them
Q5. Draw a simple diagram of connected nodes (machines) and their functionality (e.g., master nodes and slave nodes as in the slides) in a map reduce environment (1 pt)
(Answer) 0.5: diagram
Master node functionality (0.3)
– store metadata about where data files are stored
– coordinate map or reduce tasks to slave nodes
– detect failures
Slave node functionality (0.2)
– store data file and perform map or reduce tasks
Q6. Where does the MapReduce program store 1. Input data, 2. Intermediate files, and 3. Output data? (0.5 pt) Why do we care about where they store the data? (0.5 pt)
(Answer) (0.5)
1. input and output data: DFS (distributed file system)
2. intermediate files: local file system of map workers
(0.5)
To deal with node failure.
If a node die, input and output data are not lost because they are stored in another node too, but intermediate files are lost. Thus, the map task of the failed node will restart on another node regardless of whether the map task is completed or in-process. In contrast, reduce task of the failed node will not restart if the reduce task is completed, and only restart if the reduce task is in-process.
Q7. Write the Map and Reduce tasks and their output for joining these two tables: (1pt)
(Answer)
Map (0.25)
for each tuple in table Order:
emit (orderid, (Order, account, date)) // e.g. (1, (Order, aaa, d1)) for each tuple in table LineItem:
emit (orderid, (LineItem, itemid,qty)) // e.g. (1, (LineItem, 10, 1))
Reduce (0.25)
// for each key(orderid), join values from one having Order and another having LineItem emit (key, joined value) -> (orderid, account, date, itemid, qty)
Eg: Input :
Key=1, values=[(Order, aaa,d1), (LineItem,10,1), (LineItem,20,3)] Output: 1, aaa, d1, 10, 1
1, aaa, d1, 20, 3
Final Output (0.5)
1, aaa, d1, 10, 1
1, aaa, d1, 20, 3
2, aaa, d2, 10, 5
2, aaa, d2, 50, 100 3, bbb, d3, 20, 1
Q8. Write a MapReduce program that multiplies two matrices A and B in two stages. You can assume that the matrices are provided to you in a file in a sparse matrix format. Each line of the file represents an element in a matrix. For example, a line: [‘A’, 0, 0, 1] indicates that A[0, 0] = 1. You may assume that both matrices are 5 x 5. (1.5 pts)
Map1 (0.375)
for each element of A: Emit (j, (A, i, Aij))
for each element of B: Emit (j, (B, k, Bjk))
Reduce1 (0.375)
// key: j, values: [(A, i, Aij) … (B, k, Bjk)] i = 1, 2, 3, 4, 5, k =1,2,3,4,5 For each value of (i, k): // (1, 1), (1, 2) … (5, 5)
Emit [(i, k), Aij*Bjk]
Map2 (0.375)
For every input [(i, k), Aij*Bjk]: Emit [(i, k), Aij*Bjk]
// the function is the identity
Reduce2 (0.375)
// input: key: (i,k), value: a list of Aij*Bjk sum = 0
for v in value:
sum += v emit ((i,k), sum)
Q9. What is reducer size and replication rate and why do we care about the reducer size (1.5 pts)?
(Answer)
Reducer Size (0.5): the upper bound on the number of values that are allowed to
appear in the list associated with a single key.
Replication Rate (0.5): the number of key-value pairs produced by all the Map tasks on all the inputs, divided by the number of inputs. That is, the replication rate is the average communication from Map tasks to Reduce tasks per input.
(0.5) If the reducer size is small, there are many reducers, and so high degree of
parallelism and low wall-clock time could be achieved. Also, the sufficiently small reducer size with which the computation of reduce task could be executed entirely in the main memory could reduce running time by avoiding data movement between the main memory and the disk.