4. MapReduce [20 points]
Now that you’re back from your date, you really need to get to work. Your boss gives
you a dataset from a symmetric social network (like Facebook) where a is a friend of b
implies that b is also a friend of a. The input is a file (sharded) containing such pairs (a,
Copyright By PowCoder代写 加微信 powcoder
– note that either a or b may be a lexicographically lower name. Each line of input is
one such entry (input to initial Map function), and lines are not duplicated (i.e., no (a,b)
and (b,a)). The intern wants to answer the following question:
“Among the set S of all
users with first name ‘Joe’, what fraction (or percentage) of such users (in S) are such
that they are friends with at least one user with first name
‘Donald’?” You can chain
Mapreduces if you want (but only if you must, and even then, only the least number).
You don’t need to write code – pseudocode is fine as long as it is understandable. Your
pseudocode may assume the presence of appropriate
library primitives (e.g.,
‘getfirstname(user _id), etc.). The Map function takes as input a tuple (a,b).
You should have at least some parallelism. You should try to minimize the total
amount of data that is shuffled (from Map to Reduce) and sent to/from HDFS (from
Reduce to the next Map). Ensure that your output does not contain duplicates. You can
set your key and value to arbitrary objects. You cannot retain data at any of the
machines from a task (Map or Reduce) for use in a later task. Chaining MapReduces is
allowed, as long as you don’t over-use chaining (where parallelization could have
helped instead). Each MapReduce in a chain can read the dataset, but there is no other
persistent memory across the chain (apart from the outputs of stages). Please write clear
pseudocode, and is preferable. Be clear and concise and don’t miss any steps.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com