编程代考 CMPSC 410 Programming Models for Big Data

DS/CMPSC 410 Programming Models for Big Data
Topic 1B MapReduce
Copyright @ 2021 , All Rights Reserved
JohnYen Spring 2021

Learning Objectives
• Be Able to Articulate What is Map, and What is Reduce in MapReduce
• Be Able to Describe How Map Can Address Big Data Challenge #1 for Certain Problems
• Be Able to Describe the Relationship between MapReduce and Functional Programming Language
• Be Able to Describe How MapReduce Uses Smart Lookahead to Address Big Data Challenge #1
• Be Able to Determine whether a problem fits the Map operation.
• Be Able to Determine whether a problem fits the Reduce operation.
Copyright @ 2021 , All Rights Reserved 2

History of MapReduce: The First Big Data Programming Model • Innovator:Google
• Motivation: Webpage streaming processing for Google Search indexes
• Time: 2004
• Apache Hadoop (released in 2006, only 2 years later) adopted by industry quickly.
• Impacts:
• Applied to a wide range of big data analytics problems very quickly.
• Hadoop Distributed File Systems (HDFS) became the industry standard for storing Big Data (also adopted by Spark later).
• Motivated the design of Spark. References:
• Dean, Jeffrey, and . “MapReduce: Simplified data processing on large clusters.“, Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, (2004), pp. 137-150.
• Dean, Jeffrey, and . “MapReduce: simplified data processing on large clusters.” Communications of the ACM, 51.1 (2008): 107-113. 57
Copyright @ 2021 , All Rights Reserved

What is Map in MapReduce ?
• Map was first introduced in a functional programming language LISP, by Prof. Carthy (one of the pioneers of AI)
• Map takes two arguments:
• arg 1: a function (a function name or a lambda expression, a function without a name)
• arg 2: a list [L1, L2, …, Ln]
where each Li is a parameter list for the function (in arg 1)
• returns a list [O1, O2, …, Ln] where Oi is the value returned by applying the function (arg 1) to Li
• Suppose the function Is-number returns True if the input is a number, otherwise it returns False.
• Apply Map to the function Is-number and the list [80, “percent”, “students”, “are”, “vaccinated”]
• returned: [True, False, False, False, False]
Copyright @ 2021 , All Rights Reserved 4

How Does Map Address Big Data Challenge (BDC) #1?
• Map is inherently a completely parallelizable operation because • Appling the function to different input elements L1, L2, … Ln are totally
independent to each other.
• Aggregating the result (i.e., O1, O2, … On) returned from multiple applications of the function (e.g., Is-Number) into a list is also a simple concatenation.
• Map Can Solve BDC #1 by
Dividing the input Big Data into smaller chunks such that each chunk can fit the memory of a machine (we will call these machines “workers”)
Copyright @ 2021 , All Rights Reserved 5

Problem: We need to process tweets, and generate a list that contains all hashtags that occur in any tweet.
• Size of Twitter data: 500 GB
• Format of data: Each tweet contains [TwitterID,TweetID, Text]
• Cluster: 25 workers, each with 25 GB memory. How should we divide/partition the data using Map?
• Write a function F that takes one tweet, returns a list of hashtags contains in the tweet. • Divide the input tweets into 25 chunks (about 20 GB memory per chunk)
• Assign each chunk to a Map worker (20 GB < 25 GB) that performs the map function • Collect the results (output chunks) from all workers into a final result. Copyright @ 2021 , All Rights Reserved 6 Which of the following problems are suitable for Map? • P2: Given a set of tweets, generate the set of tweets that mention #Covid-19 and #vaccine. • P3: Given a set of course descriptions (courseID, text), generate a list of
• P4: Given a set of , generate a list of for students graduate from one of the top 100 University based on a given ranking.
Copyright @ 2021 , All Rights Reserved 7

What is Reduce in MapReduce?
• Aggregate inputs in the form of pairs by common keys • D = {d1, d2, …, dn} where each element di is in the form of
• Reduce (by key) in MapReduce returns a list of where
val_listj contains all values associated with a common key (kj). An exemplar problem suitable for Reduce:
• Given a set of key value pairs for each hashtag mentioned by a tweet in a Twitter dataset,
• we want to compute the total number of tweets that mention the hashtag.
Copyright @ 2021 , All Rights Reserved 8

How to Assign Tasks to Reduce Worker?
1. Problem: Need to assign pairs, generated from map workers, to reduce workers such that
• All pairs with a common key (generated from different Map
workers) need to be assigned to the same Reduce worker.
• There is a one-to-one mapping of all keys to Reduce Workers.
2. Solution of MapReduce:
• Partition the keys into R subsets (using alphabetic order, hash, or other algorithms),where R is the total number of Reduce workers.
• Assign each subset of keys to each Reduce Worker.
Copyright @ 2021 , All Rights Reserved 9

An Exemplar Problem for MapReduce
• Problem: Given a set of tweets, generate total count for each hashtag that occurs in the entire tweets dataset.
• A MapReduce solution:
• Map phase: Take a set of Tweets, returns a (large) list of for each
hashtag that occurs in a tweet.
• Reduce phase: Aggregate by hashtag so that all of the ‘1’s associated with the same hashtag are summed into a total count for the hashtag.
Copyright @ 2021 , All Rights Reserved 10

What is the Cost of Map + Reduce?
• Cost (Partition Data for Map) + Cost(Map) + Cost(Gather Inputs for Reduce) + Cost(Reduce) + Cost (Saving Output Reduce in Output File)
• Cost(Gather Inputs for Reduce) can be COSTLY
Dean, Jeffrey, and . “MapReduce: simplified data processing on large clusters.” Communications of the ACM 51.1 (2008): 107-113.
Copyright @ 2021 , All Rights Reserved 11

Why gathering inputs for Reduce Workers can be costly?
• Each Reduce Worker needs to gather all pairs for all keys assigned to the Reduce Worker from outputs of ALL Map Workers.
• Non-scalable Approach:
• Each Map Worker generates one output file of pairs in its own disk.
• Each Reduce Worker copies the output of each Map Worker to its local disk. (cost time and space)
• Each Reduce Worker filters the key value pairs in each of their Map Worker output files (local copy) for keys assigned to the Reduce Worker. (high computational cost)
• Each Reduce Worker combines all of the filtered key-value files, one from each Map Worker, into one big input file for the Reduce Worker.
Copyright @ 2021 , All Rights Reserved 12

Does this idea help?
What if each Map Worker generates output files based on needs of Reduce Worker (plan ahead)?
This way, each Reduce Worker does not need to filter/search for pairs assigned to it.
The Reduce Worker simply takes the output file prepared for it by each Map Worker, combine them into one input file for itself.
Copyright @ 2021 , All Rights Reserved 13

A MapReduce Game
Problem: Given a set of randomly ordered cards, generate an output where all cards of the same number/letter are grouped together. (sorting is not needed)
MapReduce Solution:
• Each number/letter of a card (1, 2, …, 10, J, Q, K) is treated as a key. Mapper/Reducer Configuration:
• Each team is formed by 5 mapping workers, 4 reduce workers (reducers), and (optional) an observer.
• Each mapping worker is assigned a deck of cards (random order), and given a paper with four areas (one marked for cards that should go to each reducer).
Each team plans how to partition the keys to assign them to the reducers (1 min). Talking is NOT allowed after this planning time.
Reducers should turn around from the table. Only when all the mappers finish, can the Reducer turn around and start working.
Copyright @ 2021 , All Rights Reserved 14

How to Reduce the Cost of Map + Reduce?
• Plan Ahead: Prepare Output Files of Map Workers Based on Key Sets Assigned to
Reduce Workers
Suppose the total number of Reduce Workers are R
Chose an algorithm to assign keys to the R Reduce Workers (e.g., hash keys to a table with R entries, each entry contains all keys assigned to one Reduce Worker)
Create R Output Files for each Map Worker, 1 output file for each Reduce Worker.
Each Map Worker uses the same algorithm (e.g., the same hash function in step 1) that assigns keys to R Reduce Workers to assign each output pair to an output file
Copyright @ 2021 , All Rights Reserved 15

Two key innovative ideas of MapReduce
1. Introduce a “structure” to the intermediate results (generated by the map worker) to be aggregated.
(Key, Value)
2. “Mapping worker” plan ahead, using information about how aggregation is divided among “reduce worker”, in generating the files of its outputs.
Copyright @ 2021 , All Rights Reserved MapReduce
Copyright @ 2021 , All Rights Reserved

Aggregate (reduce) task is distributed with smart
look-ahead planning
• Outputs of Map: (key, value) pairs
• Number of Map computing nodes (worker): M
• Number of Reduce computing nodes (worker): R
• Look ahead planning:
• Partition all keys into R groups, assign each group to a reduce worker.
• Map computing nodes generate intermediate results in R files, one for each group of keys. • Efficient aggregation by Reduce worker:
• The Map worker generates one intermediate output file for each group of keys assigned to each Reduce worker.
• Every Reduce worker aggregates M files (one from each Map worker), all for the group of keys assigned to the Reduce worker.
Copyright @ 2021 , All Rights Reserved MapReduce
Copyright @ 2021 , All Rights Reserved

• Each Reduce worker is assigned a range of keys
• The Map worker generates one intermediate file in its disk for EACH
Reduce Worker
• Each Reduce worker simply copys the file prepared for it by EACH Map worker.
<(lion, 5)>
(lion, 2) (nittany, 2)
(lion, 1) (nittany, 1)
(nittany, 2)
Copyright @ 2021 , All Rights Reserved
(lion, <5, ...>) (lion, <2, ...>)
(nittany, <2, ...>)
Reduce (nittany,<1,...>) Worker3
(lion, 10000)
(nittany, 1000)
Map worker
Reduce Worker 1
Map worker
Reduce Worker 2
Map Worker
Map Worker
Reduce Worker 4
. MapReduce
(nittany, <2, ...>) .
Copyright @ 2021 , All Rights Reserved
Reduce Worker 5

Plan-ahead Using Hash Functions • One Problem:
• Keys are generally not known during the planning stage.
• Assigning keys to Reducers (e.g., using alphanumeric order) may result in highly
uneven load distributions among Reducers. • Solution:
• A hash function enables an efficient access to pairs.
• The Master uses a hash function to map all possible keys to one of R values, each
value corresponds to a Reduce worker and an intermediate file for each Reducer.
• The Mapper also generates another set of R hash functions, one for each Reducer, which will be used for storing , and used by the Reducer to retrieve pairs more efficiently (e.g., values of the same key are in the same entry of a hash table).
• The Reducer #k copies the output file #k from each Mappers.
• The Reducer uses the kth hash function to read the values of the keys assigned
to it by the Master.
Copyright @ 2021 , All Rights Reserved
Copyright @ 2021 , All Rights Reserved

Rapid Growth of MapReduce Instances in Google
Copyright @ 2021 ,
Dean, Jeffrey, and . “MapReduce: simplified data processing on large clusters.”
Communications of the ACM 51.1 (2008): 107-113. All Rights Reserved

Computing Problems Suitable for MapReduce
• Embarrassingly parallel applications that can be decomposed into a map step and a reduce (aggregation) step.
• The intermediate results of the map step can be represented as key value pairs. • The aggregation of reduce step groups intermediate results by keys.
Copyright @ 2021 , All Rights Reserved 66
Copyright @ 2021 , All Rights Reserved

Early Days Impacts of MapReduce at Google
Copyright @ 2021 ,
Dean, Jeffrey, and . “MapReduce: simplified data processing on large clusters.”
Communications of the ACM 51.1 (2008): 107-113. All Rights Reserved

Apache Hadoop
• Open platform that implements MapReduce
• Hadoop Distributed File Systems (HDFS) becomes industry standard for
storing and processing massive datasets.
Copyright @ 2021 , All Rights Reserved
Copyright @ 2021 , All Rights Reserved

A Map Reduce Approach for TF-IDF
• Compute TF-IDF (WebPages)
• Partition all WebPages into “Chunks”
• Do (parallel) for each chunk
• For each WPi in a chunk
• Initialize TF(WPi, wj) to 0
• For each word wj in WPi (by scanning) do
• TF (WPi, wj) = TF (WPi, wj) +1 • End
• Aggregate TF of the same word wj
• Partition all words into “Chunks”
• Do (parallel) for each chunk
• For each word wj in a chunk do
Aggregation
• For each non-zero TF of wj, Increment DF(wj) • End
Copyright @ 2021 , All Rights Reserved
Copyright @ 2021 , All Rights Reserved

Robustness Requirement of Big Data Computing
• Needs to be able to tolerate node failures • Map Reduce Solution:
• If the master detects any map worker fails
• the Master dynamically reassign all of their tasks to other nodes.
• The Master also informs all Reduce workers about the changed location of the Map worker (and its intermediate files)
• If the master detects any reduce worker fails
• The Master dynamically reassign all of their tasks to another reduce worker
Copyright @ 2021 , All Rights Reserved MapReduce
Copyright @ 2021 , All Rights Reserved

Design Choice for MapReduce Applications
• Divide the input such that each input chunk for a map task is typically a bit larger than the memory of each node in the cluster (e.g., 64MB in 2010).
• The number of reduce tasks is typically a small multiple of the total number of worker machines in a cluster.
Copyright @ 2021 , All Rights Reserved 113
Copyright @ 2021 , All Rights Reserved

Ideas Worth Spreading in MapReduce
• Distribute computing by dividing a task (1) a completely decomposable (Map) phase and (2) a key-based aggregation (Reduce) phase.
• Key-guided Structure: Key is the dependency between the two phases; Use pairs as intermediate results of Map to facilitate the Reduce phase.
• Coordinated Output/Input: Each Mapper generates R files, one for each Reducer (reduce the cost of copying files)
• Coordinated Storage/Retrieval: All Mappers use the same hash function for each Reducer’s files to facilitate the Reducer’s retrieval of pairs of the same keys (reduce the cost of retrieving key value pairs).
Copyright @ 2021 , All Rights Reserved MapReduce
Copyright @ 2021 , All Rights Reserved

MapReduce Opens a for Computing
• All These Ideas Lay the Foundation for Scalable Solutions for Big Data Analytics.
• All These Ideas are Leveraged by Spark and other Big Data Programming/Analytics Models.
• All These Ideas are Important for Designing Scalable Data Analytics Pipelines in Your Future Career.
Copyright @ 2021 , All Rights Reserved 35

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts