Big Data Processing Semester 2, 2020
Lecture 3 – MapReduce Basics
Ke Deng Ke.deng@rmit.edu.au
RMIT Classification: Trusted
Last week
Cloud Computing • Rebranding of web 2.0
– What is cloud?
• Utility computing
– Enabling technology – virtualization • Everything as a service
– Infrastructure as a Service(IaaS)
– Platform as a Service(PaaS)
– Software as a Service(SaaS)
Data Centre
• server, rack, cluster
– Bandwidth, latency, capacity (RAM and disk)
8/5/2020 Big Data Processing 2
RMIT Classification: Trusted
Last week – Apache Hadoop Ecosystem
The base Apache Hadoop framework is composed of modules:
• Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster;
5/08/2020 Big Data Processing 3
Hadoop YARN – a platform responsible for managing computing resources in clusters and using them for scheduling users’ applications;
Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster;
RMIT Classification: Trusted
Last week – Apache Hadoop Ecosystem
The base Apache Hadoop framework is composed of modules:
5/08/2020 Big Data Processing 4
RMIT Classification: Trusted
Content
Typical Big Data Problem MapReduce Execution Platform MapReduce Basics
MapReduce – Local Aggregation
8/5/2020 Big Data Processing 5
RMIT Classification: Trusted
Apache Hadoop Ecosystem
• Hadoop MapReduce
– A programming model for large-scale data processing.
8/5/2020
Big Data Processing 6
RMIT Classification: Trusted
Typical Big Data Problem
Iterate over a large number of records
(Dean and Ghemawat, OSDI 2004)
8/5/2020 Big Data Processing 7
RMIT Classification: Trusted
Typical Big Data Problem
Iterate over a large number of records Extract something of interest from each
(Dean and Ghemawat, OSDI 2004)
8/5/2020 Big Data Processing 8
RMIT Classification: Trusted
Typical Big Data Problem
Map
Iterate over a large number of records Extract something of interest from each
(Dean and Ghemawat, OSDI 2004)
8/5/2020 Big Data Processing 9
RMIT Classification: Trusted
Typical Big Data Problem
Iterate over a large number of records Extract something of interest from each
Shuffle and sort intermediate results
Map
(Dean and Ghemawat, OSDI 2004)
8/5/2020 Big Data Processing 10
RMIT Classification: Trusted
Typical Big Data Problem
Iterate over a large number of records Extract something of interest from each
Shuffle and sort intermediate results Aggregate intermediate results
Map
(Dean and Ghemawat, OSDI 2004)
8/5/2020 Big Data Processing 11
RMIT Classification: Trusted
Typical Big Data Problem
Iterate over a large number of records Extract something of interest from each
Shuffle and sort intermediate results
Aggregate intermediate results
Generate final output
Map
(Dean and Ghemawat, OSDI 2004)
8/5/2020 Big Data Processing 12
RMIT Classification: Trusted
Typical Big Data Problem
Iterate over a large number of records Extract something of interest from each
Shuffle and sort intermediate results
(Dean and Ghemawat, OSDI 2004)
Map Reduce
Aggregate intermediate results Generate final output
8/5/2020 Big Data Processing
13
MapReduce
RMIT Classification: Trusted
8/5/2020 Big Data Processing 14
MapReduce
RMIT Classification: Trusted
a large number ofrecords
8/5/2020 Big Data Processing 15
MapReduce
Map
RMIT Classification: Trusted
8/5/2020 Big Data Processing 16
Iterate over a large number of records and Extract something of interest from each
MapReduce
Map
Iterate over a large number of records and Extract something of interest from each
RMIT Classification: Trusted
8/5/2020
Big Data Processing 17
Shuffle and sort intermediate results
MapReduce
Map
Iterate over a large number of records and Extract something of interest from each
Shuffle and sort intermediate results
Reduce
RMIT Classification: Trusted
8/5/2020
Big Data Processing 18
Aggregate intermediate results and generate final output
RMIT Classification: Trusted
Power of MapReduce
Simplicity – MapReduce programmer
Preparing the input data
Implement mapper, reducer
Optionally implement combiner and partitioner
8/5/2020 Big Data Processing 19
RMIT Classification: Trusted
Power of MapReduce
Simplicity – MapReduce programmer
Preparing the input data
Implement mapper, reducer
Optionally implement combiner and partitioner
8/5/2020 Big Data Processing 20
RMIT Classification: Trusted
Power of MapReduce
Simplicity – MapReduce programmer
Preparing the input data
Implement mapper, reducer
Optionally implement combiner and partitioner
8/5/2020 Big Data Processing 21
MapReduce
just specify four
:
map (k1, v1) → [
combine (k2, [v2]) → [
partitions) → partition for k’ reduce (k2, [v2]) → [
RMIT Classification: Trusted
Programmers
functions
8/5/2020 Big Data Processing 22
MapReduce
just specify four
:
map (k1, v1) → [
combine (k2, [v2]) → [
partitions) → partition for k’ reduce (k2, [v2]) → [
RMIT Classification: Trusted
Programmers
functions
8/5/2020 Big Data Processing 23
MapReduce
just specify four
:
map (k1, v1) → [
combine (k2, [v2]) → [
partitions) → partition for k’ reduce (k2, [v2]) → [
RMIT Classification: Trusted
Programmers
functions
8/5/2020 Big Data Processing 24
MapReduce
just specify four
:
map (k1, v1) → [
combine (k2, [v2]) → [
partitions) → partition for k’ reduce (k2, [v2]) → [
RMIT Classification: Trusted
Programmers
functions
8/5/2020 Big Data Processing 25
MapReduce
just specify four
:
RMIT Classification: Trusted
Same server
Programmers
functions
map (k1, v1) → [
partition (k’, number of partitions) → partition for k’
reduce (k2, [v2]) → [
8/5/2020 Big Data Processing 26
MapReduce
just specify four
:
RMIT Classification: Trusted
Same server
Same server
Same server
Same server
Programmers
functions
map (k1, v1) → [
partition (k’, number of partitions) → partition for k’
reduce (k2, [v2]) → [
8/5/2020 Big Data Processing 27
MapReduce
just specify four
:
map (k1, v1) → [
combine (k2, [v2]) → [
partitions) → partition for k’
RMIT Classification: Trusted
Programmers
functions
reduce (k2, [v2]) → [
server
server
server
8/5/2020 Big Data Processing
28
RMIT Classification: Trusted
MapReduce
just specify four
:
map (k1, v1) → [
combine (k2, [v2]) → [
partitions) → partition for k’ reduce (k2, [v2]) → [
MapReduce Execution Framework
transparently handles everything else…
Programmers
functions
8/5/2020 Big Data Processing 29
RMIT Classification: Trusted
MapReduce
just specify four
:
map (k1, v1) → [
combine (k2, [v2]) → [
partitions) → partition for k’ reduce (k2, [v2]) → [
MapReduce Execution Framework
transparently handles everything else…
Programmers
functions
8/5/2020 Big Data Processing 30
RMIT Classification: Trusted
MapReduce Execution Platform
8/5/2020 Big Data Processing 31
RMIT Classification: Trusted
MapReduce Execution Framework transparently handles everything else…
A key idea is to separate “what” from “how”
8/5/2020 Big Data Processing 32
RMIT Classification: Trusted
MapReduce Execution Framework transparently handles everything else…
A key idea is to separate “what” from “how”
•
Developer specifies the computation that needs to be
performed
8/5/2020 Big Data Processing 33
RMIT Classification: Trusted
MapReduce Execution Framework transparently handles everything else…
A key idea is to separate “what” from “how”
• Developer specifies the computation that needs to be
performed •
Execution framework (also called “runtime”
sometimes) handles actual execution
8/5/2020
Big Data Processing 34
RMIT Classification: Trusted
MapReduce Execution Framework transparently handles everything else…
A key idea is to separate “what” from “how”
• •
Developer specifies the computation that needs to be performed
Execution framework (also called “runtime” sometimes) handles actual execution
What are included …
8/5/2020
Big Data Processing 35
RMIT Classification: Trusted
MapReduce Execution Framework transparently handles everything else… Handles scheduling
• Assigns workers to map and reduce tasks Handles “data distribution”
• Moves processes todata Handles synchronization
• Gathers, sorts, and shuffles intermediate data Handles errors and faults
• Detects worker failures and restarts
8/5/2020 Big Data Processing 36
RMIT Classification: Trusted
MapReduce Execution Framework transparently handles everything else… Handles scheduling
•
Handles “data distribution” • Moves processes todata
Handles synchronization
• Gathers, sorts, and shuffles intermediate data
Handles errors and faults
• Detects worker failures and restarts
Assigns workers to map and reduce
tasks
8/5/2020 Big Data Processing 37
RMIT Classification: Trusted
Handles scheduling
Each MapReduce job is divided into smaller units called
tasks.
8/5/2020 Big Data Processing 38
RMIT Classification: Trusted
Handles scheduling
Each MapReduce job is divided into smaller units called tasks.
e.g., 4 smaller map tasks
e.g., 3 smaller reduce tasks
8/5/2020 Big Data Processing 39
RMIT Classification: Trusted
Handles scheduling
Each MapReduce job is divided into smaller units called tasks.
In a large job, the total number of tasks may exceed the
number of tasks that can be run on the cluster
concurrently.
8/5/2020 Big Data Processing 40
RMIT Classification: Trusted
Handles scheduling
Each MapReduce job is divided into smaller units called tasks.
In large jobs, the total number of tasks may exceed the number of tasks that can be run on the cluster concurrently.
A task queue needs to be maintained and
scheduling which task to be processed once a
cluster node is available.
8/5/2020 Big Data Processing 41
RMIT Classification: Trusted
Handles scheduling
Each MapReduce job is divided into smaller units called tasks.
In large jobs, the total number of tasks may exceed the number of tasks that can be run on the cluster concurrently.
A task queue needs to be maintained and scheduling which task to be processed once a cluster node is available.
Among different jobs, scheduling involves
coordination of their tasks.
8/5/2020 Big Data Processing 42
RMIT Classification: Trusted
Handles scheduling – Yarn
The ResourceManager (RM) in Yarn tracks resources on a cluster, and assigns them to applications that need them.
8/5/2020 Big Data Processing 43
Handles scheduling
RMIT Classification: Trusted
Handles scheduling – Yarn
The ResourceManager (RM) in Yarn tracks resources on a cluster, and assigns them to applications that need them.
• The scheduler is that part of the RM that manages computing resource sharing resources in a cluster among many tasks from same/different jobs.
Handles scheduling
8/5/2020 Big Data Processing
44
RMIT Classification: Trusted
Handles scheduling – Yarn
The ResourceManager (RM) in Yarn tracks resources on a cluster, and assigns them to applications that need them.
• The scheduler is that part of the RM that manages computing resource sharing resources in a cluster among many tasks from same/different jobs.
Fair Scheduler
• The Fair Scheduler is a popular choice (recommended by Cloudera) among the schedulers YARN supports.
Handles scheduling
8/5/2020 Big Data Processing
45
RMIT Classification: Trusted
Handles scheduling – Yarn
The ResourceManager (RM) in Yarn tracks resources on a cluster, and assigns them to applications that need them.
• The scheduler is that part of the RM that manages computing resource sharing resources in a cluster among many tasks from same/different jobs.
Fair Scheduler
• The Fair Scheduler is a popular choice (recommended by Cloudera) among the schedulers YARN supports.
• In its simplest form, it shares resources fairly among all jobs running on the cluster.
Handles scheduling
8/5/2020 Big Data Processing
46
RMIT Classification: Trusted
Handles scheduling – Straggler
The speed of MapReduce job is bounded by the
running time of the slowest reduce task.
.
Straggler: the task taken an usually long time to
complete
8/5/2020
Big Data Processing 47
RMIT Classification: Trusted
Handles scheduling – Straggler
Straggler – Two Causes
Machine suffering from broken hardware
•
Solution: speculative execution (an identical copy of the same task is executed on different machine, and uses the result of the first task complete.)
Animation
https://d2h0cx97tjks2p.cloudfront.net/blogs/wp- content/uploads/sites/2/2017/04/Speculative-Execution-in-Spark.gif
8/5/2020
Big Data Processing 48
RMIT Classification: Trusted
Handles scheduling – Straggler
Straggler – Two Causes
Machine suffering from broken hardware •
Animation
https://d2h0cx97tjks2p.cloudfront.net/blogs/wp- content/uploads/sites/2/2017/04/Speculative-Execution-in-Spark.gif
Solution:
speculative execution
(an identical copy of the same task is
executed on different machine, and uses the result of the first task
complete.)
8/5/2020 Big Data Processing 49
RMIT Classification: Trusted
Handles scheduling – Straggler
Straggler – Two Causes
Machine suffering from broken hardware
• Solution: speculative execution (an identical copy of the same task is executed on different machine, and uses the result of the first task complete.)
Skew in the distribution of
values associated with
intermediate keys
8/5/2020
Big Data Processing 50
RMIT Classification: Trusted
Handles scheduling – Straggler
Straggler – Two Causes
Machine suffering from broken hardware
•
Solution: speculative execution (an identical copy of the same task is executed on different machine, and uses the result of the first task complete.)
Skew in the distribution of
values associated with
intermediate keys
Intermediate data
8/5/2020
Big Data Processing 51
RMIT Classification: Trusted
Handles scheduling – Straggler
Skew in the distribution of values associated with intermediate keys
3keys-a, b,c
8/5/2020 Big Data Processing 52
RMIT Classification: Trusted
Handles scheduling – Straggler
Skew in the distribution of values associated with intermediate keys
3keys-a, b,c
8/5/2020 Big Data Processing 53
RMIT Classification: Trusted
Handles scheduling – Straggler
Skew in the distribution of values associated with intermediate keys
3keys-a, b,c
8/5/2020 Big Data Processing 54
RMIT Classification: Trusted
Handles scheduling – Straggler
Skew in the distribution of values associated with intermediate keys
3keys-a, b,c
The reduce task responsible for processing the most frequent elements will run much longer
8/5/2020 Big Data Processing 55
RMIT Classification: Trusted
Handles scheduling – Straggler
Skew in the distribution of values associated with intermediate keys
3keys-a, b,c
The reduce task responsible for processing the most frequent elements will run much longer
Note, Zipfian distribution is common in text processing 8/5/2020 Big Data Processing 56
RMIT Classification: Trusted
Handles scheduling – Straggler
Zipfian distribution
One of a family of related discrete power
law probability distributions. Formally, ,
the normalized frequency of elements of
rank
8/5/2020
Big Data Processing 57
k
RMIT Classification: Trusted
Handles scheduling – Straggle
Zipfian distribution
One of a family of related discrete power law probability distributions. Formally, the normalized frequency of elements of rank k,
r
Percentage of the
most frequent word
8/5/2020 Big Data Processing 58
RMIT Classification: Trusted
Handles scheduling – Straggler
Zipfian distribution
One of a family of related discrete power law probability distributions.
Formally, the normalized frequency of elements of rank k,
8/5/2020 Big Data Processing 59
Percentage of the 2nd most frequent word
RMIT Classification: Trusted
Handles scheduling – Straggler
Zipfian distribution
One of a family of related discrete power law probability distributions.
Formally, the normalized frequency of elements of rank k,
8/5/2020 Big Data Processing 60
Percentage of the 3rd most frequent word
RMIT Classification: Trusted
Handles scheduling – Straggler
Zipfian distribution
One of a family of related discrete power law probability distributions.
Formally, the normalized frequency of elements of rank k,
8/5/2020 Big Data Processing 61
Percentage of the 3rd most frequent word
And so on…
RMIT Classification: Trusted
MapReduce Execution Framework transparently handles everything else… Handles scheduling
• Assigns workers to map and reduce tasks Handles “data distribution”
•
Handles synchronization
• Gathers, sorts, and shuffles intermediate data
Handles errors and faults
• Detects worker failures and restarts
Moves processes to
data
8/5/2020 Big Data Processing 62
RMIT Classification: Trusted
Handles “data distribution”
Data and Code Co-location
, .
While the key idea behind MapReduce is to move
the code not the data
sometimes
we need data to the code
8/5/2020
Big Data Processing 63
RMIT Classification: Trusted
Handles “data distribution”
Data and Code Co-location
While the key idea behind MapReduce is to move the code not the data, we need data to the code sometimes.
• •
To achieve data locality,
.
If this is not possible, new tasks will be started elsewhere (another node), and the data will be streamed over the network.
– For example, the node already running too many tasks,
the scheduler starts on the
node that holds a particular block of data and move
code to the data
8/5/2020
Big Data Processing 64
RMIT Classification: Trusted
Handles “data distribution”
Data and Code Co-location
While the key idea behind MapReduce is to move the code not the data, we need data to the code sometimes.
•
•
To achieve data locality, the scheduler starts on the node that holds a particular block of data and move code to the data.
If this is not possible, new tasks will be started
elsewhere (another node), and the data will be
streamed over the network
.
– For example, the node already running too many tasks,
8/5/2020
Big Data Processing 65
•
•
To achieve data locality, the scheduler starts on the node that holds a particular block of data and move code to the data.
RMIT Classification: Trusted
Optimization – prefer node on the same rack in the datacenter as the nHodaenhodldlinegsth“edrealevtantdaistatbrloibcku,stinicoenin”ter-rackbandwidthis significantly less than intra-rack bandwidth.
Data and Code Co-location
While the key idea behind MapReduce is to move the code not the data, we need data to the code sometimes.
If this is not possible, new tasks will be started
elsewhere (another node), and the data will be
streamed over the network
.
– For example, the node already running too many tasks,
8/5/2020
Big Data Processing 66
RMIT Classification: Trusted
MapReduce Execution Framework transparently handles everything else… Handles scheduling
• Assigns workers to map and reduce tasks Handles “data distribution”
• Moves processes todata Handles synchronization
• Gathers, sorts, and shuffles intermediate data Handles errors and faults
• Detects worker failures and restarts
8/5/2020 Big Data Processing 67
RMIT Classification: Trusted
Handles Synchronization
Synchronization •
Enforce the proper execution order when multiple
concurrently running processes tries to update the
common data.
8/5/2020 Big Data Processing 68
RMIT Classification: Trusted
Handles Synchronization
Synchronization
• Enforce the proper execution order when multiple concurrently running processes tries to update the common data.
Synchronization in MapReduce
• Intermediate key-value pairs must be grouped by key. This involves
– all nodes execute map tasks
– Copying intermediate data over the
network – shuffle and sort
– all nodes execute reduce tasks
8/5/2020
Big Data Processing 69
RMIT Classification: Trusted
Handles Synchronization
Synchronization
• Enforce the proper execution order when multiple concurrently running processes tries to update the common data.
Reduce computation cannot start until …
Synchronization in MapReduce
• Intermediate key-value pairs must be grouped by key. This involves
– all nodes execute map tasks
– Copying intermediate data over the
network – shuffle and sort
– all nodes execute reduce tasks
8/5/2020
Big Data Processing 70
RMIT Classification: Trusted
Handles Synchronization
Synchronization
• Enforce the proper execution order when multiple concurrently running processes tries to update the common data.
Reduce computation cannot
start until
–
all the mappers
have finished emitting key
value pairs
Synchronization in MapReduce
• Intermediate key-value pairs must be grouped by key. This involves
– all nodes execute map tasks
– Copying intermediate data over the
network – shuffle and sort
– all nodes execute reduce tasks
8/5/2020
Big Data Processing 71
RMIT Classification: Trusted
Handles Synchronization
Synchronization
• Enforce the proper execution order when multiple concurrently running processes tries to update the common data.
Reduce computation cannot start until all the mappers have finished emitting key- value pairs and
intermediate key
all
–
value pairs
have been shuffled and sorted
Synchronization in MapReduce
• Intermediate key-value pairs must be grouped by key. This involves
– all nodes execute map tasks –
–
Copying intermediate data over the
network –
shuffle and sort
all nodes execute reduce tasks
8/5/2020
Big Data Processing 72
RMIT Classification: Trusted
MapReduce Execution Framework transparently handles everything else… Handles scheduling
• Assigns workers to map and reduce tasks Handles “data distribution”
• Moves processes todata Handles synchronization
• Gathers, sorts, and shuffles intermediate data Handles errors and faults
•
Detects worker failures and
restarts
8/5/2020 Big Data Processing 73
RMIT Classification: Trusted
Handles Errors and Faults
MapReduce are
end commodity servicers
explicitly designed around low-
8/5/2020 Big Data Processing 74
RMIT Classification: Trusted
Handles Errors and Faults
MapReduce are explicitly designed around low- end commodity servicers
• Hardware Failure
– disk failures
– memory failures
– planned outages (e.g. system maintenance and hardware upgrades)
– Unexpected outage (e.g. power failure, connectivity loss, etc.)
• Software Failure
– No software is bug free
– Solving large problems may fall in extreme cases where bugs trigger unusual errors
– Large dataset may contain various types of corrupted data or records
8/5/2020
Big Data Processing 75
RMIT Classification: Trusted
Handles Errors and Faults
MapReduce are explicitly designed around low- end commodity servicers
• Hardware
– disk failures
– memory failures
– planned outages (e.g. system maintenance and hardware upgrades)
– Unexpected outage (e.g. power failure, connectivity loss, etc.)
• Software
– No software is bug free
– Solving large problems may fall in extreme cases where bugs trigger unusual errors
– Large dataset may contain various types of corrupted data or records
The MapReduce execution framework must
thrive in this hostile environment
8/5/2020 Big Data Processing 76
RMIT Classification: Trusted
MapReduce Basics
8/5/2020 Big Data Processing 77
MapReduce
Programmers implement two functions:
map (k1, v1) → [
reduce (k2, [v2]) → [
All values with the same key are reduced together
RMIT Classification: Trusted
8/5/2020 Big Data Processing 78
RMIT Classification: Trusted
MapReduce
Programmers implement two functions:
map → [
reduce (k2, [v2]) → [
All values with the same key are reduced together
(k1, v1)
(k1, v1) is each individual input record on the same server key value
8/5/2020
Big Data Processing 79
RMIT Classification: Trusted
MapReduce
Programmers implement two functions:
map (k1, v1) →
reduce (k2, [v2]) → [
All values with the same key are reduced together
[
[
8/5/2020
Big Data Processing 80
RMIT Classification: Trusted
MapReduce
Programmers implement two functions:
map (k1, v1) →
reduce (k2, [v2]) → [
All values with the same key are reduced together
[
Stored on the server where the map function is
8/5/2020
Big Data Processing 81
[
RMIT Classification: Trusted
MapReduce
Programmers implement two functions:
map (k1, v1) → [
reduce (k2, [v2]) → [
All values with the same key are reduced together
8/5/2020 Big Data Processing 82
RMIT Classification: Trusted
MapReduce
Programmers implement two functions:
map (k1, v1) → [
reduce (k2, [v2]) → [
All values with the same key are reduced together
(k2, [v2]) is a record by shuffling & sorting outputs of all map functions key a list of values associated with the same key, i.e., k2
8/5/2020 Big Data Processing 83
RMIT Classification: Trusted
MapReduce
Programmers implement two functions:
map (k1, v1) → [
reduce (k2, [v2]) →
All values with the same key are reduced together
[
(k3, [v3]) is a set of records output by each individual reduce function key
a list of values
associated with the
same key, i.e., k2
8/5/2020 Big Data Processing 84
MapReduce
Programmers implement two functions:
map (k1, v1) → [
reduce (k2, [v2]) →
All values with the same key are reduced together
RMIT Classification: Trusted
(k3, [v3]) is a set of records output by each individual reduce function key
a list of values
[
associated with the
Stored on the server where
the reduce function is
same key, i.e., k2
8/5/2020 Big Data Processing 85
RMIT Classification: Trusted
Example: “map” and “reduce” only
WordCount
Reads text files and counts the frequency of the words.
8/5/2020 Big Data Processing 86
RMIT Classification: Trusted
Example: “map” and “reduce” only
WordCount
Reads text files and counts the
frequency of the words.
8/5/2020 Big Data Processing 87
RMIT Classification: Trusted
Example: “map” and “reduce” only
WordCount
Reads text files and counts the
frequency of the words. •
Each mapper takes a line of the input file
as input and breaks it into words.
8/5/2020 Big Data Processing 88
RMIT Classification: Trusted
Example: “map” and “reduce” only
WordCount
Reads text files and counts the
frequency of the words. •
Each mapper takes a line of the input file
as input and breaks it into words.
8/5/2020 Big Data Processing 89
RMIT Classification: Trusted
Example: “map” and “reduce” only
WordCount
Reads text files and counts the
frequency of the words.
• Each mapper takes a line of the input file
as input and breaks it into words. •
It then emits a key/value pair of the word
(In the form of (word, 1))
8/5/2020 Big Data Processing 90
RMIT Classification: Trusted
Example: “map” and “reduce” only
WordCount
Reads text files and counts the frequency of the words.
• •
•
Each mapper takes a line of the input file as input and breaks it into words.
It then emits a key/value pair of the word (In the form of (word, 1))
Each reducer sums the counts for
each word and emits a single
key/value with the word and sum.
8/5/2020 Big Data Processing 91
RMIT Classification: Trusted
Example: “map” and “reduce” only
WordCount
Reads text files and counts the
frequency of the words. •
• It then emits a key/value pair of the word (In the form of (word, 1))
• Each reducer sums the counts for each word and emits a single key/value with the word and sum.
For example, given a text file:
Each mapper takes a line of the input file
as input and breaks it into words.
Hello wordcount MapReduce Hadoop program. This is my first MapReduce program
8/5/2020 Big Data Processing 92
RMIT Classification: Trusted
Example: “map” and “reduce” only
WordCount
Reads text files and counts the
frequency of the words. •
•
• Each reducer sums the counts for each word and emits a single key/value with the word and sum.
For example, given a text file:
Each mapper takes a line of the input file
as input and breaks it into words.
It then emits a key/value pair of the word
(In the form of (word, 1))
Hello 1 wordcount 1 MapReduce 1 Hadoop 1 program 1 This 1 is 1 my 1 first 1 MapReduce 1 program 1
Hello wordcount MapReduce Hadoop program. This is my first MapReduce program
8/5/2020 Big Data Processing 93
RMIT Classification: Trusted
Example: “map” and “reduce” only
WordCount
Reads text files and counts the
frequency of the words. •
• •
For example, given a text file:
Each mapper takes a line of the input file
as input and breaks it into words.
It then emits a key/value pair of the word
(In the form of (word, 1))
Hello 1 wordcount 1
Hadoop 1 program 1 This 1 is 1 my 1 first 1
program 1
MapReduce
1
MapReduce
1
Each reducer sums the counts for
each word and emits a single
key/value with the word and sum.
Hello wordcount MapReduce Hadoop program. This is my first MapReduce program
8/5/2020
Big Data Processing 94
RMIT Classification: Trusted
Example: “map” and “red
WordCount
Reads text files and counts the
frequency of the words. •
• •
For example, given a text file:
first 1 Hadoop 1
uce” only
Hello 1 is 1 MapReduce 2 my 1 program 2 This 1 wordcount 1
Each mapper takes a line of the input file
as input and breaks it into words.
It then emits a key/value pair of the word
(In the form of (word, 1))
Hello 1 wordcount 1
Hadoop 1 program 1 This 1 is 1 my 1 first 1
program 1
MapReduce
1
MapReduce
1
Each reducer sums the counts for
each word and emits a single
key/value with the word and sum.
Hello wordcount MapReduce Hadoop program. This is my first MapReduce program
8/5/2020
Big Data Processing 95
RMIT Classification: Trusted
Example: “map” and “reduce” only
Map(String docid, String text):
for each word w in text:
Hello 1 wordcount 1 MapReduce 1 Hadoop 1 program 1 This 1 is 1 my 1 first 1 MapReduce 1 program 1
Word Count mapper:
Java
Emit(w, 1);
8/5/2020 Big Data Processing 96
RMIT Classification: Trusted
Example: “map” and “reduce” only
Map(String docid, String text):
for each word w in text: Emit(w, 1);
Reduce(String term, Iterator
int sum = 0;
for each v in values:
sum += v; Emit(term, value);
Word Count mapper:
Java
Hello 1 wordcount 1 MapReduce 1 Hadoop 1 program 1 This 1 is 1 my 1 first 1 MapReduce 1 program 1
Word Count reducer:
Java
8/5/2020 Big Data Processing 97
RMIT Classification: Trusted
Example: “map” and “reduce” only
Map(String docid, String text):
for each word w in text: Emit(w, 1);
Reduce(String term, Iterator
int sum = 0;
for each v in values:
sum += v;
Word Count mapper:
Java
Word Count reducer:
Java
first 1 Hadoop 1 Hello 1 is 1 MapReduce 2 my 1 program 2 This 1 wordcount 1
8/5/2020
Big Data Processing 98
Emit(term, value);
RMIT Classification: Trusted
Word Count mapper: Java
private static class MyMapper
extends Mapper
private final static IntWritable ONE = new IntWritable(1); private final static Text WORD = new Text();
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException { String line = ((Text) value).toString();
StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) {
} }
}
WORD.set(itr.nextToken()); context.write(WORD, ONE);
8/5/2020
Big Data Processing 99
RMIT Classification: Trusted
Word Count reducer: Java
private static class MyReducer
extends Reducer
private final static IntWritable SUM = new IntWritable();
@Override
public void reduce(Text key, Iterable
} }
Context context) throws IOException, InterruptedException { Iterator
int sum = 0;
while (iter.hasNext()) {
sum += iter.next().get(); }
SUM.set(sum); context.write(key, SUM);
8/5/2020
Big Data Processing 100
MapReduce
Programmers implement twofunctions:
map (k1, v1) → [
reduce (k2, [v2]) → [
All values with the same key are reduced together
RMIT Classification: Trusted
Optional functions implemented by programmers:
• combine (k2, [v2 ]) → [< k2, v2 >]
• partition (k2, # of partitions) → partition for k2
8/5/2020 Big Data Processing 101
MapReduce
Programmers implement twofunctions:
map (k1, v1) → [
reduce (k2, [v2]) → [
All values with the same key are reduced together
RMIT Classification: Trusted
Optional functions implemented by programmers:
• combine (k2, [v2 ]) → [< k2, v2 >]
• partition (k2, # of partitions) → partition for k2
Why we need combine?
8/5/2020 Big Data Processing 102
RMIT Classification: Trusted
MapReduce- Combiners
If not implement combiner
–
All emitted key
from mappers need to be
value pairs
copied across the network
8/5/2020
Big Data Processing 103
RMIT Classification: Trusted
MapReduce- Combiners
If not implement combiner
All emitted key-value pairs from mappers need to be copied across the network
So the amount of
will be larger than the input data.
This is inefficient.
intermediate data
8/5/2020 Big Data Processing 104
RMIT Classification: Trusted
MapReduce- Combiners
If not implement combiner
All emitted key-value pairs from mappers need to be copied across the network
So the amount of
will be larger than
This is inefficient.
intermediate data
the input data.
8/5/2020 Big Data Processing 105
RMIT Classification: Trusted
MapReduce- Combiners
If not implement combiner
All emitted key-value pairs from mappers need to be copied across the network
So the amount of
will be larger than the
intermediate data
input data.
This is inefficient.
Hello 1 wordcount 1 MapReduce 1 Hadoop 1 program 1 This 1 is 1 my 1 first 1 MapReduce 1 program 1
Hello wordcount MapReduce Hadoop program. This is my first MapReduce program
8/5/2020 Big Data Processing 106
RMIT Classification: Trusted
MapReduce- Combiners
If not implement combiner
All emitted key-value pairs from mappers need to be copied across the network
So the amount of intermediate data will be larger than the input data. This is inefficient.
–
Solution
Before the shuffle and sort, perform
local aggregation on the output of each
mapper
combiner
8/5/2020
Big Data Processing 107
RMIT Classification: Trusted
MapReduce- Combiners
If not implement combiner
All emitted key-value pairs from mappers need to be copied across the network
So the amount of intermediate data will be larger than the input data. This is inefficient.
Solution
Before the shuffle and sort, perform local aggregation on the output of each mapper – combiner
Combiner – compute a local count for a key over all the documents processed by the associated mapper.
8/5/2020 Big Data Processing 108
RMIT Classification: Trusted
MapReduce- Combiners
If not implement combiner
All emitted key-value pairs from mappers need to be copied across the network
So the amount of intermediate data will be larger than the input data. This is inefficient.
Solution
Before the shuffle and sort, perform local aggregation on the output of each mapper – combiner
Combiner – compute a local count for a key over all the documents processed by the associated mapper.
8/5/2020 Big Data Processing 109
RMIT Classification: Trusted
MapReduce- Combiners
If not implement combiner
All emitted key-value pairs from mappers need to be copied across the network
So the amount of intermediate data will be larger than the input data. This is inefficient.
Solution
Before the shuffle and sort, perform local aggregation on the output of each mapper – combiner
Combiner – compute a local count for a key over all the documents processed by the associated mapper.
8/5/2020 Big Data Processing 110
RMIT Classification: Trusted
MapReduce- Combiners
If not implement combiner
All emitted key-value pairs from mappers need to be copied across the network
So the amount of intermediate data will be larger than the input data. This is inefficient.
Solution
Before the shuffle and sort, perform local aggregation on the output of each mapper – combiner
Combiner – compute a local count for a key over all the documents processed by the associated mapper.
8/5/2020 Big Data Processing 111
RMIT Classification: Trusted
Partition, Shuffle and Sorting
Partition & Shuffle
After the map tasks have completed,
it begins
to where they are required by the reducers.
exchanging the
intermediate outputs from the map
tasks
8/5/2020 Big Data Processing 112
RMIT Classification: Trusted
Partition, Shuffle and Sorting
Partition & Shuffle
After the map tasks have completed,
it begins
to where they are required by the reducers.
exchanging the
intermediate outputs from the map
tasks
8/5/2020 Big Data Processing 113
RMIT Classification: Trusted
Partition, Shuffle and Sorting
Partition & Shuffle
After the map tasks have completed,
it begins
to where they are required by the reducers.
exchanging the
intermediate outputs from the map
tasks
8/5/2020 Big Data Processing 114
RMIT Classification: Trusted
Partition, Shuffle and Sorting
Partition & Shuffle
After the map tasks have completed,
it begins
to where they are required by the reducers.
exchanging the
intermediate outputs from the map
tasks
8/5/2020 Big Data Processing 115
RMIT Classification: Trusted
Partition, Shuffle and Sorting
Partition & Shuffle
After the map tasks have completed,
it begins
to where they are required by the reducers.
exchanging the
intermediate outputs from the map
tasks
This process of moving map outputs to the reducers is known as shuffling.
8/5/2020 Big Data Processing 116
RMIT Classification: Trusted
Partition, Shuffle and Sorting
Partition & Shuffle
After the map tasks have completed,
it begins
to where they are required by the reducers.
exchanging the
intermediate outputs from the map
tasks
This process of moving map outputs to the reducers is known as shuffling.
8/5/2020 Big Data Processing 117
RMIT Classification: Trusted
Partition, Shuffle and Sorting
Partition & Shuffle
After the map tasks have completed,
it begins
to where they are required by the reducers.
exchanging the
intermediate outputs from the map
tasks
This process of moving map outputs to the reducers is known as shuffling.
All values for the same key are always reduced together regardless of which mapper is its origin.
8/5/2020 Big Data Processing 118
RMIT Classification: Trusted
Partition, Shuffle and Sorting
Partition & Shuffle
After the map tasks have completed,
it begins
to where they are required by the reducers.
This process of moving map outputs to the reducers is known as shuffling.
exchanging the
intermediate outputs from the map
tasks
For each (key, value) pair, MapReduce platform decides which All values for the same key are always reduced
partition it should go.
together regardless of which mapper is its origin.
8/5/2020 Big Data Processing 119
RMIT Classification: Trusted
Partition, Shuffle and Sorting
Partition & Shuffle
After the map tasks have completed,
it begins
to where they are required by the reducers.
This process of moving map outputs to the reducers is known as shuffling.
exchanging the
intermediate outputs from the map
tasks
For each (key, value) pair, if a programmer want to determine All values for the same key are always reduced
which partition it will go, implement partitioner. together regardless of which mapper is its origin.
8/5/2020 Big Data Processing 120
RMIT Classification: Trusted
Partition, Shuffle and Sorting
Sorting
and not by value. Values passed to each reducer
are not sorted; they can be in any order.
Before starting of reducer, all
intermediate
key
–
value pairs
in
MapReduce that are generated by
mapper get sorted by key
8/5/2020 Big Data Processing 121
RMIT Classification: Trusted
Partition, Shuffle and Sorting
Sorting
and not by value. Values passed to each reducer
are not sorted; they can be in any order.
Before starting of reducer, all
intermediate
key
–
value pairs
in
MapReduce that are generated by
mapper get sorted by key
8/5/2020 Big Data Processing 122
RMIT Classification: Trusted
hash
8/5/2020 Big Data Processing 123
RMIT Classification: Trusted
hash
8/5/2020 Big Data Processing 124
RMIT Classification: Trusted
Combiner
Combiner
Combiner Combiner
hash
8/5/2020
Big Data Processing 125
RMIT Classification: Trusted
Ayush 1 Ayush 1
…Mona 1 Mona 1 …
Combiner
Combiner
Combiner Combiner
8/5/2020
Big Data Processing 126
hash
RMIT Classification: Trusted
How many pair
Ayush 1 Ayush 1
…Mona 1 Mona 1 …
Combiner
Combiner
Combiner Combiner
8/5/2020
Big Data Processing 127
hash
RMIT Classification: Trusted
hash
Partition
8/5/2020 Big Data Processing 128
The default partitioner computes a hash value for the key and assigns the partition based on this result.
RMIT Classification: Trusted
hash
8/5/2020 Big Data Processing 129
RMIT Classification: Trusted
hash
8/5/2020 Big Data Processing 130
RMIT Classification: Trusted
A summary – as a programmer,
You have limited control over
data and execution flow!
• All algorithms must be expressed in
“map, reduce, combine, partition”
8/5/2020 Big Data Processing 131
RMIT Classification: Trusted
A summary – as a programmer,
You have limited control over
data and execution flow!
• All algorithms must be expressed in
“map, reduce, combine, partition” You don’t know:
• Where mappers and reducers run
8/5/2020 Big Data Processing 132
RMIT Classification: Trusted
A summary – as a programmer,
You have limited control over
data and execution flow!
• All algorithms must be expressed in
“map, reduce, combine, partition” You don’t know:
• •
Where mappers and reducers run
When a mapper or reducer begins or finishes
8/5/2020
Big Data Processing 133
RMIT Classification: Trusted
A summary – as a programmer,
You have limited control over
data and execution flow!
• All algorithms must be expressed in
“map, reduce, combine, partition” You don’t know:
• Where mappers and reducers run
• When a mapper or reducer begins or
finishes
• Which input a particular mapper is processing
8/5/2020
Big Data Processing 134
RMIT Classification: Trusted
A summary – as a programmer,
You have limited control over
data and execution flow!
• All algorithms must be expressed in
“map, reduce, combine, partition” You don’t know:
• Where mappers and reducers run
• When a mapper or reducer begins or
finishes
• Which input a particular mapper is processing
• Which intermediate key a particular reducer is processing
8/5/2020
Big Data Processing 135
RMIT Classification: Trusted
A summary – as a programmer,
You have limited control over
data and execution flow!
• All algorithms must be expressed in
“map, reduce, combine, partition” You don’t know:
• Where mappers and reducers run
• When a mapper or reducer begins or
finishes
• Which input a particular mapper is processing
• Which intermediate key a particular reducer is processing
Resource Manager, Application Master, Node Manager, Containers
8/5/2020
Big Data Processing 136
RMIT Classification: Trusted
A summary – as a programmer,
You have limited control over
data and execution flow!
• All algorithms must be expressed in
“map, reduce, combine, partition” You don’t know:
• Where mappers and reducers run
• When a mapper or reducer begins or
finishes
• Which input a particular mapper is processing
• Which intermediate key a particular reducer is processing
8/5/2020
Big Data Processing 137
Resource Manager, Application Master, Node Manager, Containers
one container
one MapReduce task (map task, reduce task)
RMIT Classification: Trusted
MapReduce – Local Aggregation
8/5/2020 Big Data Processing 138
RMIT Classification: Trusted
Local Aggregation
Transfer data over network
Exchange of intermediate results from the processes (e.g., mapper) that produced them to the processes (e.g., reducer) that will ultimately consume them.
8/5/2020 Big Data Processing 139
RMIT Classification: Trusted
Local Aggregation
Transfer data over network
Exchange of intermediate results from the processes (e.g., mapper) that produced them to the processes (e.g., reducer) that will ultimately consume them.
8/5/2020 Big Data Processing 140
RMIT Classification: Trusted
Local Aggregation
Transfer data over network
Exchange of intermediate results from the processes (e.g., mapper) that produced them to the processes (e.g., reducer) that will ultimately consume them.
Data written to local disk in Hadoop Intermediate results are written to local disk before being sent over the network.
8/5/2020 Big Data Processing 141
RMIT Classification: Trusted
Local Aggregation
Transfer data over network
Exchange of intermediate results from the processes (e.g., mapper) that produced them to the processes (e.g., reducer) that will ultimately consume them.
Data written to local disk in Hadoop Intermediate results are written to local disk before being sent over the network.
Local aggregation of intermediate results is on the keys • it uses in-mapping combiner
• it preserves state across multiple inputs (e.g., 3 files)
8/5/2020 Big Data Processing 142
RMIT Classification: Trusted
Local Aggregation
Transfer data over network
Exchange of intermediate results from the processes (e.g., mapper) that produced them to the processes (e.g., reducer) that will ultimately consume them.
Data written to local disk in Hadoop Intermediate results are written to local disk before being sent over the network.
Local aggregation of intermediate results is on the keys
• it uses in-mapping combiner
• it preserves state across multiple inputs (e.g., 3 files)
8/5/2020 Big Data Processing 143
RMIT Classification: Trusted
Combiner and In-mapping Combining
Combiner (introduced already)
• Combiners aggregate term counts of intermediate key- value pairs by each map task.
8/5/2020 Big Data Processing 144
RMIT Classification: Trusted
Combiner and In-mapping Combining
Combiner (introduced already)
• Combiners aggregate term counts of intermediate key- value pairs by each map task.
Intermediate results are written to local disk before being sent over network
8/5/2020 Big Data Processing 145
RMIT Classification: Trusted
Combiner and In-mapping Combining
Combiner An Example
Hello wordcount MapReduce Hadoop program. This is my first MapReduce program
8/5/2020 Big Data Processing 146
RMIT Classification: Trusted
Combiner and In-mapping Combining
Hello wordcount MapReduce Hadoop program. This is my first MapReduce program
Combiner An Example
written to local disk
Hello 1 wordcount 1 MapReduce 1 Hadoop 1 program 1 This 1 is 1 my 1 first 1 MapReduce 1 program 1
8/5/2020
Big Data Processing 147
RMIT Classification: Trusted
Combiner and In-mapping Combining
Hello wordcount MapReduce Hadoop program. This is my first MapReduce program
written to local disk
Combiner An Example
Hello 1 wordcount 1 MapReduce 1 Hadoop 1 program 1 This 1 is 1 my 1 first 1 MapReduce 1 program 1
Combiner
8/5/2020
Big Data Processing 148
RMIT Classification: Trusted
Combiner and In-mapping Combining
Hello wordcount MapReduce Hadoop program. This is my first MapReduce program
Combiner An Example
written to local disk
Hello 1 wordcount 1 MapReduce 1 Hadoop 1 program 1 This 1 is 1 my 1 first 1 MapReduce 1 program 1
first 1 Hadoop 1 Hello 1 is 1 MapReduce 2 my 1 program 2 This 1 wordcount 1
Combiner
8/5/2020
Big Data Processing 149
RMIT Classification: Trusted
Combiner and In-mapping Combining
Hello wordcount MapReduce Hadoop program. This is my first MapReduce program
Combiner An Example
written to local disk
Hello 1 wordcount 1 MapReduce 1 Hadoop 1 program 1 This 1 is 1 my 1 first 1 MapReduce 1 program 1
first 1 Hadoop 1 Hello 1 is 1 MapReduce 2 my 1 program 2 This 1 wordcount 1
Combiner
Hadoop makes no guarantees on how many time combiners are called, it could be 0, 1, or multiple times
8/5/2020
Big Data Processing 150
RMIT Classification: Trusted
Combiner and In-mapping Combining
In-mapping Combining
8/5/2020 Big Data Processing 151
RMIT Classification: Trusted
Combiner and In-mapping Combining
In-mapping Combining
8/5/2020 Big Data Processing 152
RMIT Classification: Trusted
Combiner and In-mapping Combining
In-mapping Combining •
An associative array (i.e., Map in Java) is introduced inside the
mapper to tally up term counts within a single document
8/5/2020 Big Data Processing 153
RMIT Classification: Trusted
Combiner and In-mapping Combining
In-mapping Combining
• An associative array (i.e., Map in Java) is introduced inside the
mapper to tally up term counts within a single document •
.
With this technique, the combiner functionality has been directly
applied insider mapper; no need to run a separate combiner
8/5/2020 Big Data Processing
154
RMIT Classification: Trusted
Combiner and In-mapping Combining
Hello wordcount MapReduce Hadoop program. This is my first MapReduce program
In-mapping Combining An Example
8/5/2020 Big Data Processing 155
RMIT Classification: Trusted
Combiner and In-mapping Combining
Hello wordcount MapReduce Hadoop program. This is my first MapReduce program
In-mapping Combining An Example
written to local disk
first 1 Hadoop 1 Hello 1 is 1 MapReduce 2 my 1 program 2 This 1 wordcount 1
8/5/2020 Big Data Processing 156
RMIT Classification: Trusted
Combiner In-mapping Combining
class Mapper {
def map(key: Long, value: String) = {
for (word <- tokenize(value)) {
emit(word, 1)
}
} }
class Mapper {
def map(key: Long, value: String) = {
val counts = new Map()
for (word <- tokenize(value)) {
counts(word) += 1
}
for ((k, v) <- counts) {
emit(k, v) }
} }
8/5/2020 Big Data Processing 157
RMIT Classification: Trusted
Can we do even better?
8/5/2020 Big Data Processing 158
RMIT Classification: Trusted
Combiner and In-mapping Combining
In-mapping Combining (Preserve State across documents)
8/5/2020 Big Data Processing 159
RMIT Classification: Trusted
Combiner and In-mapping Combining
In-mapping Combining (Preserve State
across documents)
• The mapper’s input key-
– Initialize an associative array for holding term counts
Initialize method is called prior to processing any
value pairs
8/5/2020 Big Data Processing 160
RMIT Classification: Trusted
Combiner and In-mapping Combining
In-mapping Combining (Preserve State
across documents)
• The mapper’s input key-
– Initialize an associative array for holding term counts
Initialize method is called prior to processing any
value pairs
8/5/2020 Big Data Processing 161
RMIT Classification: Trusted
Combiner and In-mapping Combining
In-mapping Combining (Preserve State
across documents)
• The mapper’s input key-
– Initialize an associative array for holding term counts
Initialize method is called prior to processing any
value pairs
8/5/2020 Big Data Processing 162
RMIT Classification: Trusted
Combiner and In-mapping Combining
In-mapping Combining (Preserve State
across documents)
• The mapper’s input key-
– Initialize an associative array for holding term counts
Initialize method is called prior to processing any
value pairs
8/5/2020 Big Data Processing 163
It is possible to preserve state across multiple calls of the Map method,
RMIT Classification: Trusted
Combiner and In-mapping Combining
In-mapping Combining (Preserve State
across documents)
• The mapper’s input key-
– Initialize an associative array for holding term counts
Initialize method is called prior to processing any
value pairs
8/5/2020 Big Data Processing 164
It is possible to preserve state across multiple calls of the Map method,
• We can continue to accumulate partial term counts in the associative array across multiple documents
RMIT Classification: Trusted
Combiner and In-mapping Combining
In-mapping Combining (Preserve State
across documents)
• The mapper’s input key-
– Initialize an associative array for holding term counts
Initialize method is called prior to processing any
value pairs
8/5/2020 Big Data Processing 165
It is possible to preserve state across multiple calls of the Map method,
• We can continue to accumulate partial term counts in the associative array across multiple documents
• Emit key-value pairs only when the mapper has processed all documents, i.e., emission of intermediate data is deferred until the Close method.
RMIT Classification: Trusted
class Mapper {
def map(key: Long,
value: String) = {
... }
def cleanup() = {
... }
}
def setup() = {
... }
8/5/2020 Big Data Processing 166
class Mapper {
def map(key: Long, value: String) = {
for (word <- tokenize(value)) {
counts(word) += 1
}
}
def cleanup() = {
for ((k, v) <- counts) {
emit(k, v) }
} }
val counts = new Map()
RMIT Classification: Trusted
def setup() = {
... }
def map(key: Long,
value: String) = {
... }
class Mapper {
def cleanup() = {
... }
}
8/5/2020
Big Data Processing 167
class Mapper {
val counts = new Map()
def map(key: Long, value: String) = {
for (word <- tokenize(value)) {
counts(word) += 1
}
}
def cleanup() = {
for ((k, v) <- counts) {
emit(k, v) }
} }
RMIT Classification: Trusted
def setup() = {
... }
def map(key: Long,
value: String) = {
... }
def cleanup() = {
... }
class Mapper {
}
8/5/2020 Big Data Processing 168
class Mapper {
def map(key: Long, value: String) = {
for (word <- tokenize(value)) {
counts(word) += 1
}
}
def cleanup() = {
for ((k, v) <- counts) {
emit(k, v) }
}
}
val counts = new Map()
RMIT Classification: Trusted
In-mapper combining
Advantage and Disadvantage
Advantage
• Provide programmers control over
– –
when local aggregation occurs how it is exactly take place
8/5/2020
Big Data Processing 169
RMIT Classification: Trusted
In-mapper combining Advantage and Disadvantage Advantage
• •
Provide programmers control over
– when local aggregation occurs
– how it is exactly take place
More efficient than using actual combiners
–
-
Combiners reduce intermediate data shuffled across network
but don’t reduce the number of key
value pairs emitted by the
mappers in the first place.
8/5/2020
Big Data Processing
170
RMIT Classification: Trusted
In-mapper combining Advantage and Disadvantage Advantage
• •
Provide programmers control over
– when local aggregation occurs
– how it is exactly take place
More efficient than using actual combiners
– Combiners reduce intermediate data shuffled across network but don’t reduce the number of key-value pairs emitted by the mappers in the first place.
–-
-
With in
mapper combining, the mappers
will generate only
those key
value pairs need to be shuffled across the network
8/5/2020
Big Data Processing 171
RMIT Classification: Trusted
In-mapper combining
Advantage and Disadvantage
Disadvantage - Scalability bottleneck
8/5/2020 Big Data Processing 172
RMIT Classification: Trusted
In-mapper combining
Advantage and Disadvantage
Disadvantage - Scalability bottleneck • Memory size
-
It requires sufficient memory to store intermediate results until the
mapper has completely processed all key
value pairs in an input
split.
8/5/2020
Big Data Processing
173
RMIT Classification: Trusted
In-mapper combining
Advantage and Disadvantage
Disadvantage - Scalability bottleneck • Memory size
It requires sufficient memory to store intermediate results until the mapper has completely processed all key-value pairs in an input split.
• Solution to limited memory, periodically
– –
“block” input key-value pairs “flush” in-memory data structure
8/5/2020
Big Data Processing 174
RMIT Classification: Trusted
In-mapper combining
Advantage and Disadvantage
Disadvantage - Scalability bottleneck • Memory size
It requires sufficient memory to store intermediate results until the mapper has completely processed all key-value pairs in an input split.
• Solution to limited memory, periodically
– –
“block” input key-value pairs “flush” in-memory data structure
Idea: instead of emitting intermediate data only after every key-value pair have been processed, emit partial results after processing every n key-value pairs.
8/5/2020
Big Data Processing 175
RMIT Classification: Trusted
Local Aggregation - Algorithm Correctness
Given a large user log from a popular website, each record consists of:
• a key -> user ids
• a value -> number of click a day
A user may have many records in the log
8/5/2020 Big Data Processing 176
RMIT Classification: Trusted
Local Aggregation – Algorithm Correctness
Given a large user log from a popular website, each record consists of:
• a key -> user ids
• a value -> number of click a day
A user may have many records in the log
….
…
…
…
8/5/2020 Big Data Processing 177
RMIT Classification: Trusted
Local Aggregation – Algorithm Correctness
Given a large user log from a popular website, each record consists of:
• a key -> user ids
• a value -> number of click a day
A user may have many records in the log
….
…
…
…
Task:
compute the mean value per user
8/5/2020 Big Data Processing 178
RMIT Classification: Trusted
Local Aggregation – Algorithm Correctness
Given a large user log from a popular website, each record consists of:
• a key -> user ids
• a value -> number of click a day
A user may have many records in the log
….
…
…
…
Task:
compute the mean value per user
John (1+2+3+4+5)/5=3
8/5/2020 Big Data Processing 179
First Solution
RMIT Classification: Trusted
8/5/2020 Big Data Processing 180
RMIT Classification: Trusted
First Solution (map and reduce only)
….
…
…
…
8/5/2020 Big Data Processing 181
RMIT Classification: Trusted
First Solution (map and reduce only)
….
…
…
…
….
8/5/2020 Big Data Processing 182
RMIT Classification: Trusted
First Solution (map and reduce only)
….
…
…
…
….
….
8/5/2020 Big Data Processing 183
RMIT Classification: Trusted
First Solution (map and reduce only)
….
…
…
…
….
….
….
Mean(1,2,3,4,5) = 15/5 =3
8/5/2020 Big Data Processing 184
RMIT Classification: Trusted
First Solution (map and reduce only)
….
…
…
…
….
….
….
Mean(1,2,3,4,5) = 15/5 =3
8/5/2020 Big Data Processing 185
RMIT Classification: Trusted
Second Solution (First Solution + Combiner)
Combiners
8/5/2020
Big Data Processing 186
RMIT Classification: Trusted
Second Solution (First Solution + Combiner)
….
…
…
…
8/5/2020 Big Data Processing 187
RMIT Classification: Trusted
Second Solution (First Solution + Combiner)
….
…
…
…
….
….
8/5/2020 Big Data Processing 188
RMIT Classification: Trusted
Second Solution (First Solution + Combiner)
….
…
…
…
….
….
8/5/2020 Big Data Processing 189
RMIT Classification: Trusted
Second Solution (First Solution + Combiner)
….
…
…
…
….
….
8/5/2020 Big Data Processing 190
RMIT Classification: Trusted
Second Solution (First Solution + Combiner)
….
…
…
…
….
….
Mean(1.5,4) = 5.5/2 =2.75
8/5/2020 Big Data Processing 191
RMIT Classification: Trusted
Second Solution (First Solution + Combiner)
….
…
…
…
….
….
Mean(1.5,4) = 5.5/2 =2.75
8/5/2020 Big Data Processing 192
RMIT Classification: Trusted
Second Solution – incorrect
Mean is not associative and commutative
8/5/2020 Big Data Processing 193
RMIT Classification: Trusted
Third Solution
(First Solution + Combiner +)
8/5/2020 Big Data Processing 194
RMIT Classification: Trusted
Third Solution
(First Solution + Combiner +)
….
…
…
…
8/5/2020 Big Data Processing 195
RMIT Classification: Trusted
Third Solution
(First Solution + Combiner +)
….
…
…
…
….
….
8/5/2020 Big Data Processing 196
Third Solution
RMIT Classification: Trusted
….
…
…
…
….
….
8/5/2020 Big Data Processing 197
RMIT Classification: Trusted
….
…
…
…
Third Solution
….
….
8/5/2020 Big Data Processing 198
Third Solution
RMIT Classification: Trusted
….
…
…
…
….
….
(3+12)/(2+3) = 15/5 = 3
8/5/2020 Big Data Processing 199
Third Solution
RMIT Classification: Trusted
….
…
…
…
….
….
(3+12)/(2+3) = 15/5 = 3
8/5/2020 Big Data Processing 200
Third Solution
RMIT Classification: Trusted
8/5/2020 Big Data Processing 201
Third Solution
RMIT Classification: Trusted
8/5/2020 Big Data Processing 202
Third Solution
RMIT Classification: Trusted
8/5/2020 Big Data Processing 203
Third Solution
RMIT Classification: Trusted
Hadoop makes no guarantees on how many time combiners are called, it could be 0, 1, or multiple times
8/5/2020 Big Data Processing 204
Third Solution
RMIT Classification: Trusted
Hadoop makes no guarantees on how many time combiners are called, it could be 0, 1, or multiple times
What if combiner is called 0 times?
8/5/2020 Big Data Processing 205
Third Solution
RMIT Classification: Trusted
Hadoop makes no guarantees on how many time combiners are called, it could be 0, 1, or multiple times
What if combiner is called 0 times?
8/5/2020 Big Data Processing 206
Third Solution
RMIT Classification: Trusted
Hadoop makes no guarantees on how many time combiners are called, it could be 0, 1, or multiple times
What if combiner is called 0 times?
8/5/2020 Big Data Processing 207
RMIT Classification: Trusted
Third Solution
Output of Mapper ≠ Output of Combiner
8/5/2020 Big Data Processing 208
Hadoop makes no guarantees on how many time combiners are called, it could be 0, 1, or multiple times
What if combiner is called 0 times?
RMIT Classification: Trusted
Fourth Solution
8/5/2020 Big Data Processing 209
Third Solution
RMIT Classification: Trusted
….
…
…
…
….
….
8/5/2020 Big Data Processing 210
RMIT Classification: Trusted
Fourth Solution
Output of Mapper = Output of Combiner
8/5/2020 Big Data Processing 211
Hadoop makes no guarantees on how many time combiners are called, it could be 0, 1, or multiple times
What if combiner is called 0 times?
RMIT Classification: Trusted
Fifth Solution (Fourth Solution +inmapper combiner with across document state preserve)
8/5/2020 Big Data Processing 212
RMIT Classification: Trusted
Next Lecture
MapReduce Algorithm Design Patterns
8/5/2020 Big Data Processing 213