程序代写代做代考 Java hadoop data structure file system algorithm Big Data Processing Semester 2, 2020

Big Data Processing Semester 2, 2020
Lecture 3 – MapReduce Basics
Ke Deng Ke.deng@rmit.edu.au

RMIT Classification: Trusted
Last week
Cloud Computing • Rebranding of web 2.0
– What is cloud?
• Utility computing
– Enabling technology – virtualization • Everything as a service
– Infrastructure as a Service(IaaS)
– Platform as a Service(PaaS)
– Software as a Service(SaaS)
Data Centre
• server, rack, cluster
– Bandwidth, latency, capacity (RAM and disk)
8/5/2020 Big Data Processing 2

RMIT Classification: Trusted
Last week – Apache Hadoop Ecosystem
The base Apache Hadoop framework is composed of modules:
• Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster;
5/08/2020 Big Data Processing 3

Hadoop YARN – a platform responsible for managing computing resources in clusters and using them for scheduling users’ applications;
Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster;
RMIT Classification: Trusted
Last week – Apache Hadoop Ecosystem
The base Apache Hadoop framework is composed of modules:
5/08/2020 Big Data Processing 4

RMIT Classification: Trusted
Content
 Typical Big Data Problem MapReduce Execution Platform  MapReduce Basics
 MapReduce – Local Aggregation
8/5/2020 Big Data Processing 5

RMIT Classification: Trusted
Apache Hadoop Ecosystem
• Hadoop MapReduce
– A programming model for large-scale data processing.
8/5/2020
Big Data Processing 6

RMIT Classification: Trusted
Typical Big Data Problem
 Iterate over a large number of records
(Dean and Ghemawat, OSDI 2004)
8/5/2020 Big Data Processing 7

RMIT Classification: Trusted
Typical Big Data Problem
 Iterate over a large number of records  Extract something of interest from each
(Dean and Ghemawat, OSDI 2004)
8/5/2020 Big Data Processing 8

RMIT Classification: Trusted
Typical Big Data Problem
 
Map
Iterate over a large number of records Extract something of interest from each
(Dean and Ghemawat, OSDI 2004)
8/5/2020 Big Data Processing 9

RMIT Classification: Trusted
Typical Big Data Problem
Iterate over a large number of records Extract something of interest from each


 Shuffle and sort intermediate results
Map
(Dean and Ghemawat, OSDI 2004)
8/5/2020 Big Data Processing 10

RMIT Classification: Trusted
Typical Big Data Problem
Iterate over a large number of records Extract something of interest from each


 Shuffle and sort intermediate results  Aggregate intermediate results
Map
(Dean and Ghemawat, OSDI 2004)
8/5/2020 Big Data Processing 11

RMIT Classification: Trusted
Typical Big Data Problem
Iterate over a large number of records Extract something of interest from each
 
 Shuffle and sort intermediate results
 Aggregate intermediate results
 Generate final output
Map
(Dean and Ghemawat, OSDI 2004)
8/5/2020 Big Data Processing 12

RMIT Classification: Trusted
Typical Big Data Problem
Iterate over a large number of records Extract something of interest from each


 Shuffle and sort intermediate results 

(Dean and Ghemawat, OSDI 2004)
Map Reduce
Aggregate intermediate results Generate final output
8/5/2020 Big Data Processing
13

MapReduce
RMIT Classification: Trusted
8/5/2020 Big Data Processing 14

MapReduce
RMIT Classification: Trusted
a large number ofrecords
8/5/2020 Big Data Processing 15

MapReduce
Map
RMIT Classification: Trusted
8/5/2020 Big Data Processing 16
Iterate over a large number of records and Extract something of interest from each

MapReduce
Map
Iterate over a large number of records and Extract something of interest from each
RMIT Classification: Trusted
8/5/2020
Big Data Processing 17
Shuffle and sort intermediate results

MapReduce
Map
Iterate over a large number of records and Extract something of interest from each
Shuffle and sort intermediate results
Reduce
RMIT Classification: Trusted
8/5/2020
Big Data Processing 18
Aggregate intermediate results and generate final output

RMIT Classification: Trusted
Power of MapReduce
Simplicity – MapReduce programmer
 Preparing the input data
 Implement mapper, reducer
 Optionally implement combiner and partitioner
8/5/2020 Big Data Processing 19

RMIT Classification: Trusted
Power of MapReduce
Simplicity – MapReduce programmer
 Preparing the input data
 Implement mapper, reducer
 Optionally implement combiner and partitioner
8/5/2020 Big Data Processing 20

RMIT Classification: Trusted
Power of MapReduce
Simplicity – MapReduce programmer
 Preparing the input data
 Implement mapper, reducer
 Optionally implement combiner and partitioner
8/5/2020 Big Data Processing 21

MapReduce
just specify four
:
map (k1, v1) → []
combine (k2, [v2]) → [] partition (k’, number of
partitions) → partition for k’ reduce (k2, [v2]) → []
RMIT Classification: Trusted
Programmers
functions
8/5/2020 Big Data Processing 22

MapReduce
just specify four
:
map (k1, v1) → []
combine (k2, [v2]) → [] partition (k’, number of
partitions) → partition for k’ reduce (k2, [v2]) → []
RMIT Classification: Trusted
Programmers
functions
8/5/2020 Big Data Processing 23

MapReduce
just specify four
:
map (k1, v1) → []
combine (k2, [v2]) → [] partition (k’, number of
partitions) → partition for k’ reduce (k2, [v2]) → []
RMIT Classification: Trusted
Programmers
functions
8/5/2020 Big Data Processing 24

MapReduce
just specify four
:
map (k1, v1) → []
combine (k2, [v2]) → [] partition (k’, number of
partitions) → partition for k’ reduce (k2, [v2]) → []
RMIT Classification: Trusted
Programmers
functions
8/5/2020 Big Data Processing 25

MapReduce
just specify four
:
RMIT Classification: Trusted
Same server
Programmers
functions
map (k1, v1) → [] combine (k2, [v2]) → []
partition (k’, number of partitions) → partition for k’
reduce (k2, [v2]) → []
8/5/2020 Big Data Processing 26

MapReduce
just specify four
:
RMIT Classification: Trusted
Same server
Same server
Same server
Same server
Programmers
functions
map (k1, v1) → [] combine (k2, [v2]) → []
partition (k’, number of partitions) → partition for k’
reduce (k2, [v2]) → []
8/5/2020 Big Data Processing 27

MapReduce
just specify four
:
map (k1, v1) → []
combine (k2, [v2]) → [] partition (k’, number of
partitions) → partition for k’
RMIT Classification: Trusted
Programmers
functions
reduce (k2, [v2]) → []
server
server
server
8/5/2020 Big Data Processing
28

RMIT Classification: Trusted
MapReduce
just specify four
:
map (k1, v1) → []
combine (k2, [v2]) → [] partition (k’, number of
partitions) → partition for k’ reduce (k2, [v2]) → []
MapReduce Execution Framework
transparently handles everything else…
Programmers
functions
8/5/2020 Big Data Processing 29

RMIT Classification: Trusted
MapReduce
just specify four
:
map (k1, v1) → []
combine (k2, [v2]) → [] partition (k’, number of
partitions) → partition for k’ reduce (k2, [v2]) → []
MapReduce Execution Framework
transparently handles everything else…
Programmers
functions
8/5/2020 Big Data Processing 30

RMIT Classification: Trusted
MapReduce Execution Platform
8/5/2020 Big Data Processing 31

RMIT Classification: Trusted
MapReduce Execution Framework transparently handles everything else…
 A key idea is to separate “what” from “how”
8/5/2020 Big Data Processing 32

RMIT Classification: Trusted
MapReduce Execution Framework transparently handles everything else…
 A key idea is to separate “what” from “how”

Developer specifies the computation that needs to be
performed
8/5/2020 Big Data Processing 33

RMIT Classification: Trusted
MapReduce Execution Framework transparently handles everything else…
 A key idea is to separate “what” from “how”
• Developer specifies the computation that needs to be
performed •
Execution framework (also called “runtime”
sometimes) handles actual execution
8/5/2020
Big Data Processing 34

RMIT Classification: Trusted
MapReduce Execution Framework transparently handles everything else…
 A key idea is to separate “what” from “how”
• •
Developer specifies the computation that needs to be performed
Execution framework (also called “runtime” sometimes) handles actual execution
What are included …
8/5/2020
Big Data Processing 35

RMIT Classification: Trusted
MapReduce Execution Framework transparently handles everything else…  Handles scheduling
• Assigns workers to map and reduce tasks  Handles “data distribution”
• Moves processes todata  Handles synchronization
• Gathers, sorts, and shuffles intermediate data  Handles errors and faults
• Detects worker failures and restarts
8/5/2020 Big Data Processing 36

RMIT Classification: Trusted
MapReduce Execution Framework transparently handles everything else…  Handles scheduling

 Handles “data distribution” • Moves processes todata
 Handles synchronization
• Gathers, sorts, and shuffles intermediate data
 Handles errors and faults
• Detects worker failures and restarts
Assigns workers to map and reduce
tasks
8/5/2020 Big Data Processing 37

RMIT Classification: Trusted
Handles scheduling
Each MapReduce job is divided into smaller units called
tasks.
8/5/2020 Big Data Processing 38

RMIT Classification: Trusted
Handles scheduling
Each MapReduce job is divided into smaller units called tasks.
e.g., 4 smaller map tasks
e.g., 3 smaller reduce tasks
8/5/2020 Big Data Processing 39

RMIT Classification: Trusted
Handles scheduling
Each MapReduce job is divided into smaller units called tasks.

In a large job, the total number of tasks may exceed the
number of tasks that can be run on the cluster
concurrently.
8/5/2020 Big Data Processing 40

RMIT Classification: Trusted
Handles scheduling
Each MapReduce job is divided into smaller units called tasks.
 In large jobs, the total number of tasks may exceed the number of tasks that can be run on the cluster concurrently.

A task queue needs to be maintained and
scheduling which task to be processed once a
cluster node is available.
8/5/2020 Big Data Processing 41

RMIT Classification: Trusted
Handles scheduling
Each MapReduce job is divided into smaller units called tasks.
 In large jobs, the total number of tasks may exceed the number of tasks that can be run on the cluster concurrently.
 A task queue needs to be maintained and scheduling which task to be processed once a cluster node is available.
Among different jobs, scheduling involves
coordination of their tasks.
8/5/2020 Big Data Processing 42

RMIT Classification: Trusted
Handles scheduling – Yarn
The ResourceManager (RM) in Yarn tracks resources on a cluster, and assigns them to applications that need them.
8/5/2020 Big Data Processing 43
Handles scheduling

RMIT Classification: Trusted
Handles scheduling – Yarn
The ResourceManager (RM) in Yarn tracks resources on a cluster, and assigns them to applications that need them.
• The scheduler is that part of the RM that manages computing resource sharing resources in a cluster among many tasks from same/different jobs.
Handles scheduling
8/5/2020 Big Data Processing
44

RMIT Classification: Trusted
Handles scheduling – Yarn
The ResourceManager (RM) in Yarn tracks resources on a cluster, and assigns them to applications that need them.
• The scheduler is that part of the RM that manages computing resource sharing resources in a cluster among many tasks from same/different jobs.
Fair Scheduler
• The Fair Scheduler is a popular choice (recommended by Cloudera) among the schedulers YARN supports.
Handles scheduling
8/5/2020 Big Data Processing
45

RMIT Classification: Trusted
Handles scheduling – Yarn
The ResourceManager (RM) in Yarn tracks resources on a cluster, and assigns them to applications that need them.
• The scheduler is that part of the RM that manages computing resource sharing resources in a cluster among many tasks from same/different jobs.
Fair Scheduler
• The Fair Scheduler is a popular choice (recommended by Cloudera) among the schedulers YARN supports.
• In its simplest form, it shares resources fairly among all jobs running on the cluster.
Handles scheduling
8/5/2020 Big Data Processing
46

RMIT Classification: Trusted
Handles scheduling – Straggler
The speed of MapReduce job is bounded by the
running time of the slowest reduce task.

.
Straggler: the task taken an usually long time to
complete
8/5/2020
Big Data Processing 47

RMIT Classification: Trusted
Handles scheduling – Straggler
Straggler – Two Causes
 Machine suffering from broken hardware

Solution: speculative execution (an identical copy of the same task is executed on different machine, and uses the result of the first task complete.)
Animation
https://d2h0cx97tjks2p.cloudfront.net/blogs/wp- content/uploads/sites/2/2017/04/Speculative-Execution-in-Spark.gif
8/5/2020
Big Data Processing 48

RMIT Classification: Trusted
Handles scheduling – Straggler
Straggler – Two Causes
 Machine suffering from broken hardware •
Animation
https://d2h0cx97tjks2p.cloudfront.net/blogs/wp- content/uploads/sites/2/2017/04/Speculative-Execution-in-Spark.gif
Solution:
speculative execution
(an identical copy of the same task is
executed on different machine, and uses the result of the first task
complete.)
8/5/2020 Big Data Processing 49

RMIT Classification: Trusted
Handles scheduling – Straggler
Straggler – Two Causes
 Machine suffering from broken hardware
• Solution: speculative execution (an identical copy of the same task is executed on different machine, and uses the result of the first task complete.)

Skew in the distribution of
values associated with
intermediate keys
8/5/2020
Big Data Processing 50

RMIT Classification: Trusted
Handles scheduling – Straggler
Straggler – Two Causes
 Machine suffering from broken hardware

Solution: speculative execution (an identical copy of the same task is executed on different machine, and uses the result of the first task complete.)

Skew in the distribution of
values associated with
intermediate keys
Intermediate data
8/5/2020
Big Data Processing 51

RMIT Classification: Trusted
Handles scheduling – Straggler
 Skew in the distribution of values associated with intermediate keys
3keys-a, b,c
8/5/2020 Big Data Processing 52

RMIT Classification: Trusted
Handles scheduling – Straggler
 Skew in the distribution of values associated with intermediate keys
3keys-a, b,c
8/5/2020 Big Data Processing 53

RMIT Classification: Trusted
Handles scheduling – Straggler
 Skew in the distribution of values associated with intermediate keys
3keys-a, b,c
8/5/2020 Big Data Processing 54

RMIT Classification: Trusted
Handles scheduling – Straggler
 Skew in the distribution of values associated with intermediate keys
3keys-a, b,c
The reduce task responsible for processing the most frequent elements will run much longer
8/5/2020 Big Data Processing 55

RMIT Classification: Trusted
Handles scheduling – Straggler
 Skew in the distribution of values associated with intermediate keys
3keys-a, b,c
The reduce task responsible for processing the most frequent elements will run much longer
Note, Zipfian distribution is common in text processing 8/5/2020 Big Data Processing 56

RMIT Classification: Trusted
Handles scheduling – Straggler
 Zipfian distribution
One of a family of related discrete power
law probability distributions. Formally, ,
the normalized frequency of elements of
rank
8/5/2020
Big Data Processing 57
k

RMIT Classification: Trusted
Handles scheduling – Straggle
 Zipfian distribution
One of a family of related discrete power law probability distributions. Formally, the normalized frequency of elements of rank k,
r
Percentage of the
most frequent word
8/5/2020 Big Data Processing 58

RMIT Classification: Trusted
Handles scheduling – Straggler
 Zipfian distribution
One of a family of related discrete power law probability distributions.
Formally, the normalized frequency of elements of rank k,
8/5/2020 Big Data Processing 59
Percentage of the 2nd most frequent word

RMIT Classification: Trusted
Handles scheduling – Straggler
 Zipfian distribution
One of a family of related discrete power law probability distributions.
Formally, the normalized frequency of elements of rank k,
8/5/2020 Big Data Processing 60
Percentage of the 3rd most frequent word

RMIT Classification: Trusted
Handles scheduling – Straggler
 Zipfian distribution
One of a family of related discrete power law probability distributions.
Formally, the normalized frequency of elements of rank k,
8/5/2020 Big Data Processing 61
Percentage of the 3rd most frequent word
And so on…

RMIT Classification: Trusted
MapReduce Execution Framework transparently handles everything else…  Handles scheduling
• Assigns workers to map and reduce tasks  Handles “data distribution”

 Handles synchronization
• Gathers, sorts, and shuffles intermediate data
 Handles errors and faults
• Detects worker failures and restarts
Moves processes to
data
8/5/2020 Big Data Processing 62

RMIT Classification: Trusted
Handles “data distribution”
Data and Code Co-location
, .
While the key idea behind MapReduce is to move
the code not the data
sometimes
we need data to the code
8/5/2020
Big Data Processing 63

RMIT Classification: Trusted
Handles “data distribution”
Data and Code Co-location
While the key idea behind MapReduce is to move the code not the data, we need data to the code sometimes.
• •
To achieve data locality,
.
If this is not possible, new tasks will be started elsewhere (another node), and the data will be streamed over the network.
– For example, the node already running too many tasks,
the scheduler starts on the
node that holds a particular block of data and move
code to the data
8/5/2020
Big Data Processing 64

RMIT Classification: Trusted
Handles “data distribution”
Data and Code Co-location
While the key idea behind MapReduce is to move the code not the data, we need data to the code sometimes.


To achieve data locality, the scheduler starts on the node that holds a particular block of data and move code to the data.
If this is not possible, new tasks will be started
elsewhere (another node), and the data will be
streamed over the network
.
– For example, the node already running too many tasks,
8/5/2020
Big Data Processing 65



To achieve data locality, the scheduler starts on the node that holds a particular block of data and move code to the data.
RMIT Classification: Trusted
Optimization – prefer node on the same rack in the datacenter as the nHodaenhodldlinegsth“edrealevtantdaistatbrloibcku,stinicoenin”ter-rackbandwidthis significantly less than intra-rack bandwidth.
Data and Code Co-location
While the key idea behind MapReduce is to move the code not the data, we need data to the code sometimes.
If this is not possible, new tasks will be started
elsewhere (another node), and the data will be
streamed over the network
.
– For example, the node already running too many tasks,
8/5/2020
Big Data Processing 66

RMIT Classification: Trusted
MapReduce Execution Framework transparently handles everything else…  Handles scheduling
• Assigns workers to map and reduce tasks  Handles “data distribution”
• Moves processes todata  Handles synchronization
• Gathers, sorts, and shuffles intermediate data  Handles errors and faults
• Detects worker failures and restarts
8/5/2020 Big Data Processing 67

RMIT Classification: Trusted
Handles Synchronization
 Synchronization •
Enforce the proper execution order when multiple
concurrently running processes tries to update the
common data.
8/5/2020 Big Data Processing 68

RMIT Classification: Trusted
Handles Synchronization
 Synchronization
• Enforce the proper execution order when multiple concurrently running processes tries to update the common data.
Synchronization in MapReduce
• Intermediate key-value pairs must be grouped by key. This involves
– all nodes execute map tasks
– Copying intermediate data over the
network – shuffle and sort
– all nodes execute reduce tasks
8/5/2020
Big Data Processing 69

RMIT Classification: Trusted
Handles Synchronization
 Synchronization
• Enforce the proper execution order when multiple concurrently running processes tries to update the common data.
Reduce computation cannot start until …
Synchronization in MapReduce
• Intermediate key-value pairs must be grouped by key. This involves
– all nodes execute map tasks
– Copying intermediate data over the
network – shuffle and sort
– all nodes execute reduce tasks
8/5/2020
Big Data Processing 70

RMIT Classification: Trusted
Handles Synchronization
 Synchronization
• Enforce the proper execution order when multiple concurrently running processes tries to update the common data.
Reduce computation cannot
start until

all the mappers
have finished emitting key
value pairs
Synchronization in MapReduce
• Intermediate key-value pairs must be grouped by key. This involves
– all nodes execute map tasks
– Copying intermediate data over the
network – shuffle and sort
– all nodes execute reduce tasks
8/5/2020
Big Data Processing 71

RMIT Classification: Trusted
Handles Synchronization
 Synchronization
• Enforce the proper execution order when multiple concurrently running processes tries to update the common data.
Reduce computation cannot start until all the mappers have finished emitting key- value pairs and
intermediate key
all

value pairs
have been shuffled and sorted
Synchronization in MapReduce
• Intermediate key-value pairs must be grouped by key. This involves
– all nodes execute map tasks –

Copying intermediate data over the
network –
shuffle and sort
all nodes execute reduce tasks
8/5/2020
Big Data Processing 72

RMIT Classification: Trusted
MapReduce Execution Framework transparently handles everything else…  Handles scheduling
• Assigns workers to map and reduce tasks  Handles “data distribution”
• Moves processes todata  Handles synchronization
• Gathers, sorts, and shuffles intermediate data  Handles errors and faults

Detects worker failures and
restarts
8/5/2020 Big Data Processing 73

RMIT Classification: Trusted
Handles Errors and Faults
MapReduce are
end commodity servicers
explicitly designed around low-
8/5/2020 Big Data Processing 74

RMIT Classification: Trusted
Handles Errors and Faults
MapReduce are explicitly designed around low- end commodity servicers
• Hardware Failure
– disk failures
– memory failures
– planned outages (e.g. system maintenance and hardware upgrades)
– Unexpected outage (e.g. power failure, connectivity loss, etc.)
• Software Failure
– No software is bug free
– Solving large problems may fall in extreme cases where bugs trigger unusual errors
– Large dataset may contain various types of corrupted data or records
8/5/2020
Big Data Processing 75

RMIT Classification: Trusted
Handles Errors and Faults
MapReduce are explicitly designed around low- end commodity servicers
• Hardware
– disk failures
– memory failures
– planned outages (e.g. system maintenance and hardware upgrades)
– Unexpected outage (e.g. power failure, connectivity loss, etc.)
• Software
– No software is bug free
– Solving large problems may fall in extreme cases where bugs trigger unusual errors
– Large dataset may contain various types of corrupted data or records
The MapReduce execution framework must
thrive in this hostile environment
8/5/2020 Big Data Processing 76

RMIT Classification: Trusted
MapReduce Basics
8/5/2020 Big Data Processing 77

MapReduce
Programmers implement two functions:
map (k1, v1) → []
reduce (k2, [v2]) → []
All values with the same key are reduced together
RMIT Classification: Trusted
8/5/2020 Big Data Processing 78

RMIT Classification: Trusted
MapReduce
Programmers implement two functions:
map → []
reduce (k2, [v2]) → []
All values with the same key are reduced together
(k1, v1)
(k1, v1) is each individual input record on the same server key value
8/5/2020
Big Data Processing 79

RMIT Classification: Trusted
MapReduce
Programmers implement two functions:
map (k1, v1) →
reduce (k2, [v2]) → []
All values with the same key are reduced together
[]
[] is a set of records output by each individual map function key value
8/5/2020
Big Data Processing 80

RMIT Classification: Trusted
MapReduce
Programmers implement two functions:
map (k1, v1) →
reduce (k2, [v2]) → []
All values with the same key are reduced together
[] is a set of records output by each individual map function key value
Stored on the server where the map function is
8/5/2020
Big Data Processing 81
[]

RMIT Classification: Trusted
MapReduce
Programmers implement two functions:
map (k1, v1) → []
reduce (k2, [v2]) → []
All values with the same key are reduced together
8/5/2020 Big Data Processing 82

RMIT Classification: Trusted
MapReduce
Programmers implement two functions:
map (k1, v1) → []
reduce (k2, [v2]) → []
All values with the same key are reduced together
(k2, [v2]) is a record by shuffling & sorting outputs of all map functions key a list of values associated with the same key, i.e., k2
8/5/2020 Big Data Processing 83

RMIT Classification: Trusted
MapReduce
Programmers implement two functions:
map (k1, v1) → []
reduce (k2, [v2]) →
All values with the same key are reduced together
[]
(k3, [v3]) is a set of records output by each individual reduce function key
a list of values
associated with the
same key, i.e., k2
8/5/2020 Big Data Processing 84

MapReduce
Programmers implement two functions:
map (k1, v1) → []
reduce (k2, [v2]) →
All values with the same key are reduced together
RMIT Classification: Trusted
(k3, [v3]) is a set of records output by each individual reduce function key
a list of values
[]
associated with the
Stored on the server where
the reduce function is
same key, i.e., k2
8/5/2020 Big Data Processing 85

RMIT Classification: Trusted
Example: “map” and “reduce” only
WordCount
Reads text files and counts the frequency of the words.
8/5/2020 Big Data Processing 86

RMIT Classification: Trusted
Example: “map” and “reduce” only
WordCount
Reads text files and counts the
frequency of the words.
8/5/2020 Big Data Processing 87

RMIT Classification: Trusted
Example: “map” and “reduce” only
WordCount
Reads text files and counts the
frequency of the words. •
Each mapper takes a line of the input file
as input and breaks it into words.
8/5/2020 Big Data Processing 88

RMIT Classification: Trusted
Example: “map” and “reduce” only
WordCount
Reads text files and counts the
frequency of the words. •
Each mapper takes a line of the input file
as input and breaks it into words.
8/5/2020 Big Data Processing 89

RMIT Classification: Trusted
Example: “map” and “reduce” only
WordCount
Reads text files and counts the
frequency of the words.
• Each mapper takes a line of the input file
as input and breaks it into words. •
It then emits a key/value pair of the word
(In the form of (word, 1))
8/5/2020 Big Data Processing 90

RMIT Classification: Trusted
Example: “map” and “reduce” only
WordCount
Reads text files and counts the frequency of the words.
• •

Each mapper takes a line of the input file as input and breaks it into words.
It then emits a key/value pair of the word (In the form of (word, 1))
Each reducer sums the counts for
each word and emits a single
key/value with the word and sum.
8/5/2020 Big Data Processing 91

RMIT Classification: Trusted
Example: “map” and “reduce” only
WordCount
Reads text files and counts the
frequency of the words. •
• It then emits a key/value pair of the word (In the form of (word, 1))
• Each reducer sums the counts for each word and emits a single key/value with the word and sum.
For example, given a text file:
Each mapper takes a line of the input file
as input and breaks it into words.
Hello wordcount MapReduce Hadoop program. This is my first MapReduce program
8/5/2020 Big Data Processing 92

RMIT Classification: Trusted
Example: “map” and “reduce” only
WordCount
Reads text files and counts the
frequency of the words. •

• Each reducer sums the counts for each word and emits a single key/value with the word and sum.
For example, given a text file:
Each mapper takes a line of the input file
as input and breaks it into words.
It then emits a key/value pair of the word
(In the form of (word, 1))
Hello 1 wordcount 1 MapReduce 1 Hadoop 1 program 1 This 1 is 1 my 1 first 1 MapReduce 1 program 1
Hello wordcount MapReduce Hadoop program. This is my first MapReduce program
8/5/2020 Big Data Processing 93

RMIT Classification: Trusted
Example: “map” and “reduce” only
WordCount
Reads text files and counts the
frequency of the words. •
• •
For example, given a text file:
Each mapper takes a line of the input file
as input and breaks it into words.
It then emits a key/value pair of the word
(In the form of (word, 1))
Hello 1 wordcount 1
Hadoop 1 program 1 This 1 is 1 my 1 first 1
program 1
MapReduce
1
MapReduce
1
Each reducer sums the counts for
each word and emits a single
key/value with the word and sum.
Hello wordcount MapReduce Hadoop program. This is my first MapReduce program
8/5/2020
Big Data Processing 94

RMIT Classification: Trusted
Example: “map” and “red
WordCount
Reads text files and counts the
frequency of the words. •
• •
For example, given a text file:
first 1 Hadoop 1
uce” only
Hello 1 is 1 MapReduce 2 my 1 program 2 This 1 wordcount 1
Each mapper takes a line of the input file
as input and breaks it into words.
It then emits a key/value pair of the word
(In the form of (word, 1))
Hello 1 wordcount 1
Hadoop 1 program 1 This 1 is 1 my 1 first 1
program 1
MapReduce
1
MapReduce
1
Each reducer sums the counts for
each word and emits a single
key/value with the word and sum.
Hello wordcount MapReduce Hadoop program. This is my first MapReduce program
8/5/2020
Big Data Processing 95

RMIT Classification: Trusted
Example: “map” and “reduce” only
Map(String docid, String text):
for each word w in text:
Hello 1 wordcount 1 MapReduce 1 Hadoop 1 program 1 This 1 is 1 my 1 first 1 MapReduce 1 program 1
Word Count mapper:
Java
Emit(w, 1);
8/5/2020 Big Data Processing 96

RMIT Classification: Trusted
Example: “map” and “reduce” only
Map(String docid, String text):
for each word w in text: Emit(w, 1);
Reduce(String term, Iterator values):
int sum = 0;
for each v in values:
sum += v; Emit(term, value);
Word Count mapper:
Java
Hello 1 wordcount 1 MapReduce 1 Hadoop 1 program 1 This 1 is 1 my 1 first 1 MapReduce 1 program 1
Word Count reducer:
Java
8/5/2020 Big Data Processing 97

RMIT Classification: Trusted
Example: “map” and “reduce” only
Map(String docid, String text):
for each word w in text: Emit(w, 1);
Reduce(String term, Iterator values):
int sum = 0;
for each v in values:
sum += v;
Word Count mapper:
Java
Word Count reducer:
Java
first 1 Hadoop 1 Hello 1 is 1 MapReduce 2 my 1 program 2 This 1 wordcount 1
8/5/2020
Big Data Processing 98
Emit(term, value);

RMIT Classification: Trusted
Word Count mapper: Java
private static class MyMapper
extends Mapper {
private final static IntWritable ONE = new IntWritable(1); private final static Text WORD = new Text();
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException { String line = ((Text) value).toString();
StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) {
} }
}
WORD.set(itr.nextToken()); context.write(WORD, ONE);
8/5/2020
Big Data Processing 99

RMIT Classification: Trusted
Word Count reducer: Java
private static class MyReducer
extends Reducer {
private final static IntWritable SUM = new IntWritable();
@Override
public void reduce(Text key, Iterable values,
} }
Context context) throws IOException, InterruptedException { Iterator iter = values.iterator();
int sum = 0;
while (iter.hasNext()) {
sum += iter.next().get(); }
SUM.set(sum); context.write(key, SUM);
8/5/2020
Big Data Processing 100

MapReduce
Programmers implement twofunctions:
map (k1, v1) → []
reduce (k2, [v2]) → []
All values with the same key are reduced together
RMIT Classification: Trusted
Optional functions implemented by programmers:
• combine (k2, [v2 ]) → [< k2, v2 >]
• partition (k2, # of partitions) → partition for k2
8/5/2020 Big Data Processing 101

MapReduce
Programmers implement twofunctions:
map (k1, v1) → []
reduce (k2, [v2]) → []
All values with the same key are reduced together
RMIT Classification: Trusted
Optional functions implemented by programmers:
• combine (k2, [v2 ]) → [< k2, v2 >]
• partition (k2, # of partitions) → partition for k2
Why we need combine?
8/5/2020 Big Data Processing 102

RMIT Classification: Trusted
MapReduce- Combiners
If not implement combiner

All emitted key
from mappers need to be
value pairs
copied across the network
8/5/2020
Big Data Processing 103

RMIT Classification: Trusted
MapReduce- Combiners
If not implement combiner
All emitted key-value pairs from mappers need to be copied across the network
So the amount of
will be larger than the input data.
This is inefficient.
intermediate data
8/5/2020 Big Data Processing 104

RMIT Classification: Trusted
MapReduce- Combiners
If not implement combiner
All emitted key-value pairs from mappers need to be copied across the network
So the amount of
will be larger than
This is inefficient.
intermediate data
the input data.
8/5/2020 Big Data Processing 105

RMIT Classification: Trusted
MapReduce- Combiners
If not implement combiner
All emitted key-value pairs from mappers need to be copied across the network
So the amount of
will be larger than the
intermediate data
input data.
This is inefficient.
Hello 1 wordcount 1 MapReduce 1 Hadoop 1 program 1 This 1 is 1 my 1 first 1 MapReduce 1 program 1
Hello wordcount MapReduce Hadoop program. This is my first MapReduce program
8/5/2020 Big Data Processing 106

RMIT Classification: Trusted
MapReduce- Combiners
If not implement combiner
All emitted key-value pairs from mappers need to be copied across the network
So the amount of intermediate data will be larger than the input data. This is inefficient.

Solution
Before the shuffle and sort, perform
local aggregation on the output of each
mapper
combiner
8/5/2020
Big Data Processing 107

RMIT Classification: Trusted
MapReduce- Combiners
If not implement combiner
All emitted key-value pairs from mappers need to be copied across the network
So the amount of intermediate data will be larger than the input data. This is inefficient.
Solution
Before the shuffle and sort, perform local aggregation on the output of each mapper – combiner
Combiner – compute a local count for a key over all the documents processed by the associated mapper.
8/5/2020 Big Data Processing 108

RMIT Classification: Trusted
MapReduce- Combiners
If not implement combiner
All emitted key-value pairs from mappers need to be copied across the network
So the amount of intermediate data will be larger than the input data. This is inefficient.
Solution
Before the shuffle and sort, perform local aggregation on the output of each mapper – combiner
Combiner – compute a local count for a key over all the documents processed by the associated mapper.
8/5/2020 Big Data Processing 109

RMIT Classification: Trusted
MapReduce- Combiners
If not implement combiner
All emitted key-value pairs from mappers need to be copied across the network
So the amount of intermediate data will be larger than the input data. This is inefficient.
Solution
Before the shuffle and sort, perform local aggregation on the output of each mapper – combiner
Combiner – compute a local count for a key over all the documents processed by the associated mapper.
8/5/2020 Big Data Processing 110

RMIT Classification: Trusted
MapReduce- Combiners
If not implement combiner
All emitted key-value pairs from mappers need to be copied across the network
So the amount of intermediate data will be larger than the input data. This is inefficient.
Solution
Before the shuffle and sort, perform local aggregation on the output of each mapper – combiner
Combiner – compute a local count for a key over all the documents processed by the associated mapper.
8/5/2020 Big Data Processing 111

RMIT Classification: Trusted
Partition, Shuffle and Sorting
 Partition & Shuffle
After the map tasks have completed,
it begins
to where they are required by the reducers.
exchanging the
intermediate outputs from the map
tasks
8/5/2020 Big Data Processing 112

RMIT Classification: Trusted
Partition, Shuffle and Sorting
 Partition & Shuffle
After the map tasks have completed,
it begins
to where they are required by the reducers.
exchanging the
intermediate outputs from the map
tasks
8/5/2020 Big Data Processing 113

RMIT Classification: Trusted
Partition, Shuffle and Sorting
 Partition & Shuffle
After the map tasks have completed,
it begins
to where they are required by the reducers.
exchanging the
intermediate outputs from the map
tasks
8/5/2020 Big Data Processing 114

RMIT Classification: Trusted
Partition, Shuffle and Sorting
 Partition & Shuffle
After the map tasks have completed,
it begins
to where they are required by the reducers.
exchanging the
intermediate outputs from the map
tasks
8/5/2020 Big Data Processing 115

RMIT Classification: Trusted
Partition, Shuffle and Sorting
 Partition & Shuffle
After the map tasks have completed,
it begins
to where they are required by the reducers.
exchanging the
intermediate outputs from the map
tasks
This process of moving map outputs to the reducers is known as shuffling.
8/5/2020 Big Data Processing 116

RMIT Classification: Trusted
Partition, Shuffle and Sorting
 Partition & Shuffle
After the map tasks have completed,
it begins
to where they are required by the reducers.
exchanging the
intermediate outputs from the map
tasks
This process of moving map outputs to the reducers is known as shuffling.
8/5/2020 Big Data Processing 117

RMIT Classification: Trusted
Partition, Shuffle and Sorting
 Partition & Shuffle
After the map tasks have completed,
it begins
to where they are required by the reducers.
exchanging the
intermediate outputs from the map
tasks
This process of moving map outputs to the reducers is known as shuffling.
All values for the same key are always reduced together regardless of which mapper is its origin.
8/5/2020 Big Data Processing 118

RMIT Classification: Trusted
Partition, Shuffle and Sorting
 Partition & Shuffle
After the map tasks have completed,
it begins
to where they are required by the reducers.
This process of moving map outputs to the reducers is known as shuffling.
exchanging the
intermediate outputs from the map
tasks
For each (key, value) pair, MapReduce platform decides which All values for the same key are always reduced
partition it should go.
together regardless of which mapper is its origin.
8/5/2020 Big Data Processing 119

RMIT Classification: Trusted
Partition, Shuffle and Sorting
 Partition & Shuffle
After the map tasks have completed,
it begins
to where they are required by the reducers.
This process of moving map outputs to the reducers is known as shuffling.
exchanging the
intermediate outputs from the map
tasks
For each (key, value) pair, if a programmer want to determine All values for the same key are always reduced
which partition it will go, implement partitioner. together regardless of which mapper is its origin.
8/5/2020 Big Data Processing 120

RMIT Classification: Trusted
Partition, Shuffle and Sorting
 Sorting
and not by value. Values passed to each reducer
are not sorted; they can be in any order.
Before starting of reducer, all
intermediate
key

value pairs
in
MapReduce that are generated by
mapper get sorted by key
8/5/2020 Big Data Processing 121

RMIT Classification: Trusted
Partition, Shuffle and Sorting
 Sorting
and not by value. Values passed to each reducer
are not sorted; they can be in any order.
Before starting of reducer, all
intermediate
key

value pairs
in
MapReduce that are generated by
mapper get sorted by key
8/5/2020 Big Data Processing 122

RMIT Classification: Trusted
hash
8/5/2020 Big Data Processing 123

RMIT Classification: Trusted
hash
8/5/2020 Big Data Processing 124

RMIT Classification: Trusted
Combiner
Combiner
Combiner Combiner
hash
8/5/2020
Big Data Processing 125

RMIT Classification: Trusted
Ayush 1 Ayush 1
…Mona 1 Mona 1 …
Combiner
Combiner
Combiner Combiner
8/5/2020
Big Data Processing 126
hash

RMIT Classification: Trusted
How many pair from the map?
Ayush 1 Ayush 1
…Mona 1 Mona 1 …
Combiner
Combiner
Combiner Combiner
8/5/2020
Big Data Processing 127
hash

RMIT Classification: Trusted
hash
Partition
8/5/2020 Big Data Processing 128
The default partitioner computes a hash value for the key and assigns the partition based on this result.

RMIT Classification: Trusted
hash
8/5/2020 Big Data Processing 129

RMIT Classification: Trusted
hash
8/5/2020 Big Data Processing 130

RMIT Classification: Trusted
A summary – as a programmer,
 You have limited control over
data and execution flow!
• All algorithms must be expressed in
“map, reduce, combine, partition”
8/5/2020 Big Data Processing 131

RMIT Classification: Trusted
A summary – as a programmer,
 You have limited control over
data and execution flow!
• All algorithms must be expressed in
“map, reduce, combine, partition”  You don’t know:
• Where mappers and reducers run
8/5/2020 Big Data Processing 132

RMIT Classification: Trusted
A summary – as a programmer,
 You have limited control over
data and execution flow!
• All algorithms must be expressed in
“map, reduce, combine, partition”  You don’t know:
• •
Where mappers and reducers run
When a mapper or reducer begins or finishes
8/5/2020
Big Data Processing 133

RMIT Classification: Trusted
A summary – as a programmer,
 You have limited control over
data and execution flow!
• All algorithms must be expressed in
“map, reduce, combine, partition”  You don’t know:
• Where mappers and reducers run
• When a mapper or reducer begins or
finishes
• Which input a particular mapper is processing
8/5/2020
Big Data Processing 134

RMIT Classification: Trusted
A summary – as a programmer,
 You have limited control over
data and execution flow!
• All algorithms must be expressed in
“map, reduce, combine, partition”  You don’t know:
• Where mappers and reducers run
• When a mapper or reducer begins or
finishes
• Which input a particular mapper is processing
• Which intermediate key a particular reducer is processing
8/5/2020
Big Data Processing 135

RMIT Classification: Trusted
A summary – as a programmer,
 You have limited control over
data and execution flow!
• All algorithms must be expressed in
“map, reduce, combine, partition”  You don’t know:
• Where mappers and reducers run
• When a mapper or reducer begins or
finishes
• Which input a particular mapper is processing
• Which intermediate key a particular reducer is processing
Resource Manager, Application Master, Node Manager, Containers
8/5/2020
Big Data Processing 136

RMIT Classification: Trusted
A summary – as a programmer,
 You have limited control over
data and execution flow!
• All algorithms must be expressed in
“map, reduce, combine, partition”  You don’t know:
• Where mappers and reducers run
• When a mapper or reducer begins or
finishes
• Which input a particular mapper is processing
• Which intermediate key a particular reducer is processing
8/5/2020
Big Data Processing 137
Resource Manager, Application Master, Node Manager, Containers
one container
one MapReduce task (map task, reduce task)

RMIT Classification: Trusted
MapReduce – Local Aggregation
8/5/2020 Big Data Processing 138

RMIT Classification: Trusted
Local Aggregation
 Transfer data over network
Exchange of intermediate results from the processes (e.g., mapper) that produced them to the processes (e.g., reducer) that will ultimately consume them.
8/5/2020 Big Data Processing 139

RMIT Classification: Trusted
Local Aggregation
 Transfer data over network
Exchange of intermediate results from the processes (e.g., mapper) that produced them to the processes (e.g., reducer) that will ultimately consume them.
8/5/2020 Big Data Processing 140

RMIT Classification: Trusted
Local Aggregation
 Transfer data over network
Exchange of intermediate results from the processes (e.g., mapper) that produced them to the processes (e.g., reducer) that will ultimately consume them.
Data written to local disk in Hadoop Intermediate results are written to local disk before being sent over the network.
8/5/2020 Big Data Processing 141

RMIT Classification: Trusted
Local Aggregation
 Transfer data over network
Exchange of intermediate results from the processes (e.g., mapper) that produced them to the processes (e.g., reducer) that will ultimately consume them.
Data written to local disk in Hadoop Intermediate results are written to local disk before being sent over the network.
Local aggregation of intermediate results is on the keys • it uses in-mapping combiner
• it preserves state across multiple inputs (e.g., 3 files)
8/5/2020 Big Data Processing 142

RMIT Classification: Trusted
Local Aggregation
 Transfer data over network
Exchange of intermediate results from the processes (e.g., mapper) that produced them to the processes (e.g., reducer) that will ultimately consume them.
Data written to local disk in Hadoop Intermediate results are written to local disk before being sent over the network.
Local aggregation of intermediate results is on the keys
• it uses in-mapping combiner
• it preserves state across multiple inputs (e.g., 3 files)
8/5/2020 Big Data Processing 143

RMIT Classification: Trusted
Combiner and In-mapping Combining
 Combiner (introduced already)
• Combiners aggregate term counts of intermediate key- value pairs by each map task.
8/5/2020 Big Data Processing 144

RMIT Classification: Trusted
Combiner and In-mapping Combining
 Combiner (introduced already)
• Combiners aggregate term counts of intermediate key- value pairs by each map task.
Intermediate results are written to local disk before being sent over network
8/5/2020 Big Data Processing 145

RMIT Classification: Trusted
Combiner and In-mapping Combining
Combiner An Example
Hello wordcount MapReduce Hadoop program. This is my first MapReduce program
8/5/2020 Big Data Processing 146

RMIT Classification: Trusted
Combiner and In-mapping Combining
Hello wordcount MapReduce Hadoop program. This is my first MapReduce program
Combiner An Example
written to local disk
Hello 1 wordcount 1 MapReduce 1 Hadoop 1 program 1 This 1 is 1 my 1 first 1 MapReduce 1 program 1
8/5/2020
Big Data Processing 147

RMIT Classification: Trusted
Combiner and In-mapping Combining
Hello wordcount MapReduce Hadoop program. This is my first MapReduce program
written to local disk
Combiner An Example
Hello 1 wordcount 1 MapReduce 1 Hadoop 1 program 1 This 1 is 1 my 1 first 1 MapReduce 1 program 1
Combiner
8/5/2020
Big Data Processing 148

RMIT Classification: Trusted
Combiner and In-mapping Combining
Hello wordcount MapReduce Hadoop program. This is my first MapReduce program
Combiner An Example
written to local disk
Hello 1 wordcount 1 MapReduce 1 Hadoop 1 program 1 This 1 is 1 my 1 first 1 MapReduce 1 program 1
first 1 Hadoop 1 Hello 1 is 1 MapReduce 2 my 1 program 2 This 1 wordcount 1
Combiner
8/5/2020
Big Data Processing 149

RMIT Classification: Trusted
Combiner and In-mapping Combining
Hello wordcount MapReduce Hadoop program. This is my first MapReduce program
Combiner An Example
written to local disk
Hello 1 wordcount 1 MapReduce 1 Hadoop 1 program 1 This 1 is 1 my 1 first 1 MapReduce 1 program 1
first 1 Hadoop 1 Hello 1 is 1 MapReduce 2 my 1 program 2 This 1 wordcount 1
Combiner
Hadoop makes no guarantees on how many time combiners are called, it could be 0, 1, or multiple times
8/5/2020
Big Data Processing 150

RMIT Classification: Trusted
Combiner and In-mapping Combining
 In-mapping Combining
8/5/2020 Big Data Processing 151

RMIT Classification: Trusted
Combiner and In-mapping Combining
 In-mapping Combining
8/5/2020 Big Data Processing 152

RMIT Classification: Trusted
Combiner and In-mapping Combining
 In-mapping Combining •
An associative array (i.e., Map in Java) is introduced inside the
mapper to tally up term counts within a single document
8/5/2020 Big Data Processing 153

RMIT Classification: Trusted
Combiner and In-mapping Combining
 In-mapping Combining
• An associative array (i.e., Map in Java) is introduced inside the
mapper to tally up term counts within a single document •
.
With this technique, the combiner functionality has been directly
applied insider mapper; no need to run a separate combiner
8/5/2020 Big Data Processing
154

RMIT Classification: Trusted
Combiner and In-mapping Combining
Hello wordcount MapReduce Hadoop program. This is my first MapReduce program
In-mapping Combining An Example
8/5/2020 Big Data Processing 155

RMIT Classification: Trusted
Combiner and In-mapping Combining
Hello wordcount MapReduce Hadoop program. This is my first MapReduce program
In-mapping Combining An Example
written to local disk
first 1 Hadoop 1 Hello 1 is 1 MapReduce 2 my 1 program 2 This 1 wordcount 1
8/5/2020 Big Data Processing 156

RMIT Classification: Trusted
Combiner In-mapping Combining
class Mapper {
def map(key: Long, value: String) = {
for (word <- tokenize(value)) { emit(word, 1) } } } class Mapper { def map(key: Long, value: String) = { val counts = new Map() for (word <- tokenize(value)) { counts(word) += 1 } for ((k, v) <- counts) { emit(k, v) } } } 8/5/2020 Big Data Processing 157 RMIT Classification: Trusted Can we do even better? 8/5/2020 Big Data Processing 158 RMIT Classification: Trusted Combiner and In-mapping Combining  In-mapping Combining (Preserve State across documents) 8/5/2020 Big Data Processing 159 RMIT Classification: Trusted Combiner and In-mapping Combining  In-mapping Combining (Preserve State across documents) • The mapper’s input key- – Initialize an associative array for holding term counts Initialize method is called prior to processing any value pairs 8/5/2020 Big Data Processing 160 RMIT Classification: Trusted Combiner and In-mapping Combining  In-mapping Combining (Preserve State across documents) • The mapper’s input key- – Initialize an associative array for holding term counts Initialize method is called prior to processing any value pairs 8/5/2020 Big Data Processing 161 RMIT Classification: Trusted Combiner and In-mapping Combining  In-mapping Combining (Preserve State across documents) • The mapper’s input key- – Initialize an associative array for holding term counts Initialize method is called prior to processing any value pairs 8/5/2020 Big Data Processing 162 RMIT Classification: Trusted Combiner and In-mapping Combining  In-mapping Combining (Preserve State across documents) • The mapper’s input key- – Initialize an associative array for holding term counts Initialize method is called prior to processing any value pairs 8/5/2020 Big Data Processing 163 It is possible to preserve state across multiple calls of the Map method, RMIT Classification: Trusted Combiner and In-mapping Combining  In-mapping Combining (Preserve State across documents) • The mapper’s input key- – Initialize an associative array for holding term counts Initialize method is called prior to processing any value pairs 8/5/2020 Big Data Processing 164 It is possible to preserve state across multiple calls of the Map method, • We can continue to accumulate partial term counts in the associative array across multiple documents RMIT Classification: Trusted Combiner and In-mapping Combining  In-mapping Combining (Preserve State across documents) • The mapper’s input key- – Initialize an associative array for holding term counts Initialize method is called prior to processing any value pairs 8/5/2020 Big Data Processing 165 It is possible to preserve state across multiple calls of the Map method, • We can continue to accumulate partial term counts in the associative array across multiple documents • Emit key-value pairs only when the mapper has processed all documents, i.e., emission of intermediate data is deferred until the Close method. RMIT Classification: Trusted class Mapper { def map(key: Long, value: String) = { ... } def cleanup() = { ... } } def setup() = { ... } 8/5/2020 Big Data Processing 166 class Mapper { def map(key: Long, value: String) = { for (word <- tokenize(value)) { counts(word) += 1 } } def cleanup() = { for ((k, v) <- counts) { emit(k, v) } } } val counts = new Map() RMIT Classification: Trusted def setup() = { ... } def map(key: Long, value: String) = { ... } class Mapper { def cleanup() = { ... } } 8/5/2020 Big Data Processing 167 class Mapper { val counts = new Map() def map(key: Long, value: String) = { for (word <- tokenize(value)) { counts(word) += 1 } } def cleanup() = { for ((k, v) <- counts) { emit(k, v) } } } RMIT Classification: Trusted def setup() = { ... } def map(key: Long, value: String) = { ... } def cleanup() = { ... } class Mapper { } 8/5/2020 Big Data Processing 168 class Mapper { def map(key: Long, value: String) = { for (word <- tokenize(value)) { counts(word) += 1 } } def cleanup() = { for ((k, v) <- counts) { emit(k, v) } } } val counts = new Map() RMIT Classification: Trusted In-mapper combining Advantage and Disadvantage  Advantage • Provide programmers control over – – when local aggregation occurs how it is exactly take place 8/5/2020 Big Data Processing 169 RMIT Classification: Trusted In-mapper combining Advantage and Disadvantage  Advantage • • Provide programmers control over – when local aggregation occurs – how it is exactly take place More efficient than using actual combiners – - Combiners reduce intermediate data shuffled across network but don’t reduce the number of key value pairs emitted by the mappers in the first place. 8/5/2020 Big Data Processing 170 RMIT Classification: Trusted In-mapper combining Advantage and Disadvantage  Advantage • • Provide programmers control over – when local aggregation occurs – how it is exactly take place More efficient than using actual combiners – Combiners reduce intermediate data shuffled across network but don’t reduce the number of key-value pairs emitted by the mappers in the first place. –- - With in mapper combining, the mappers will generate only those key value pairs need to be shuffled across the network 8/5/2020 Big Data Processing 171 RMIT Classification: Trusted In-mapper combining Advantage and Disadvantage  Disadvantage - Scalability bottleneck 8/5/2020 Big Data Processing 172 RMIT Classification: Trusted In-mapper combining Advantage and Disadvantage  Disadvantage - Scalability bottleneck • Memory size - It requires sufficient memory to store intermediate results until the mapper has completely processed all key value pairs in an input split. 8/5/2020 Big Data Processing 173 RMIT Classification: Trusted In-mapper combining Advantage and Disadvantage  Disadvantage - Scalability bottleneck • Memory size It requires sufficient memory to store intermediate results until the mapper has completely processed all key-value pairs in an input split. • Solution to limited memory, periodically – – “block” input key-value pairs “flush” in-memory data structure 8/5/2020 Big Data Processing 174 RMIT Classification: Trusted In-mapper combining Advantage and Disadvantage  Disadvantage - Scalability bottleneck • Memory size It requires sufficient memory to store intermediate results until the mapper has completely processed all key-value pairs in an input split. • Solution to limited memory, periodically – – “block” input key-value pairs “flush” in-memory data structure Idea: instead of emitting intermediate data only after every key-value pair have been processed, emit partial results after processing every n key-value pairs. 8/5/2020 Big Data Processing 175 RMIT Classification: Trusted Local Aggregation - Algorithm Correctness Given a large user log from a popular website, each record consists of: • a key -> user ids
• a value -> number of click a day
A user may have many records in the log
8/5/2020 Big Data Processing 176

RMIT Classification: Trusted
Local Aggregation – Algorithm Correctness
Given a large user log from a popular website, each record consists of:
• a key -> user ids
• a value -> number of click a day
A user may have many records in the log
….



8/5/2020 Big Data Processing 177

RMIT Classification: Trusted
Local Aggregation – Algorithm Correctness
Given a large user log from a popular website, each record consists of:
• a key -> user ids
• a value -> number of click a day
A user may have many records in the log
….



Task:
compute the mean value per user
8/5/2020 Big Data Processing 178

RMIT Classification: Trusted
Local Aggregation – Algorithm Correctness
Given a large user log from a popular website, each record consists of:
• a key -> user ids
• a value -> number of click a day
A user may have many records in the log
….



Task:
compute the mean value per user
John (1+2+3+4+5)/5=3
8/5/2020 Big Data Processing 179

First Solution
RMIT Classification: Trusted
8/5/2020 Big Data Processing 180

RMIT Classification: Trusted
First Solution (map and reduce only)
….



8/5/2020 Big Data Processing 181

RMIT Classification: Trusted
First Solution (map and reduce only)
….



…. ….
8/5/2020 Big Data Processing 182

RMIT Classification: Trusted
First Solution (map and reduce only)
….



…. ….
…. ….
8/5/2020 Big Data Processing 183

RMIT Classification: Trusted
First Solution (map and reduce only)
….



…. ….
…. ….
….
Mean(1,2,3,4,5) = 15/5 =3
8/5/2020 Big Data Processing 184

RMIT Classification: Trusted
First Solution (map and reduce only)
….



…. ….
…. ….
….
Mean(1,2,3,4,5) = 15/5 =3
8/5/2020 Big Data Processing 185

RMIT Classification: Trusted
Second Solution (First Solution + Combiner)
Combiners
8/5/2020
Big Data Processing 186

RMIT Classification: Trusted
Second Solution (First Solution + Combiner)
….



8/5/2020 Big Data Processing 187

RMIT Classification: Trusted
Second Solution (First Solution + Combiner)
….



…. ….
…. ….
8/5/2020 Big Data Processing 188

RMIT Classification: Trusted
Second Solution (First Solution + Combiner)

….



…. ….
…. ….
8/5/2020 Big Data Processing 189

RMIT Classification: Trusted
Second Solution (First Solution + Combiner)

….



…. ….
…. ….

8/5/2020 Big Data Processing 190

RMIT Classification: Trusted
Second Solution (First Solution + Combiner)
….




…. ….
…. ….


Mean(1.5,4) = 5.5/2 =2.75
8/5/2020 Big Data Processing 191

RMIT Classification: Trusted
Second Solution (First Solution + Combiner)
….




…. ….
…. ….


Mean(1.5,4) = 5.5/2 =2.75
8/5/2020 Big Data Processing 192

RMIT Classification: Trusted
Second Solution – incorrect
Mean is not associative and commutative
8/5/2020 Big Data Processing 193

RMIT Classification: Trusted
Third Solution
(First Solution + Combiner +)
8/5/2020 Big Data Processing 194

RMIT Classification: Trusted
Third Solution
(First Solution + Combiner +)
….



8/5/2020 Big Data Processing 195

RMIT Classification: Trusted
Third Solution
(First Solution + Combiner +)
….



…. ….
…. ….
8/5/2020 Big Data Processing 196

Third Solution

RMIT Classification: Trusted
….



…. ….
…. ….
8/5/2020 Big Data Processing 197

RMIT Classification: Trusted
….



Third Solution

…. ….
…. ….

8/5/2020 Big Data Processing 198

Third Solution

RMIT Classification: Trusted
….



…. ….
…. ….


(3+12)/(2+3) = 15/5 = 3
8/5/2020 Big Data Processing 199

Third Solution

RMIT Classification: Trusted
….



…. ….
…. ….


(3+12)/(2+3) = 15/5 = 3
8/5/2020 Big Data Processing 200

Third Solution
RMIT Classification: Trusted
8/5/2020 Big Data Processing 201

Third Solution
RMIT Classification: Trusted
8/5/2020 Big Data Processing 202

Third Solution
RMIT Classification: Trusted
8/5/2020 Big Data Processing 203

Third Solution
RMIT Classification: Trusted
Hadoop makes no guarantees on how many time combiners are called, it could be 0, 1, or multiple times
8/5/2020 Big Data Processing 204

Third Solution
RMIT Classification: Trusted
Hadoop makes no guarantees on how many time combiners are called, it could be 0, 1, or multiple times
What if combiner is called 0 times?
8/5/2020 Big Data Processing 205

Third Solution
RMIT Classification: Trusted
Hadoop makes no guarantees on how many time combiners are called, it could be 0, 1, or multiple times
What if combiner is called 0 times?
8/5/2020 Big Data Processing 206

Third Solution
RMIT Classification: Trusted
Hadoop makes no guarantees on how many time combiners are called, it could be 0, 1, or multiple times
What if combiner is called 0 times?
8/5/2020 Big Data Processing 207

RMIT Classification: Trusted
Third Solution
Output of Mapper ≠ Output of Combiner
8/5/2020 Big Data Processing 208
Hadoop makes no guarantees on how many time combiners are called, it could be 0, 1, or multiple times
What if combiner is called 0 times?

RMIT Classification: Trusted
Fourth Solution
8/5/2020 Big Data Processing 209

Third Solution
RMIT Classification: Trusted
….



….
….
….
….
8/5/2020 Big Data Processing 210

RMIT Classification: Trusted
Fourth Solution
Output of Mapper = Output of Combiner
8/5/2020 Big Data Processing 211
Hadoop makes no guarantees on how many time combiners are called, it could be 0, 1, or multiple times
What if combiner is called 0 times?

RMIT Classification: Trusted
Fifth Solution (Fourth Solution +inmapper combiner with across document state preserve)
8/5/2020 Big Data Processing 212

RMIT Classification: Trusted
Next Lecture
MapReduce Algorithm Design Patterns
8/5/2020 Big Data Processing 213