7CCSMBDT – Big Data Technologies Week 3
Grigorios Loukides, PhD (grigorios.loukides@kcl.ac.uk)
Spring 2016/2017
1
Objectives 1/2
Analytics flow for big data
LECTURE 2
Data collection
The data is collected and ingested into a big data stack Issues required for meaningful processing are resolved
Data Preparation
TODAY
Analysis modes
The mode of analysis is determined
Analysis types
The type of analysis is determined
Visualizations
The analysis results are presented to the user
2
Objectives 2/2
Today:
Software for Big Data Analysis in Batch mode (results updated “infrequently”)
Specifically:
Motivation
Distributed file systems Hadoop
MapReduce
Read: Chapter 2 from Leskovec & Chapters 6.1, 7.1 from Bagha
Slides based on http://www.mmds.org
3
Motivation (1/2)
Social network provider wants to count the number of friends of each user
Simple “solution”: For each node, count neighbors
Problem: social networks are very large
Social connections of a neighborhood of 3.5K residents in Seattle
Facebook has ~1.6B users
http://sudocity.com/columbia-city-social-network-analysis/ 4
Motivation (2/2)
Ranking of web pages by Google
Simple “solution”: PageRank (iterative matrix multiplication*)
Problem: matrix has 20+ billion columns (also 20+ billion rows),
*http://www.math.cornell.edu/~mec/Winter2009/RalucaRemus/Lecture3/lecture3.html
5
New architecture (1/3)
Solution: new architecture based on
Commodity hardware
Computer nodes (conventional processors)
Commodity network (Ethernet) to connect the nodes
Connects the nodes in the rack
node
In 2011 it was guestimated that Google had 1M machines, http://bit.ly/Shh0RO
6
New architecture (2/3)
Solution: new architecture But
How do you distribute computation?
How do you write distributed programs?
How do you deal with machine failures (loss of a single node, loss of a rack)?
If you have 1,000 servers, expect to lose 1 per day
1000 servers out of 1M of Google’s servers fail every day!
7
New architecture (3/3)
Write software to deal with BUTs Software stack
Distributed file system
stores large data
provides replication
protects against failures
Programming model
performs large-scale, distributed computations efficiently tolerates hardware failures
MapReduce is such a programming model
8
Let’s start with
Distributed file system stores large data
provides replication
protects against failures
9
Distributed file system (1/6)
Distributed file system
Provides global file namespace
Used when
files are huge (100s of GBs to TB) existing files are rarely updated
reads and appends are common
ExamplesGoogleGFS,HadoopHDFS, CloudStore
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
10
Distributed file system (2/6)
Distributed file system
Files are divided into chunks that are stored in chunk servers and replicated on different machines or racks (recovery from server or rack failures)
…
C5
D1
C0
D0 C1
C2
D0
C5
C1
C2
C5
Chunk server 1 Chunk server 2 Chunk server 3
C2
Chunk server N
C3
Master node is a metadata file used to find the chunks of a file. Master node is also replicated.
Directory for the file system is also replicated
Client library for accessing files (talks to master, connects to chunk servers)
11
Distributed file system (3/6)
Distributed file system: Hadoop HDFS
Chunks==blocks, Master node==Namenode, Chunkserver==Datanode
Chunkservers
Chunks
Datanode
12
Distributed file system (4/6)
Distributed file system HDFS read path
Client library
(1) Get block locations
(4) Block locations sorted by the distance to the client
(2) Does the file exist? (3) Does the client have read permission?
Namenode
Metadata {file.txt Block A: 1,3,5 Block B: 2,4,5 ,,,}
(5) Datanodes
stream data
(if a server becomes unavailable, the client can read another replica on a different server)
…
C2
C0
C5
Datanode 1
Datanode N
C1
C2
13
Distributed file system (5/6)
Distributed file system HDFS write path (1/2)
Client library
(2) Does the file exists (3) Does the client have write permission?
Namenode
Metadata {file.txt Chunk A: 1,3,5 Chunk B: 2,4,5 ,,,}
(4) Outputstream object (5) Data split into packets and added into queue (6) Namenode allocates
new blocks
C0
Datanode 1
Datanode N
…
C0
C5
C1
C2
D0
C5
C2
14
Distributed file system (6/6)
Distributed file system HDFS write path (2/2)
Client library
(7) Connection with Datanodes
(9) Client closes outputstream and sends request to
close the file
C0
Datanode 1
Datanode N
(8) Datanode 1 consumes data from the queue, writes to Datanode 2 (for replication), which sends acknowledgement to Datanode 1, …, Datanode 1 sends ack. to client
15
…
C0
Namenode
Metadata {file.txt Chunk A: 1,3,5 Chunk B: 2,4,5 ,,,}
D0
C5
C2
C5
C1
C2
Let’s proceed to
Programming model
performs large-scale, distributed computations efficiently tolerates hardware failures
MapReduce is such a programming model / system / paradigm
OVERVIEW
USE in a WordCount example
Environment in detail
*http://www.math.cornell.edu/~mec/Winter2009/RalucaRemus/Lecture3/lecture3.html
16
MapReduce: Overview
17
MapReduce: Overview
Map task:
extract something you care about
• Gets chunks from the distributed file system
• Turns chunks into key-value pairs
• Code tells what chunks to get and what key-value pairs to create
18
MapReduce: Overview
Shuffle and sort task:
• Performed by the system automatically
• Data from all mappers are grouped by the key, split among reducers, and sorted by the key.
• Each reducer obtains all values associated with the same key.
19
MapReduce: Overview
Reduce task:
Aggregate, summarize, filter, or transform
• Data from all mappers are grouped by the key, split among reducers, and sorted by the key.
• Each reducer obtains all values associated with the same key.
20
MapReduce: Overview
Write the combined output
21
MapReduce: Overview
Overview (once again)
Map task: extract something you care about
Group by key: shuffle and sort
Reduce task
Write the combined output
22
MapReduce functions
To implement Map and Reduce operations, we write code
Map and Reduce as functions specified by the programmer 𝑀𝑎𝑝𝑘,𝑣 →<𝑘′,𝑣′>∗
Input: a key-value pair
Output: a possibly empty list of key-value pairs One Map call for every different (k,v) pair
𝑅𝑒𝑑𝑢𝑐𝑒 𝑘′,<𝑣′ >∗ →<𝑘′,𝑣′′>∗
Input: a key and a possibly empty list of its associated values Output: A possibly empty list of key-value pairs
One Reduce function for every different key k’
23
MapReduce: WordCount
Provided by the programmer
Provided by the programmer
MAP:
Read input and produces a set of key-value pairs
Group by
key:
Collect all pairs with same key
Reduce:
Collect all values belonging to the key and output
The crew of the space shuttle Endeavor recently returned to Earth as
ambassadors, harbingers of a new era of space exploration. Scientists at NASA are saying that the recent assembly of the Dextre bot is the first step in a long-term space-based man/mache partnership. ‘”The work we’re doing now — the robotics we’re doing – – is what we’re going to need ……………………..
(The, 1) (crew, 1)
(of, 1) (the, 1)
(space, 1) (shuttle, 1)
(Endeavor, 1) (recently, 1) ….
(crew, 1)
(crew, 1) (space, 1)
(the, 1) (the, 1) (the, 1)
(shuttle, 1) (recently, 1) …
(crew, 2) (space, 1) (the, 3) (shuttle, 1) (recently, 1) …
Huge text file (key, value)
(key, value)
(key, value)
24
OnSleyquesnetqiaulelynrteiaald trheeadsata
MapReduce: WordCount
mapper.py
reducer.py
25
Map-Reduce environment
It takes care of:
Workflow
Splitting the input data
Scheduling the program’s execution across a set of machines
Managing required inter-machine communication
Performing the group by key (shuffle & sort) step
Handling machine failures
We will see how in detail
26
MapReduce workflow
(1) Input data are split into pieces and multiple instances of the program start (2) One instance elected as master, the rest are workers. Master assigns map or
reduce task(s) to an idle worker
2
1
27
MapReduce workflow
(3) Worker with Map task processes its input, outputs key-value pairs, and passes them to the user’s map function. The pairs are buffered in memory.
(4) The buffered pairs are written to local disk and partitioned into regions. The locations of pairs on disk are passed to the master.
3
4
28
MapReduce workflow
(5) When a Reducer worker is notified by the master about the locations, it reads the buffered pairs, and groups them by key (key, list-of-values)
(6) The Reducer worker passes the key and values into the user’s reducer function, whose output is appended to the final output file.
5
6
29
MapReduce workflow
(7) When all Map and Reduce tasks are complete, the control returns to the user program. The program may use the output as input to another MapReduce task.
7
All phases are distributed with many tasks doing the work.
30
MapReduce dealing with failures
What if master fails? Master failure
MapReduce task is aborted (needs to be restarted)
31
MapReduce dealing with failures
Master periodically pings the workers for failures
What if map worker fails?
Map worker failure
Map tasks completed or in-progress at
worker are reset to idle
Reduce workers are notified when task is rescheduled on another worker
32
MapReduce dealing with failures
Master periodically pings the workers for failures
What if reduce fails?
Reduce worker failure
Only in-progress tasks are reset to idle
Reduce task is rescheduled to start later
33
How many Map and Reduce jobs?
M map tasks, R reduce tasks Rule of thumb:
M >> number of nodes in the cluster Improves dynamic load balancing
speeds up recovery from worker failures Can pipeline shuffling with map execution
34
How many Map and Reduce jobs?
M map tasks, R reduce tasks Rule of thumb:
M >> number of nodes in the cluster
Usually R << M
R set as small multiple of number of nodes Because output is spread across R files
35
Refinement: Backup Tasks
Problem: There are slow workers
why?:
Other jobs on the machine
Bad disks (may reduce read performance 30 times) Bugs
They significantly lengthen the job completion time Solution
When MR operation is about to complete, spawn backup copies of in-progress tasks
Whichever one finishes first “wins” Effect
Dramatically shortens job completion time
36
Refinement: Combiners
Often a Map task will produce many pairs of the form (k,v1), (k,v2), ... for the same key k
E.g., popular words in the word count example
Can save network time by pre-aggregating values in the mapper
37
Refinement: Combiners
Combiner (i.e., mini-reducer)
combine(k, list(v1))(k2, v2 for list(v1))
Combiner is usually same as the reduce function
But
Can only be applied to a function that is commutative and associative
e.g., max(5, 4, 1, 2) = max(max(5, 1), max(4, 2))
Its input & output must be of the same type as that of the mapper
38
Refinement: Combiners
Back to our word counting example:
Combiner combines the values of all keys of a
single mapper (single machine):
Much less data needs to be copied and shuffled!
39
Refinement: Partitioner
Want to control how keys get partitioned
Inputs to map tasks are created by contiguous
splits of input file
Reduce needs to ensure that records with the same intermediate key end up at the same worker
Partitioner
divides data (key,value) pairs of Map tasks
ensures that the same key (that can be output by multiple mappers) goes to the same reducer
40
Refinement: Partitioner
Partitioner
By default:
hash(key) mod R
it aims to create “well-balanced” partitions
Sometimes useful to override the hash function:
E.g., hash(hostname(URL)) mod R to put all URLs from a host in the same output file
41
MapReduce with python
mrjob
lets you write MapReduce jobs in Python 2.6+/3.3+ and run them on several platforms (locally, on a Hadoop cluster, on Amazon Elastic MapReduce etc.)
https://pythonhosted.org/mrjob/index.html Reference manual in pdf
https://media.readthedocs.org/pdf/mrjob/latest/mrjob.pdf
Lab with mrjob
42
MapReduce with python
mr_word_count.py Empty key and
a line as value
Returns three key-value pairs
For each key,
returns the sum of its values
43
Please complete the intermediate feedback form
44