PowerPoint Presentation
Big Data Computing
Overview
MapReduce and Hadoop
1
Self Introduction
Research interests
Algorithms on big data;
databases;
data streams;
sampling;
data structures, database indexing;
parallel and distributed algorithms;
external memory algorithms;
data mining;
computational geometry.
2
First slide of my PhD defense in 2006
What is Big Data?
6
Big Data Definition
No single standard definition…
“Big Data” is data whose scale, complexity, and speed require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it…
7
Characteristics of Big Data:
1-Scale (Volume)
Data Volume
44x increase from 2009 to 2020
From 0.8 zettabytes to 35zb
Data volume is increasing exponentially
8
Exponential increase in collected/generated data
The Model Has Changed…
The Model of Generating/Consuming Data has Changed
Old Model: Few companies are generating data, all others are consuming data
New Model: all of us are generating data, and all of us are consuming data
9
Who’s Generating Big Data
Social media and networks
(all of us are generating data)
Scientific instruments
(collecting all sorts of data)
Mobile devices
(tracking all objects all the time)
Sensor technology and networks
(measuring all kinds of data)
The progress and innovation is no longer hindered by the ability to collect data
But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion
10
Characteristics of Big Data:
2-Complexity (Variety)
Various formats, types, and structures
Numerical, text, images, audio, video, sequences, time series, social media data, multi-dim arrays, etc…
A single application can be generating/collecting many types of data
11
To extract knowledge all these types of data need to linked together
The Structure Spectrum
12
Structured Data
The relational data model is the most used data model
Relation, a table with rows and columns
Every relation has a schema defining each columns’ type
The programmer must statically specify the schema
13
SELECT AuthorName FROM Authors, Books WHERE Authors.AuthorID=Books.AuthorID AND Books.Date>1980
14
Relational Database and SQL
How much structured data?
Conventional Wisdom:
» Only 20% of data is structured
Decreasing due to:
» Consumer applications
» Enterprise search
» Media applications
15
Semi-structured Data: JSON
16
A JSON object consists of a collection name: value pairs, separated by commas. Each value can be
a string
a number
a boolean
null
an array: an ordered list of zero or more values, each of which may be of any type. Arrays use square bracket notation with elements being comma-separated.
a JSON object
Characteristics of Big Data:
3-Speed (Velocity)
Data is being generated fast and need to be processed fast
Static data Streaming data
Online Data Analytics
Late decisions missing opportunities
Examples
E-Promotions: Based on your current location, your purchase history, what you like send promotions right now for store next to you
Healthcare monitoring: sensors monitoring your activities and body any abnormal measurements require immediate reaction
17
Some Make it 4V’s
18
My Take
None of the V’s is really new
Each had been studied before the term “big data” was coined
“Big Data” is a more of a marketing term
Draws attention
Draws funding
Draws students
Motivates new technologies that provide synthesized support for all the V’s.
19
20
21
Going Parallel/Distributed is the Only Way to Scale
22
The Frustration of Parallel Programming
Race conditions
Intended result: add 2 to V
But, what if 1A is executed between 1B and 3B?
23
Thread A Thread B
1A: Read variable V 1B: Read variable V
2A: Add 1 to variable V 2B: Add 1 to variable V
3A: Write back to variable V 3B: Write back to variable V
The Frustration of Parallel Programming
Use locks
However, there can be deadlocks…
24
Thread A Thread B
1A: Lock variable V 1B: Lock variable V
2A: Read variable V 2B: Read variable V
3A: Add 1 to variable V 3B: Add 1 to variable V
4A: Write back to variable V 4B: Write back to variable V
5A: Unlock variable V 5B: Unlock variable V
The Frustration of Parallel Programming
Hard to debug: Race conditions and deadlocks are nondeterministic
Most programming languages are low-level
The programmer needs to manage shared memory and/or communication
OpenMP is a good step forward, but still difficult for most programmers
Programs written for multi-cores do not easily carry over to clusters
25
26
How do you program this thing?
Cloud Computing
Computing as a utility: deliver computing resources over the Internet, as a metered service
Dynamic provisioning: pay-as-you-go
Scalability: “infinite” capacity
Elasticity: scale up or down
27
The datacenter is a computer
28
How it all got started:
Google MapReduce (2004)
19485 citations and counting …
29
MapReduce
30
(key, value) pairs are used as the format for both data and intermediate results
Key-Value Pairs
Mappers and Reducers are users’ code (provide as functions)
Just need to obey the Key-Value pairs interface
Mappers:
Consume
Produce
Reducers:
Consume
Produce
Shuffling and Sorting:
Hidden phase between mappers and reducers
Groups all
31
Processing Granularity
Mappers
Run on a record-by-record basis
Your code processes that record and may produce
Zero, one, or many outputs
Reducers
Run on a group-of-records (having same key)
Your code processes that group and may produce
Zero, one, or many outputs
32
Example 1: Word Count
33
Map
Tasks
Reduce
Tasks
Job: Count the occurrences of each word in a data set
How it looks like in Java
Map function
Reduce function
Provide implementation for Hadoop’s Mapper abstract class
Provide implementation for Hadoop’s Reducer abstract class
Job configuration
Example 2: Inverted Index
Search engines use inverted index to find webpages containing a given keyword quickly
MapReduce program for creating an inverted index:
Map
For each (url, doc) pair
Emit (keyword, url) for each keyword in doc
Reduce
For each keyword, output (keyword, list of urls)
35
SELECT AuthorName FROM Authors, Books WHERE Authors.AuthorID=Books.AuthorID AND Books.Date>1980
36
Exercise: How to process this SQL query in MapReduce?
Answer
For each record in the ‘Authors’ table:
Map: Emit (AuthorID, AuthorName)
For each record in the ‘Books’ table:
Map: Emit (AuthorID, Date)
Reduce:
For each AuthorID, if Date>1980, output AuthorName
37
Answer (Optimized)
For each record in the ‘Authors’ table:
Map: Emit (AuthorID, AuthorName)
For each record in the ‘Books’ table:
Map: If Date>1980, emit (AuthorID, Date)
Reduce:
For each AuthorID, output AuthorName
38
Where do we store data?
Input/output data
Intermediate data
39
Google File System (GFS)
Why distributed file system?
Can we have a big disk?
40
I/O is a bottleneck
Suppose we have crawled 100 Tb worth of data
State-of-the-art 7,200 rpm SATA drive has 3 Gbps I/O
It takes 100 Tb / 3 Gbps = 9.3 hours to scan through the entire data!
41
I/O throughput lags behind
42
Source: R1Soft, http://wiki.r1soft.com/pages/viewpage.action?pageId=3016608
I/O throughput lags behind
43
Source: R1Soft, http://wiki.r1soft.com/pages/viewpage.action?pageId=3016608
Other issues
Building a high-end supercomputer is much, much costly
Storing all data in one place adds the risk of hardware failures
Never put all eggs in one basket!
44
Google datacenter
Lots of cheap, commodity PCs, each with disk and CPU
100s to 1000s of PCs in cluster in early days
High aggregate storage capacity
No costly “big disk”
Spread processing across many machines
High I/O bandwidth, proportional to the # of machines
Parallel data processing
45
A cool idea! But wait…
Stuff breaks
If you have one server, it may stay up 3 years (1,000 days)
If you have 10k servers, expect to lose 10 a day
46
GFS: The Google File System
47
A highly reliable storage system built atop highly unreliable hardwares
Target environment
Thousands of computers
Distributed
Computers have their own disks, and the file system spans those disks
Failures are the norm
Disks, networks, processors, power supplies, application software, OS software, human errors
48
Target environment
Files are huge, but not many
>100M, usually multi-gigabyte
Read/write characteristics (write-once, read-many)
Files are mutated by appending
Once written, files are typically only read
Large streaming reads and small random reads are typical
49
Target environment
I/O bandwidth is more important than latency
Suitable for batch processing and log analytics
50
GFS design decisions
Files stored as/divided into chunks
Chunk size: 64 MB
Reliability through replication
Each chunk replicated across 3+ chunkservers
Single master to coordinate access, keep metadata
51
Hadoop
Hadoop is open-source implementation for Google’s MapReduce and GFS
Clean and simple programming abstraction
Users only provide two functions “map” and “reduce”
Automatic parallelization & distribution
Hidden from the end-user
Fault tolerance and automatic recovery
Nodes/tasks will fail and will recover automatically
52
Brief history
Initially developed by Doug Cutting as a filesystem for Apache Nutch, a web search engine
early name: Nutch Distributed FileSystem (NDFS)
moved out of Nutch and acquired by Yahoo! in 2006 as an independent project called Hadoop
53
The origin of the name
“Hadoop” is a made-up name, as explained by Doug Cutting:
54
“The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria. Kids are good at generating such.”
Hadoop: How it Works
55
Hadoop Architecture
Hadoop framework consists of two main layers
Distributed file system (HDFS)
Execution engine (MapReduce)
56
Master node (single node)
Many slave nodes
Hadoop Distributed File System (HDFS)
57
One namenode
Maintains metadata info about files:
Maps a filename to a set of blocks
Maps a block to the DataNodes where it resides
Replication engine for blocks
Many datanode (1000s)
– Store the actual data
– Files are divided into blocks
– Each block is replicated r times
(Default = 3)
– Communicates with NameNode through periodic “heartbeat” (once per 3 secs)
File F
1
2
3
4
5
Blocks (64 MB)
58
NameNode
(Master)
Secondary
NameNode
Client
DataNodes
1. filename
2. blkId, DataNodes
3. Read/write data
ClusterId
ClusterId
Data flow overview
Heartbeats
DataNodes send heartbeats to the NameNode
Once every 3 secs
NameNode uses heartbeats to detect DataNode failure
No response in 10 mins is considered a failure
59
Replication engine
Upon detecting a DataNode failure
Choose new DataNodes for replicas
Balance disk usage
Balance communication traffic to DataNodes
60
Data corrections
Checksums to validate data (CRC32)
File creation
Client computes checksum per 512 byte
DataNode stores the checksum
File access
Client retrieves data and checksum from DataNodes
If validation fails, try other replicas
61
NameNode failure
A single point of failure
Transaction log stored in multiple directories
Directory on local file system
A directory on a remote file system (NFS)
Add a secondary NameNode
62
Hadoop Map-Reduce
(Example: Color Count)
63
Shuffle & Sorting based on k
Input blocks on HDFS
Produces (k, v)
( , 1)
Consumes(k, [v])
( , [1,1,1,1,1,1..])
Produces(k’, v’)
( , 100)
Users only provide the “Map” and “Reduce” functions
Hadoop MapReduce
Job Tracker is the master node (runs with the namenode)
Receives the user’s job
Decides on how many tasks will run (number of mappers)
Decides on where to run each mapper (locality matters)
64
This file has 5 Blocks run 5 map tasks
Where to run the task reading block “1”
Try to run it on Node 1 or Node 3
Node 1
Node 2
Node 3
Choosing the “closest” block
65
Hadoop MapReduce
Task Tracker is the slave node (runs on each datanode)
Receives the task from Job Tracker
Runs the task until completion (either map or reduce task)
Always in communication with the Job Tracker reporting progress
66
In this example, 1 map-reduce job consists of 4 map tasks and 3 reduce tasks
On worker failure
Detect failure via periodic heartbeats
Workers send heartbeat messages (ping) periodically to the master node
Re-execute completed and in-progress map tasks
Re-execute in-progress reduce tasks
Task completion committed through master
67
Master failure can be handled by systems like Zookeeper
68
The MapReduce Stack
Google
Google File System (GFS)
MapReduce
BigTable
Dremel
Pregel
Everyone else
Hadoop File System (HDFS)
Hadoop MapReduce
HBase
Drill
Giraph
69
MapReduce:
A major step backwards
MapReduce may be a good idea for writing certain types of computations
But
A giant step backward in the programming paradigm for large-scale data intensive applications
A sub-optimal implementation, in that it uses brute force instead of indexing
Missing most of the features that are routinely included in current DBMS
Not novel at all
70
Valiant’s BSP Model (1990)
Bulk Synchronous Parallel
71
72
73
About the Course
Aims at the most up-to-date technologies
Combines hands-on experience and theory
Exam will have a paper-based component and a programming component.
You are encouraged to bring your laptop to the lectures, and follow along
Python (prerequisite)
SQL (will do a quick review)
74
Reduce
Reduce
Reduce
Reduce
Reduce
Reduce
Map
Map
Map
Map
Parse-hash
Parse-hash
Parse-hash
Parse-hash
Reduce
Reduce
Reduce
Map
Map
Map
Map
Parse-hash
Parse-hash
Parse-hash
Parse-hash
Reduce
Reduce
Reduce
Map
Map
Map
Map
Parse-hash
Parse-hash
Parse-hash
Parse-hash
/docProps/thumbnail.jpeg