程序代写代做代考 database algorithm hbase hadoop file system data structure SQL python data mining Java PowerPoint Presentation

PowerPoint Presentation

Big Data Computing

Overview
MapReduce and Hadoop

Self Introduction
Research interests
Algorithms on big data;
databases;
data streams;
sampling;
data structures, database indexing;
parallel and distributed algorithms;
external memory algorithms;
data mining;
computational geometry.
2

First slide of my PhD defense in 2006

What is Big Data?
6

Big Data Definition
No single standard definition…

“Big Data” is data whose scale, complexity, and speed require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it…
7

Characteristics of Big Data:
1-Scale (Volume)
Data Volume
44x increase from 2009 to 2020
From 0.8 zettabytes to 35zb
Data volume is increasing exponentially

Exponential increase in collected/generated data

The Model Has Changed…
The Model of Generating/Consuming Data has Changed

Old Model: Few companies are generating data, all others are consuming data

New Model: all of us are generating data, and all of us are consuming data

Who’s Generating Big Data

Social media and networks
(all of us are generating data)

Scientific instruments
(collecting all sorts of data)

Mobile devices
(tracking all objects all the time)

Sensor technology and networks
(measuring all kinds of data)
The progress and innovation is no longer hindered by the ability to collect data
But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion
10

Characteristics of Big Data:
2-Complexity (Variety)
Various formats, types, and structures
Numerical, text, images, audio, video, sequences, time series, social media data, multi-dim arrays, etc…
A single application can be generating/collecting many types of data

11
To extract knowledge all these types of data need to linked together

The Structure Spectrum

Structured Data
The relational data model is the most used data model
Relation, a table with rows and columns
Every relation has a schema defining each columns’ type
The programmer must statically specify the schema
13

SELECT AuthorName FROM Authors, Books WHERE Authors.AuthorID=Books.AuthorID AND Books.Date>1980
14
Relational Database and SQL

How much structured data?
Conventional Wisdom:
» Only 20% of data is structured
Decreasing due to:
» Consumer applications
» Enterprise search
» Media applications
15

Semi-structured Data: JSON

16
A JSON object consists of a collection name: value pairs, separated by commas. Each value can be
a string
a number
a boolean
null
an array: an ordered list of zero or more values, each of which may be of any type. Arrays use square bracket notation with elements being comma-separated.
a JSON object

Characteristics of Big Data:
3-Speed (Velocity)
Data is being generated fast and need to be processed fast
Static data  Streaming data
Online Data Analytics
Late decisions  missing opportunities
Examples
E-Promotions: Based on your current location, your purchase history, what you like  send promotions right now for store next to you

Healthcare monitoring: sensors monitoring your activities and body  any abnormal measurements require immediate reaction

Some Make it 4V’s
18

My Take
None of the V’s is really new
Each had been studied before the term “big data” was coined
“Big Data” is a more of a marketing term
Draws attention
Draws funding
Draws students
Motivates new technologies that provide synthesized support for all the V’s.
19

Going Parallel/Distributed is the Only Way to Scale

The Frustration of Parallel Programming
Race conditions

Intended result: add 2 to V
But, what if 1A is executed between 1B and 3B?

23
Thread A Thread B
1A: Read variable V 1B: Read variable V
2A: Add 1 to variable V 2B: Add 1 to variable V
3A: Write back to variable V 3B: Write back to variable V

The Frustration of Parallel Programming
Use locks

However, there can be deadlocks…

24
Thread A Thread B
1A: Lock variable V 1B: Lock variable V
2A: Read variable V 2B: Read variable V
3A: Add 1 to variable V 3B: Add 1 to variable V
4A: Write back to variable V 4B: Write back to variable V
5A: Unlock variable V 5B: Unlock variable V

The Frustration of Parallel Programming
Hard to debug: Race conditions and deadlocks are nondeterministic
Most programming languages are low-level
The programmer needs to manage shared memory and/or communication
OpenMP is a good step forward, but still difficult for most programmers
Programs written for multi-cores do not easily carry over to clusters
25

How do you program this thing?

Cloud Computing
Computing as a utility: deliver computing resources over the Internet, as a metered service
Dynamic provisioning: pay-as-you-go
Scalability: “infinite” capacity
Elasticity: scale up or down
27

The datacenter is a computer
28

How it all got started:
Google MapReduce (2004)
19485 citations and counting …
29

MapReduce
30

(key, value) pairs are used as the format for both data and intermediate results

Key-Value Pairs
Mappers and Reducers are users’ code (provide as functions)
Just need to obey the Key-Value pairs interface
Mappers:
Consume pairs
Produce pairs
Reducers:
Consume >
Produce
Shuffling and Sorting:
Hidden phase between mappers and reducers
Groups all pairs with the same key from all mappers, and passes them to a certain reducer in the form of >
31

Processing Granularity
Mappers
Run on a record-by-record basis
Your code processes that record and may produce
Zero, one, or many outputs
Reducers
Run on a group-of-records (having same key)
Your code processes that group and may produce
Zero, one, or many outputs

Example 1: Word Count
33
Map
Tasks
Reduce
Tasks
Job: Count the occurrences of each word in a data set

How it looks like in Java

Map function

Reduce function

Provide implementation for Hadoop’s Mapper abstract class
Provide implementation for Hadoop’s Reducer abstract class
Job configuration

Example 2: Inverted Index
Search engines use inverted index to find webpages containing a given keyword quickly
MapReduce program for creating an inverted index:
Map
For each (url, doc) pair
Emit (keyword, url) for each keyword in doc
Reduce
For each keyword, output (keyword, list of urls)
35

SELECT AuthorName FROM Authors, Books WHERE Authors.AuthorID=Books.AuthorID AND Books.Date>1980
36
Exercise: How to process this SQL query in MapReduce?

Answer
For each record in the ‘Authors’ table:
Map: Emit (AuthorID, AuthorName)
For each record in the ‘Books’ table:
Map: Emit (AuthorID, Date)
Reduce:
For each AuthorID, if Date>1980, output AuthorName

Answer (Optimized)
For each record in the ‘Authors’ table:
Map: Emit (AuthorID, AuthorName)
For each record in the ‘Books’ table:
Map: If Date>1980, emit (AuthorID, Date)
Reduce:
For each AuthorID, output AuthorName

Where do we store data?
Input/output data
Intermediate data
39

Google File System (GFS)
Why distributed file system?
Can we have a big disk?
40

I/O is a bottleneck
Suppose we have crawled 100 Tb worth of data
State-of-the-art 7,200 rpm SATA drive has 3 Gbps I/O
It takes 100 Tb / 3 Gbps = 9.3 hours to scan through the entire data!
41

I/O throughput lags behind
42

Source: R1Soft, http://wiki.r1soft.com/pages/viewpage.action?pageId=3016608

I/O throughput lags behind
43
Source: R1Soft, http://wiki.r1soft.com/pages/viewpage.action?pageId=3016608

Other issues
Building a high-end supercomputer is much, much costly
Storing all data in one place adds the risk of hardware failures
Never put all eggs in one basket!
44

Google datacenter
Lots of cheap, commodity PCs, each with disk and CPU
100s to 1000s of PCs in cluster in early days
High aggregate storage capacity
No costly “big disk”
Spread processing across many machines
High I/O bandwidth, proportional to the # of machines
Parallel data processing
45

A cool idea! But wait…
Stuff breaks
If you have one server, it may stay up 3 years (1,000 days)
If you have 10k servers, expect to lose 10 a day

GFS: The Google File System
47
A highly reliable storage system built atop highly unreliable hardwares

Target environment
Thousands of computers
Distributed
Computers have their own disks, and the file system spans those disks
Failures are the norm
Disks, networks, processors, power supplies, application software, OS software, human errors
48

Target environment
Files are huge, but not many
>100M, usually multi-gigabyte
Read/write characteristics (write-once, read-many)
Files are mutated by appending
Once written, files are typically only read
Large streaming reads and small random reads are typical
49

Target environment
I/O bandwidth is more important than latency
Suitable for batch processing and log analytics
50

GFS design decisions
Files stored as/divided into chunks
Chunk size: 64 MB
Reliability through replication
Each chunk replicated across 3+ chunkservers
Single master to coordinate access, keep metadata
51

Hadoop
Hadoop is open-source implementation for Google’s MapReduce and GFS
Clean and simple programming abstraction
Users only provide two functions “map” and “reduce”
Automatic parallelization & distribution
Hidden from the end-user
Fault tolerance and automatic recovery
Nodes/tasks will fail and will recover automatically

Brief history
Initially developed by Doug Cutting as a filesystem for Apache Nutch, a web search engine
early name: Nutch Distributed FileSystem (NDFS)
moved out of Nutch and acquired by Yahoo! in 2006 as an independent project called Hadoop
53

The origin of the name
“Hadoop” is a made-up name, as explained by Doug Cutting:
54
“The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria. Kids are good at generating such.”

Hadoop: How it Works
55

Hadoop Architecture
Hadoop framework consists of two main layers
Distributed file system (HDFS)
Execution engine (MapReduce)

Master node (single node)
Many slave nodes

Hadoop Distributed File System (HDFS)
57

One namenode
Maintains metadata info about files:
Maps a filename to a set of blocks
Maps a block to the DataNodes where it resides
Replication engine for blocks

Many datanode (1000s)
– Store the actual data
– Files are divided into blocks
– Each block is replicated r times
(Default = 3)
– Communicates with NameNode through periodic “heartbeat” (once per 3 secs)

File F
1
2
3
4
5
Blocks (64 MB)

58
NameNode
(Master)
Secondary
NameNode
Client

DataNodes
1. filename

2. blkId, DataNodes

3. Read/write data

ClusterId
ClusterId
Data flow overview

Heartbeats
DataNodes send heartbeats to the NameNode
Once every 3 secs
NameNode uses heartbeats to detect DataNode failure
No response in 10 mins is considered a failure
59

Replication engine
Upon detecting a DataNode failure
Choose new DataNodes for replicas
Balance disk usage
Balance communication traffic to DataNodes
60

Data corrections
Checksums to validate data (CRC32)
File creation
Client computes checksum per 512 byte
DataNode stores the checksum
File access
Client retrieves data and checksum from DataNodes
If validation fails, try other replicas
61

NameNode failure
A single point of failure
Transaction log stored in multiple directories
Directory on local file system
A directory on a remote file system (NFS)
Add a secondary NameNode
62

Hadoop Map-Reduce
(Example: Color Count)
63

Shuffle & Sorting based on k

Input blocks on HDFS
Produces (k, v)
( , 1)

Consumes(k, [v])
( , [1,1,1,1,1,1..])

Produces(k’, v’)
( , 100)

Users only provide the “Map” and “Reduce” functions

Hadoop MapReduce
Job Tracker is the master node (runs with the namenode)
Receives the user’s job
Decides on how many tasks will run (number of mappers)
Decides on where to run each mapper (locality matters)
64

This file has 5 Blocks  run 5 map tasks

Where to run the task reading block “1”
Try to run it on Node 1 or Node 3
Node 1
Node 2
Node 3

Choosing the “closest” block
65

Hadoop MapReduce
Task Tracker is the slave node (runs on each datanode)
Receives the task from Job Tracker
Runs the task until completion (either map or reduce task)
Always in communication with the Job Tracker reporting progress
66

In this example, 1 map-reduce job consists of 4 map tasks and 3 reduce tasks

On worker failure
Detect failure via periodic heartbeats
Workers send heartbeat messages (ping) periodically to the master node
Re-execute completed and in-progress map tasks
Re-execute in-progress reduce tasks
Task completion committed through master
67

Master failure can be handled by systems like Zookeeper
68

The MapReduce Stack
Google
Google File System (GFS)
MapReduce
BigTable
Dremel
Pregel
Everyone else
Hadoop File System (HDFS)
Hadoop MapReduce
HBase
Drill
Giraph

MapReduce:
A major step backwards
MapReduce may be a good idea for writing certain types of computations
But
A giant step backward in the programming paradigm for large-scale data intensive applications
A sub-optimal implementation, in that it uses brute force instead of indexing
Missing most of the features that are routinely included in current DBMS
Not novel at all
70

Valiant’s BSP Model (1990)
Bulk Synchronous Parallel
71

About the Course
Aims at the most up-to-date technologies
Combines hands-on experience and theory
Exam will have a paper-based component and a programming component.
You are encouraged to bring your laptop to the lectures, and follow along
Python (prerequisite)
SQL (will do a quick review)
74

Reduce

Map

Parse-hash

Reduce

Map

Parse-hash

Reduce
Reduce
Reduce
Map
Map
Map
Map
Parse-hash
Parse-hash
Parse-hash
Parse-hash

/docProps/thumbnail.jpeg

Related Posts