CS代考 SE 3244: Data Management in the Cloud Professor

Data Processing in the Wild: Hadoop and SE 3244: Data Management in the Cloud Professor
This slide set was first created by Prof. Xiaoyi Lu for The Ohio State course CSE3244
Thank you Professor Lu (UC Merced)

Copyright By PowCoder代写 加微信 powcoder

Recall: WordCount Execution in MapReduce
• The overall execution process of WordCount in MapReduce
The Ohio State University CSE 3244 2

A Hadoop MapReduce Example – WordCount
public class WordCount {
public static class Map extends Mapper {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
static class Reduceextends Reducer {
public void reduce(Text key, Iterator values, Context context)
throws IOException, InterruptedException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
context.write(key, new IntWritable(sum));
The Ohio State University CSE 3244 3

Map Reduce Objectives
Writing software that processes “big” unstructured data that has (1) many records, (2) large records, and (3) complicated structure is HARD.
Build a platform that simplifies the process for the programmer.
– First attempt: Map & Reduce functions in Java
Productive
Scalable Fault-Tolerant
The Ohio State University CSE 3244 3

Scalability Problems in MapReduce
The Ohio State University
Scale: Add blades with processor, memory, disk
Problem: In practice, disk is often the bottleneck (slowest link)
Don’t forget to make multiple copies in case of failure
HDFS HDFS HDFS
In-Memory?
faster than network and disk

Scalability in Spark: RDD Programming
• Key idea: Resilient Distributed Datasets (RDDs)
– Immutable distributed collections of objects that can be cached in memory across
cluster nodes
– Created by transforming data in stable storage using data flow operators (map, filter, groupBy, …)
– Manipulated through various parallel operators
– Automatically rebuilt on failure – Fault tolerance by design
• rebuilt if a partition is lost
The Ohio State University CSE 3244 5

Productivity: RDD Operations
Clean language-integrated API in Scala (Python & Java) Can be used interactively from Scala console
Transformations (define a new RDD)
sample union groupByKey reduceByKey sortByKey join
Actions (return a result to driver)
countByKey saveAsTextFile saveAsSequenceFile …
More Information:
• https://spark.apache.org/docs/latest/programming-guide.html#transformations • https://spark.apache.org/docs/latest/programming-guide.html#actions
The Ohio State University CSE 3244 6

RDD Example: Word Count in Spark!
val file = spark.textFile(“hdfs://…”)
val counts = file.flatMap(line => line.split(” “)) .map(word => (word, 1))
.reduceByKey(_ + _) counts.saveAsTextFile(“hdfs://…”)
The Ohio State University CSE 3244 9

Overview of Apache Hadoop Architecture
• Open-source implementation of Google MapReduce, GFS, and BigTable for Big Data Analytics
• Hadoop Common Utilities (RPC, etc.), HDFS, MapReduce, YARN
• http://hadoop.apache.org Hadoop 1.x
Hadoop 2.x
(Data Processing)
(Cluster Resource Management & Job Scheduling)
Other Models
(Data Processing)
(Cluster Resource Management & Data Processing)
Hadoop Distributed File System (HDFS)
Hadoop Common/Core (RPC, ..)
Hadoop Distributed File System (HDFS)
Hadoop Common/Core (RPC, ..)
The Ohio State University CSE 3244 10

MapReduce on Hadoop 2.x — YARN Architecture
• Resource Manager: coordinates the allocation of compute resources
• Node Manager: in charge of resource containers, monitoring resource usage, and reporting to Resource Manager
• Application Master: in charge of the life cycle an application, like a MapReduce Job. It negotiates with Resource Manager of cluster resources and keeps track of task progress and status
Data Nodes Data Nodes
Locality settings
Data Nodes Data Nodes
Courtesy: http://www.cyanny.com/2013/12/05/hadoop-mapreduce-2-yarn/
Data Nodes
The Ohio State University CSE 3244 11

of Apache Hadoop 3.x Architecture
Hadoop Apps
MR HIVE YARN
docker docker
TensorFlow
• Erasure Coding
• MapReduce
• Task-level native optimization (up to 30%
faster for shuffle-intensive jobs)
• Support for more than 2 NameNodes
• Intra-datanode balancer
• Built-in support for Long Running Services
• Better resource isolation (isolation supports for disk and network) and Docker
• Scheduling enhancement (enhance container scheduling throughput by 6x)
• Re-architecture for YARN Timeline Service – ATS v2
The Ohio State University CSE 3244 14

An in-memory data-processing framework
– Iterative machine learning jobs
– Interactive data analytics
– Scala based Implementation
– Standalone, YARN, Mesos
A unified engine to support Batch, Streaming, SQL, Graph, ML/DL workloads
Scalable and communication intensive
– Wide dependencies between Resilient
Distributed Datasets (RDDs)
– MapReduce-like shuffle operations to repartition RDDs
Worker Worker
SparkContext
Caffe, TensorFlow,
BigDL, etc.
(real-time)
Map Reduce
Standalone
Apache Mesos YARN
http://spark.apache.org
MLlib Machine Learning
The Ohio State University CSE 3244 15

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com