CS代考 CRICOS code 00025BCRICOS code 00025B

CRICOS code 00025BCRICOS code 00025B

• “Big Data” has been in use since 1990s.

Copyright By PowCoder代写 加微信 powcoder

• Data sets with sizes beyond the ability of commonly used software tools to capture, curate,

manage, and process data within a tolerable elapsed time.

• Reasons of Big Data:

– Hardware development: Storage (more cheaper), CPUs (more cores)

– Internet bandwidth: 56kbps vs 100Mbps

– Data generation:

 Transactional data (stock, shopping records in Woolworths/Coles)

 User-centric data (videos/images)

 Sensor-based data (cameras)

Background – Big Data Era

Cloud Computing

https://en.wikipedia.org/wiki/Big_data
https://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm

CRICOS code 00025BCRICOS code 00025B

• Characteristics of Big Data: 5Vs:

– Volume (quantity) – from 4.4 trillion gigabytes to 44 trillion

(more than doubles every two years).

– Variety (type) – structured vs non-structured

– Velocity (speed)

– Veracity (quality)

• Data must be processed with advanced tools to reveal

meaningful information.

– Data mining and machine learning

– Cloud computing

Background – Big Data Era

Cloud Computing

https://en.wikipedia.org/wiki/Big_data
https://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm

CRICOS code 00025BCRICOS code 00025B

Distributed storage

– Huge volumes of data are stored on clusters of storage nodes

– Distributed file systems

– GFS, BigTable (Google) and HDFS/HBase (open-source in

Apache Hadoop)

Distributed computing

– Clusters of computing nodes process data in parallel manner

– Distributed computing models/frameworks

– MapReduce in Hadoop and Apache Spark

Background – Big Data Technologies

Cloud Computing

https://projects.apache.org
https://en.wikipedia.org/wiki/The_Apache_Software_Foundation

CRICOS code 00025BCRICOS code 00025B

• Introduction to Hadoop

• What is Hadoop & History

• Hadoop Ecosystem

• Hadoop Computation Model: MapReduce

• MapReduce Components & Paradigm

• MapReduce Workflow

• MapReduce Examples:

 Word Count

 Find the Highest/Averaged Temperature

 Find Word Length Distribution

 Find Common Friends

 Inverted Indexing

• Besides the Core: Hive and Pig

CRICOS code 00025BCRICOS code 00025B

• Apache Hadoop’s MapReduce and HDFS components were inspired by Google papers

on MapReduce and Google File System (GFS).

• “an open-source software platform for distributed storage and distributed

processing of very large data sets on computer clusters built from commodity

hardware” – Hortonworks

• Features

• Abstract and facilitate the storage and processing of large and/or rapidly growing data sets

– Structured and non-structured data

– Simple programming models

• High scalability and availability

• Fault-tolerance (failures are common)

• Move computation rather than data (data locality)

What is Hadoop?

Dean, Jeffrey, and . “MapReduce: Simplified data processing on large clusters.” (2004).

Ghemawat, Sanjay, , and Shun- . “The Google file system.” (2003).

CRICOS code 00025BCRICOS code 00025B

• The core of Apache Hadoop consists of:

– Storage: Hadoop Distributed File System (HDFS),

– Computation model: MapReduce programming model.

• The base Apache Hadoop framework is composed of the following modules:

– Hadoop Common – contains libraries and utilities needed by other Hadoop modules;

– Hadoop Distributed File System (HDFS) – a distributed file-system;

– Hadoop YARN – a cluster manager;

Hadoop HBase – a NoSQL

– Hadoop Hive – a data warehouse that supports high-level SQL-like query language

– Hadoop MapReduce – an implementation of the MapReduce programming model for big

data processing.

Hadoop Core and Sub-modules

Dean, Jeffrey, and . “MapReduce: Simplified data processing on large clusters.” (2004).

Ghemawat, Sanjay, , and Shun- . “The Google file system.” (2003).

CRICOS code 00025BCRICOS code 00025B

• The term Hadoop is often used for both base modules and sub-modules
and also the ecosystem,

• Some additional software packages that can be installed on top of or
alongside Hadoop are also included:

– Apache Pig: is a high-level platform for creating programs that run
on Apache Hadoop

– Apache Hive: is a data warehouse software project built on top
of Apache Hadoop for providing data query and analysis

– Apache HBase: is an open-source, distributed, versioned, non-
relational database inspired by Google BigTable

– Apache Phoenix: is an open source, massively parallel, relational
database engine supporting OLTP for Hadoop using Apache HBase
as its backing store.

– Apache Spark: is an in-memory computing platform

Additional Packages in Hadoop

CRICOS code 00025BCRICOS code 00025B

• Need to process Multi Petabyte (large-scale) Datasets

• Data may not have strict schema

• Expensive to build reliability in each application

• Nodes fails everyday

• Need common infrastructure

• Very Large Distributed File System

• Assumes Commodity Hardware on heterogeneous OS

• Optimized for Batch Processing

Why use Hadoop?

CRICOS code 00025BCRICOS code 00025B

Designed to answer the question: “How to process big data with reasonable cost and

Hadoop was created by Doug Cutting and has its origins in Apache Nutch, an open source

web search engine.

Inspired by Google’s GFS and MapReduce papers, development started and then moved to

the new Hadoop subproject in Jan 2006.

The name of Hadoop was named after a toy elephant of Doug Cutting’s son.

Initial code just include 5,000 lines of code for HDFS and about 6,000 lines for MapReduce.

Milestones:

– 2008 – Hadoop Wins Terabyte Sort Benchmark

– 2013 – Hadoop 1.1.2 and Hadoop 2.0.3 alpha.

– 2014 – Hadoop 2.3 and become top level Apache Project

– 2019 – Apache Hadoop 3.2 available

Brief History of Hadoop

Doug Cutting

CRICOS code 00025BCRICOS code 00025B

• 2003 – Google File System (GFS)

• 2004 – MapReduce computation model (Google)

• 2006 – Hadoop was born

• 2006/April – HDFS + MapReduce in Hadoop 0.1.0

• 2006/April – Hadoop won sorting competition (1.8T on
188 nodes in 47.9 hours)

• 2006 – 2007 – Yahoo contributed to Hadoop

• 2008/March – HBase released

• 2008/Sept – Pig released

• 2009/June – Sqoop released

• 2010/Oct – Hive released

• 2011/Feb – Zookeeper released

• 2012 – Hadoop YARN released

• 2014/May – Spark released

Hadoop Ecosystem

Hadoop Distributed File System (HDFS)

Hadoop MapReduce

Hadoop YARN

Pig Hive Sqoop

ML/GraphX/

CRICOS code 00025BCRICOS code 00025B

• Hadoop is in use at most organizations that handle

– Facebook

– and more…

• Main applications using Hadoop:

– Advertisement (Mining user behavior to generate

recommendations)

– Searches (group related documents)

– Security (search for uncommon patterns)

Hadoop in the Wild

https://www.statista.com/statistics/593479/worldwide-hadoop-bigdata-market/

https://cwiki.apache.org/confluence/display/HADOOP2/PoweredBy
https://www.statista.com/statistics/593479/worldwide-hadoop-bigdata-market/

CRICOS code 00025BCRICOS code 00025B

• Introduction to Hadoop

• What is Hadoop & History

• Hadoop Ecosystem

• Hadoop Computation Model: MapReduce

• MapReduce Components & Paradigm

• MapReduce Workflow

• MapReduce Examples:

 Word Count

 Find the Highest/Averaged Temperature

 Find Word Length Distribution

 Find Common Friends

 Inverted Indexing

• Besides the Core: Hive and Pig

CRICOS code 00025BCRICOS code 00025B

Coins Deposit

MapReduce: A Real World Analogy

Mapper: Categorize coins by their face values

Reducer: Count the coins in each face value in parallel

CRICOS code 00025BCRICOS code 00025B

– MapReduce program by users will be submitted to JobTracker via Client

– Users can display job running status through interfaces in Client

MapReduce Architecture: Components

CRICOS code 00025BCRICOS code 00025B

• JobTracker:

– Monitor resources and coordinate jobs

– Monitor health of all the TaskTrackers (transfer jobs to other nodes once failure found)

MapReduce Architecture: Components

CRICOS code 00025BCRICOS code 00025B

• TaskTracker:

– Periodically heartbeat with resource information job execution status to JobTracker

– Receive and execute commands from JobTracker (start new tasks or kill existing tasks)

MapReduce Architecture: Components

CRICOS code 00025BCRICOS code 00025B

– Map Task and Reduce Task

– Initiated by TaskTracker

MapReduce Architecture: Components

CRICOS code 00025BCRICOS code 00025B

MapReduce Architecture ver1: Workflow

Task SchedulerJobTracker

TaskTracker

Reduce Task

TaskTracker

Reduce Task

TaskTracker

Reduce Task

CRICOS code 00025BCRICOS code 00025B

• Map-Reduce is the data processing component of Hadoop.

• Map-Reduce programs transform lists of input data elements into lists of output data

• A Map-Reduce program will do map and reduce tasks asynchronously

• MapReduce Job (an execution of a Mapper and Reducer across a data set) consists

– the input data (submitted by users),

– the MapReduce Program (program logic written by users),

– and configuration info.

MapReduce Paradigm

CRICOS code 00025BCRICOS code 00025B

• The MapReduce framework operates exclusively on pairs;

• The framework views the input to the job as a set of pairs and produces a set

of pairs as the output of the job.

• Input and Output types of a MapReduce job (e.g., word count):

Inputs and Outputs

Function Input Output Note

Map

E.g.

List()

e.g. <“a”, 1>, <“b”, 1>, <“c”, 1>

1. Convert splits of data into a list

of pairs.

2. Each input will output

a list of key/value pairs as

intermediate results

Reduce

e.g. <“a”, <1, 1, 1> >

e.g. <“a”, 3>, <“b”, 2>, <“c”, 4>

The value of in the

intermediate result, List(v2),

represents the values of the same

CRICOS code 00025BCRICOS code 00025B

Map takes one pair of data with a type in one data domain, and returns a list of pairs in a
different domain: Map(k1,v1) → list(k2,v2)

• k1 stands for line number while v1 stands for contents

• E.g. (1, “Hello, world. Welcome to MapReduce world!”) → (”Hello”, 1), (“world”, 1),
(“Welcome”, 1), (“to”, 1), (“MapReduce”,1), (“world”,1)

• The Map function is applied in parallel to every pair (keyed by k1) in the input dataset

– Each line will be applied with Map function

– (2, “Hello, MapReduce. Welcome to world!”) → (”Hello”, 1), (“world”, 1) …

– (3, “Hello, Spark. Spark is better than MapReduce.”) → (”Hello”, 1), (“Spark”, 1) …

• a list of pairs (keyed by k2) will be generated. k2 here is a word not a line number.

• After that, the MapReduce framework collects all pairs with the same key (k2) from all lists
and groups them together, creating one group for each key.

– (“Hello”, <1, 1, 1>), (“world”, <1, 1, 1, 1>), etc.

Inputs and Outputs

CRICOS code 00025BCRICOS code 00025B

The Reduce function is then applied in parallel to each group, which in turn produces a

collection of values in the same domain: Reduce(k2, list (v2)) → (k3, v3)

• E.g. (“Hello”, <1, 1, 1>) → (”Hello”, 3), (“world”, <1, 1, 1, 1>) → (”world”, 4)

• Each Reduce call typically produces either one value v3 or an empty return.

Inputs and Outputs

CRICOS code 00025BCRICOS code 00025B

• No communications between Map tasks

• No communications between Reduce tasks

• Need shuffle process to transfer data to reducer

Divide and Conquer

CRICOS code 00025BCRICOS code 00025B

• Given a file consists of the following words: “Dear, Bear, River, Car, Car, River, Deer, Car

and Bear” and The overall MapReduce process will look like:

Example I – Word Count

River Car Car

River Deer

Dear Bear River

Car Car River

Deer Car Bear

Bear, (1,1)

Deer, (1,1)

River, (1,1)

Car, (1,1,1)

Input Splitting Mapping Shuffling Reducing Final Result

List(K2, V2) K2, List(V2)

List(K3,V3)

CRICOS code 00025BCRICOS code 00025B

• The input is split into three different data-sets that are distributed among the map nodes.

• The words are tokenized and a value of 1 is assigned to each word which is hardcoded. This

is done assuming that each word will appear only once.

• After this list of key-value pairs are created where the key is the word and value is the

number one. For e.g. the first data-set has three key-value pairs: Bear, 1; River, 1; Dear, 1.

• After the mapping process is complete the shuffling and sorting tasks are executed so that all

the tuples that have the same key are combined together and send to the reducer.

• On completion of the sorting and shuffling tasks, each reducer will have a unique key and the

corresponding list of values. For example, Car, [1,1,1]; Deer, [1,1]… etc.

• The reducer will now count the values present in the list of each unique key and generate the

key, value pair for that key, where the value will now correspond to the count.

• The output is written in the key, value format to the output file.

Example I – word count

CRICOS code 00025BCRICOS code 00025B

Example I – word count

• Compile WordCount.java and create a jar

• Run the application

https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-

core/MapReduceTutorial.html#Source_Code

https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Source_Code

CRICOS code 00025BCRICOS code 00025B

Example I – word count

Main Function

• Create a configuration and a job instance

• Indicate Mapper, Combiner, Reducer

• Specify input and output directory with

FileInputFormat and FileoutputFormat

https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-

core/MapReduceTutorial.html#Source_Code

https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Source_Code

CRICOS code 00025BCRICOS code 00025B

• Data are stored in HDFS across two nodes as

input files.

• InputFormat defines how these input files are

split and read. It selects the files or other objects

that are used for input. InputFormat creates

InputSplit.

• InputSplit logically represents the data which will

be processed by an individual Mapper.

• One map task is created for each split; thus the

number of map tasks will be equal to the number

of InputSplits.

• The split is divided into records and each record

will be processed by the mapper.

InputFormat InputFormat

Node 1 Node 2

split split split split

CRICOS code 00025BCRICOS code 00025B

• RecordReader (RR) communicates with the InputSplit and

converts the data into key-value pairs suitable for reading by

the mapper.

• RecordReader (RR) will actually read data in HDFS according

to the split information and pass key-value pairs to mappers.

• Mapper processes each input record (from RecordReader)

and generates new key-value pair, which is completely

different from the input pair.

• The output of Mapper is also known as intermediate output

which is written to the local disk.

• The output of the Mapper is not stored on HDFS as this is

temporary data and writing on HDFS will create unnecessary

• Mappers output is passed to the combiner for further process

InputFormat InputFormat

Node 1 Node 2

split split split split

RR RR RR RR

Map Map Map Map

List()

CRICOS code 00025BCRICOS code 00025B

• Combiner (aka mini-reducer) performs local aggregation

on the mappers’ output, which helps to minimize the data

transfer between mapper and reducer (we will see

reducer below).

InputFormat InputFormat

Node 1 Node 2

split split split split

RR RR RR RR

Map Map Map Map

Combiner Combiner

cloud is in cloud

machine learning is learning machine

(cloud, 1)

(cloud, 1)

(machine, 1)

(learning, 1)

(learning, 1)

(machine, 1)

(cloud, 2)

(machine, 2)

(learning, 2)

No combiner

9 key/value

intermediate data

(cloud, 1)

(cloud, 1)

(machine, 1)

(learning, 1)

(learning, 1)

(machine, 1)

Combiner 1

(cloud, 2)

Combiner 2

(machine, 2)

(learning, 2)

6 key/value

intermediate data

With combiners

Improved Overall

Performance

CRICOS code 00025BCRICOS code 00025B

• Partitioner takes the output from combiners and performs

partitioning.

• Keys of the intermediate data will be partitioned according to

hash function.

• All keys within the same partition will go to the same reducer.

• The total number of Partitioner depends on the number of

InputFormat InputFormat

Node 1 Node 2

split split split split

RR RR RR RR

Map Map Map Map

Combiner Combiner

Partitioner Partitioner

<“apple”,1>

<“able”,1>

<“accept”,1>

<“bear”,1>

<”country”,2>

Hash function or

Partition function

CRICOS code 00025BCRICOS code 00025B

• Shuffling is the process by which the intermediate

output from mappers is transferred to the reducer.

• Reducer has three primary phases: shuffle & sort and

• OutputFormat determines the way these output key-

value pairs are written in output files.

• The final results will be written in HDFS.

• Note that the intermediate data from mappers are not

immediately written into local storage, but into memory.

• There is a mechanism, call spill, that periodically write

intermediate data in memory into disk.

InputFormat InputFormat

Node 1 Node 2

split split split split

RR RR RR RR

Map Map Map Map

Combiner Combiner

Partitioner Partitioner

ReduceReduce

OutputFormat OutputFormat

Shuffle & Sort

CRICOS code 00025BCRICOS code 00025B

Example I – word count

• Customize mapper based on the Mapper class

provided by Hadoop

• Define a map method that contains the

mapping logic.

• Output: e.g., (line1, Bear) -> (Bear, 1)

https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-

core/MapReduceTutorial.html#Source_Code

https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Source_Code

CRICOS code 00025BCRICOS code 00025B

Example I – word count

• Customize reducer based on the Reducer class

provided by Hadoop

• Define a reduce method that contains the

mapping logic.

• Output: e.g., (Bear, <1,1,1>)->(Bear, 3)

https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-

core/MapReduceTutorial.html#Source_Code

https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Source_Code

CRICOS code 00025BCRICOS code 00025B

Ripple Quiz

CRICOS code 00025BCRICOS code 00025B

• Assuming that there are four text files on different nodes, you want to count the word

frequency using MapReduce model.

• Text 1 (node 1): the weather is good

• Text 2 (node 2): today is good

• Text 3 (node 3): good weather is good

• Text 4 (node 4): today has good weather

Revisit Word Count Exampl

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com