CRICOS code 00025BCRICOS code 00025B
• “Big Data” has been in use since 1990s.
Copyright By PowCoder代写 加微信 powcoder
• Data sets with sizes beyond the ability of commonly used software tools to capture, curate,
manage, and process data within a tolerable elapsed time.
• Reasons of Big Data:
– Hardware development: Storage (more cheaper), CPUs (more cores)
– Internet bandwidth: 56kbps vs 100Mbps
– Data generation:
Transactional data (stock, shopping records in Woolworths/Coles)
User-centric data (videos/images)
Sensor-based data (cameras)
Background – Big Data Era
Cloud Computing
https://en.wikipedia.org/wiki/Big_data
https://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm
CRICOS code 00025BCRICOS code 00025B
• Characteristics of Big Data: 5Vs:
– Volume (quantity) – from 4.4 trillion gigabytes to 44 trillion
(more than doubles every two years).
– Variety (type) – structured vs non-structured
– Velocity (speed)
– Veracity (quality)
• Data must be processed with advanced tools to reveal
meaningful information.
– Data mining and machine learning
– Cloud computing
Background – Big Data Era
Cloud Computing
https://en.wikipedia.org/wiki/Big_data
https://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm
CRICOS code 00025BCRICOS code 00025B
Distributed storage
– Huge volumes of data are stored on clusters of storage nodes
– Distributed file systems
– GFS, BigTable (Google) and HDFS/HBase (open-source in
Apache Hadoop)
Distributed computing
– Clusters of computing nodes process data in parallel manner
– Distributed computing models/frameworks
– MapReduce in Hadoop and Apache Spark
Background – Big Data Technologies
Cloud Computing
https://projects.apache.org
https://en.wikipedia.org/wiki/The_Apache_Software_Foundation
CRICOS code 00025BCRICOS code 00025B
• Introduction to Hadoop
• What is Hadoop & History
• Hadoop Ecosystem
• Hadoop Computation Model: MapReduce
• MapReduce Components & Paradigm
• MapReduce Workflow
• MapReduce Examples:
Word Count
Find the Highest/Averaged Temperature
Find Word Length Distribution
Find Common Friends
Inverted Indexing
• Besides the Core: Hive and Pig
CRICOS code 00025BCRICOS code 00025B
• Apache Hadoop’s MapReduce and HDFS components were inspired by Google papers
on MapReduce and Google File System (GFS).
• “an open-source software platform for distributed storage and distributed
processing of very large data sets on computer clusters built from commodity
hardware” – Hortonworks
• Features
• Abstract and facilitate the storage and processing of large and/or rapidly growing data sets
– Structured and non-structured data
– Simple programming models
• High scalability and availability
• Fault-tolerance (failures are common)
• Move computation rather than data (data locality)
What is Hadoop?
Dean, Jeffrey, and . “MapReduce: Simplified data processing on large clusters.” (2004).
Ghemawat, Sanjay, , and Shun- . “The Google file system.” (2003).
CRICOS code 00025BCRICOS code 00025B
• The core of Apache Hadoop consists of:
– Storage: Hadoop Distributed File System (HDFS),
– Computation model: MapReduce programming model.
• The base Apache Hadoop framework is composed of the following modules:
– Hadoop Common – contains libraries and utilities needed by other Hadoop modules;
– Hadoop Distributed File System (HDFS) – a distributed file-system;
– Hadoop YARN – a cluster manager;
Hadoop HBase – a NoSQL
– Hadoop Hive – a data warehouse that supports high-level SQL-like query language
– Hadoop MapReduce – an implementation of the MapReduce programming model for big
data processing.
Hadoop Core and Sub-modules
Dean, Jeffrey, and . “MapReduce: Simplified data processing on large clusters.” (2004).
Ghemawat, Sanjay, , and Shun- . “The Google file system.” (2003).
CRICOS code 00025BCRICOS code 00025B
• The term Hadoop is often used for both base modules and sub-modules
and also the ecosystem,
• Some additional software packages that can be installed on top of or
alongside Hadoop are also included:
– Apache Pig: is a high-level platform for creating programs that run
on Apache Hadoop
– Apache Hive: is a data warehouse software project built on top
of Apache Hadoop for providing data query and analysis
– Apache HBase: is an open-source, distributed, versioned, non-
relational database inspired by Google BigTable
– Apache Phoenix: is an open source, massively parallel, relational
database engine supporting OLTP for Hadoop using Apache HBase
as its backing store.
– Apache Spark: is an in-memory computing platform
Additional Packages in Hadoop
CRICOS code 00025BCRICOS code 00025B
• Need to process Multi Petabyte (large-scale) Datasets
• Data may not have strict schema
• Expensive to build reliability in each application
• Nodes fails everyday
• Need common infrastructure
• Very Large Distributed File System
• Assumes Commodity Hardware on heterogeneous OS
• Optimized for Batch Processing
Why use Hadoop?
CRICOS code 00025BCRICOS code 00025B
Designed to answer the question: “How to process big data with reasonable cost and
Hadoop was created by Doug Cutting and has its origins in Apache Nutch, an open source
web search engine.
Inspired by Google’s GFS and MapReduce papers, development started and then moved to
the new Hadoop subproject in Jan 2006.
The name of Hadoop was named after a toy elephant of Doug Cutting’s son.
Initial code just include 5,000 lines of code for HDFS and about 6,000 lines for MapReduce.
Milestones:
– 2008 – Hadoop Wins Terabyte Sort Benchmark
– 2013 – Hadoop 1.1.2 and Hadoop 2.0.3 alpha.
– 2014 – Hadoop 2.3 and become top level Apache Project
– 2019 – Apache Hadoop 3.2 available
Brief History of Hadoop
Doug Cutting
CRICOS code 00025BCRICOS code 00025B
• 2003 – Google File System (GFS)
• 2004 – MapReduce computation model (Google)
• 2006 – Hadoop was born
• 2006/April – HDFS + MapReduce in Hadoop 0.1.0
• 2006/April – Hadoop won sorting competition (1.8T on
188 nodes in 47.9 hours)
• 2006 – 2007 – Yahoo contributed to Hadoop
• 2008/March – HBase released
• 2008/Sept – Pig released
• 2009/June – Sqoop released
• 2010/Oct – Hive released
• 2011/Feb – Zookeeper released
• 2012 – Hadoop YARN released
• 2014/May – Spark released
Hadoop Ecosystem
Hadoop Distributed File System (HDFS)
Hadoop MapReduce
Hadoop YARN
Pig Hive Sqoop
ML/GraphX/
CRICOS code 00025BCRICOS code 00025B
• Hadoop is in use at most organizations that handle
– and more…
• Main applications using Hadoop:
– Advertisement (Mining user behavior to generate
recommendations)
– Searches (group related documents)
– Security (search for uncommon patterns)
Hadoop in the Wild
https://www.statista.com/statistics/593479/worldwide-hadoop-bigdata-market/
https://cwiki.apache.org/confluence/display/HADOOP2/PoweredBy
https://www.statista.com/statistics/593479/worldwide-hadoop-bigdata-market/
CRICOS code 00025BCRICOS code 00025B
• Introduction to Hadoop
• What is Hadoop & History
• Hadoop Ecosystem
• Hadoop Computation Model: MapReduce
• MapReduce Components & Paradigm
• MapReduce Workflow
• MapReduce Examples:
Word Count
Find the Highest/Averaged Temperature
Find Word Length Distribution
Find Common Friends
Inverted Indexing
• Besides the Core: Hive and Pig
CRICOS code 00025BCRICOS code 00025B
Coins Deposit
MapReduce: A Real World Analogy
Mapper: Categorize coins by their face values
Reducer: Count the coins in each face value in parallel
CRICOS code 00025BCRICOS code 00025B
– MapReduce program by users will be submitted to JobTracker via Client
– Users can display job running status through interfaces in Client
MapReduce Architecture: Components
CRICOS code 00025BCRICOS code 00025B
• JobTracker:
– Monitor resources and coordinate jobs
– Monitor health of all the TaskTrackers (transfer jobs to other nodes once failure found)
MapReduce Architecture: Components
CRICOS code 00025BCRICOS code 00025B
• TaskTracker:
– Periodically heartbeat with resource information job execution status to JobTracker
– Receive and execute commands from JobTracker (start new tasks or kill existing tasks)
MapReduce Architecture: Components
CRICOS code 00025BCRICOS code 00025B
– Map Task and Reduce Task
– Initiated by TaskTracker
MapReduce Architecture: Components
CRICOS code 00025BCRICOS code 00025B
MapReduce Architecture ver1: Workflow
Task SchedulerJobTracker
TaskTracker
Reduce Task
TaskTracker
Reduce Task
TaskTracker
Reduce Task
CRICOS code 00025BCRICOS code 00025B
• Map-Reduce is the data processing component of Hadoop.
• Map-Reduce programs transform lists of input data elements into lists of output data
• A Map-Reduce program will do map and reduce tasks asynchronously
• MapReduce Job (an execution of a Mapper and Reducer across a data set) consists
– the input data (submitted by users),
– the MapReduce Program (program logic written by users),
– and configuration info.
MapReduce Paradigm
CRICOS code 00025BCRICOS code 00025B
• The MapReduce framework operates exclusively on
• The framework views the input to the job as a set of
of
• Input and Output types of a MapReduce job (e.g., word count):
Inputs and Outputs
Function Input Output Note
Map
E.g. List( e.g. <“a”, 1>, <“b”, 1>, <“c”, 1> 1. Convert splits of data into a list of 2. Each input a list of key/value pairs as intermediate results Reduce e.g. <“a”, <1, 1, 1> > e.g. <“a”, 3>, <“b”, 2>, <“c”, 4> The value of intermediate result, List(v2), represents the values of the same CRICOS code 00025BCRICOS code 00025B Map takes one pair of data with a type in one data domain, and returns a list of pairs in a • k1 stands for line number while v1 stands for contents • E.g. (1, “Hello, world. Welcome to MapReduce world!”) → (”Hello”, 1), (“world”, 1), • The Map function is applied in parallel to every pair (keyed by k1) in the input dataset – Each line will be applied with Map function – (2, “Hello, MapReduce. Welcome to world!”) → (”Hello”, 1), (“world”, 1) … – (3, “Hello, Spark. Spark is better than MapReduce.”) → (”Hello”, 1), (“Spark”, 1) … • a list of pairs (keyed by k2) will be generated. k2 here is a word not a line number. • After that, the MapReduce framework collects all pairs with the same key (k2) from all lists – (“Hello”, <1, 1, 1>), (“world”, <1, 1, 1, 1>), etc. Inputs and Outputs CRICOS code 00025BCRICOS code 00025B The Reduce function is then applied in parallel to each group, which in turn produces a collection of values in the same domain: Reduce(k2, list (v2)) → (k3, v3) • E.g. (“Hello”, <1, 1, 1>) → (”Hello”, 3), (“world”, <1, 1, 1, 1>) → (”world”, 4) • Each Reduce call typically produces either one value v3 or an empty return. Inputs and Outputs CRICOS code 00025BCRICOS code 00025B • No communications between Map tasks • No communications between Reduce tasks • Need shuffle process to transfer data to reducer Divide and Conquer CRICOS code 00025BCRICOS code 00025B • Given a file consists of the following words: “Dear, Bear, River, Car, Car, River, Deer, Car and Bear” and The overall MapReduce process will look like: Example I – Word Count River Car Car River Deer Dear Bear River Car Car River Deer Car Bear Bear, (1,1) Deer, (1,1) River, (1,1) Car, (1,1,1) Input Splitting Mapping Shuffling Reducing Final Result List(K2, V2) K2, List(V2) List(K3,V3) CRICOS code 00025BCRICOS code 00025B • The input is split into three different data-sets that are distributed among the map nodes. • The words are tokenized and a value of 1 is assigned to each word which is hardcoded. This is done assuming that each word will appear only once. • After this list of key-value pairs are created where the key is the word and value is the number one. For e.g. the first data-set has three key-value pairs: Bear, 1; River, 1; Dear, 1. • After the mapping process is complete the shuffling and sorting tasks are executed so that all the tuples that have the same key are combined together and send to the reducer. • On completion of the sorting and shuffling tasks, each reducer will have a unique key and the corresponding list of values. For example, Car, [1,1,1]; Deer, [1,1]… etc. • The reducer will now count the values present in the list of each unique key and generate the key, value pair for that key, where the value will now correspond to the count. • The output is written in the key, value format to the output file. Example I – word count CRICOS code 00025BCRICOS code 00025B Example I – word count • Compile WordCount.java and create a jar • Run the application https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client- core/MapReduceTutorial.html#Source_Code https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Source_Code CRICOS code 00025BCRICOS code 00025B Example I – word count Main Function • Create a configuration and a job instance • Indicate Mapper, Combiner, Reducer • Specify input and output directory with FileInputFormat and FileoutputFormat https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client- core/MapReduceTutorial.html#Source_Code https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Source_Code CRICOS code 00025BCRICOS code 00025B • Data are stored in HDFS across two nodes as input files. • InputFormat defines how these input files are split and read. It selects the files or other objects that are used for input. InputFormat creates InputSplit. • InputSplit logically represents the data which will be processed by an individual Mapper. • One map task is created for each split; thus the number of map tasks will be equal to the number of InputSplits. • The split is divided into records and each record will be processed by the mapper. InputFormat InputFormat Node 1 Node 2 split split split split CRICOS code 00025BCRICOS code 00025B • RecordReader (RR) communicates with the InputSplit and converts the data into key-value pairs suitable for reading by the mapper. • RecordReader (RR) will actually read data in HDFS according to the split information and pass key-value pairs to mappers. • Mapper processes each input record (from RecordReader) and generates new key-value pair, which is completely different from the input pair. • The output of Mapper is also known as intermediate output which is written to the local disk. • The output of the Mapper is not stored on HDFS as this is temporary data and writing on HDFS will create unnecessary • Mappers output is passed to the combiner for further process InputFormat InputFormat Node 1 Node 2 split split split split RR RR RR RR Map Map Map Map List( CRICOS code 00025BCRICOS code 00025B • Combiner (aka mini-reducer) performs local aggregation on the mappers’ output, which helps to minimize the data transfer between mapper and reducer (we will see reducer below). InputFormat InputFormat Node 1 Node 2 split split split split RR RR RR RR Map Map Map Map Combiner Combiner cloud is in cloud machine learning is learning machine (cloud, 1) (cloud, 1) (machine, 1) (learning, 1) (learning, 1) (machine, 1) (cloud, 2) (machine, 2) (learning, 2) No combiner 9 key/value intermediate data (cloud, 1) (cloud, 1) (machine, 1) (learning, 1) (learning, 1) (machine, 1) Combiner 1 (cloud, 2) Combiner 2 (machine, 2) (learning, 2) 6 key/value intermediate data With combiners Improved Overall Performance CRICOS code 00025BCRICOS code 00025B • Partitioner takes the output from combiners and performs partitioning. • Keys of the intermediate data will be partitioned according to hash function. • All keys within the same partition will go to the same reducer. • The total number of Partitioner depends on the number of InputFormat InputFormat Node 1 Node 2 split split split split RR RR RR RR Map Map Map Map Combiner Combiner Partitioner Partitioner <“apple”,1> <“able”,1> <“accept”,1> <“bear”,1> <”country”,2> Hash function or Partition function CRICOS code 00025BCRICOS code 00025B • Shuffling is the process by which the intermediate output from mappers is transferred to the reducer. • Reducer has three primary phases: shuffle & sort and • OutputFormat determines the way these output key- value pairs are written in output files. • The final results will be written in HDFS. • Note that the intermediate data from mappers are not immediately written into local storage, but into memory. • There is a mechanism, call spill, that periodically write intermediate data in memory into disk. InputFormat InputFormat Node 1 Node 2 split split split split RR RR RR RR Map Map Map Map Combiner Combiner Partitioner Partitioner ReduceReduce OutputFormat OutputFormat Shuffle & Sort CRICOS code 00025BCRICOS code 00025B Example I – word count • Customize mapper based on the Mapper class provided by Hadoop • Define a map method that contains the mapping logic. • Output: e.g., (line1, Bear) -> (Bear, 1) https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client- core/MapReduceTutorial.html#Source_Code https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Source_Code CRICOS code 00025BCRICOS code 00025B Example I – word count • Customize reducer based on the Reducer class provided by Hadoop • Define a reduce method that contains the mapping logic. • Output: e.g., (Bear, <1,1,1>)->(Bear, 3) https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client- core/MapReduceTutorial.html#Source_Code https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Source_Code CRICOS code 00025BCRICOS code 00025B Ripple Quiz CRICOS code 00025BCRICOS code 00025B • Assuming that there are four text files on different nodes, you want to count the word frequency using MapReduce model. • Text 1 (node 1): the weather is good • Text 2 (node 2): today is good • Text 3 (node 3): good weather is good • Text 4 (node 4): today has good weather Revisit Word Count Exampl 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com
different domain: Map(k1,v1) → list(k2,v2)
(“Welcome”, 1), (“to”, 1), (“MapReduce”,1), (“world”,1)
and groups them together, creating one group for each key.