CS代考 Cloud Computing INFS3208

Cloud Computing INFS3208
Background – Big Data Era
• “Big Data” has been in use since 1990s.
• Data sets with sizes beyond the ability of commonly used software tools to capture, curate,
manage, and process data within a tolerable elapsed time.
• Reasons of Big Data:
– – –
Hardware development: Storage (more cheaper), CPUs (more cores) Internet bandwidth: 56kbps vs 100Mbps
Data generation:
 Transactional data (stock, shopping records in Woolworths/Coles)  User-centric data (videos/images)
 Sensor-based data (cameras)
https://en.wikipedia.org/wiki/Big_data https://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm
CRICOS code 00025B 2

Cloud Computing INFS3208
Background – Big Data Era
• Characteristics of Big Data: 5Vs:
– Volume (quantity) – from 4.4 trillion gigabytes to 44 trillion (more than doubles every two years).
– Variety (type) – structured vs non-structured
– Velocity (speed)
– Veracity (quality)
– Value
• Data must be processed with advanced tools to reveal meaningful information.
– Data mining and machine learning
– Cloud computing
https://en.wikipedia.org/wiki/Big_data https://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm
CRICOS code 00025B 3

Cloud Computing INFS3208
Background – Big Data Technologies
Distributed storage
– Huge volumes of data are stored on clusters of storage nodes
– Distributed file systems
– GFS, BigTable (Google) and HDFS/HBase (open-source in Apache Hadoop)
Distributed computing
– Clusters of computing nodes process data in parallel manner
– Distributed computing models/frameworks
– MapReduce in Hadoop and Apache Spark
https://projects.apache.org https://en.wikipedia.org/wiki/The_Apache_Software_Foundation
CRICOS code 00025B 4

Outline


• •
• • •
Introduction to Hadoop What is Hadoop & History Hadoop Ecosystem
Hadoop Computation Model: MapReduce MapReduce Components & Paradigm MapReduce Workflow
MapReduce Examples:  Word Count
 Find the Highest/Averaged Temperature  Find Word Length Distribution
 Find Common Friends
 Inverted Indexing
Besides the Core: Hive and Pig

CRICOS code 00025B 5

What is Hadoop?
• Apache Hadoop’s MapReduce and HDFS components were inspired by Google papers on MapReduce and Google File System (GFS).
• “an open-source software platform for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware” – Hortonworks
• Features

• • •
– –
Abstract and facilitate the storage and processing of large and/or rapidly growing data sets Structured and non-structured data
Simple programming models
High scalability and availability
Fault-tolerance (failures are common)
Move computation rather than data (data locality)
Dean, Jeffrey, and . “MapReduce: Simplified data processing on large clusters.” (2004).
Ghemawat, Sanjay, , and Shun- . “The Google file system.” (2003).
CRICOS code 00025B 6

Hadoop Core and Sub-modules


The core of Apache Hadoop consists of:
– Storage: Hadoop Distributed File System (HDFS),
– Computation model: MapReduce programming model.
The base Apache Hadoop framework is composed of the following modules:
– Hadoop Common – contains libraries and utilities needed by other Hadoop modules;
– Hadoop Distributed File System (HDFS) – a distributed file-system;
– Hadoop YARN – a cluster manager; Hadoop HBase – a NoSQL
– Hadoop Hive – a data warehouse that supports high-level SQL-like query language
– Hadoop MapReduce – an implementation of the MapReduce programming model for big
data processing.
Dean, Jeffrey, and . “MapReduce: Simplified data processing on large clusters.” (2004).
Ghemawat, Sanjay, , and Shun- . “The Google file system.” (2003).
CRICOS code 00025B 7

Additional Packages in Hadoop
• The term Hadoop is often used for both base modules and sub-modules and also the ecosystem,
• Some additional software packages that can be installed on top of or alongside Hadoop are also included:
– Apache Pig: is a high-level platform for creating programs that run on Apache Hadoop
– Apache Hive: is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis
– Apache HBase: is an open-source, distributed, versioned, non- relational database inspired by Google BigTable
– Apache Phoenix: is an open source, massively parallel, relational database engine supporting OLTP for Hadoop using Apache HBase as its backing store.
– Apache Spark: is an in-memory computing platform
– etc.
CRICOS code 00025B 8

Why use Hadoop?
• Need to process Multi Petabyte (large-scale) Datasets
• Data may not have strict schema
• Expensive to build reliability in each application
• Nodes fails everyday
• Need common infrastructure
• Very Large Distributed File System
• Assumes Commodity Hardware on heterogeneous OS
• Optimized for Batch Processing
CRICOS code 00025B 9

Brief History of Hadoop
Designed to answer the question: “How to process big data with reasonable cost and time?”
Hadoop was created by Doug Cutting and has its origins in Apache Nutch, an open source web search engine.
Inspired by Google’s GFS and MapReduce papers, development started and then moved to the new Hadoop subproject in Jan 2006.
The name of Hadoop was named after a toy elephant of Doug Cutting’s son.
Initial code just include 5,000 lines of code for HDFS and about 6,000 lines for MapReduce. Milestones:
– 2008 – Hadoop Wins Terabyte Sort Benchmark
– 2013 – Hadoop 1.1.2 and Hadoop 2.0.3 alpha.
– 2014 – Hadoop 2.3 and become top level Apache Project – 2019 – Apache Hadoop 3.2 available
Doug Cutting
CRICOS code 00025B 10

Hadoop Ecosystem
• 2003 – Google File System (GFS)
• 2004 – MapReduce computation model (Google)
• 2006 – Hadoop was born
• 2006/April – HDFS + MapReduce in Hadoop 0.1.0
• 2006/April – Hadoop won sorting competition (1.8T on 188 nodes in 47.9 hours)
• 2006 – 2007 – Yahoo contributed to Hadoop
• 2008/March – HBase released
• 2008/Sept – Pig released
• 2009/June – Sqoop released
• 2010/Oct – Hive released
• 2011/Feb – Zookeeper released
• 2012 – Hadoop YARN released
• 2014/May – Spark released
Hive Sqoop
Hadoop YARN
Hadoop Distributed File System (HDFS)
ML/GraphX/ Steam/SQL
Spark
HBase
Pig
Hadoop MapReduce
CRICOS code 00025B 11
Zookeeper

Hadoop in the Wild

Hadoop is in use at most organizations that handle big data:
– Yahoo!
– Facebook
– Amazon
– Netflix
– and more…
Main applications using Hadoop:
– Advertisement (Mining user behavior to generate recommendations)
– Searches (group related documents)
– Security (search for uncommon patterns)

https://www.statista.com/statistics/593479/worldwide-hadoop-bigdata-market/
CRICOS code 00025B 12

Outline


• •
• • •
Introduction to Hadoop What is Hadoop & History Hadoop Ecosystem
Hadoop Computation Model: MapReduce MapReduce Components & Paradigm MapReduce Workflow
MapReduce Examples:  Word Count
 Find the Highest/Averaged Temperature  Find Word Length Distribution
 Find Common Friends
 Inverted Indexing
Besides the Core: Hive and Pig

CRICOS code 00025B 13

MapReduce: A Real World Analogy
Coins Deposit
Mapper: Categorize coins by their face values Reducer: Count the coins in each face value in parallel
CRICOS code 00025B

MapReduce Architecture: Components
• Client:
– MapReduce program by users will be submitted to JobTracker via Client
– Users can display job running status through interfaces in Client
CRICOS code 00025B 15

MapReduce Architecture: Components
• JobTracker:
– Monitor resources and coordinate jobs
– Monitor health of all the TaskTrackers (transfer jobs to other nodes once failure found)
heartbeat
CRICOS code 00025B 16

MapReduce Architecture: Components
• TaskTracker:
– Periodically heartbeat with resource information job execution status to JobTracker
– Receive and execute commands from JobTracker (start new tasks or kill existing tasks)
CRICOS code 00025B 17

MapReduce Architecture: Components
• Task:
– Map Task and Reduce Task
– Initiated by TaskTracker
CRICOS code 00025B 18

MapReduce Architecture ver1: Workflow
Client
Client 1 Client
JobTracker
2
Name Node
Task Scheduler
TaskTracker
Map Task Map Task Reduce Task
3 444
TaskTracker
Map Task Map Task Reduce Task
TaskTracker
Map Task Map Task
Reduce Task
CRICOS code 00025B 19

MapReduce Paradigm
• Map-Reduce is the data processing component of Hadoop.
• Map-Reduce programs transform lists of input data elements into lists of output data
elements.
• A Map-Reduce program will do map and reduce tasks asynchronously
• MapReduce Job (an execution of a Mapper and Reducer across a data set) consists of
– the input data (submitted by users),
– the MapReduce Program (program logic written by users),
– and configuration info.
CRICOS code 00025B 20

Inputs and Outputs
• The MapReduce framework operates exclusively on pairs;
• The framework views the input to the job as a set of pairs and produces a set
of pairs as the output of the job.
• Input and Output types of a MapReduce job (e.g., word count):
Function
Input

E.g. Output
List()
e.g. <“a”, 1>, <“b”, 1>, <“c”, 1>
Note
Map
1. Convertsplitsofdataintoalist of pairs.
2. Eachinputwilloutput a list of key/value pairs as intermediate results
Reduce

e.g. <“a”, <1, 1, 1> >

e.g. <“a”, 3>, <“b”, 2>, <“c”, 4>
The value of in the intermediate result, List(v2), represents the values of the same
key k2.
CRICOS code 00025B 21

Inputs and Outputs
Map takes one pair of data with a type in one data domain, and returns a list of pairs in a different domain: Map(k1,v1) → list(k2,v2)
• •

– – – –
• •

k1 stands for line number while v1 stands for contents
E.g. (1, “Hello, world. Welcome to MapReduce world!”) → (”Hello”, 1), (“world”, 1),
(“Welcome”, 1), (“to”, 1), (“MapReduce”,1), (“world”,1)
The Map function is applied in parallel to every pair (keyed by k1) in the input dataset Each line will be applied with Map function
(2, “Hello, MapReduce. Welcome to world!”) → (”Hello”, 1), (“world”, 1) …
(3, “Hello, Spark. Spark is better than MapReduce.”) → (”Hello”, 1), (“Spark”, 1) … (…)
a list of pairs (keyed by k2) will be generated. k2 here is a word not a line number.
After that, the MapReduce framework collects all pairs with the same key (k2) from all lists and groups them together, creating one group for each key.
(“Hello”, <1, 1, 1>), (“world”, <1, 1, 1, 1>), etc.
CRICOS code 00025B 22

Inputs and Outputs
The Reduce function is then applied in parallel to each group, which in turn produces a collection of values in the same domain: Reduce(k2, list (v2)) → (k3, v3)
• E.g. (“Hello”, <1, 1, 1>) → (”Hello”, 3), (“world”, <1, 1, 1, 1>) → (”world”, 4)
• Each Reduce call typically produces either one value v3 or an empty return.
CRICOS code 00025B 23

Divide and Conquer
• No communications between Map tasks
• No communications between Reduce tasks
• Need shuffle process to transfer data to reducer
Split 0
Split 1
Split 2
Split 3
Split 4
Map() Map() Map() Map() Map()
Reduce() Reduce() Reduce()
Output 0
Output 1
Output 2
CRICOS code 00025B 24

Example I – Word Count

Given a file consists of the following words: “Dear, Bear, River, Car, Car, River, Deer, Car and Bear” and The overall MapReduce process will look like:
Input
Splitting
K1, V1
Dear Bear River
Car Car River
Deer Car Bear
Mapping
List(K2, V2)
Dear, 1 Bear, 1 River, 1
Car, 1
Car, 1 River, 1
Deer, 1 Car, 1 Bear, 1
Shuffling
K2, List(V2)
Bear, (1,1)
Car, (1,1,1)
Deer, (1,1)
River, (1,1)
Reducing
Bear, 2
Car, 3
Deer, 2
River, 2
Final Result
List(K3,V3)
Bear, 2 Car, 3 Deer, 2 River, 2
Dear Bear River Car Car River Deer Car Bear
CRICOS code 00025B
25

Example I – word count
• The input is split into three different data-sets that are distributed among the map nodes.
• The words are tokenized and a value of 1 is assigned to each word which is hardcoded. This is done assuming that each word will appear only once.
• After this list of key-value pairs are created where the key is the word and value is the number one. For e.g. the first data-set has three key-value pairs: Bear, 1; River, 1; Dear, 1.
• After the mapping process is complete the shuffling and sorting tasks are executed so that all the tuples that have the same key are combined together and send to the reducer.
• On completion of the sorting and shuffling tasks, each reducer will have a unique key and the corresponding list of values. For example, Car, [1,1,1]; Deer, [1,1]… etc.
• The reducer will now count the values present in the list of each unique key and generate the key, value pair for that key, where the value will now correspond to the count.
• The output is written in the key, value format to the output file.
CRICOS code 00025B 26

Example I – word count
• Compile WordCount.java and create a jar
• Run the application
https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client- core/MapReduceTutorial.html#Source_Code CRICOS code 00025B 27

Example I – word count
Main Function
• Create a configuration and a job instance
• Indicate Mapper, Combiner, Reducer
• Specify input and output directory with FileInputFormat and FileoutputFormat
https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client- core/MapReduceTutorial.html#Source_Code CRICOS code 00025B 28

DataFlow
• Data are stored in HDFS across two nodes as input files.
• InputFormat defines how these input files are split and read. It selects the files or other objects that are used for input. InputFormat creates InputSplit.
• InputSplit logically represents the data which will be processed by an individual Mapper.
• One map task is created for each split; thus the number of map tasks will be equal to the number of InputSplits.
• The split is divided into records and each record will be processed by the mapper.
Node 1
InputFormat
Node 2
InputFormat
split
split
split
split
CRICOS code 00025B
29

DataFlow
• RecordReader (RR) communicates with the InputSplit and converts the data into key-value pairs suitable for reading by the mapper.
• RecordReader (RR) will actually read data in HDFS according to the split information and pass key-value pairs to mappers.
• Mapper processes each input record (from RecordReader) and generates new key-value pair, which is completely different from the input pair.
• The output of Mapper is also known as intermediate output which is written to the local disk.
• The output of the Mapper is not stored on HDFS as this is temporary data and writing on HDFS will create unnecessary copies.
• Mappers output is passed to the combiner for further process
Node 1
InputFormat
Node 2
InputFormat
split
RR RR RR RR
split
split
split

Map Map
Map Map
CRICOS code 00025B
30
List()

DataFlow
• Combiner (aka mini-reducer) performs local aggregation on the mappers’ output, which helps to minimize the data transfer between mapper and reducer (we will see reducer below).
Node 1
InputFormat
split split
RR RR RR RR
Node 2
InputFormat
split split
cloud is in cloud
machine learning is learning machine
Map Map Combiner
With combiners
6 key/value intermediate data
Map Map Combiner
Mapper 1
(cloud, 1) (is, 1) (in, 1) (cloud, 1)
Mapper 2
(machine, 1) (learning, 1) (is, 1) (learning, 1) (machine, 1)
No combiner
9 key/value intermediate data
CRICOS code 00025B
31
Improved Overall
Performance
Mapper 1
(cloud, 1) (is, 1) (in, 1) (cloud, 1)
Mapper 2
(machine, 1) (learning, 1) (is, 1) (learning, 1) (machine, 1)
Combiner 1
(cloud, 2) (is, 1) (in, 1)
Combiner 2
(machine, 2) (learning, 2) (is, 1)
Reducer
(cloud, 2) (is, 2) (machine, 2) (learning, 2) (in, 1)

DataFlow
• Partitioner takes the output from combiners and performs partitioning.
• Keys of the intermediate data will be partitioned according to hash function.
• All keys within the same partition will go to the same reducer.
• The total number of Partitioner depends on the number of reducers.
Node 1
InputFormat
split split
RR RR RR RR
Node 2
InputFormat
split split
<“apple”,1> <“able”,1> <“accept”,1> <”boy”,2> <“bear”,1> <”car”,1> <”country”,2>
Hash function or Partition function
Reducer 1
Reducer 2 Reducer 3
Map Map
Combiner Partitioner
Map Map
Combiner Partitioner
P1
P2
P3
CRICOS code 00025B
32

DataFlow
• Shuffling is the process by which the intermediate output from mappers is transferred to the reducer.
• Reducer has three primary phases: shuffle & sort and reduce.
• OutputFormat determines the way these output key- value pairs are written in output files.
• The final results will be written in HDFS.
• Note that the intermediate data from mappers are not immediately written into local storage, but into memory.
• There is a mechanism, call spill, that periodically write intermediate data in memory into disk.
Node 1
InputFormat
split split
RR RR RR RR
Node 2
InputFormat
split split
Map Map
Combiner Partitioner
Shuffle & Sort
Reduce
Map Map
Combiner Partitioner
Reduce OutputFormat
OutputFormat
CRICOS code 00025B
33

Example I – word count
Mapper
• Customize mapper based on the Mapper class provided by Hadoop
• Define a map method that contains the mapping logic.
• Output: e.g., (line1, Bear) -> (Bear, 1)
https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client- core/MapReduceTutorial.html#Source_Code CRICOS code 00025B 34

Example I – word count
Reducer
• Customize reducer based on the Reducer class provided by Hadoop
• Define a reduce method that contains the mapping logic.
• Output: e.g., (Bear, <1,1,1>)->(Bear, 3)
https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client- core/MapReduceTutorial.html#Source_Code CRICOS code 00025B 35

CRICOS code 00025B 36

Revisit Word Count Example:
• Assuming that there are four text files on different nodes, you want to count the word frequency using MapReduce model.
• Text 1 (node 1): the weather is good
• Text 2 (node 2): today is good
• Text 3 (node 3): good weather is good
• Text 4 (node 4): today has good weather
The weather is good
Today is good
Good weather is good
Today has good weather
CRICOS code 00025B 37

Revisit Word Count Example: Map phase
Intermediate Data
(good,1) (is,1) (the,1) (weather,1)
(good,1) (is,1) (today,1)
(good,2)
(is,1) (weather,1)
(today,1) (has,1) (good,1) (weather,1)
The weather is good
Today is good
Good weather is good
Today has good weather
(the,1) (weather,1) (is,1) (good,1)
(today,1) (is,1) (good,1)
(good,1) (weather,1) (is,1) (good,1)
(today,1) (has,1) (good,1) (weather,1)
(good,1) (is,1) (the,1) (weather,1)
(good,1) (is,1) (today,1)
(good,1) (good,1) (is,1) (weather,1)
(today,1) (has,1) (good,1) (weather,1)
Map
Sort
Combine
CRICOS code 00025B
38

Revisit Word Count Example: Reduce phase
Shuffle&Sort
(good,1) (is,1) (the,1) (weather,1)
(good,1) (is,1) (today,1)
(good,2)
(is,1) (weather,1)
(today,1) (has,1) (good,1) (weather,1)
(good, 5) (has, 1)
(good,(1,1,2,1)) (has, 1)
(the, 1) (today,(1,1))
(the, 1) (today, 2)
(good, 5) (is,3) (has,1) (the,1) (today,2) (weather,3)
(is,(1,1,1)) (weather,(1,1,1))
(is, 3) (weather,3)
Hash function: need to distribute records evenly over the partitions
CRICOS code 00025B
39

Example II: Find the Highest Temperature
Problem: Find the maximum monthly temperature for each year from weather reports Input: A set of records with format as:

– (201707,37.8), (201706,32.2) – (201508, 32.2), (201607,37.8) – (201708, 26.7), (201606,26.7)
Question: write down the Map and Reduce function to solve this problem • Assume we split the input by line:
Split 1: (201707,37.8), (201706,32.2)
Split 2: (201508, 32.2), (201607,37.8)
Split 3: (201708, 26.7), (201606,26.7)
CRICOS code 00025B 40

Example II: Find the Highest Temperature
Split 1: (201707,37.8), (201706,32.2)
Split 2: (201508, 32.2), (201607,37.8)
Split 3: (201708, 26.7), (201606,26.7)
Input Map
(201707,37.8), (201706,32.2)
(201508, 32.2), (201607,37.8)
(201508, 32.2), (201607,37.8)
(2016, [26.7,37.8])
(201708, 26.7), (201606,26.7)
(201708, 26.7), (201606,26.7)
(2017,[26.7,37.8])
Combiner
Shuffle/Sort
(201707,37.8)
(2015, [32.2])
Reduce
(2015, 32.2)
(2016, 37.8)
CRICOS code 00025B (2017, 37.8) 41

Example III: Find the Average Temperature (1)
Split 1: (201707,37.8), (201706,32.2)
Split 2: (201508, 32.2), (201607,37.8)
Split 3: (201708, 26.7), (201606,26.7)
Input Map
Combiner
(37.8+32.2+26.7)/3 = 32.23
Avg = (𝑎+𝑏+𝑐)
3
=?
(𝑎+𝑏)+𝑐 (𝑎+𝑏+2𝑐) 2=
24
(201708, 26.7), (201606,26.7)
(201708, 26.7), (201606,26.7)
(2017,[26.7,35])
(201707,37.8), (201706,32.2)
(201508, 32.2), (201607,37.8)
(201508, 32.2), (201607,37.8)
(2016, [26.7,37.8])
Shuffle/Sort
Reduce
(201707,35)
(2015, [32.2])
(2015, 32.2)
(2016, 32.25)
CRICOS code 00025B (2017, 30.85) 42

Example III: Find the Average Temperature (2)
Split 1: (201707,37.8), (201706,32.2)
Split 2: (201508, 32.2), (201607,37.8)
Split 3: (201708, 26.7), (201606,26.7)
Input Map
(201707,37.8), (201706,32.2)
(201508, 32.2), (201607,37.8)
(201708, 26.7), (201606,26.7)
Shuffle/Sort
(2015, 32.2)
(2016, [26.7,37.8])
(2017,[37.8, 32.2, 26.7])
Reduce
(2015, 32.2)
(2016, 32.25)
CRICOS code 00025B (2017, 32.23) 43

Example III: Find the Average Temperature (3)
Split 1: (201707,37.8), (201706,32.2)
Split 2: (201508, 32.2), (201607,37.8)
Split 3: (201708, 26.7), (201606,26.7)
Input Map
(201707,37.8), (201706,32.2)
(201508, 32.2), (201607,37.8)
(201708, 26.7), (201606,26.7)
Combiner
Reduce
(201707,<70,2>)
(2015, [<32.2,1>])
(2016, [<26.7,1>,<37.8,1>])
(2017,[<70,2>, <26.7,1>])
(2015, 32.2)
(201508, <32.2,1>), (201607,<37.8,1>)
(201708, <26.7,1>), (201606,<26.7,1>)
Shuffle/Sort
(2016, 32.25)
CRICOS code 00025B (2017, 32.23) 44


Reduce (key, list of values){
// key: year
// list of values: a list of tuples int total_temp = 0;
int total_count=0;
for each v in values:
total_temp= total_temp+v->sum;
total_count=total_count+v->count; Emit(key,total_temp/total_count);}
Example III: Find the Average Temperature (Pseudo code)

Map(key, value){
// key: line number
// value: tuples in a line for each tuple t in value:
Emit(t->year, t->temperature);}
• Combiner(key, list of values){ // key: year
// list of values: a list of monthly temperature
int total_temp = 0; for each v in values:
total_temp= total_temp+v; Emit(key,);}
CRICOS code 00025B 45

Example IV: Find Word Length Distribution
• Given a set of documents, use Map-Reduce model to find the length distribution of all words contained in the documents.
• Example:
• Input Doc: {This is a test data for the word length distribution problem}
• Output: {(12,1), (7,1), (6,1), (4,4), (3,2), (2,1), (1,1)}
Word
Length
Word
Length
this
4
the
3
is
2
word
4
a
1
length
6
test
4
distribution
12
data
4
problem
7
for
3
CRICOS code 00025B 46

Example IV: Find Word Length Distribution
Map(key, value){
{This is a test data for the word length distribution problem}
Map
(4, this), (2, is), (1, a), …, (12, distribution), (7, problem)
Combiner
(4, ), (2, is), …, (7, problem)
Reduce
// key: document name
// value: words in a document for each word w in value:
Emit(length(w), w);} Reduce(key, list of values){
// key: length of a word
// list of values: a list of words with the same length
Emit(key, size_of(values));}
{(12,1), (7,1), (6,1), (4,4), (3,2), (2,1), (1,1)} CRICOS code 00025B 47

Example V: Find Common Friends
Given a group of people on online social media (e.g., Facebook), each has a list of friends, use Map-Reduce to find common friends of any two persons who are friends
CRICOS code 00025B

Example V: Find Common Friends
Simple example:
AC
Input:
A -> B,C,D B-> A,C,D C-> A,B D->A,B
Output:
(A ,B) -> C,D (A,C) -> B (A,D) -> .. ….
MapReduce
BD
CRICOS code 00025B

Example V: Find Common Friends
Map(key, value){
// key: person_id
// value: the list of friends of the person for each friend f_id in value:
Emit(, value);} Reduce(key, list of values){
(B, ) =>
(,) (,)
,<,>
// key:
// list of values: a set of friend lists related with the friend pair (,) for v1, v2 in values:
(A, ) =>
(, ) (, )
(, )
(,)

common_friends = v1 intersects v2; Emit(key, common_friends);}
CRICOS code 00025B

Example V: Find Common Friends
Input:
A -> B,C,D B-> A,C,D C-> A,B D->A,B
Map:
(A,C) -> B,C,D (A,D) -> B,C,D
(B,C) -> A,C,D (B,D) -> A,C,D (A,C) -> A,B (B,C) -> A,B (A,D) -> A,B (B,D) -> A,B
Reduce:
(A,B) -> C,D (A,C) -> B (A,D) -> B (B,C) -> A (B,D) -> A
Suggest Fiends ☺
(A,B) -> B,C,D
(A,B) -> A,C,D
CRICOS code 00025B

Example VI: Inverted Index for Large Collection of Documents
• Inverted index is a database index storing a mapping from content (e.g. words) to its locations (e.g. pages) in
a document or a set of documents
• The purpose of an inverted index is to allow fast full-text searches.
• Inverted index algorithm becomes very challenging for documents on Internet:
– Documents are in all different formats
– Millions of Words and Short Phrases in Documents
– Billions of short or long documents
CRICOS code 00025B 52

Example VI: Inverted Index for Large Collection of Documents
Algorithm:
Step 1. For each page, get a list of all words on that page.
Step 2. For each word from Step 1, get a list of pages on which it appears.
CRICOS code 00025B 53

Outline


• •
• • •
Introduction to Hadoop What is Hadoop & History Hadoop Ecosystem
Hadoop Computation Model: MapReduce MapReduce Components & Paradigm MapReduce Workflow
MapReduce Examples:  Word Count
 Find the Highest/Averaged Temperature  Find Word Length Distribution
 Find Common Friends
 Inverted Indexing
Besides the Core: Hive and Pig

CRICOS code 00025B 54

Hive and Pig
Hadoop is great for large-data processing!
• But writing Java programs for everything is verbose and slow
• Not everyone wants to (or can) write Java code
Solution: develop higher-level data processing languages and convert to Hadoop jobs • Hive: HQL is like SQL
– Query language is HQL, variant of SQL
– Tables stored on HDFS as flat files
– Developed by Facebook, now open source
• Pig:
– Pig Latin is a bit like Perl
– Scripts are written in Pig Latin, a dataflow language – Developed by Yahoo!, now open source
– Roughly 1/3 of all Yahoo! internal jobs
Pig Hive
MapReduce
CRICOS code 00025B

Hive
• Apache Hive supports analysis of large datasets stored in Hadoop’s HDFS and compatible file systems such as Amazon S3 filesystem.
• It provides a SQL-like query language called HiveQL with schema on read and transparently converts queries to MapReduce
• HiveQL does not strictly follow the full SQL-92 standard but offers some extensions not found in SQL, such as multitable inserts.
• HiveQL has full ACID properties.
• Hive is a data warehouse that is good for OLAP rather than OLTP
http://demo.gethue.com/
CRICOS code 00025B 56

Online Transactional Processing (OLTP) vs Online Analytical Processing (OLAP)
• Online transaction processing, or OLTP, is a class of information systems that facilitate and manage transaction-oriented applications, typically for data entry and
retrieval transaction processing.
– Processes operational data
– Quick responses to user requests
– Concurrently used by many users
– Frequent updates (UPDATE, INSERT, DELETE) e.g., booking airline tickets.
– HBase is a NoSQL database for OLTP
• Online Analytical Processing, or OLAP, is a class of systems that is designed to response to multi-dimensional analytical (MDA) queries.
– No need to promptly respond
– Hive is a data warehouse suitable for OLAP
CRICOS code 00025B

Hive Data Model
Data in Hive organized into : • Tables
– Analogous to relational tables
– Each table has a corresponding directory in HDFS
– Data serialized and stored as files within that directory
• Partitions
– Each table can be broken into partitions
– Partitions determine distribution of data within subdirectories
• Buckets
– Data in each partition divided into buckets
– Based on a hash function of the column
– Each bucket is stored as a file in partition directory
https://data-flair.training/blogs/hive-data-model/
CREATE TABLE mytable ( name string, city string, employee_id int )
PARTITIONED BY (year STRING, month STRING, day STRING)
CLUSTERED BY (employee_id) INTO 256 BUCKETS CRICOS code 00025B

Hive Data Model – Example
Data in Hive organized into :
/hive/warehouse/Student_record/ITEE
Partition 1: ITEE
Student_record
/hive/warehouse/Student_record
Partition 2: CE Partition 3: MME
Bucket 1
ID
Name Year
Bucket 2
Bucket 1
Bucket 2
Bucket 1 Bucket 2 ID ID ID
Name Year
ID Name
Name Year Year
ID Name
Name Year
Year
https://data-flair.training/blogs/hive-data-model/
CRICOS code 00025B

Apache Pig
• Apache Pig is a high-level platform for creating programs that run on Apache Hadoop.
• The language for this platform is called Pig Latin.
• Pig can execute its Hadoop jobs in MapReduce or Apache Spark.
• Pig Latin can be regarded as
– High-level Map/Reduce commands pipeline
– Very intuitive and friendly to users who dislike Java
• Pig Latin can be extended using user-defined functions (UDFs)
– the user can write in Java, Python, JavaScript, etc
– then call directly from the language.
• Limitations:
– No loops and conditions (IF.. THEN) Pig Slides adapted from Olston et al.
CRICOS code 00025B
P. 60

Pig Latin Script
Pig Slides adapted from Olston et al.
CRICOS code 00025B P. 61

Java vs. Pig Latin
1/20 the lines of code
1/16 the development time
180
160
140
120
100
80 60 40 20
300
250
200
150
100
50 00
Hadoop Pig
Performance on par with raw Hadoop!
Hadoop Pig
Pig Slides adapted from Olston et al.
CRICOS code 00025B
P. 62
Minutes

Hive v/s Pig
• Similarities:
– Both High level Languages which work on top of map reduce framework
– Can coexist since both use the under lying HDFS and map reduce
• Differences:
– Language
 Pig is a procedural ; (A = load ‘mydata’; dump A)  Hive is Declarative (select * from A)
– Work Type
 Pig more suited for adhoc analysis (on demand analysis of click stream search logs)  Hive a reporting tool (e.g. weekly BI reporting)
– Users
 Pig – Researchers, Programmers (build complex data pipelines, machine learning)  Hive – Business Analysts
CRICOS code 00025B

Head-to-Head (the Bee, the Pig, the Elephant)
Version: Hadoop – 0.18x, Pig:786346, Hive:786346
CRICOS code 00025B

References
1.https://data-flair.training/blogs/hadoop-mapreduce-tutorial/
2. CSE 40822/60822, Cloud Computing, https://www3.nd.edu/~dthain/courses/cse40822/fall2018/ 3.MapReduce: Simplified Data Processing on Large Clusters
4.Hadoop Tutorial (https://www.tutorialspoint.com/hadoop/index.htm)
CRICOS code 00025B 65

Tutorial and Practical Sessions in Week 9
Tutorial:
1. What are Hadoop core components?
2. Why use Hadoop?
3. What is MapReduce?
4. Explain what is shuffling in MapReduce?
5. What are Hive and Pig? Please compare their differences.
Practical:
I. A1 review & discussion
II. Individual project consultation
CRICOS code 00025B 66

Updates
• Semester Break next week
• Final Exam 17/11/2021 8:00am
• Book your exam with ProctorU From Friday 1 October but no later than 72 hours prior to your scheduled exams
• Refer to my.UQ -> Online supervised exams -> Register with ProctorU (check the latest announcement)
CRICOS code 00025B 67

Next (Week 10) Topic:
Introduction to
CRICOS code 00025B 68