CS代考 INFS5710 IT Infra. for BA

INFS5710 IT Infra. for BA

Chapter 14
Hadoop, MapReduce and NoSQL 14-2 to 14-3

Copyright By PowCoder代写 加微信 powcoder

pp. 664 – 679
Last week, we were looking at Big Data, this week we will be looking at Hadoop, MapReduce and NoSQL.

INFS5710 Week 1
Copyright © 2012, SAS Institute Inc. All rights reserved.
Database Systems Infrastructure
Entity Relationship Model (ERM)
Normalisation
Prepared by , Feb. 2021
External Data (e.g. Excel)
ETL (Data Cleansing)
Star Schema (De-Normalised)
Data Warehouse
External Data (e.g. Excel)
Data (DW to BD, or vice versa or both)
Unstructured Data (Social Media)
Data Streaming
Structured Data (Internet of Things (IOT))
Data Streaming
Flat Files
Machine Learning
Relational Database
Today, we will cover the last bit you will learn in this course on Big Data. Now, you see how everything in the diagram is connected.
In summary, you have learned normalisation design of the database in Week 4 using ERD, i.e. Entity-Relationship Diagram, which you explored in Week 2 and 3. You learned to use Oracle to create an ERD in the lab.
Moreover, you have been learning SQL or Sequel in the Lab workshops. You learned about data warehouse in Week 5.
Last week, you learned characteristics of Big Data and how Big Data has influenced in today’s society in the video, “The Human Face of Big Data”.
Again, there are extra materials for this lecture – all the text should be on slides.
This week, we will talk about Hadoop, MapReduce and NoSQL behind the building of big data. Also, on how data can be retrieved from relational database or saved the data back to the relational database. A data warehouse could be a relational database.
(“Not Normalised”)
Spark and NoSQL (and other tools)
Hadoop Distributed File System (HDFS) and MapReduce
Reporting (Business Intelligence and Visualisation) and Business Analysis (End Users)
Note: In-Memory Database (e.g. SAP Hana) is an alternative data model not shown here

A software framework provides a standard way to build and deploy applications
 De facto standard for most Big Data storage and processing
 Java-based framework for distributing and processing very large data sets across clusters of computers
 Most important components:
 Hadoop Distributed File System (HDFS): Low-level distributed file processing system that can be used directly for data storage
 MapReduce: Programming model that supports processing large data sets
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.

Hadoop Distributed File System (HDFS)
 Approach based on several key assumptions:
 High volume – Default block sizes is 64 MB and can be configured to even larger values Store big data in blocks stored in
scalability becomes important multiple devices
To avoid  Write-once, read-many – Model simplifies concurrency fragmentation
issues and improves data throughput
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
 Streaming access – Hadoop is optimized for batch processing of entire files as a continuous stream of data
(from start to the end of each file w/o random seek)
 Fault tolerance – HDFS is designed to replicate data across many different devices so that when one fails, data is still available from another device

Hadoop Distributed File System (HDFS)
 Uses several types of nodes (computers):
 Data node store the actual file data
 Name node contains file system metadata
 Client node makes requests to the file system as needed
to support user applications
 Data node communicates with name node by regularly
sending block reports and heartbeats
what blocks are in data node
to inform name node the file status in the data node
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
store metadata

Replication factor = 3
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.

 Framework used to process large data sets across
clusters divide and conquer
 Breaks down complex tasks into smaller subtasks,
performing the subtasks and producing a final result
 Map function takes a collection of data and sorts and
 Reduce summaries results of map function to produce
over nodes
filters it into a set of key-value pairs
 Mapper program performs the map function
(key, value) or (attribute, value)
a single result
 Reducer program performs the reduce function
Need an “objective”, e.g., sum, mean, max, or min.
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.

 Implementation complements HDFS structure
 Uses a job tracker or central control program to accept, distribute, monitor and report on jobs in a Hadoop environment divide and conquer
 Task tracker is a program in MapReduce responsible for reducing tasks on a node
 System uses batch processing which runs tasks from (a sequence of tasks)
beginning to end with no user interaction
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.

Determine the total number of units for each product that has been sold.
This task would be straightforward if the invoice data are stored in a relational DB.
expect data redundancy
The task is to look for p_code and line_unit
key-value pairs
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.

Another Example: Analysis of Weather Dataset [AE1]
 Data format
 Line-oriented ASCII format
 Each record has many elements
 We focus on the temperature element
 What’s the highest recorded global temperature for each year in the dataset?
Year Temperature
Contents of data files
Acknowledgement: This slide has permission from Dr Xin Cao to use in the teaching of INFS5710.
List of data files
0067011990999991950051507004…9999999N9+00001+99999999999… 0043011990999991950051512004…9999999N9+00221+99999999999… 0043011990999991950051518004…9999999N9-00111+99999999999… 0043012650999991949032412004…0500001N9+01111+99999999999… 0043012650999991949032418004…0500001N9+00781+99999999999…
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
This is example by my colleague Xin. We co-teach in another course on Big Data.
If you want to gain more big data technical skills, you can enrol to COMP9313 (PG) in CSE.
After listening to his presentation, I think his example is better – explaining in more details how things work underneath.
This is a bit more details and “heavier” than the previous example. It has a bit newer materials and more up-to-date than the textbook.
In this example, we are looking maximum temperature for each year. In the input file, we can see year and temperature are different location on each row of the input file.

Solve this problem on one node [AE2]
 Keep a hash table
 Read the data line by line
 For each line: get the year and temperature, check the current maximum temperature for the year, and update it accordingly
Acknowledgement: This slide has permission from Dr Xin Cao to use in the teaching of INFS5710.
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
We first look at the one node, i.e. one data node, on how the data is stored.
The process is start with a hash table. A hash table is a table that uses a hash function to compute an index in the table, so you can search very quickly.
Then, you read the data line by line. For each line, you retrieve the year and temperature.
Lastly, you compare the current maximum temperature for the year, and if it is higher than the current maximum, then you update the table accordingly.

Solve this problem on multiple nodes [AE3]
 You need to first divide the data into several parts and distribute them to the nodes
 On each node, you need to maintain a hash table
 The nodes do the following task in parallel: for each line, get the year and temperature, check the current maximum temperature for the year, and update it accordingly
 After all the nodes find the “local” maximum
temperature store in hash tables, aggregate the results on
one node to compute the maximum temperature of each
Acknowledgement: This slide has permission from Dr Xin Cao to use in the teaching of INFS5710.
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
Same as before, except you do on multiple nodes.

Maximum Temperature [AE4]
Weather dataset
Acknowledgement: This slide has permission from Dr Xin Cao to use in the teaching of INFS5710.
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
temperature 1
temperature 2
temperature 3
Local maximum temperature 4
Maximum temperature of each year
This is an overall picture. All the nodes read the lines and then update the maximum temperature of each year.

MapReduce Algorithm Design [AE5]
 What does a mapper do?
 Pull out the year and the temperature
 Indeed in this example, the map phase is simply data preparation phase
 Drop bad records (filtering) Input File
Input of Map Function (key, value)
Acknowledgement: This slide has permission from Dr Xin Cao to use in the teaching of INFS5710.
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
Output of Map Function
(key, value)
Now, in more details on how this is done.
After reading the lines from the input file, the map function will map the year and temperature as output of the Map Function (key, value), i.e. (year, temperature).
In the example, Map function manages to retrieve the data into year and temperature as (1950,0), (1950,22). (1950,-11) etc.

MapReduce Algorithm Design [AE6]
 The output from the map function is processed by MapReduce framework
 Sorts and groups the key-value pairs by key Sort and Group By
 What does a reducer do?
 Reducer input: (year, [temperature1, temperature2, temperature3, …])
 Reduce function iterates through the list and pick up the maximum value Reduce
Acknowledgement: This slide has permission from Dr Xin Cao to use in the teaching of INFS5710.
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
The data is then sorted and grouped into key-value pairs by key; so, now you have (1950, [0, 22, -11]). The values in the square bracket only appear once. If you have the same value, such as 22, it only keeps one.
The next stage is to use the reduce function. The reducer will then reduce the maximum temperature for each year to (1950, 22).

Combiner [AE7]
 Combiner aims to reduce the mapper output, thus
reducing the cost of data transfer
Input of Map Function (key, value)
 We can see that the mapper output three temperature for year 1950. However this is unnecessary. The mapper can output a “local” maximum temperature for each year, rather than store all temperatures of the year.
 With a combiner, the mapper output is (1950, 22) and (1949, 111).
Acknowledgement: This slide has permission from Dr Xin Cao to use in the teaching of INFS5710.
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
Need an “objective”, e.g., sum, mean, max, or min here.
Output of Map Function (key, value)
This is not in the text book.
Let’s look into more details on how this is done. This Combiner step comes after Map phase but before Reduce Phase.
The Combiner sorts and groups the year together, i.e. all the keys with 1950s are grouped together.
In this example, the mapper output will be (1950, 22) and (1949, 111). You will have maximum temperature for all the years from the original input file.
Note, here we are only looking at one node only. Other nodes will do exactly like this one, it might come out different output as it might different blocks. Data Node 2, say, has (1950, 10), and Data Node 2 might have (1959, 5).

Partitioner [AE8]
 Partitioner controls the partitioning of the keys of the intermediate map outputs.
 The key (or a subset of the key) is used to derive the partition, typically by a hash function.
 The total number of partitions is the same as the number of reduce tasks for the job.
 This controls which reduce tasks an intermediate key (and hence the record) is sent to for reduction.
 System uses HashPartitioner by default:
 hash(key) mod R, where R is the number of partitions
The remainder of hash(key) divided by R, which can be 0, 1, 2, …, R-1.
Acknowledgement: This slide has permission from Dr Xin Cao to use in the teaching of INFS5710.
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
Again, this is not in the text book, partitioner controls the partitioning of the keys of the intermediate map outputs. This partition phase, again, takes place after the Map phase but before the Reduce phase.
In this step, you look at all the nodes, and find all the temperature for the same year, i.e. you look at year 1950 in Data Node 1, Data Node 2 and so on. You will find the highest temperature from these nodes, and put into one partition.
The number of partitioners is equal to the number of reducers.
What we are saying here is a partitioner will divide the data according to the number of reducers.
Therefore, the data passed from a single partitioner is processed by a single Reducer.

MapReduce DataFlow [AE9]
Acknowledgement: This slide has permission from Dr Xin Cao to use in the teaching of INFS5710.
certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
R=3, i.e., 3 reduce tasks. Why? Mostly because there are three key
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distvalues.
ributed with a
This diagram shows the data nodes are run in parallel and generate different results for a, b, and c.
After mapping process, you can see you have a=1 and b=2; c=3 and c=6; a=5 and c=2; b=7 and c=6.
After combiner process, you can see you have c=3 and c=6 become c=9.
After partitioner process, you can see you have a has 1 and 5; b has 2 and 7, and c has 2, 9 and 8.

Partitioner determines 5 reduce tasks
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.

Hadoop Ecosystem (HE1)
 Hadoop is a low-level tool, which requires technical skills, considerable effort to create, manage, and use, it presents quite a few obstacles.
 Most organisations that use Hadoop as a set of other related products that interact and complement each other to produce an entire ecosystem of applications and tools on top of Hadoop.
 These applications and tools will help less technical users who do not have technical skills to do low-level tool.
 Like any ecosystem, the interconnected pieces are constantly evolving and their relationships are changing, so it is a rather fluid situation
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
As stated on the slide…

Hadoop Ecosystem (HE2)
 MapReduce Simplification Applications:
 Hive is a data warehousing system that sites on top of HDFS and supports its own SQL-like language
 Pig compiles a high-level scripting language (Pig L

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com