程序代写代做代考 Java database AWS file system data mining go graph Hive hadoop algorithm COSC2633/2637 – Big Data Processing Semester 2, 2020

COSC2633/2637 – Big Data Processing Semester 2, 2020
Week 1
Introduction to Big Data Processing
Dr. Ke Deng ke.deng@rmit.edu.au

RMIT Classification: Trusted
Acknowledgement of country
RMIT University acknowledges the Wurundjeri people of the Kulin Nations as the Traditional Owners of the land on which the University stands. RMIT University respectfully recognises Elders past, present and future.
7/23/2020 Big Data Processing – 2 –

RMIT Classification: Trusted
Content
● Course Administration
■ Prerequisite knowledge
■ Teaching Team ■ Canvas
■ Lectures and laboratory
● What is Big Data ?
● Course overview, Lab Platform, Assessment
7/23/2020 Big Data Processing – 3 –

RMIT Classification: Trusted
Course administration
Prerequisite Knowledge
You should be familiar with databases (i.e., SQL) and have developed strong programming skills (i.e., Java)
● This prerequisite knowledge can be attained by completing
■ ISYS1055DatabaseConcepts
■ COSC1295AdvancedProgramming(Java)
7/23/2020 Big Data Processing – 4 –

RMIT Classification: Trusted
Course administration
Lectures and Laboratory Classes
● Lectures
2 hours per week via Collaborate Ultra – recorded
■ Ke Deng (Wed 16:30-18:30)
◆ ke.deng@rmit.edu.au
◆ Any questions – by email, by discussion forum, by online meeting via Teams (pre-booking is needed)
● Laboratory sessions (starting in week 2)
1 hour per week via Collaborate Ultra – recorded ■ Lab Teachers
◆ Andrian Radic (Fri. 3:30-4:30, Fri. 4:30-5:30, Thurs 18:30-19:30) ◆ Ke Deng (Tue 18.30-19.30)
● No Tutorial
7/23/2020 Big Data Processing – 5 –

RMIT Classification: Trusted
Course administration
Canvas
This semester we are using the Canvas system accessible through myRMIT which will include access to online resources including:
■ Syllabus
■ Lecturenotes
■ Labmaterial
■ Assignment
■ Discussionforums
If communicating with staff please make sure you use your RMIT student email
7/23/2020 Big Data Processing – 6 –

RMIT Classification: Trusted
Course administration Learning Materials
No compulsory textbook
Learning materials will be available online
■ Weekly modules
• Essential and other readings
• Lecture notes
■ Collaborate Ultra
• Lecture recording
• Lab recording
■ Additional Readings
• •
Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer, University of Maryland, College Park, Manuscript prepared April 11, 2010
Other specified readings
7/23/2020
Big Data Processing – 7 –

Content
RMIT Classification: Trusted
● Course administration
● What is Big Data ?
■ What is big data?
■ 3Vs and more Vs
● Course overview, Lab Platform, Assessment
7/23/2020 Big Data Processing – 8 –

RMIT Classification: Trusted
Examples of Big Data
– 90% of the
We generate millions of bytes of data every day
world’s data has been created in last two years.
• Walmart handles more than 1 million customer transactions every hour.
• Facebook stores, accesses, and analyzes 30+ Petabytes of user generated data.
• 230+ millions of tweets are created every day.
• More than 5 billion people are calling, texting, tweeting and browsing on mobile phones worldwide.
• YouTube users upload 48 hours of new video every minute of the day.
• Amazon handles 15 million customer click stream user data per day to
• 294 billion emails are sent every day. Services analyses this data to find the spams.
• Modern cars have close to 100 sensors which monitors fuel level, tire pressure etc. , each vehicle generates a lot of sensor data.
recommend products.
23/07/2020 Big Data Processing – Week 2 9

RMIT Classification: Trusted
Examples of Big Data
– 90% of the
We generate millions of bytes of data every day
world’s data has been created in last two years.
• Walmart handles more than 1 million customer transactions every hour.
• Facebook stores, accesses, and analyzes 30+ Petabytes of user generated data.
• 230+ millions of tweets are created every day.
• More than 5 billion people are calling, texting, tweeting and browsing on mobile phones worldwide.
• YouTube users upload 48 hours of new video every minute of the day.
• Amazon handles 15 million customer click stream user data per day to
• 294 billion emails are sent every day. Services analyses this data to find the spams.
• Modern cars have close to 100 sensors which monitors fuel level, tire pressure etc. , each vehicle generates a lot of sensor data.
recommend products.
23/07/2020 Big Data Processing – Week 2 10

RMIT Classification: Trusted
Examples of Big Data
– 90% of the
We generate millions of bytes of data every day
world’s data has been created in last two years.
• Walmart handles more than 1 million customer transactions every hour.
• Facebook stores, accesses, and analyzes 30+ Petabytes of user generated data.
• 230+ millions of tweets are created every day.
• More than 5 billion people are calling, texting, tweeting and browsing on mobile phones worldwide.
• YouTube users upload 48 hours of new video every minute of the day.
• Amazon handles 15 million customer click stream user data per day to
• 294 billion emails are sent every day. Services analyses this data to find the spams.
• Modern cars have close to 100 sensors which monitors fuel level, tire pressure etc. , each vehicle generates a lot of sensor data.
recommend products.
23/07/2020 Big Data Processing – Week 2 11

RMIT Classification: Trusted
Examples of Big Data
• Walmart handles more than 1 million customer transactions every hour.
• Facebook stores, accesses, and analyzes 30+ Petabytes of user generated data.
• 230+ millions of tweets are created every day.
• More than 5 billion people are calling, texting, tweeting and browsing on mobile phones worldwide.
• YouTube users upload 48 hours of new video every minute of the day.
• Amazon handles 15 million customer click stream user data per day to
• 294 billion emails are sent every day. Services analyses this data to find the spams.
• Modern cars have close to 100 sensors which monitors fuel level, tire pressure etc. , each vehicle generates a lot of sensor data.
– 90% of the
We generate millions of bytes of data every day
world’s data has been created in last two years.
recommend products.
23/07/2020 Big Data Processing – Week 2 12

RMIT Classification: Trusted
Examples of Big Data
More than
5 billion
– 90% of the
We generate millions of bytes of data every day
world’s data has been created in last two years.
• Walmart handles more than 1 million customer transactions every hour.
• Facebook stores, accesses, and analyzes 30+ Petabytes of user generated data.
• 230+ millions of tweets are created every day.
• people are calling, texting, tweeting and browsing on mobile
• YouTube users upload 48 hours of new video every minute of the day.
• Amazon handles 15 million customer click stream user data per day to
• 294 billion emails are sent every day. Services analyses this data to find the spams.
• Modern cars have close to 100 sensors which monitors fuel level, tire pressure etc. , each vehicle generates a lot of sensor data.
phones worldwide.
recommend products.
23/07/2020 Big Data Processing – Week 2 13

RMIT Classification: Trusted
Examples of Big Data
We generate millions of bytes of data every day
world’s data has been created in last two years.
• • • •
• •
• •
– 90% of the
Walmart handles more than 1 million customer transactions every hour.
Facebook stores, accesses, and analyzes 30+ Petabytes of user generated data.
230+ millions of tweets are created every day.
More than 5 billion people are calling, texting, tweeting and browsing on mobile phones worldwide.
YouTube users upload 48 hours of new video every minute of the day. customer click stream user data per day to
emails are sent every day. Services analyses this data to find the
Amazon handles
15 million
recommend products.
294 billion
spams.
Modern cars have close to
etc. , each vehicle generates a lot of sensor data.
100 sensors
which monitors fuel level, tire pressure
23/07/2020
Big Data Processing – Week 2 14

RMIT Classification: Trusted
What is big data? Is big data a new wave?
7/23/2020 Big Data Processing – 15 –

RMIT Classification: Trusted
What is big data? Is big data a new wave?
No

No, most of these problems have been in the focus of
data management research for
years
7/23/2020 Big Data Processing – 16 –

RMIT Classification: Trusted
What is big data? Is big data a new wave?
No & Yes
● No,mostoftheseproblemshavebeeninthefocusof
data management research for years
● Yes,itisadifferenttypeofdatawave:
One needs to put together many sources of information coming through many different channels, throwing away what is not important, working under time constraints, serving analysts and end users
7/23/2020 Big Data Processing – 17 –

RMIT Classification: Trusted
What is big data? Is big data a new wave?
No & Yes
● No,mostoftheseproblemshavebeeninthefocusof
data management research for years
● Yes,itisadifferenttypeofdatawave:
One needs to put together through
important,
users
coming , throwing away what is not
, serving analysts and end
many sources of information
many different channels
working under time constraints
7/23/2020
Big Data Processing – 18 –

RMIT Classification: Trusted
What is big data?
Most commonly accepted definition, by Gartner (3Vs)
“Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.”
7/23/2020 Big Data Processing – 19 –

RMIT Classification: Trusted
What is big data?
Most commonly accepted definition, by Gartner (3Vs)
“Big data is – – that demand for
information assets
high

volume, high
velocity and high
variety
cost

forms of information processing
decision
making.”
effective, innovative
enhanced insight and
The amount of data generated is vast compared to traditional data sources
• exabyte, zettabyte, yottabyte, etc
• More data sources, higher resolution sensors
7/23/2020 Big Data Processing
– 20 –

RMIT Classification: Trusted
What is big data?
Most commonly accepted definition, by Gartner (3Vs)
“Big data is – for
information assets
decision
high
making.”

volume,
that demand
high

velocity
cost

and high
effective, innovative
variety
forms of information processing
enhanced insight and
Data is being generated extremely fast, a process never stops, and the speed at which data is transformed into insight.
The amount of data generated is vast compared to traditional data sources
• exabyte, zettabyte, yottabyte, etc
• More data sources, higher resolution sensors
7/23/2020 Big Data Processing
– 21 –

RMIT Classification: Trusted
What is big data?
Most commonly accepted definition, by Gartner (3Vs) “Big data is
for
information assets
high

volume,
that demand
high

velocity
cost
and
high
effective, innovative

variety

forms of information processing
making.”
enhanced insight and
decision
• Data is being generated extremely fast,
• a process never stops
• the speed at which data is transformed into insight.
The amount of data generated is vast compared to traditional data sources
• exabyte, zettabyte, yottabyte, etc
• More data sources, higher resolution sensors
7/23/2020 Big Data Processing – 22 –
Data comes from different sources
• numerical, textual, spatial and temporal, videos, images, transactions, graph, etc
• Temperatures, traffic speed, population distribution, electricity consumption, social media

More than 3Vs
RMIT Classification: Trusted
7/23/2020 Big Data Processing – 23 –

RMIT Classification: Trusted
More than 3Vs – some make it 4Vs
7/23/2020 Big Data Processing – 24 –

RMIT Classification: Trusted
More than 3Vs – some make it 4Vs
7/23/2020 Big Data Processing – 25 –

RMIT Classification: Trusted
More than 3Vs – some make it 4Vs
7/23/2020 Big Data Processing – 26 –

RMIT Classification: Trusted
More than 3Vs – some make it 4Vs
7/23/2020 Big Data Processing – 27 –

RMIT Classification: Trusted
More than 3Vs – some make it 4Vs
7/23/2020 Big Data Processing – 28 –

Big Data Big Impact
(Prediction in 2011)
RMIT Classification: Trusted
7/23/2020 Big Data Processing – 29 –

Big Data Big Impact
(Prediction in 2011)
RMIT Classification: Trusted
Data Processing – 30 –
7/23/2020 Big

Big Data Big Impact
(Prediction in 2011)
RMIT Classification: Trusted
7/23/2020 Big Data Processing – 31 –

RMIT Classification: Trusted
Big Data Big Impact
(Prediction in 2011) (Reviewed in 2016)
Mckinsey Global Institute, December 2016, The age of analytics: competing in a data driven world
7/23/2020 Big Data Processing – 32 –

Big Data Big Impact
(Prediction in 2011)
RMIT Classification: Trusted
7/23/2020 Big Data Processing – 33 –

RMIT Classification: Trusted
Big Data Big Impact
(Prediction in 2011) (Reviewed in 2016)
Mckinsey Global Institute, December 2016, The age of analytics: competing in a data driven world
7/23/2020 Big Data Processing – 34 –

Big Data Big Impact
(Prediction in 2011)
RMIT Classification: Trusted
7/23/2020 Big Data Processing – 35 –

RMIT Classification: Trusted
Big Data Big Impact
(Prediction in 2011) (Reviewed in 2016)
Mckinsey Global Institute, December 2016, The age of analytics: competing in a data driven world
7/23/2020 Big Data Processing – 36 –

Big Data Big Impact
(Prediction in 2011)
RMIT Classification: Trusted
7/23/2020 Big Data Processing – 37 –

RMIT Classification: Trusted
Big Data Big Impact
(Prediction in 2011) (Reviewed in 2016)
Mckinsey Global Institute, December 2016, The age of analytics: competing in a data driven world
7/23/2020 Big Data Processing – 38 –

Big Data Big Impact
(Prediction in 2011)
RMIT Classification: Trusted
7/23/2020 Big Data Processing – 39 –

RMIT Classification: Trusted
Big Data Big Impact
(Prediction in 2011) (Reviewed in 2016)
Mckinsey Global Institute, December 2016, The age of analytics: competing in a data driven world
7/23/2020 Big Data Processing – 40 –

RMIT Classification: Trusted
Big Data Big Impact
From 2011 to 2016
7/23/2020 Big Data Processing – 41 –

RMIT Classification: Trusted
Big Data Big Impact
From 2011 to 2016
7/23/2020 Big Data Processing – 42 –

RMIT Classification: Trusted
Big Data Big Impact
From 2011 to 2016
7/23/2020 Big Data Processing – 43 –

RMIT Classification: Trusted
Big Data Big Impact
From 2011 to 2016
7/23/2020 Big Data Processing – 44 –

RMIT Classification: Trusted
7/23/2020 Big Data Processing – 45 –

Big Data Landscape in 2014
RMIT Classification: Trusted
7/23/2020 Big Data Processing – 46 –

RMIT Classification: Trusted
Big Data Landscape in 2018
7/23/2020 Big Data Processing – 47 –

RMIT Classification: Trusted
Big Data – Big Ideas
Tackling large-data problems requires a distinct approach  Scale “out”, not “up”
 Failure are common
 Move processing to the data
7/23/2020 Big Data Processing – 48 –

RMIT Classification: Trusted
Big Data – Big Ideas
Tackling large-data problems requires a distinct approach  Scale “out”, not “up”
 Failure are common
 Move processing to the data
7/23/2020 Big Data Processing – 49 –

RMIT Classification: Trusted
Big Data – Big Ideas
Scale “out”, not “up”
The data-intensive workloads today are beyond the capability of any single machine, no matter how powerful
7/23/2020 Big Data Processing – 50 –

RMIT Classification: Trusted
Big Data – Big Ideas
Scale “out”, not “up”
The data-intensive workloads today are beyond the capability of any single machine, no matter how powerful
• Two Approaches
– -end servers is the scaling
–Using a small number of high-end servers is the scaling “up” approach
Using a
“out ” approach
large number of commodity low
7/23/2020 Big Data Processing – 51 –

RMIT Classification: Trusted
Big Data – Big Ideas
Scale “out”, not “up”
The data-intensive workloads today are beyond the capability of any single machine, no matter how powerful
• Two Approaches
–Using a large number of commodity low-end servers is the scaling
“out ” approach
–Using a small number of high-end servers is the scaling “up” approach
7/23/2020 Big Data Processing – 52 –

RMIT Classification: Trusted
Big Data – Big Ideas
Scale “out”, not “up”
The data-intensive workloads today are beyond the capability of any single machine, no matter how powerful
• Two Approaches
–Using a large number of commodity low-end servers is the scaling
“out ” approach
–Using a small number of high-end servers is the scaling “up” approach
The scaling “out ” approach
is preferred over the scaling “up” approach.
7/23/2020 Big Data Processing – 53 –

RMIT Classification: Trusted
Big Data – Big Ideas
Scale “out”, not “up”
The data-intensive workloads today are beyond the capability of any single machine, no matter how powerful
• Two Approaches
–Using a large number of commodity low-end servers is the scaling
“out ” approach
–Using a small number of high-end servers is the scaling “up” approach
The scaling “out ” approach
is preferred over the scaling “up” approach.
Most existing implementations of the MapReduce programming model are designed around clusters of low-end commodity servers.
7/23/2020 Big Data Processing – 54 –

RMIT Classification: Trusted
Big Data – Big Ideas
Tackling large-data problems requires a distinct approach  Scale “out”, not “up”
 Failure are common
 Move processing to the data
7/23/2020 Big Data Processing – 55 –

RMIT Classification: Trusted
Big Data – Big Ideas Failure are common
Failures are common in a cluster of large number of low-end servers
7/23/2020 Big Data Processing – 56 –

RMIT Classification: Trusted
Big Ideas
Failure are common
Failures are common in a cluster of large number of low-end servers
• For a 10,000-server cluster, –-
What if MTBF is 2000 days?
The mean
time between failures (MTBF) of each server is 1000
days. There are 10 failures a day in average.
7/23/2020 Big Data Processing – 57 –

RMIT Classification: Trusted
Big Ideas
Failure are common
Failures are common in a cluster of large number of low-end servers
• For a 10,000-server cluster,
– The mean-time between failures (MTBF) of each server is 1000
days. There are 10 failures a day in average. What if MTBF is 2000 days?
•–
A well
designed, fault
expected quality of services
tolerant service must ensure the
7/23/2020 Big Data Processing – 58 –

RMIT Classification: Trusted
Big Ideas
Failure are common
Failures are common in a cluster of large number of low-end servers
• For a 10,000-server cluster,
– The mean-time between failures (MTBF) of each server is 1000
days. There are 10 failures a day in average. What if MTBF is 2000 days?
• A well-designed, fault-tolerant service must ensure the expected quality of services

– A broken server that has been repaired should be able to seamlessly rejoin the service without manual reconfiguration by the administrator.
As servers go down, other cluster nodes should seamlessly take
over the load.
7/23/2020 Big Data Processing – 59 –

RMIT Classification: Trusted
Big Ideas
Failure are common
Failures are common in a cluster of large number of low-end servers
• For a 10,000-server cluster,
– The mean-time between failures (MTBF) of each server is 1000
days. There are 10 failures a day in average. What if MTBF is 2000 days?
• A well-designed, fault-tolerant service must ensure the expected quality of services
– As servers go down, other cluster nodes should seamlessly take over the load.

A broken server that has been repaired should be able to
seamlessly rejoin the service without manual reconfiguration by
the administrator.
7/23/2020 Big Data Processing – 60 –

RMIT Classification: Trusted
Big Data – Big Ideas
Tackling large-data problems requires a distinct approach  Scale “out”, not “up”
 Failure are common
 Move processing to the data
7/23/2020 Big Data Processing – 61 –

RMIT Classification: Trusted
Big Data – Big Ideas
Move processing to the data
In traditional high-
nuclear simulations), high-capacity interconnect.
(HPC) applications (e.g., climate or and storage nodes linked together by a
performance computing
processing nodes
7/23/2020
Big Data Processing – 62 –

RMIT Classification: Trusted
Big Data – Big Ideas
Move processing to the data
In traditional high-performance computing (HPC) applications (e.g., climate or nuclear simulations), processing nodes and storage nodes linked together by a high-capacity interconnect.
•– –Any general processor such as the one in storage node can be
used
–Moving large volume of data becomes the performance bottleneck
– processor wait for data
Many data
intensive workloads are not very processor
demanding (e.g., word count, time complexity is linear).
7/23/2020 Big Data Processing – 63 –

RMIT Classification: Trusted
Big Data – Big Ideas
Move processing to the data
In traditional high-performance computing (HPC) applications (e.g., climate or nuclear simulations), processing nodes and storage nodes linked together by a high-capacity interconnect.
•– –
–Moving large volume of data becomes the performance bottleneck – processor wait for data
Many data
intensive workloads are not very processor
demanding (e.g., word count, time complexity is linear).
Any general processor such as the one in storage node can be
used
7/23/2020 Big Data Processing – 64 –

RMIT Classification: Trusted
Big Data – Big Ideas
Move processing to the data
In traditional high-performance computing (HPC) applications (e.g., climate or nuclear simulations), processing nodes and storage nodes linked together by a high-capacity interconnect.
•– –Any general processor such as the one in storage node can be
used
–Moving large volume of data becomes the performance bottleneck – processor wait for data
Many data
intensive workloads are not very processor
demanding (e.g., word count, time complexity is linear).
7/23/2020 Big Data Processing – 65 –

RMIT Classification: Trusted
Big Data – Big Ideas
Move processing to the data
In traditional high-performance computing (HPC) applications (e.g., climate or nuclear simulations), processing nodes and storage nodes linked together by a high-capacity interconnect.
• Many data-intensive workloads are not very processor- demanding (e.g., word count, time complexity is linear).
–Any general processor such as the one in storage node can be used
–Moving large volume of data becomes the performance bottleneck – processor wait for data
So, moving and running the processing code to the nodes where data are stored.
7/23/2020 Big Data Processing – 66 –

RMIT Classification: Trusted
Content
● Course administration
● What is Big Data ?
● Course overview, Lab Platform, Assessment
7/23/2020 Big Data Processing – 67 –

RMIT Classification: Trusted
Course overview
Week 1 – Introduction to big data processing
Week 2 – Hadoop, HDFS, YARN
Week 3 – MapReduce Basics
Week 4 – MapReduce Algorithm Design Patterns Week 5 – MapReduce Clustering and Classification Week 6 – MapReduce Large Graph Processing Week 7 – Spark Basics
Week 8 – Spark streaming, Kafa – 1
Week 9 – Spark streaming, Kafa – 2
Week 10 – PIG, HIVE, Mahout, Solr, Storm (1) Week 11 – PIG, HIVE, Mahout, Solr, Storm (2) Week 12 – Revision and exam preparation
7/23/2020 Big Data Processing – 68 –

RMIT Classification: Trusted
Course overview
Week 1 – Introduction to big data processing
Week 2 – Hadoop, HDFS, YARN
Week 3 – MapReduce Basics
Week 4 – MapReduce Algorithm Design Patterns Week 5 – MapReduce Clustering and Classification Week 6 – MapReduce Large Graph Processing
Week 7 – Spark Basics
Week 8 – Spark streaming, Kafka – 1
Week 9 – Spark streaming, Kafka – 2
Week 10 – PIG, HIVE, Mahout, Solr, Storm (1) Week 11 – PIG, HIVE, Mahout, Solr, Storm (2) Week 12 – Revision and exam preparation
7/23/2020 Big Data Processing – 69 –

RMIT Classification: Trusted
Course overview
Week 1 – Introduction to big data processing
Week 2 – Hadoop, HDFS, YARN
Week 3 – MapReduce Basics
Week 4 – MapReduce Algorithm Design Patterns Week 5 – MapReduce Clustering and Classification Week 6 – MapReduce Large Graph Processing
Week 10 – PIG, HIVE, Mahout, Solr, Storm (1) Week 11 – PIG, HIVE, Mahout, Solr, Storm (2) Week 12 – Revision and exam preparation
Week 7 – Spark Basics
Week 8 – Spark streaming, Kafka – 1 Week 9 – Spark streaming, Kafka – 2
7/23/2020 Big Data Processing – 70 –

RMIT Classification: Trusted
Course overview
Week 1 – Introduction to big data processing
Week 2 – Hadoop, HDFS, YARN
Week 3 – MapReduce Basics
Week 4 – MapReduce Algorithm Design Patterns Week 5 – MapReduce Clustering and Classification Week 6 – MapReduce Large Graph Processing Week 7 – Spark Basics
Week 8 – Spark streaming, Kafka – 1 Week 9 – Spark streaming, Kafka – 2
Week 12 – Revision and exam preparation
Week 10 – PIG, HIVE, Mahout, Solr, Storm (1) Week 11 – PIG, HIVE, Mahout, Solr, Storm (2)
7/23/2020 Big Data Processing – 71 –

RMIT Classification: Trusted
Course overview
Week 2 – Hadoop, HDFS, YARN
• Data Center and Cloud computing
• Utility Computing Everything as a Service
7/23/2020 Big Data Processing – 72 –

RMIT Classification: Trusted
Course overview
Week 2 – Hadoop, HDFS, YARN
• Data Center and Cloud computing
• Utility Computing Everything as a Service
7/23/2020 Big Data Processing – 73 –

Course overview
Week 2 – Hadoop, HDFS, YARN
• Apache Hadoop Ecosystem
RMIT Classification: Trusted
7/23/2020 Big Data Processing – 74 –

RMIT Classification: Trusted
Course overview
Week 2 – Hadoop, HDFS, YARN
• Apache Hadoop Ecosystem – Hadoop Distributed File System (HDFS)

A distributed file
system that stores data on commodity machines, providing very high
aggregate bandwidth across the cluster;
7/23/2020
Big Data Processing
– 75 –

Course overview
Week 2 – Hadoop, HDFS, YARN
• Apache Hadoop Ecosystem – Yet Another Resource Negotiator (Yarn)
RMIT Classification: Trusted
A platform responsible for managing computing resources in clusters and using them for
scheduling users’ applications
7/23/2020 Big Data Processing – 76 –

RMIT Classification: Trusted
Course overview
Week 3 – MapReduce Basics
MapReduce is programming model for large-scale data processing.
7/23/2020 Big Data Processing – 77 –

RMIT Classification: Trusted
Course overview
Week 3 – MapReduce Basics
• Programmers just : map (k1, v1) → []
combine (k2, [v2]) → []
partition (k’, number of partitions) → partition for k’ reduce (k2, [v2]) → []
specify four
functions
7/23/2020 Big Data Processing – 78 –

RMIT Classification: Trusted
Course overview
Week 3 – MapReduce Basics
• Programmers just : map (k1, v1) → []
combine (k2, [v2]) → []
partition (k’, number of partitions) → partition for k’ reduce (k2, [v2]) → []
specify four
functions
7/23/2020 Big Data Processing – 79 –

RMIT Classification: Trusted
Course overview
Week 3 – MapReduce Basics
• Programmers just : map (k1, v1) → []
combine (k2, [v2]) → []
partition (k’, number of partitions) → partition for k’ reduce (k2, [v2]) → []
specify four
functions
All other aspects of execution are
handled transparently by the
execution framework on clusters
7/23/2020 Big Data Processing – 80 –

RMIT Classification: Trusted
Course overview
Week 4 – MapReduce Algorithm Design Patterns via building term co-occurrence matrix
• term co-occurrence matrix for
MapReduce Algorithms for
a text
collection
7/23/2020 Big Data Processing – 81 –

RMIT Classification: Trusted
Course overview
Week 4 – MapReduce Algorithm Design Patterns via building term co-occurrence matrix
• term co-occurrence matrix for
MapReduce Algorithms for
a text
collection
7/23/2020 Big Data Processing – 82 –

RMIT Classification: Trusted
Course overview
Week 4 – MapReduce Algorithm Design Patterns via building term co-occurrence matrix
• term co-occurrence matrix for
• •
7/23/2020
Big Data Processing
– 83 –
MapReduce Algorithms for
a text
collection
Pairs Approach
Strips Approach

RMIT Classification: Trusted
Course overview
Week 5 – MapReduce Clustering and Classification
Classification
Clustering
Supervised
Unsupervised
Known number of classes
Unknown number of classes
Based on a training set
No prior knowledge
Used to classify future observations
Used to understand (explore) data
7/23/2020 Big Data Processing – 84 –

RMIT Classification: Trusted
Course overview
Week 5 – MapReduce Clustering and Classification An Example
Marks in pre-requisite courses
Marks in pre-requisite courses
7/23/2020
Big Data Processing
– 85 –
STUDY HOURS
STUDY HOURS

RMIT Classification: Trusted
Course overview
Week 6 – MapReduce Large Graph Processing
• road networks, online social media, utility networks, WWW …
7/23/2020 Big Data Processing – 86 –

RMIT Classification: Trusted
Course overview
Week 6 – MapReduce Large Graph Processing
• road networks, online social media, utility networks, WWW … • Shortest Path Problem
–find shortest path from a source node to one or more target nodes
From A to F
7/23/2020 Big Data Processing – 87 –

RMIT Classification: Trusted
Course overview
Week 7 – Spark Basics
An open-source and fast engine for large-scale data processing.
7/23/2020 Big Data Processing – 88 –

RMIT Classification: Trusted
Course overview
Week 7 – Spark Basics
An open-source and fast engine for large-scale data processing.
7/23/2020 Big Data Processing – 89 –

RMIT Classification: Trusted
Course overview
Week 7 – Spark Basics
An open-source and fast engine for large-scale data processing. •
–.
Apache Spark vs. Apache Hadoop MapReduce
Spark is generally a lot faster than MapReduce because of the way it processes data
7/23/2020 Big Data Processing – 90 –

RMIT Classification: Trusted
Course overview
Week 7 – Spark Basics
7/23/2020 Big Data Processing – 91 –

RMIT Classification: Trusted
Course overview
Week 7 – Spark Basics
Spark supports SQL, data streaming machine learning and graph processing. 7/23/2020 Big Data Processing – 92 –

RMIT Classification: Trusted
Course overview
Week 7 – Spark Basics
Spark supports SQL, data streaming machine learning and graph processing. 7/23/2020 Big Data Processing – 93 –

RMIT Classification: Trusted
Course overview
Week 7 – Spark Basics
Spark supports SQL, data streaming machine learning and graph processing. 7/23/2020 Big Data Processing – 94 –

RMIT Classification: Trusted
Course overview
Week 9, 10 – Spark streaming, Kafka

Spark Streaming is an extension of the core Spark API that enables scalable,
high
throughput, fault
tolerant stream processing of live data streams.

7/23/2020 Big Data Processing – 95 –

RMIT Classification: Trusted
Course overview
Week 9, 10 – Spark streaming, Kafka

Spark Streaming is an extension of the core Spark API that enables scalable,
high
throughput, fault
tolerant stream processing of live data streams.

7/23/2020 Big Data Processing – 96 –

RMIT Classification: Trusted
Course overview
Week 9, 10 – Spark streaming, Kafka
• Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams.
Spark Streaming receives live input data streams
7/23/2020 Big Data Processing – 97 –

RMIT Classification: Trusted
Course overview
Week 9, 10 – Spark streaming, Kafka
• Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams.
Spark Streaming receives live input data streams
Divides the data into batches
7/23/2020 Big Data Processing – 98 –

RMIT Classification: Trusted
Course overview
Week 9, 10 – Spark streaming, Kafka
• Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams.
Then process each batch by the Spark engine to generate the final stream of results in batches
Spark Streaming receives live input data streams
Divides the data into batches
7/23/2020 Big Data Processing – 99 –

RMIT Classification: Trusted
Course overview
Week 9, 10 – Spark streaming, Kafka
Inputs sources can be • Flume,
• HDFS,
• KAFKA,
• etc
7/23/2020 Big Data Processing – 100 –
Outputs the processed data to
• filesystems
• database
• live dashboards

RMIT Classification: Trusted
Course overview
Week 10, 11 – PIG, HIVE, Mahout, Solr, Storm
7/23/2020 Big Data Processing – 101 –

RMIT Classification: Trusted
Course overview
Week 10, 11 – PIG, HIVE, Mahout, Solr, Storm
• Pig: large-scale data processing system –Scripts are written in Pig Latin, a dataflow language –Programmer focuses on data transformations –Developed by Yahoo!, now open source
7/23/2020 Big Data Processing – 102 –

RMIT Classification: Trusted
Course overview
Week 10 – PIG, HIVE, Mahout, Solr, Storm (1)
• Hive: data warehousing application in Hadoop – Query language is HQL, variant of SQL
– Tables stored on HDFS with different encodings – Developed by Facebook, now open source
7/23/2020 Big Data Processing – 103 –

RMIT Classification: Trusted
Course overview
Week 11 – PIG, HIVE, Mahout, Solr, Storm (2)
•–
–For example, k-means, fuzzy k-means, Canopy, Dirichlet, and Mean-
Shift, Naive Bayes and Complementary Naive Bayes classification
Mahout offers a ready
tasks on large volumes of data,
to
use framework for doing data mining
7/23/2020 Big Data Processing – 104 –

RMIT Classification: Trusted
Course overview
Week 11 – PIG, HIVE, Mahout, Solr, Storm (2)
•- –
• Solr is highly reliable, scalable and fault tolerant, automated failover and recovery, centralized configuration and more.
Solr
runs as a standalone full
text search server by providing distributed
indexing, replication and load
balanced querying,
7/23/2020 Big Data Processing – 105 –

RMIT Classification: Trusted
Course overview
Week 11 – PIG, HIVE, Mahout, Solr, Storm (2)
• computation • Apache Storm makes it easy to reliably process unbounded streams of
data, doing for realtime processing what Hadoop did for batch processing.
Apache Storm is a free and open source distributed realtime
system
7/23/2020 Big Data Processing – 106 –

RMIT Classification: Trusted
AWS Environment in Lab Classes
The labs use AWS EMR (Amazon Elastic MapReduce)
• We have created an JumpHost account for each student, • The access key will send to you this week
• The first lab instruct details how to access JumpHost and then access EMR
Academics/Students (outside RMIT or working at RMIT)
TCP:22 (SSH)
TCP:888 8 (HTTP)
Amazon Virtual Private Cloud (Amazon VPC)
7/23/2020
Big Data Processing
– 107 –

RMIT Classification: Trusted
AWS Environment in Lab Classes
The labs use AWS EMR (Amazon Elastic MapReduce)
• We have created an JumpHost account for each student, • The access key will send to you this week
• The first lab instruct details how to access JumpHost and then access EMR
Academics/Students (outside RMIT or working at RMIT)
TCP:22 (SSH)
TCP:888 8 (HTTP)
Amazon Virtual Private Cloud (Amazon VPC)
7/23/2020
Big Data Processing
– 108 –

RMIT Classification: Trusted
AWS Environment in Lab Classes
The labs use AWS EMR (Amazon Elastic MapReduce) • We have created an JumpHost account for each student,
• The access key will send to you this week
• The first lab instruct details how to access JumpHost and then access EMR
Academics/Students (outside RMIT or working at RMIT)
TCP:22 (SSH)
TCP:888 8 (HTTP)
Amazon Virtual Private Cloud (Amazon VPC)
7/23/2020
Big Data Processing
– 109 –

RMIT Classification: Trusted
Assessment
• 40% Assignment 1
• 40% Major Assignment 2
• 20% Week 14-16 Online Test
Deadline of assessment tasks in Canvas – Syllabus
https://rmit.instructure.com/courses/52636/assignments/syllabus
7/23/2020 Big Data Processing – 110 –

RMIT Classification: Trusted
Late submissions and special consideration
Late submission of assignment:
● -10% per day late
● After 5 days, zero will be granted for the assignment.
Special consideration (for weekly tests/tasks, assignment or exam):
● For <= 5days extension required, lecture can approve based on evidence. Granted only in accordance with University special consideration policy, e.g., illness, family tragedy, serious life disruption, etc. ● For > 5 days extension required, you must submit formal application through University
7/23/2020 Big Data Processing – 111 –

RMIT Classification: Trusted
Academic Integrity
University policy – students caught passing off somebody else’s work are subject to discipline policy
● Thereisadifferencebetweencopyingsomebodyelse’s text and summarizing their content in your own words after crediting the source
● Wewillcheckassignmentsforplagiarism! http://www.rmit.edu.au/students/academic-integrity
7/23/2020 Big Data Processing – 112 –

RMIT Classification: Trusted
Next Week
7/23/2020 Big Data Processing – 113 –