代写代考 COMP9313: Big Data Management

COMP9313: Big Data Management
Course web site: http://www.cse.unsw.edu.au/~cs9313/

Chapter 1: Course Information and Introduction to Big Data Management

Part 1: Course Information

Course Info
❖ Lectures: 10:00 – 12:00 (Tuesday) and 14:00 – 16:00 (Thursday) ➢ Purely online (access through Moodle)
❖ Labs: Weeks 2-10
❖ Consultation (Weeks 1-10): Questions regarding lectures, course
materials, assignements, exam, etc.
➢ Time: 16:00 – 17:00 (Thursday)
➢ Place: Online
❖ Course Admin:
➢ Sijin Wang,
❖ Tutors: Shunyang Li, , , Yizhang He, ,
❖ Discussion and QA: WebCMS3

Lecturer in Charge
❖ Lecturer:
➢ Office: 201D K17 (outside the lift turn left) ➢ Email:
❖ Research interests
➢ Database
➢ Data Mining
➢ Big Data Technologies
➢ Machine Learning Applications
➢ My homepage: https://xincao-unsw.github.io/
➢ My publications list at google scholar: https://scholar.google.com.au/citations?user=kJIkUagAAAAJ&hl=en

Course Aims
❖ This course aims to introduce you to the concepts behind Big Data, the core technologies used in managing large-scale data sets, and a range of technologies for developing solutions to large-scale data analytics problems.
❖ This course is intended for students who want to understand modern large-scale data analytics systems. It covers a wide range of topics and technologies and will prepare students to be able to build such systems as well as use them efficiently and effectively to address challenges in big data management.
❖ Not possible to cover every aspect of big data management.

❖ Lectures focusing on the frontier technologies on big data management and the typical applications
❖ Try to run in more interactive mode and provide more examples
❖ A few lectures may run in more practical manner (e.g., like a
lab/demo) to cover the applied aspects
❖ Lecture length varies slightly depending on the progress (of that lecture)
❖ Note: attendance to every lecture is assumed

❖ Textbooks
➢ Hadoop: The Definitive Guide. . 4th Edition – O’
➢ Mining of Massive Datasets. , , . 2nd edition – Cambridge University Press
➢ Data-Intensive Text Processing with MapReduce. and . University of Maryland, College Park.
➢ Learning Spark . , , , . O’
❖ Reference Books and other readings ➢ Apache MapReduce Tutorial
➢ Apache Spark Quick Start
➢ Many other online tutorials … …
Big Data is a relatively new topic (so no fixed syllabus)

Prerequisite
❖ Official prerequisite of this course is COMP9024 (Data Structures and Algorithms) and COMP9311 (Database Systems).
❖ Before commencing this course, you should:
➢ have experiences and good knowledge of algorithm design
(equivalent to COMP9024 )
➢ have a solid background in database systems (equivalent to COMP9311)
➢ have solid programming skills in Java/Python
➢ be familiar with working on a Unix-style operating systems
➢ have basic knowledge of linear algebra (e.g., vector spaces, matrix multiplication), probability theory and statistics , and graph theory
❖ No previous experience necessary in
➢ MapReduce/Spark
➢ Parallel and distributed programming

Please do not enrol if you
❖ Don’t have COMP9024/9311 knowledge
❖ Cannot produce correct Java/Python program on your own
Never worked on Unix-style operating systems
❖ Have poor time management
❖ Are too busy to attend lectures/labs
❖ Otherwise, you are likely to perform badly in this subject

Learning outcomes
❖ After completing this course, you are expected to:
➢ describe the important characteristics of Big Data
➢ develop an appropriate storage structure for a Big Data repository
➢ utilise the map/reduce paradigm and the Spark platform to manipulate Big Data
➢ use a high-level query language to manipulate Big Data
➢ develop efficient solutions for analytical problems involving Big Data

Assessment

❖ Projects:
➢ 1 project on MapReduce
➢ 1 project on Spark
➢ 1 project on a real cloud platform (AWS/Azure/Dataproc)
❖ Assignment: previous years exam questions
❖ Both results and source codes will be checked.
➢ If not able to run your codes due to some bugs, you will not lose all marks.
and Assignment

CSE Computing Environment
❖ Use Linux/command line (a virtual machine image will be provided)
➢ Projects marked on Linux servers
➢ You need to be able to upload, run, and test your program under Linux
❖ Assignment submission
➢ Use Give to submit (either command line or web page)
➢ Classrun. Check your submission, marks, etc. Read https://wiki.cse.unsw.edu.au/give/Classrun

Final exam
❖ Final written exam (50 pts)
❖ If you are ill on the day of the exam, do not attend the exam – I will not accept any medical special consideration claims from people who already attempted the exam
❖ You need to achieve at least 20 marks in the final exam ❖ No supplementary exam will be given

You May Fail Because …
❖ *Plagiarism*
❖ Code failed to compile due to some mistakes ❖ Late submission
➢ 1 sec late = 1 day late
➢ submit wrong files
❖ Program did not follow the spec
❖ I am unlikely to accept the following excuses: ➢ “Too busy”
➢ “It took longer than I thought it would take” ➢ “It was harder than I initially thought”

Tentative Course Schedule
Assignment
Course info and introduction to big data
Hadoop, HDFS, and YARN
Hadoop MapReduce 1
Hadoop MapReduce 2
Finding Similar Items
Data stream mining
Recommender Systems
NoSQL and High Level MapReduce Tools
Assignment

❖ 1 lab on Hadoop setup HDFS practice
❖ 3 labs on MapReduce
❖ 3 labs on Spark
❖ 1 lab on high level MapReduce tools
❖ 1 lab on a real cloud platform (AWS/Azure/Dataproc)

Virtual Machine
Software: Virtualbox VM image:
➢ Xubuntu 20.04 with pre-installed Hadoop and Spark
 Download the VM image at: https://drive.google.com/file/d/1SaAaQa8f17SsOixq8mNI6R6D n2mMDYkp/view?usp=sharing
 Open VirtualBox, File->Import Applicance
 Browse the image folder, select the “*.ova” file
 The image will be imported to your computer, which may take 10 minutes
 comp9313 is used as both username and password. The hadoop installation path is the same as in the virtual machine on lab computers.
You can also install Hadoop and Spark in your own computer

Your Feedbacks Are Important
❖ Big data is a new topic, and thus the course is tentative
❖ The technologies keep evolving, and the course materials need to be
updated correspondingly
❖ Please advise where I can improve after each lecturer, at the discussion and QA website
❖ myExperience system

Why Attend the Lectures?

Part 2: Introduction to Big Data

What is Big Data?
❖ No standard definition! here is from Wikipedia:
➢ Big data is a field that treats ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by traditional data-processing application software
➢ Challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating, information privacy, and data source..
➢ The term “big data” often refers simply to the use of predictive analytics, user behavior analytics, or certain other advanced data analytics methods that extract value from big data, and seldom to a particular size of data set.

Instead of Talking about “Big Data”…
❖ Let’s talk about a crowded application ecosystem: ➢ Hadoop MapReduce
➢ High-level query languages (e.g., Hive)
➢ NoSQL (e.g., HBase, MongoDB, Neo4j) ➢ Pregel
❖ Let’s talk about data science and data management: ➢ Finding similar items
➢ Graph data processing
➢ Streaming data processing
➢ Machine learning technologies ➢ ……

Who is generating Big Data?
User Tracking & Engagement
Homeland Security
Financial Services
Real Time Search

Big Data Characteristics: 3V
❖ The Vs of big data were often referred to as the “three Vs”
➢ Volume: In a big data environment, the amounts of data collected and processed are much larger than those stored in typical relational databases.
➢ Variety: Big data consists of a rich variety of data types.
➢ Velocity: Big data arrives to the organization at high speeds and
from multiple sources simultaneously.

Volume (Scale)
❖ In the big data era, huge amount of data is being generated every day

Twitter by the Numbers: Stats, Demographics & Fun Facts


Recent Twitter statistics

Volume (Scale)
❖ Data volume is increasing exponentially (40% increase per year)
Data amount in Zetabytes from 2010 to 2025
A forecast by IDC & SeaGate. Image by Sven Balnojan.

Variety (Complexity)
❖ Different Types:
➢ Relational Data (Tables/Transaction/Legacy Data) ➢ Text Data (Web)
➢ Semi-structured Data (XML)
➢ Spatial Data
➢ Temporal Data
➢ Graph Data
 Social Network, Semantic Web (RDF), …
➢ One application can be generating/collecting many types of data
❖ Different Sources:
➢ Movie reviews from IMDB and Rotten Tomatoes ➢ Product reviews from different provider websites
To extract knowledge➔ all these types of data need to linked together

A Single View to the Customer
Social Media
Banking Finance
Our Known History

A Global View of Linked Big Data
Diversified social network

Velocity (Speed)
❖ Velocity essentially measures how fast the data is coming in. ❖ Data is being generated fast and need to be processed fast
➢ Late decisions -> missing opportunities
❖ It is usually met in online data analytics, for example
➢ E-Promotions: based on your current location, your purchase history, what you like -> send promotions right now for store next to you
➢ Healthcare monitoring: sensors monitoring your activities and body -> any abnormal measurements require immediate reaction
➢ Pandemic management and response: contact tracing for new infected COIVD-19 cases and future hotspots prediction to slow down the spread of infectious diseases

Velocity in Real
❖ The statistics for 1 second in many applications. http://www.internetlivestats.com/one-second/

Big Data Characteristics: 6V
❖ Volume: In a big data environment, the amounts of data collected and processed are much larger than those stored in typical relational databases.
❖ Variety: Big data consists of a rich variety of data types.
❖ Velocity: Big data arrives to the organization at high speeds and from
multiple sources simultaneously.
❖ Veracity: Data quality issues are particularly challenging in a big data context.
❖ Value: Ultimately, big data is meaningless if it does not provide value toward some meaningful goal.
❖ Visibility/Visualization/Variability/Validity/….

Veracity (Quality & Trust)
❖ Data = quantity + quality
❖ When we talk about big data, we typically mean its quantity:
➢ What capacity of a system provides to cope with the sheer size of the data?
➢ Is a query feasible on big data within our available resources?
➢ How can we make our queries tractable on big data?
❖ Can we trust the answers to our queries?
➢ Dirty data routinely lead to misleading financial reports, strategic business planning decision -> loss of revenue, credibility and customers, disastrous consequences
❖ The study of data quality is as important as data quantity

Data in real
life is often dirty
81 million National Insurance numbers but only 60 million eligible citizens[1]
Medical errors may account for as many as 251,000 deaths annually in the US[3]
[1] https://publications.parliament.uk/pa/cm200001/cmhansrd/vo010327/debtext/10327-21.htm [2] https://www.privacy.org.au/Campaigns/ID_cards/MedicareSmartcard.html
[3] Your Health Care May Kill You: Medical Errors. https://pubmed.ncbi.nlm.nih.gov/28186008/
500,000 dead people retain active Medicare cards[2]

❖ Big data is meaningless if it does not provide value toward some meaningful goal

❖ Visibility: the state of being able to see or be seen is implied. ➢ Big Data – visibility = ?
❖ Visualization: Making all that vast amount of data comprehensible in a manner that is easy to understand and read.
A visualization of Divvy bike rides across Chicago
❖ Big data visualization tools:

❖ Variability
➢ Variability refers to data whose meaning is constantly changing. This is
particularly the case when gathering data relies on language processing.
➢ It defines the need to get meaningful data considering all possible circumstances.
❖ Viscosity
➢ This term is sometimes used to describe the latency or lag time in the data relative to the event being described. We found that this is just as easily understood as an element of Velocity.
❖ Volatility
➢ Big data volatility refers to how long is data valid and how long should it be stored. You need to determine at what point is data no longer relevant to the current analysis.
❖ More V’s in the future …
➢ How many v’s are there in big data? http://www.clc-
ent.com/TBDE/Docs/vs.pdf

Big Data: 6V in Summary
Transforming Energy and Utilities through Big Data & Analytics. By Anders

Tag Clouds of Big Data

Why Study Big Data Technologies?
❖ The hottest topic in both research and industry ❖ Highly demanded in real world
❖ A promising future career
➢ Research and development of big data systems:
distributed systems (eg, Hadoop), visualization tools, data
warehouse, OLAP, data integration, data quality control, … ➢ Big data applications:
social marketing, healthcare, …
➢ Data analysis: to get values out of big data
discovering and applying patterns, predicative analysis, business intelligence, privacy and security, …
❖ Get enough credits

Big Data Open
https://datafloq.com/big-data-open-source-tools/os-home/
Source Tools

What Will the Course Cover
❖ Topic 1. Big data management tools ➢ Apache Hadoop
 YARN/HDFS/Hive (briefly introduced)  MapReduce
 NoSQL (HBase)
 Amazon AWS/Microsoft Azure platform/Google Dataproc
❖ Topic 2. Big data typical applications ➢ Finding similar items
➢ Graph data processing
➢ Data stream mining
➢ Recommender Systems

Distributed Processing is Non
❖ How to assign tasks to different workers in an efficient way?
❖ What happens if tasks fail?
❖ How do workers exchange results?
❖ How to synchronize distributed tasks allocated to different workers?

Big Data Storage is Challenging
❖ Data Volumes are massive
❖ Reliability of Storing PBs of data is challenging
❖ All kinds of failures: Disk/Hardware/Network Failures
❖ Probability of failures simply increase with the number of machines …

What is Hadoop
❖ Open-source data storage and processing platform
❖ Before the advent of Hadoop, storage and processing of big data was
a big challenge
❖ Massively scalable, automatically parallelizable
➢ Based on work from Google
 Google: GFS + MapReduce + BigTable (Not open)
 Hadoop: HDFS + Hadoop MapReduce + HBase(opensource)
❖ Named by Doug Cutting in 2006 (worked at Yahoo! at that time), after his son’s toy elephant.

❖ Redundant, Fault-tolerant data storage ❖ Parallel computation framework
❖ Job coordination
Q: Where file is located?
Q: How to handle failures & data lost?
Q: How to divide computation?
Q: How to program for scaling?
Hadoop Offers
No longer need to worry about
Programmers

Why Use Hadoop?
➢ Scales to Petabytes or more easily
➢ Parallel data processing
➢ Suited for particular types of big data problems

Companies Using Hadoop

Processing (MapReduce) Other Tools/Frameworks
❖ Data storage (HDFS)
➢ Runs on commodity
hardware (usually Linux)
Parallelized (scalable) processing
Fault Tolerant
HBase Hive ……
➢ Horizontally scalable HDFS Storage
MapReduce API
Redundant (3 copies)
For large files – large blocks 64 or 128 MB / block
Can scale to 1000s of nodes
Other Libraries
Batch (Job) processing
Distributed and Localized to clusters (Map)
Auto-Parallelizable for huge amounts of data
Fault-tolerant (auto retries)
Adds high availability and more
Pig Hive HBase Others

Hadoop 2.x
❖ Single Use System ➢ Batch apps
Multi-Purpose Platform
Batch, Interactive, Online, Streaming
Hadoop 1.x
Hive (sql)
Pig (data flow)
MapReduce (cluster resource management & data processing)
HDFS (redundant, reliable storage)
Hadoop 2.x
MapReduce (batch processing)
(in-memory)(streaming)
(cluster resource management)
(redundant, highly available & reliable storage)
Hadoop YARN (Yet Another Resource Negotiator): a resource- management platform responsible for managing computing resources in clusters and using them for scheduling of users’ applications

Hadoop 3.x
Hadoop 2.x
Hadoop 3.x
Minimum supported Java version
Storage Scheme
3x Replication Scheme
Eraser encoding in HDFS
Fault Tolerance
Replication is the only way to handle fault tolerance which is not space optimized.
Erasure coding is used for handling fault tolerance.
Storage Overhead
200% of HDFS (6 blocks of data will occupy the space of 18 blocks due
to replication factor)
50% (6 blocks of data will occupy 9 blocks i.e 6 blocks for actual data and 3 blocks for parity)
Scalability
Limited Scalability, can have upto 10000 nodes in a cluster.
Scalability is improved, can have more then 10000 nodes in a cluster.
A single active NameNode and a single Standby NameNode
allows users to run multiple standby NameNodes to tolerate the failure of more nodes
https://hadoop.apache.org/docs/stable/index.html

Hadoop Ecosystem
A combination of technologies which have proficient advantage in solving business problems.

Tutorial 2: Introduction to Hadoop Architecture, and Components

Common Hadoop Distributions
❖ Open Source ➢ Apache
❖ Commercial ➢ Cloudera
➢ Hortonworks
➢ AWS MapReduce ➢ Microsoft Azure

Setting up Hadoop Development
Hadoop Binaries
Local install
•Linux •Windows
Cloudera’s Demo VM
•Need Virtualization software, i.e. VMware, etc…
• Microsoft • Others
Data Storage
•File System •HDFS Pseudo-
distributed (single- node)
• Azure •Others
Other Libraries & Tools
Vendor Tools

From Wikipedia 2006
From Wikipedia 2017
AWS (Amazon Web Services)

AWS (Amazon Web Services)
❖ AWS is a subsidiary of Amazon.com, which offers a suite of cloud computing services that make up an on-demand computing platform.
❖ Amazon Web Services (AWS) provides a number of different services, including:
➢ Amazon Elastic Compute Cloud (EC2) Virtual machines for running custom software
➢ Amazon Simple Storage Service (S3)
Simple key-value store, accessible as a web service
➢ Amazon Elastic MapReduce (EMR) Scalable MapReduce computation
➢ Amazon DynamoDB
Distributed NoSQL database, one of several in AWS
➢ Amazon SimpleDB Simple NoSQL database

Cloud Computing Services in AWS
➢ EC2, S3, …
➢ Highlight: EC2 and S3 are two of the earliest products in AWS ❖ PaaS
➢ Aurora, Redshift, …
➢ Highlight: Aurora and Redshift are two of the fastest growing
products in AWS ❖ SaaS
➢ WorkDocs, WorkMail
➢ Highlight: May not be the
main focus of AWS

Microsoft Azure
❖ Azure is a cloud computin