CS代写 INFS3208/INFS7208

CRICOS code 00025BCRICOS code 00025B

• Database Background

• Relational Data Bases

– Revisit Relational DBs

– ACID Properties *

– Clustered RDBMs

• Non-relational Data Bases

– NoSQL concepts

– CAP Theorem ***

– Cassandra

Re-cap – Lecture 7

Cloud Computing
INFS3208/INFS7208

CRICOS code 00025BCRICOS code 00025B

• Background of Distributed File Systems

– Big Data, Data Centre Technology, and Storage Hardware

• Distributed File System

– File System

– Server/Client System

– Sun’s Network File System (NFS)

• Clustered File System (CFS)

– Google File System (GFS)

– Hadoop Distributed File System (HDFS)

– HDFS Shell Commands

CRICOS code 00025BCRICOS code 00025B

Structured Unstructured

Multimedia

Structured
Repeatable

Unstructured
Exploratory

Social Data

Text Data:

Sensor data:

Internal App

Transaction

OLTP System

Traditional

SourcesERP

Issues of huge volume of data generated daily!!

CRICOS code 00025BCRICOS code 00025B

We are facing
PBs of data

https://www.simplilearn.com/data-science-vs-big-data-vs-data-analytics-article

Issues of huge volume of data generated daily!!

CRICOS code 00025BCRICOS code 00025B

Measuring the Size of Data

1,125,899,906,842,624 bytes or 250

approx. 1,000,000,000,000,000 or 10 15

2 Petabytes: All U. S. academic research libraries

1,152,921,504,606,846,976 bytes or 260

approx. 1,000,000,000,000,000,000 or 10 18

5 Exabytes: All words ever spoken by human beings.

1,180,591,620,717,411,303,424 bytes or 270

approx. 1,000,000,000,000,000,000,000 or 10 21

1,208,925,819,614,629,174,706,176 bytes or 280

approx. 1,000,000,000,000,000,000,000,000 or 10 24

Bytes (8 bits)

1,024 bytes; 210; approx. 1,000 or 10 3

2 Kilobytes: Typewritten page

1,048,576 bytes; 220;

approx 1,000,000 or 10 6

5 Megabytes: Complete works of Shakespeare

1,073,741,824 bytes; 230;

approx 1,000,000,000 or 10 9

20 Gigabytes: Audio collection of the works of Beethoven

1,099,511,627,776 or 240;

approx. 1,000,000,000,000 or 10 12

10 Terabytes: Printed collection of the U. S. Library of Congress

with 130 million items on about 530 miles of bookshelves,

including 29 million book, 2.7 million recordings, 12 million

photographs, 4.8 million maps, and 58 million manuscripts

UQ INFS1200/INFS7900 Week 1 Lecture Notes

≈ 43,980,465 km

384,402 km between earth

and moon (114.413 times)
https://en.wikipedia.org/wiki/Floppy_disk

549,755,813,888 X

https://en.wikipedia.org/wiki/Floppy_disk

CRICOS code 00025BCRICOS code 00025B

How much data are we generating?

• By 2020, it’s estimated that 1.7MB of data will be

created every second for every person on earth and

there will be around 40 trillion gigabytes of data (40

zettabytes).

• 90% of all data has been created in the last two years.

• In 2018, internet users spent 2.8 million years online.

• Social media accounts for 33% of the total time spent

• Today it would take a person approximately 181 million

years to download all the data from the internet.

• 97.2% of organizations are investing in big data and AI

• Using big data, Netflix saves $1 billion per year on

customer retention.

Big Data Statistics

https://www.socialmediatoday.com/news/how-much-data-is-generated-every-minute-infographic-1/525692/

https://techjury.net/stats-about/big-data-statistics/

https://www.socialmediatoday.com/news/how-much-data-is-generated-every-minute-infographic-1/525692/
https://techjury.net/stats-about/big-data-statistics/

CRICOS code 00025BCRICOS code 00025B

A data centre is a specialised IT infrastructure that houses centralised IT resources

• Servers (rack in cabinet);

• Databases and software systems;

• Networking and telecommunication devices.

Typical technologies and components

• Virtualisation

• Standardisation and Modularity

• Remote Operation and Management

• High Availability

• Security-Aware Design, Operation and Management

• Facilities

Data Centre Technology

CRICOS code 00025BCRICOS code 00025B

Hardware: Array of Hard Disks

https://www.wikihow.com/Recover-a-Dead-Hard-Disk http://www.comnetdrc.com/

CRICOS code 00025BCRICOS code 00025B

Hardware: Network Switch

P. 10http://exd-int.com/my-product/space-mind/

Network Switch Can Route 15 Terabits Per Second

CRICOS code 00025BCRICOS code 00025B

Google Data Centre

CRICOS code 00025BCRICOS code 00025B

• Background of Distributed File Systems

– Big Data, Data Centre Technology, and Storage Hardware

• Distributed File System

– File System

– Server/Client System

– Sun’s Network File System (NFS)

• Clustered File System (CFS)

– Google File System (GFS)

– Hadoop Distributed File System (HDFS)

– HDFS Shell Commands

CRICOS code 00025BCRICOS code 00025B

• A file system is an abstraction: enable users to manipulate

and organize data.

• Typically, FS is in a hierarchical tree: files and directories.

• FS enables a uniform view, independent of the underlying

storage devices: floppy/optical drives, hard drives and USB

stick, etc.

• The connection between the logical file system and the

storage device was typically a one-to-one mapping.

• Examples:

– Windows: NTFS, FAT32, FAT

– MacOS: System (AFS),

– Linux: Ext4, etc.

What is a File System?

CRICOS code 00025BCRICOS code 00025B

• DFS is a distributed implementation of the classical time-sharing model
of a file system, where multiple users share files and storage
resources.

• A distributed file system spreads over multiple, autonomous
computers.

• A distributed file system should have following characteristics:

– Access transparency

– Location transparency

– Concurrency transparency

– Failure transparency

– Replication transparency

– Migration transparency

– Heterogeneity

– Scalability

What is a Distributed File System (DFS)?

https://en.wikipedia.org/wiki/Clustered_file_system#Distributed_file_systems

CRICOS code 00025BCRICOS code 00025B

• In general, files in a DFS can be located in “any” system.

– Servers: data holders (the “source(s)” of files)

– Clients: data users who are accessing the servers.

• Potentially, a server for a file can become a client for another file.

• However, most distributed systems distinguish between clients and servers

in more strict way:

– Clients simply access files and do not have/share local files.

– Even if clients have disks, they (disks) are used for swapping, caching,

loading the OS, etc.

– Servers are the actual sources of files.

– In most cases, servers are more powerful machines (in terms of CPU,

physical memory, disk bandwidth, ..)

What is a Distributed File System (DFS)?

https://goo.gl/images/Nwq6tV https://www.slideshare.net/AnamikaSingh211/distributed-file-system-72294718

CRICOS code 00025BCRICOS code 00025B

• Network File System (NFS) is a distributed file system protocol originally developed by Sun
Microsystems in 1984.

• NFS allows a user on a client computer to access files over a computer network much like
local storage is accessed.

• NFS is a client-server application, where a user can view, store and update the files on a
remote computer.

• NFS builds on the Remote Procedure Call (RPC) system to route requests between clients
and servers.

• NFS protocol is designed to be independent of computer, operating system, network
architecture, and transport protocol.

• NFS allows the user or system administrator to mount complete or a partial file system on a

• The portion of the file system that is mounted can be accessed by clients with different
privileges (e.g. read-only or read-write)

Sun Network File System (NFS)

CRICOS code 00025BCRICOS code 00025B

Advantages:

• easy sharing of data across clients

• centralized administration (backup done on multiple servers instead of many clients)

• security (put server behind firewall)

Sun Network File System (NFS)

Network Server

http://pages.cs.wisc.edu/~remzi/OSTEP/dist-nfs.pdf

CRICOS code 00025BCRICOS code 00025B

• Each file server presents a standard view of its local file

• Transparent access to remote files

• Compatibility with multiple operating systems and
platforms.

• Easy crash recovery and backup at server

• A software component namely, Virtual File System (VFS)
is available in most OS (Operating Systems) as the
interface to different local and distributed file systems.

• Virtual File System (VFS) in an OS (Operating System)
acts as an interface between the system-call layer and all
files in network nodes.

• The user interface to NFS is the same as the interface to
local file systems. The calls go to the VFS layer, which
passes them either to a local file system or to the NFS

Sun Network File System (NFS)

P. 18http://pages.cs.wisc.edu/~remzi/OSTEP/dist-nfs.pdf

Making remote files as if local to the client

For Windows, the NFS is available via SMB – Server Message

Block, or a version namely, CIFS – Common Internet File System.

http://pages.cs.wisc.edu/~remzi/OSTEP/dist-nfs.pdf

CRICOS code 00025BCRICOS code 00025B

• A big hard drive (30 TB) on a windows server:

• The driver can be shared by users in Linux

A Real Example of NFS

CRICOS code 00025BCRICOS code 00025B

• File Sharing

A Real Example of NFS

Linux Server 1 Linux Server 2 Linux Server 7

Experimental data

Experimental programs

CRICOS code 00025BCRICOS code 00025B

• Background of Distributed File Systems

– Big Data, Data Centre Technology, and Storage Hardware

• Distributed File System

– File System

– Server/Client System

– Sun’s Network File System (NFS)

• Clustered File System (CFS)

– Google File System (GFS)

– Hadoop Distributed File System (HDFS)

– HDFS Shell Commands

CRICOS code 00025BCRICOS code 00025B

• For bigger scale of data storage, a cluster of thousands of data servers is required.

• Scalability and availability should be provided for big data storage.

• Resiliency and load balancing are also very essential.

Why using Clustered File System (CFS)?

CRICOS code 00025BCRICOS code 00025B

• Clustered File System (CFS) is not a single server with a set of clients, but instead a cluster of servers

that all work together to provide high performance service to their clients.

• To the clients of CFS, the cluster is transparent.

• A CFS can organize the storage and access data across all clusters.

What is a Clustered File System (CFS)?

P. 23https://gitlab.com/arm-hpc/packages/wikis/packages/BeeGFS

CRICOS code 00025BCRICOS code 00025B

The difference between

(a) distributing whole files across several servers and

(b) striping files for parallel access.

CFS is Managed at a Block Level

File A can be divided into blocks

CRICOS code 00025BCRICOS code 00025B

Distributed File System (DFS): A file system that has its

components spread across multiple systems. On the other hand, an

NFS is inherently a distributed file system as well. The client

component is on a different system than the underlying physical

storage or its management.

https://www.quora.com/What-is-the-difference-between-a-distributed-file-system-clustered-file-system-and-a-network-file-system

Client-Server Architecture Cluster Server Architecture

What are the differences among DFS, NFS, CFS?

Network File system (NFS): Files

are not local. They are served over a

network, with the physical storage

units and their management hosted

by a different entity.

Clustered File System (CFS): It is

built by pooling several different

discrete components, typically multiple

servers, multiple disks, working

together to provide a unified

namespace. A client is not aware of

the physical boundaries that make up

the file system.

CRICOS code 00025BCRICOS code 00025B

• Background of Distributed File Systems

– Big Data, Data Centre Technology, and Storage Hardware

• Distributed File System

– File System

– Server/Client System

– Sun’s Network File System (NFS)

• Clustered File System (CFS)

– Google File System (GFS)

– Hadoop Distributed File System (HDFS)

– HDFS Shell Commands

CRICOS code 00025BCRICOS code 00025B

• Google File System (GFS or GoogleFS) is a

scalable distributed file system developed by Google to

provide efficient, reliable access to data using large clusters

of commodity hardware.

• Shares many of the same goals as previous distributed file

systems such as performance, scalability, reliability, and

availability.

• GFS was internally used and became the creation basis of

Hadoop Distributed File System (HDFS)

Google File System

https://en.wikipedia.org/wiki/Google_File_System#/media/File:GoogleFileSystemGFS.svg

CRICOS code 00025BCRICOS code 00025B

• Thousands of commodity computers

– cheap hardware but distributed

• GFS has high component failure rates

– System is built from many inexpensive commodity components

• Modest number of huge files

– A few million files, each typically 100MB or larger (Multi-GB files are common)

– No need to optimize for small files

• Workloads : two kinds of reads and writes

– Large streaming reads (1MB or more) and small random reads (a few KBs)

– Sequential appends to files by hundreds of data producers

• High sustained throughput is more important than latency

– Response time for individual read and write is not critical

Design Assumptions

CRICOS code 00025BCRICOS code 00025B

• Files stored as chunks

– With a fixed size of 64MB each.

– With a 64-bit ID each.

• Reliability through replication

– Each chunk is replicated across 3 (by default) or more
chunk servers

– Replication number can be manually set

• Single Master

– Centralized management

– Only store meta-data

GFS Design Overview

A file with four chunks

Replicas=2

CRICOS code 00025BCRICOS code 00025B

GFS Architecture – Read

CRICOS code 00025BCRICOS code 00025B

1.Client asks master which chunk server holds current lease of chunk and
locations of other replicas.

2.Master replies with identity of primary and locations of secondary replicas.

3.Client pushes data to all replicas

4.Once all replicas have acknowledged receiving the data, client sends write
request to primary. The primary assigns consecutive serial numbers to all the
mutations it receives, providing serialization. It applies mutations in serial
number order.

5.Primary forwards write request to all secondary replicas. They apply
mutations in the same serial number order.

6.Secondary recplicas reply to primary indicating they have completed

7.Primary replies to the client with success or error message

GFS Architecture – Write

CRICOS code 00025BCRICOS code 00025B

• Mater maintains all system metadata

– Name space, access control info, file to chunk mappings, chunk locations, etc.

• Periodically communicates with chunk servers

– Through HeartBeat messages

• Advantages:

– Simplifies the design

• Disadvantages:

– Single point of failure

• Solution

– Replication of Master state on multiple machines

– Operational log and check points are replicated on multiple machines

GFS Components – Master

CRICOS code 00025BCRICOS code 00025B

• Fixed size of 64MB (vs 4kb of cluster size for NTFS)

• Advantages

– Size of meta data is reduced

– Involvement of Master is reduced

– Network overhead is reduced

– Lazy space allocation avoids internal fragmentation

• Disadvantages

– Hot spots

 A small file consists of a small number of chunks, perhaps just one. The chunkservers
storing those chunks may become hot spots if many clients are accessing the same file.

 Solutions: increase the replication factor and stagger application start times; allow clients to
read data from other clients

GFS Components – Chunks

https://support.microsoft.com/en-us/help/140365/default-cluster-size-for-ntfs-fat-and-exfat

Chunk numbers

Chunk size

https://support.microsoft.com/en-us/help/140365/default-cluster-size-for-ntfs-fat-and-exfat

CRICOS code 00025BCRICOS code 00025B

• Three major types of metadata

– The file and chunk namespaces

– The mapping from files to chunks

– Locations of each chunk’s replicas

• All the metadata is kept in the Master’s memory

• 64MB chunk has 64 bytes of metadata

• Chunk locations

– Chunk servers keep track of their chunks and relay data to Master through HeartBeat

• Master “operation log”

– Consists of namespaces and file to chunk mappings

– Replicated on remote machines

GFS Components – Metadata and Operational Log

CRICOS code 00025BCRICOS code 00025B

• Background of Distributed File Systems

– Big Data, Data Centre Technology, and Storage Hardware

• Distributed File System

– File System

– Server/Client System

– Sun’s Network File System (NFS)

• Clustered File System (CFS)

– Google File System (GFS)

– Hadoop Distributed File System (HDFS)

– HDFS Shell Commands

Cloud Computing
INFS3208/INFS7208

CRICOS code 00025BCRICOS code 00025B

• Apache Hadoop was proposed in 2010 as a

collection of open-source software utilities to

deal with big data problem.

• The core of Apache Hadoop consists of a

storage part, known as Hadoop Distributed File

System (HDFS), and a processing part which is

a MapReduce programming model.

– Hadoop splits files into large blocks and

distributes them across nodes in a cluster.

– It then transfers packaged code into nodes

– It takes advantage of data locality.

Hadoop Distributed File System

Shvachko, Konstantin, , , and . “The hadoop distributed file system.” In MSST, vol. 10, pp. 1-10. 2010

CRICOS code 00025BCRICOS code 00025B

• Many inexpensive commodity hardware and failures are very common

• Many big files: millions of files, ranging from MBs to GBs

• Two types of reads

– Large streaming reads

– Small random reads

• Once written, files are seldom modified

– Random writes are supported but do not have to be efficient

• High sustained bandwidth is more important than low latency

Design Motivations

CRICOS code 00025BCRICOS code 00025B

• Equivilent to Master Node in GFS

• represents files and directories on the NameNode

• records attributes like permissions, modification and

access times, namespace and disk space quotas.

• maintains the namespace tree and the mapping of

file blocks to DataNodes

• When writing data, NameNode nominates a suite of

three DataNodes to host the block replicas.

• The client then writes data to the DataNodes in a

pipeline fashion.

• keeps meta-data in RAM

HDFS Component – NameNode

CRICOS code 00025BCRICOS code 00025B

• fsimage:

• contains the entire filesystem namespace at
the latest checkpoint

– Blocks information of the file (location,
timestamp, etc.)

– Folder information (ownership, access, etc.)

• stored as an image file in the NameNode’s
local file system.

• editlog:

• contains all the recent modifications made to

the file system on the most recent fsImage.

• create/update/delete requests from the client

HDFS Component – Meta-Data

CRICOS code 00025BCRICOS code 00025B

Checkpoint Node (Secondary NameNode)

• “Secondary” does not mean “2nd” NameNode tha

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts