CRICOS code 00025BCRICOS code 00025B
• Database Background
Copyright By PowCoder代写 加微信 powcoder
• Relational Data Bases
– Revisit Relational DBs
– ACID Properties *
– Clustered RDBMs
• Non-relational Data Bases
– NoSQL concepts
– CAP Theorem ***
– Cassandra
Re-cap – Lecture 7
Cloud Computing
INFS3208/INFS7208
CRICOS code 00025BCRICOS code 00025B
• Background of Distributed File Systems
– Big Data, Data Centre Technology, and Storage Hardware
• Distributed File System
– File System
– Server/Client System
– Sun’s Network File System (NFS)
• Clustered File System (CFS)
– Google File System (GFS)
– Hadoop Distributed File System (HDFS)
– HDFS Shell Commands
CRICOS code 00025BCRICOS code 00025B
Structured Unstructured
Multimedia
Structured
Repeatable
Unstructured
Exploratory
Social Data
Text Data:
Sensor data:
Internal App
Transaction
OLTP System
Traditional
SourcesERP
Issues of huge volume of data generated daily!!
CRICOS code 00025BCRICOS code 00025B
We are facing
PBs of data
https://www.simplilearn.com/data-science-vs-big-data-vs-data-analytics-article
Issues of huge volume of data generated daily!!
CRICOS code 00025BCRICOS code 00025B
Measuring the Size of Data
1,125,899,906,842,624 bytes or 250
approx. 1,000,000,000,000,000 or 10 15
2 Petabytes: All U. S. academic research libraries
1,152,921,504,606,846,976 bytes or 260
approx. 1,000,000,000,000,000,000 or 10 18
5 Exabytes: All words ever spoken by human beings.
1,180,591,620,717,411,303,424 bytes or 270
approx. 1,000,000,000,000,000,000,000 or 10 21
1,208,925,819,614,629,174,706,176 bytes or 280
approx. 1,000,000,000,000,000,000,000,000 or 10 24
Bytes (8 bits)
1,024 bytes; 210; approx. 1,000 or 10 3
2 Kilobytes: Typewritten page
1,048,576 bytes; 220;
approx 1,000,000 or 10 6
5 Megabytes: Complete works of Shakespeare
1,073,741,824 bytes; 230;
approx 1,000,000,000 or 10 9
20 Gigabytes: Audio collection of the works of Beethoven
1,099,511,627,776 or 240;
approx. 1,000,000,000,000 or 10 12
10 Terabytes: Printed collection of the U. S. Library of Congress
with 130 million items on about 530 miles of bookshelves,
including 29 million book, 2.7 million recordings, 12 million
photographs, 4.8 million maps, and 58 million manuscripts
UQ INFS1200/INFS7900 Week 1 Lecture Notes
≈ 43,980,465 km
384,402 km between earth
and moon (114.413 times)
https://en.wikipedia.org/wiki/Floppy_disk
549,755,813,888 X
https://en.wikipedia.org/wiki/Floppy_disk
CRICOS code 00025BCRICOS code 00025B
How much data are we generating?
• By 2020, it’s estimated that 1.7MB of data will be
created every second for every person on earth and
there will be around 40 trillion gigabytes of data (40
zettabytes).
• 90% of all data has been created in the last two years.
• In 2018, internet users spent 2.8 million years online.
• Social media accounts for 33% of the total time spent
• Today it would take a person approximately 181 million
years to download all the data from the internet.
• 97.2% of organizations are investing in big data and AI
• Using big data, Netflix saves $1 billion per year on
customer retention.
Big Data Statistics
https://www.socialmediatoday.com/news/how-much-data-is-generated-every-minute-infographic-1/525692/
https://techjury.net/stats-about/big-data-statistics/
https://www.socialmediatoday.com/news/how-much-data-is-generated-every-minute-infographic-1/525692/
https://techjury.net/stats-about/big-data-statistics/
CRICOS code 00025BCRICOS code 00025B
A data centre is a specialised IT infrastructure that houses centralised IT resources
• Servers (rack in cabinet);
• Databases and software systems;
• Networking and telecommunication devices.
Typical technologies and components
• Virtualisation
• Standardisation and Modularity
• Remote Operation and Management
• High Availability
• Security-Aware Design, Operation and Management
• Facilities
Data Centre Technology
CRICOS code 00025BCRICOS code 00025B
Hardware: Array of Hard Disks
https://www.wikihow.com/Recover-a-Dead-Hard-Disk http://www.comnetdrc.com/
CRICOS code 00025BCRICOS code 00025B
Hardware: Network Switch
P. 10http://exd-int.com/my-product/space-mind/
Network Switch Can Route 15 Terabits Per Second
CRICOS code 00025BCRICOS code 00025B
Google Data Centre
CRICOS code 00025BCRICOS code 00025B
• Background of Distributed File Systems
– Big Data, Data Centre Technology, and Storage Hardware
• Distributed File System
– File System
– Server/Client System
– Sun’s Network File System (NFS)
• Clustered File System (CFS)
– Google File System (GFS)
– Hadoop Distributed File System (HDFS)
– HDFS Shell Commands
CRICOS code 00025BCRICOS code 00025B
• A file system is an abstraction: enable users to manipulate
and organize data.
• Typically, FS is in a hierarchical tree: files and directories.
• FS enables a uniform view, independent of the underlying
storage devices: floppy/optical drives, hard drives and USB
stick, etc.
• The connection between the logical file system and the
storage device was typically a one-to-one mapping.
• Examples:
– Windows: NTFS, FAT32, FAT
– MacOS: System (AFS),
– Linux: Ext4, etc.
What is a File System?
CRICOS code 00025BCRICOS code 00025B
• DFS is a distributed implementation of the classical time-sharing model
of a file system, where multiple users share files and storage
resources.
• A distributed file system spreads over multiple, autonomous
computers.
• A distributed file system should have following characteristics:
– Access transparency
– Location transparency
– Concurrency transparency
– Failure transparency
– Replication transparency
– Migration transparency
– Heterogeneity
– Scalability
What is a Distributed File System (DFS)?
https://en.wikipedia.org/wiki/Clustered_file_system#Distributed_file_systems
https://en.wikipedia.org/wiki/Clustered_file_system#Distributed_file_systems
CRICOS code 00025BCRICOS code 00025B
• In general, files in a DFS can be located in “any” system.
– Servers: data holders (the “source(s)” of files)
– Clients: data users who are accessing the servers.
• Potentially, a server for a file can become a client for another file.
• However, most distributed systems distinguish between clients and servers
in more strict way:
– Clients simply access files and do not have/share local files.
– Even if clients have disks, they (disks) are used for swapping, caching,
loading the OS, etc.
– Servers are the actual sources of files.
– In most cases, servers are more powerful machines (in terms of CPU,
physical memory, disk bandwidth, ..)
What is a Distributed File System (DFS)?
https://goo.gl/images/Nwq6tV https://www.slideshare.net/AnamikaSingh211/distributed-file-system-72294718
CRICOS code 00025BCRICOS code 00025B
• Network File System (NFS) is a distributed file system protocol originally developed by Sun
Microsystems in 1984.
• NFS allows a user on a client computer to access files over a computer network much like
local storage is accessed.
• NFS is a client-server application, where a user can view, store and update the files on a
remote computer.
• NFS builds on the Remote Procedure Call (RPC) system to route requests between clients
and servers.
• NFS protocol is designed to be independent of computer, operating system, network
architecture, and transport protocol.
• NFS allows the user or system administrator to mount complete or a partial file system on a
• The portion of the file system that is mounted can be accessed by clients with different
privileges (e.g. read-only or read-write)
Sun Network File System (NFS)
CRICOS code 00025BCRICOS code 00025B
Advantages:
• easy sharing of data across clients
• centralized administration (backup done on multiple servers instead of many clients)
• security (put server behind firewall)
Sun Network File System (NFS)
Network Server
http://pages.cs.wisc.edu/~remzi/OSTEP/dist-nfs.pdf
http://pages.cs.wisc.edu/~remzi/OSTEP/dist-nfs.pdf
CRICOS code 00025BCRICOS code 00025B
• Each file server presents a standard view of its local file
• Transparent access to remote files
• Compatibility with multiple operating systems and
platforms.
• Easy crash recovery and backup at server
• A software component namely, Virtual File System (VFS)
is available in most OS (Operating Systems) as the
interface to different local and distributed file systems.
• Virtual File System (VFS) in an OS (Operating System)
acts as an interface between the system-call layer and all
files in network nodes.
• The user interface to NFS is the same as the interface to
local file systems. The calls go to the VFS layer, which
passes them either to a local file system or to the NFS
Sun Network File System (NFS)
P. 18http://pages.cs.wisc.edu/~remzi/OSTEP/dist-nfs.pdf
Making remote files as if local to the client
For Windows, the NFS is available via SMB – Server Message
Block, or a version namely, CIFS – Common Internet File System.
http://pages.cs.wisc.edu/~remzi/OSTEP/dist-nfs.pdf
CRICOS code 00025BCRICOS code 00025B
• A big hard drive (30 TB) on a windows server:
• The driver can be shared by users in Linux
A Real Example of NFS
CRICOS code 00025BCRICOS code 00025B
• File Sharing
A Real Example of NFS
Linux Server 1 Linux Server 2 Linux Server 7
Experimental data
Experimental programs
CRICOS code 00025BCRICOS code 00025B
• Background of Distributed File Systems
– Big Data, Data Centre Technology, and Storage Hardware
• Distributed File System
– File System
– Server/Client System
– Sun’s Network File System (NFS)
• Clustered File System (CFS)
– Google File System (GFS)
– Hadoop Distributed File System (HDFS)
– HDFS Shell Commands
CRICOS code 00025BCRICOS code 00025B
• For bigger scale of data storage, a cluster of thousands of data servers is required.
• Scalability and availability should be provided for big data storage.
• Resiliency and load balancing are also very essential.
Why using Clustered File System (CFS)?
CRICOS code 00025BCRICOS code 00025B
• Clustered File System (CFS) is not a single server with a set of clients, but instead a cluster of servers
that all work together to provide high performance service to their clients.
• To the clients of CFS, the cluster is transparent.
• A CFS can organize the storage and access data across all clusters.
What is a Clustered File System (CFS)?
P. 23https://gitlab.com/arm-hpc/packages/wikis/packages/BeeGFS
CRICOS code 00025BCRICOS code 00025B
The difference between
(a) distributing whole files across several servers and
(b) striping files for parallel access.
CFS is Managed at a Block Level
Tanenbaum & , Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5
File A can be divided into blocks
CRICOS code 00025BCRICOS code 00025B
Distributed File System (DFS): A file system that has its
components spread across multiple systems. On the other hand, an
NFS is inherently a distributed file system as well. The client
component is on a different system than the underlying physical
storage or its management.
https://www.quora.com/What-is-the-difference-between-a-distributed-file-system-clustered-file-system-and-a-network-file-system
Client-Server Architecture Cluster Server Architecture
What are the differences among DFS, NFS, CFS?
Network File system (NFS): Files
are not local. They are served over a
network, with the physical storage
units and their management hosted
by a different entity.
Clustered File System (CFS): It is
built by pooling several different
discrete components, typically multiple
servers, multiple disks, working
together to provide a unified
namespace. A client is not aware of
the physical boundaries that make up
the file system.
CRICOS code 00025BCRICOS code 00025B
• Background of Distributed File Systems
– Big Data, Data Centre Technology, and Storage Hardware
• Distributed File System
– File System
– Server/Client System
– Sun’s Network File System (NFS)
• Clustered File System (CFS)
– Google File System (GFS)
– Hadoop Distributed File System (HDFS)
– HDFS Shell Commands
CRICOS code 00025BCRICOS code 00025B
• Google File System (GFS or GoogleFS) is a
scalable distributed file system developed by Google to
provide efficient, reliable access to data using large clusters
of commodity hardware.
• Shares many of the same goals as previous distributed file
systems such as performance, scalability, reliability, and
availability.
• GFS was internally used and became the creation basis of
Hadoop Distributed File System (HDFS)
Google File System
https://en.wikipedia.org/wiki/Google_File_System#/media/File:GoogleFileSystemGFS.svg
CRICOS code 00025BCRICOS code 00025B
• Thousands of commodity computers
– cheap hardware but distributed
• GFS has high component failure rates
– System is built from many inexpensive commodity components
• Modest number of huge files
– A few million files, each typically 100MB or larger (Multi-GB files are common)
– No need to optimize for small files
• Workloads : two kinds of reads and writes
– Large streaming reads (1MB or more) and small random reads (a few KBs)
– Sequential appends to files by hundreds of data producers
• High sustained throughput is more important than latency
– Response time for individual read and write is not critical
Design Assumptions
CRICOS code 00025BCRICOS code 00025B
• Files stored as chunks
– With a fixed size of 64MB each.
– With a 64-bit ID each.
• Reliability through replication
– Each chunk is replicated across 3 (by default) or more
chunk servers
– Replication number can be manually set
• Single Master
– Centralized management
– Only store meta-data
GFS Design Overview
A file with four chunks
Replicas=2
CRICOS code 00025BCRICOS code 00025B
GFS Architecture – Read
CRICOS code 00025BCRICOS code 00025B
1.Client asks master which chunk server holds current lease of chunk and
locations of other replicas.
2.Master replies with identity of primary and locations of secondary replicas.
3.Client pushes data to all replicas
4.Once all replicas have acknowledged receiving the data, client sends write
request to primary. The primary assigns consecutive serial numbers to all the
mutations it receives, providing serialization. It applies mutations in serial
number order.
5.Primary forwards write request to all secondary replicas. They apply
mutations in the same serial number order.
6.Secondary recplicas reply to primary indicating they have completed
7.Primary replies to the client with success or error message
GFS Architecture – Write
CRICOS code 00025BCRICOS code 00025B
• Mater maintains all system metadata
– Name space, access control info, file to chunk mappings, chunk locations, etc.
• Periodically communicates with chunk servers
– Through HeartBeat messages
• Advantages:
– Simplifies the design
• Disadvantages:
– Single point of failure
• Solution
– Replication of Master state on multiple machines
– Operational log and check points are replicated on multiple machines
GFS Components – Master
CRICOS code 00025BCRICOS code 00025B
• Fixed size of 64MB (vs 4kb of cluster size for NTFS)
• Advantages
– Size of meta data is reduced
– Involvement of Master is reduced
– Network overhead is reduced
– Lazy space allocation avoids internal fragmentation
• Disadvantages
– Hot spots
A small file consists of a small number of chunks, perhaps just one. The chunkservers
storing those chunks may become hot spots if many clients are accessing the same file.
Solutions: increase the replication factor and stagger application start times; allow clients to
read data from other clients
GFS Components – Chunks
https://support.microsoft.com/en-us/help/140365/default-cluster-size-for-ntfs-fat-and-exfat
Chunk numbers
Chunk size
https://support.microsoft.com/en-us/help/140365/default-cluster-size-for-ntfs-fat-and-exfat
CRICOS code 00025BCRICOS code 00025B
• Three major types of metadata
– The file and chunk namespaces
– The mapping from files to chunks
– Locations of each chunk’s replicas
• All the metadata is kept in the Master’s memory
• 64MB chunk has 64 bytes of metadata
• Chunk locations
– Chunk servers keep track of their chunks and relay data to Master through HeartBeat
• Master “operation log”
– Consists of namespaces and file to chunk mappings
– Replicated on remote machines
GFS Components – Metadata and Operational Log
CRICOS code 00025BCRICOS code 00025B
• Background of Distributed File Systems
– Big Data, Data Centre Technology, and Storage Hardware
• Distributed File System
– File System
– Server/Client System
– Sun’s Network File System (NFS)
• Clustered File System (CFS)
– Google File System (GFS)
– Hadoop Distributed File System (HDFS)
– HDFS Shell Commands
Cloud Computing
INFS3208/INFS7208
CRICOS code 00025BCRICOS code 00025B
• Apache Hadoop was proposed in 2010 as a
collection of open-source software utilities to
deal with big data problem.
• The core of Apache Hadoop consists of a
storage part, known as Hadoop Distributed File
System (HDFS), and a processing part which is
a MapReduce programming model.
– Hadoop splits files into large blocks and
distributes them across nodes in a cluster.
– It then transfers packaged code into nodes
– It takes advantage of data locality.
Hadoop Distributed File System
Shvachko, Konstantin, , , and . “The hadoop distributed file system.” In MSST, vol. 10, pp. 1-10. 2010
CRICOS code 00025BCRICOS code 00025B
• Many inexpensive commodity hardware and failures are very common
• Many big files: millions of files, ranging from MBs to GBs
• Two types of reads
– Large streaming reads
– Small random reads
• Once written, files are seldom modified
– Random writes are supported but do not have to be efficient
• High sustained bandwidth is more important than low latency
Design Motivations
CRICOS code 00025BCRICOS code 00025B
• Equivilent to Master Node in GFS
• represents files and directories on the NameNode
• records attributes like permissions, modification and
access times, namespace and disk space quotas.
• maintains the namespace tree and the mapping of
file blocks to DataNodes
• When writing data, NameNode nominates a suite of
three DataNodes to host the block replicas.
• The client then writes data to the DataNodes in a
pipeline fashion.
• keeps meta-data in RAM
HDFS Component – NameNode
CRICOS code 00025BCRICOS code 00025B
• fsimage:
• contains the entire filesystem namespace at
the latest checkpoint
– Blocks information of the file (location,
timestamp, etc.)
– Folder information (ownership, access, etc.)
• stored as an image file in the NameNode’s
local file system.
• editlog:
• contains all the recent modifications made to
the file system on the most recent fsImage.
• create/update/delete requests from the client
HDFS Component – Meta-Data
CRICOS code 00025BCRICOS code 00025B
Checkpoint Node (Secondary NameNode)
• “Secondary” does not mean “2nd” NameNode tha
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com