Recap – Course Structure
This course includes 13 lectures and 10 tutorial/practical sessions
Lecture 1
Introduction
Lecture 2
Adv. topics&appl
Lecture 3
Networks & Load Balancing
Concepts
Orchestration
Storage
Computation
Others
Programming and Linux experiences required!
Lecture 4
VT: Docker I
Lecture 5
VT: Docker II
Lecture 6
VT: Docker III
Lecture 7
DBs in Cloud Computing
Lecture 8
DFS
Lecture 9
Hadoop & MapReduce
Lecture 10
Lecture 11
I
Lecture 12
Security & Privacy
Lecture 13
Revision
CRICOS code 00025B
2
Outline
• Database Background
• Relational Data Bases
– Revisit Relational DBs
– ACID Properties
– Clustered RDBMs
• Non-relational Data Bases
– NoSQL concepts & CAP Theorem
– MongoDB
– Cassandra
– HBase
CRICOS code 00025B 3
What is a database?
“A set of information held in a computer”
Oxford English Dictionary “One or more large structured sets of persistent data, usually associated with software to
update and query the data”
“A collection of data arranged for ease and speed of search and retrieval”
A database is a shared, integrated computer structure that stores a collection of the following:
• End-user data—that is, raw facts of interest to the end user
• Metadata, or data about data, through which the end-user data is integrated and managed
1.Coronel, Carlos, and . Database systems: design, implementation, & management. Cengage Learning, 2016.
CRICOS code 00025B P. 4
Free On-Line Dictionary of Computing Dictionary.com
Why Database systems?
Program data independence
• easily change the structure of database without modifying the application program.
Sharing of data, controlled redundancy
• data can be shared by authorized users.
• data is recorded in only one place in the database and it is not duplicated.
Concurrent, multi-user access to large volumes of data
• Many authorized users can simultaneously access the same piece of information.
• The remote users can also access the same data.
• Concurrent assess to data in file system leads to incorrect data.
Back-up and recovery for reliability
• Most of the DBMSs provide the ‘backup and recovery’ sub-systems that automatically create the backup of data and restore data if required.
• In a computer file-based system, the back-up and recovery are often inefficient. CRICOS code 00025B
Why Database systems?
Uniform, logical data model for representing data
• The logical data model represents the conceptual data model (including attributes, names, relationships, and other metadata). Logical data model is often developed by UML.
Rich, standard language for querying data
• A number of extended SQL language, such as PLSQL (Oracle), PL/pgSQL (PostgreSQL), Transact-SQL or T-SQL (Microsoft SQL Server)
Effective optimizations for efficient querying
• DBMS queries can be optimized to achieve efficiency
Ensure data integrity within single applications
• Integrity constraints or consistency rules can be applied to database so that the correct data can be entered into database.
CRICOS code 00025B
Relational database
A relational database is a collection of data items organized as a set of formally-described tables from which data can be accessed or reassembled in many different ways without having to reorganize the database tables.
In [1], relation algebra for databases was proposed and formed the foundation of modern relational databases.
Five primitive operators in [1] are selection, projection, Cartesian product, set union, and set difference. SEQUEL
The relational database was invented by E. F. Codd at IBM in 1970.
CRICOS code 00025B
P. 7
Turing Award in 1981
[1] Codd, E. F. (1970). “Relational Completeness of Data Base Sublanguages”
Key concepts in Relational database
DBMS
• The database management system (DBMS) is the software that interacts with end users, applications, and the database itself to capture and analyze the data.
• The DBMS software additionally includes the core facilities provided to administer the database.
• Often the term “database” is also used to loosely refer to any of the DBMS, the database system or an application associated with the database.
CRICOS code 00025B
P. 8
Key concepts in Relational database
Attributes
Some important database terms:
Entities / Tables – Entities represent items that we want to store data about (e.g., a student)
Attributes / Fields / Columns headings – attributes are the pieces of data that we want to store (e.g., the students name)
Relationships – relationships are used to show how entities within the database are related (e.g., a student may be enrolled in a course, so keep student ID in Enrolment table).
Records / Rows – a logically connected one or more
fields, e.g., a student record consisting of name, student number & phone
Data / Data item – raw fact, e.g., student grade or phone number.
Metadata – data about data that provides description of data to enable program–data independence, e.g., type of data (number or text)
Entities (tables)
Record
Relationship
School of Information and Communication Technology
CRICOS code 00025B
P. 9
Metadata
Data
What is a database schema?
• The database schema of a database system is its structure described in
a formal language supported by
the database management
system (DBMS).
• The term “schema” refers to the organization of data as a blueprint of how the database is constructed (divided into database tables in the case of relational databases).
https://en.wikipedia.org/wiki/Database_schema
https://stackoverflow.com/questions/45135485/creating-a-database-schema- from-the-movie-database
CRICOS code 00025B
P. 10
Types of SQL statements:
Data Definition Language (DDL) – Defines and modifies a schema e.g. CREATE / DROP / ALTER table; does not manipulate data
Data Manipulation Language (DML) – Language used to retrieve (SELECT), add (INSERT), modify (UPDATE) and DELETE data
Data Control Language (DCL) statements. Used for providing (GRANT) / withdrawing (REVOKE) access privileges
Transaction Control Language (TCL) statements are used to manage the changes made by DML statements. It allows statements to be grouped together into logical transactions. Example: COMMIT, ROLLBACK, etc.
School of Information and Communication Technology CRICOS code 00025B
Key concepts in Relational database
No duplicate tuples (rows)
•
•
by definition sets do not contain duplicate elements hence tuples (rows) are unique
Ensured by entity integrity (primary key)
•
Tuples are unordered within a relation (table) • by definition sets are not ordered
• hence tuples can only be accessed by content No ordering of attributes within a tuple
• by definition sets are not ordered
School of Information and Communication Technology
CRICOS code 00025B
P. 12
Key concepts in Relational database
Entity-Relation Design (ERD):
• Entity: corresponds to a table in which we store data about a particular thing, e.g., Student
• Attribute: describes characteristics of an entity, e.g. attributes for the Employee entity are employee number, first name, last name, job title etc.
• Relationship: illustrates an association (business rule) between two entities, e.g. a verb Relationship
works in
Employee
Employee number
First name Last name Job title Dept number
Department
Dept number
Dept name Dept location
School of Information and Communication Technology
CRICOS code 00025B
P. 13
Key concepts in Relational database
What is an actual business rule between Employee and Department entities?
• An employee works in a department and
• A department can employ many employees.
Cardinality/connectivity specifies maximum number of times an instance of an entity can be related to instances of a related entity (1:1, 1:M, and M:N).
Modality/participation specifies minimum number of times an instance of an entity can be related to instances of a related entity (0 or 1).
School of Information and Communication Technology
CRICOS code 00025B P. 14
Key concepts in Relational database
NORMALISATION – SIMPLY ‘COMMON SENSE’
• Converts a relation into relations of progressively smaller number of attributes and tuples until an optimum level of decomposition is reached – little or no data redundancy exists
• Normalisation is a Relational Database Implementation Model focused approach (it makes extensive use of FK’s to connect relations)
Goals:
• Each table represents a single subject
• No data item will be unnecessarily stored in more than
one table, i.e., No data redundancy
• All non-key attributes in a table are dependent on the
primary key
• Each table is void of insertion, update, deletion anomalies
• Objective of normalisation is to ensure that all tables
are in at least 3NF
No Redundancy!
School of Information and Communication Technology
CRICOS code 00025B
P. 15
Key concepts in Relational database
Transaction (ACID) Properties
• Atomicity – all parts of a transaction be completed successfully otherwise, the transaction is aborted (never partially executed, Done or not done)
• Consistency – concurrent execution of transactions yields consistent results (consistent -> consistent, inconsistent -> roll back -> consistent)
• Isolation – data used during one transaction cannot be used by a second until the first is completed- Multi-user
• Durability – ensures that the result or effect of a committed transaction persists in case of a system failure.
All transaction properties work together to make sure that a database maintains data integrity and consistency for a single-user or a multi-user DBMS.
School of Information and Communication Technology
CRICOS code 00025B P. 16
Clustered RDBMs
• RDBMS is famous for ACID and suitable for transaction workloads, but scalability of RDBMs is an issue.
• To horizontally scale up and support ACID transactions, traditional RDBMs have been extended to maintain high-availability, high-redundancy in a distributed computing environment.
• Products:
– MySQL NDB Cluster 8.0
– CockroachDB
– Amazon Aurora
• Pros: high availability (99.99%+), high throughput and low latency, etc.
• Cons: complex management/deployment/configuration, no foreign key, need huge memory and disk storage.
School of Information and Communication Technology
CRICOS code 00025B
P. 17
Outline
• Database Background
• Relational Data Bases
– Revisit Relational DBs
– ACID Properties
– Clustered RDBMs
• Non-relational Data Bases
– NoSQL concepts & CAP Theorem
– MongoDB
– Cassandra
– HBase
CRICOS code 00025B 18
What is NoSQL?
A class of database management systems (DBMS) NoSQL – “Not only SQL”
Does not use SQL as querying language
Distributed, fault-tolerant architecture
No fixed schema (formally described structure)
No joins (typical in databases operated with SQL)
It’s not a replacement for a RDBMS but compliments it
No Relational Query Languages
http://www.gradpost.ucsb.edu/events/events-article/2019/07/09/default-calendar/academic-job-market-writing-success-series-c.v.s
CRICOS code 00025B
Why NoSQL? Modern Web applications:
• Support large numbers of concurrent users (tens of thousands, perhaps millions) – / Singles’ Day (11.11) / Boxing Day
• Deliver highly responsive experiences to a globally distributed base of users – Need scalability (horizontally scalable)
• Be always available – no downtime
– Rare to see downtime/maintenance for YouTube or Amazon
• Handle semi- and unstructured data
– Twitter tweets / Youtube videos / Instagram photos
• Rapidly adapt to changing requirements with frequent updates and new features – Cloud environment & cluster deployment
• The world has gone mobile, but still need data storage.
CRICOS code 00025B
P. 20
Types of NoSQL Databases
Key-value stores
• Key is a string, e.g. “s123456”, “user109029”, or
“19228872987494923”
• Value can be in different types: Double, Int, Array, List, etc.
• Usage cases: Frequent I/O operations in simple data model
– mobile apps, shopping carts, etc. • Pros and Cons:
– Scalable, flexible, high performance for writing
– Not suitable for structured data and low query performance for
conditions
• Representative Products:
Key: sn1
Value: “Mike, 24, Brisbane, …” Value: “Mary, 22, Sydney, …”
– Redis, Riak, Memcached, etc. https://www.kdnuggets.com/2016/06/top-nosql-database-engines.html
Key: sn2
CRICOS code 00025B
Types of NoSQL Databases
Document stores
• Similar to key-value stores, but allows nested values associated with each
key;
• Value is a document (nested values)
• Usage cases:
– Document-oriented data, e.g. Tweets, Facebook posts, Amazon comments/feedback
– Semi-structured data, e.g. JSON files and XML files.
• Pros and Cons:
– Flexible data structure, simple, and multi-indexed capability.
– No unique query language (like calling a function in MongoDb)
• Representative products:
– MongoDB, CouchDB, etc.
CRICOS code 00025B
Types of NoSQL Databases
Column stores
• Data are stored by columns rather than rows.
CRICOS code 00025B
Types of NoSQL Databases
Column stores (e.g., HBase, Cassandra)
• Practical use of a column store versus a row store differs little in the relational DBMS world.
• Row-based systems are designed to efficiently return data for an entire row (a record).
– Example: return a record of user, including name, age, gender, address, etc.
• Row-based systems are not efficient for column-wide operations on the whole table
– Find all records of customer’s age between 20 and 30 across a 10M table.
– Scan the column of age row by row without skipping other fields (name, gender, etc.)
– Big table will need multiple data block (machine) to fit in all data (more disk operations).
• Columnar store can be more efficiently accessed for some particular operations (e.g. data aggregation)
CRICOS code 00025B
Types of NoSQL Databases
Column stores
• With column storage, different columns or column families can be distributed
over multiple data nodes.
• offers very high performance and a highly scalable architecture.
• Usage cases:
– Big Data process, e.g. Google search (BigTable), Spotify (user recommendation), Facebook (messages)
• Pros and Cons:
– High performance in terms of query, scalability, distribution.
– Incremental data loading and Queries against only a few rows
• Representative products:
– BigTable, Hbase, Cassandra, etc.
CRICOS code 00025B 25
Types of NoSQL Databases
Graph databases
• Nodes and relationships are the bases of graph databases.
– A node represents an entity, like a user, category, or a piece of data.
– A relationship represents how two nodes are associated, like friendship, works for, etc.
• The nature of connections in Graph eliminates the (time-consuming) search-match operation
found in relational databases.
• Usage cases:
– Network data, e.g. friendships in Tweets, Facebook
• Pros and Cons:
– Model complex relationship and support Graph-based algorithm.
– Complexity (smaller data scale)
• Representative products:
– Neo4j, OrientDB, InfoGrid, etc.
CRICOS code 00025B 26
Database Shards
• A database shard is a horizontal partition of data in a database.
• Each individual partition is referred to as a shard or database shard.
• Each shard is held on a separate database server instance, to spread load of a large volume of data.
• Some data within a database remains present in all shards, but some appears only in a single shard.
• Each shard (or server) acts as the single source for this subset of data.
R0: {id: 95, name: ‘aa’, tag:’older’} R1: {id: 302, name: ‘bb’,}
R2: {id: 759, name: ‘aa’,}
R3: {id: 607, name: ‘dd’, age: 18} R4: {id: 904, name: ‘ff’,}
R5: {id: 246, name: ‘gg’,} R6: {id: 148, name: ‘ff’,} R7: {id: 533, name: ‘kk’,}
https://en.wikipedia.org/wiki/Shard_(database_architecture)
CRICOS code 00025B
P. 27
CAP Theorem
• The CAP theorem states that it is impossible for
a distributed data store to simultaneously provide more than two out of the following three guarantees:
– Consistency: Every read receives the most recent write or an error
– Availability: Every request receives a (non-error) response – without the guarantee that it contains the most recent write
– Partition tolerance: The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes
CRICOS code 00025B
28
NoSQL Implementation Options
CA (Consistency & Availability)
➢All clients always have the same view on the data
➢Each Client can always read and write
➢The system may not tolerance to failure and reconfiguration
AP (Availability & Partition Tolerance) ➢Each Client can always read and write
➢The system works well despite the physical partitions ➢Clients may have inconsistent views on the data
CP (Consistency & Partition Tolerance)
➢All clients always have the same view on the data ➢The system works well despite the physical partitions ➢Clients sometimes may not be able to access data
A
CP
CRICOS code 00025B
P. 29
Example of CAP
Proof in diagrams
PP
N1
A
v1 V0 N2
B
1
V0
N1
A
V1 N2
B
2
V0
N1
A
V1 N2
B
3
V0
N1
N1
A
V0 N2
A
V1 N2
B
V0
B
3
V0
http://www.julianbrowne.com/article/brewers-cap-theorem
A
C
CRICOS code 00025B
30
v1
CAP for DBs
You must always give
something up: consistency,
availability or tolerance to
failure and reconfiguration
CRICOS code 00025B 31
BASE: ACID Alternative
• The opposite of ACID for transactions in relational databases is BASE: – Basically Available, Soft-state, Eventually consistent
• Basically Available:
– one distributed system has failure parts but the total system is still working properly.
– E.g. online shopping on
• Soft-State: corresponds to hard-state, which guarantees consistency and durability in RDBMS, and allows delays or outages (short period)
• Eventually consistent:
– Strong consistency: the data should be consistent after every transaction.
– Rather than requiring consistency after every transaction, it is enough for the distributed database to eventually be in a consistent state.
CRICOS code 00025B
Outline
• Database Background
• Relational Data Bases
– Revisit Relational DBs
– ACID Properties
– Clustered RDBMs
• Non-relational Data Bases
– NoSQL concepts & CAP Theorem
– MongoDB
– Cassandra
– HBase
CRICOS code 00025B 33
Examples of NoSQL databases
CRICOS code 00025B P. 34
What is MongoDB?
• A document-oriented NoSQL database:
– developed by C++ and cross-platform
– open-source, dfs-based, and horizontally scalable
– Schema-less: No Data Definition Language (DDL)
– So you can store hashes with any keys and values that you choose
Keys are a basic data type but in reality stored as strings
Document Identifiers (_id) will be created for each document, field name reserved by system
– Uses BSON format: based on JSON (JSON like, B stands for Binary)
• Supports APIs (drivers) in many computer languages (10+)
– JavaScript, Python, Ruby, Perl, Java, Scala, C#, C++, Haskell, Erlang, etc.
https://en.wikipedia.org/wiki/Hash_function
CRICOS code 00025B 35
MongoDB – Document Database
• A record in MongoDB is a document: field and value pairs.
• MongoDB documents are similar to JSON objects.
• The values of fields may include other documents, arrays, and arrays of documents.
The advantages of using documents are:
• Documents (i.e. objects) correspond to native data types in many programming languages.
• Embedded documents and arrays reduce need for expensive joins. https://en.wikipedia.org/wiki/Hash_function
CRICOS code 00025B 36
JSON Format Revisit
• •
•
•
•
•
•
•
•
Data is in name/value pairs
A name/value pair consists of a field name followed by a colon, followed by a value:
Example: name: “R2-D2”
Data is separated by commas
Example: name: “R2-D2”, race : “Droid” Curly braces hold objects
Example: {name: “R2-D2”, race : “Droid”, affiliation: “rebels”} An array is stored in brackets []
Example: [ {name: “R2-D2”, race : “Droid”, affiliation: “rebels”}, {name: “Yoda”, affiliation: “rebels”} ]
JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate.
https://www.json.org/
CRICOS code 00025B
P. 37
BSON format
• BSON is simply binary-encoded serialization of JSON-like
documents
• Zero or more key-value pairs are stored as a single entity
• Each entry consists of a field name, a data type, and a value
• Support faster encoding and decoding techniques
• Suitable for data storage
• Lightweight, fast and traversable
https://en.wikipedia.org/wiki/BSON
A document such as {“hello”:”world”}
CRICOS code 00025B
P. 38
Schema Free
• MongoDB does not need any pre-defined data schema
• Every document in a collection could have different data
{name: will”, eyes: “blue”, birthplace: “NY”, aliases: [“bill”, “la ciacco”],
loc: [32.7, 63.4], boss:”ben”}
name: “jeff”, eyes: “blue”, loc: [40.7, 73.4], boss: “ben”}
name: “ben”, hat: ”yes”}
{name: “brendan”, aliases: [“el diablo”]}
CRICOS code 00025B
P. 39
{name: “matt”, pizza: “DiGiorno”,
height: 72,
loc: [44.6, 71.3]}
MongoDB: Hierarchical Objects
• A MongoDB instance may have zero or more ‘databases’
• A database may have zero or more ‘collections’.
• A collection may have zero or more ‘documents’.
• A document may have one or more ‘fields’.
• MongoDB ‘Indexes’ function much like their RDBMS counterparts.
Note:
0 Collection is not strict about what it Stores 0 Document can be Embedded
CRICOS code 00025B 40
RDBMS
MongoDB
Database
Database
Table, View
Collection
Row
Document (BSON)
Column
Field
Index
Index
Join
Embedded Document
Foreign Key
Reference
Partition
Shard
0 or more Databases
0 or more Collections
0 or more Documents
0 or more Fields
Key Features of MongoDB
• High Performance
– Support for embedded data models reduces I/O activity on database system.
– Indexes support faster queries and can include keys from embedded documents and arrays.
• Rich Query Language
– Support CRUD, Data Aggregation, Text Search and Geospatial Queries. • High Availability
– Support automatic failover and data redundancy.
• Horizontal Scalability
• Multiple Storage Engines
– WiredTiger Storage Engine (Default)
– In-Memory Storage Engine
CRICOS code 00025B
P. 41
CRUD operations
• Create
• db.collection.insertOne(
• db.collection.insertMany(
• db.collection.find(
• db.collection.findOne(
• db.collection.update(
• db.collection.remove(
Collection specifies the collection or the ‘table’ to store the document
CRICOS code 00025B
P. 42
Create Operations
Create or insert operations add new documents to a collection.
• MongoDB provides the following methods to insert documents into a collection: syntax: db.collection.insertOne() & db.collection.insertMany()
• In MongoDB, insert operations target a single collection.
No “_id”
with “_id”
CRICOS code 00025B
P. 43
Create Operations
Create or insert operations add new documents to a collection.
• db.collection.insertMany()
• In MongoDB, insert operations target a single collection.
• All write operations in MongoDB are atomic on the level of a single document.
CRICOS code 00025B
P. 44
Read Operations
• Read operations retrieves documents from a collection; – i.e. queries a collection for documents.
• MongoDB provides the following methods to read documents from a collection: db.collection.find()
– Find all documents in a collection: – Query for equality:
– Query for embedded item:
– Query using operators ($):
– Query for ranges:
– Query for multiple conditions: https://docs.mongodb.org/manual/reference/operator/query/
=> select * from bios; (SQL-like) => select * from bios where _id=5;
CRICOS code 00025B
P. 45
Read Operations – Query Operators
Name
Description
$eq
Matches value that are equal to a specified value
$gt, $gte
Matches values that are greater than (or equal to a specified value
$lt, $lte
Matches values less than or ( equal to ) a specified value
$ne
Matches values that are not equal to a specified value
$in
Matches any of the values specified in an array
$nin
Matches none of the values specified in an array
$or
Joins query clauses with a logical OR returns all
$and
Join query clauses with a loginalAND
$not
Inverts the effect of a query expression
$nor
Join query clauses with a logical NOR
$exists
Matches documents that have a specified field
https://docs.mongodb.org/manual/reference/operator/query/
CRICOS code 00025B P. 46
Read Operations
• The find() method returns a cursor to the results and an iteration is required to get the results
• .next() method:
• .foreach() method:
• The limit(n) method returns n documents instead of all results: • The skip(n) method controls the starting point of the results set:
https://docs.mongodb.org/manual/reference/operator/query/
CRICOS code 00025B
P. 47
Update Operations
• •
•
Update operations modify existing documents in a collection.
MongoDB provides the following methods to update documents of a collection:
– db.collection.updateOne(), db.collection.updateMany(), and db.collection.replaceOne() Conditions with operators can be used to filter the data.
replaceOne() vs updateOne()
– replaceOne() allows to replace the entire document with updated field’s values
– updateOne() allows to update the specific field’s values.
•
https://docs.mongodb.org/manual/reference/operator/query/
CRICOS code 00025B
P. 48
Delete Operations
• Delete operations remove documents from a collection.
• MongoDB provides the following methods to delete documents of a collection:
– db.collection.deleteOne()
– db.collection.deleteMany()
• Similarly, criteria, or filters, that identify the documents to remove can be
specified with the same syntax as read operations.
https://docs.mongodb.org/manual/reference/operator/query/
CRICOS code 00025B
P. 49
MongoDB: CAP Position
FocusonConsistency andPartitiontolerance(CP)
Consistency
➢allreplicascontainthesame versionofthedata
Availability
➢systemremainsoperationalon failingnodes
Partition tolarence
➢ multiple entry points
➢ system remains operational on system split
A
CP
CAP Theorem: satisfying all three at the same time is impossible
CRICOS code 00025B 50
Outline
• Database Background
• Relational Data Bases
– Revisit Relational DBs
– ACID Properties
– Clustered RDBMs
• Non-relational Data Bases
– NoSQL concepts & CAP Theorem
– MongoDB
– Cassandra
– HBase
CRICOS code 00025B 51
Example of NoSQL database
CRICOS code 00025B P. 52
What is Cassandra?
Apache Cassandra is a column-oriented distributed database:
• Free, open-source, decentralized (P2P), distributed
• Aims to managing very large amounts of data across the world.
• provides highly available service with no single point of failure.
• It is scalable, fault-tolerant, and eventually consistent.
• Created at Facebook, it differs sharply from relational
database management systems.
• Cassandra is being used by some of the biggest companies such as Facebook, Twitter, Cisco, Rackspace, ebay, Twitter, Netflix, and more.
https://en.wikipedia.org/wiki/Apache_ RICOS code 00025B 53
Cassandra is the cursed Oracle
Cassandra Architecture
• The design goal: handle big data workloads across multiple nodes without any single point of failure.
• Cassandra has peer-to-peer distributed system across its nodes, and data is distributed among all the nodes in a cluster.
– All the nodes in a cluster play the same role and each node is independent.
– Each node in a cluster can accept read and write requests, regardless of where the data is actually located in the cluster.
– When a node goes down, read/write requests can be served from other nodes in the network.
• one or more of the nodes in a cluster act as replicas for a given piece of data.
• Cassandra uses the Gossip Protocol in the background to allow
the nodes to communicate with each other and detect any faulty
nodes in the cluster.
https://en.wikipedia.org/wiki/Apache_ RICOS code 00025B 54
Read & Write in Cassandra
• Write Operations
– Every write activity of nodes is captured by the commit logs written
in the nodes in step 1.
– In step 2, the data will be captured and stored in the mem-table.
– Whenever the mem-table is full, data will be written into the SStable data file in step 3.
– All writes are automatically partitioned and replicated throughout the cluster.
– Cassandra periodically consolidates the SSTables, discarding unnecessary data.
• Read Operations
– Cassandra tries to get values from the mem-table
– Or finds the appropriate SSTable that holds the required data.
https://en.wikipedia.org/wiki/Apache_ RICOS code 00025B 55
Cassandra: a CAP approach
➢ Cassandra is typically classified as an AP system.
➢ Availability and Partition Tolerance are generally considered to be more important than consistency in Cassandra.
➢ Cassandra can be tuned with replication factor and consistency level to also meet C.
A
CP
CAP Theorem:
satisfying all three at the same time is impossible
CRICOS code 00025B
56
Outline
• Database Background
• Relational Data Bases
– Revisit Relational DBs
– ACID Properties
– Clustered RDBMs
• Non-relational Data Bases
– NoSQL concepts & CAP Theorem
– MongoDB
– Cassandra
– HBase
CRICOS code 00025B 57
What is HBase?
• Inspired by Google’s proprietary distributed file storage BigTable
• Open-source Apache project
• Runs on top of HDFS
• Written in Java
• NoSQL (Not Only SQL) Distributed Database
• Consistent and Partition tolerant
• Runs on commodity hardware
• Large Database (terabytes to petabytes).
• Many companies are using HBase
– Facebook, Twitter, Adobe, Mozilla, Yahoo!, Trend Micro, and StumbleUpon
CRICOS code 00025B
P. 58
HBase – Architecture
HBase is built on top of HDFS
HBase files are internally stored in HDFS
BigTable
HBase
DFS
GFS
HDFS
Computation Model
MapReduce
Hadoop MapReduce
CRICOS code 00025B
P. 59
HBase vs HDFS
• •
Both are distributed systems that scale to hundreds or thousands of nodes
HDFS is a DFS while HBase is a column-oriented NoSQL distributed databases.
•
If your application has neither random reads or writes, stick to HDFS CRICOS code 00025B
–
–
HDFS is good for batch processing (scans over big files) Not good for record lookup (random access)
Not good for incremental addition of small batches
Not good for updates
HBase is designed to efficiently query data records Fast record lookup (online process)
Support for record-level insertion Support for updates (not in place)
P. 60
HBase – Data Model
HBase is a sparse, multi-dimensional, sorted map.
There are four components in an index to locate the data’s value in the map: Row key, column family, column qualifier, and time stamp
Values in the cell have no data type.
There are two time stamps for this cell, ts1 and ts2. Each ts corresponds to a specific data: ts1: ts2:
column qualifier
column family
Personal
Uni
name
dob
gender
email
program
s82723123
John
1/1/10
M
BInfoTech
s82384783
Tom
5/8/09
M
BInfoTech
s87473482
Mary
6/9/11
F
MInfoTech
s17365618
Peter
7/4/10
M
MinfoTech (Management)
Row keys
CRICOS code 00025B
P. 61
cells
HBase – Data Model
• In HBase, each row has a row key that can be sorted and arbitrary number of columns.
• In the horizontal direction, a table is compromised by column families, each of which can
have arbitrary number of columns.
• Column family can be dynamically extended by adding a new column family or a column
• HBase will keep the historical records when updating with the new value:
– is new and the previous is be kept.
Personal
Uni
name
dob
gender
email
program
GPA
s82723123
John
1/1/10
M
BInfoTech
4.75
s82384783
Tom
5/8/09
M
tom.z
BInfoTech
5.25
s87473482
Mary
6/9/11
F
MInfoTech
6
s17365618
Peter
7/4/10
M
MinfoTech (Management)
6.5
CRICOS code 00025B
P. 62
HBase – Data Model
Key
Value
[”s82384783”, “Uni”, “emai”,”ts1”]
[”s82384783”, “Uni”, “emai”,”ts2”]
Row keys
column family
column qualifier
Timestamp
CRICOS code 00025B P. 63
HBase – Conceptual View
The table stores the crawled data from a website:
Row Key
TS
Column Family contents
Column Family
anchor
”au.net.abc.www”
t5
anchor:abcsi.com=”ABC”
t4
anchor:my.look.ca=”ABC.net.au”
t3
contents:html=”…”
t2
contents:html=”…”
t1
contents:html=”…”
CRICOS code 00025B P. 64
HBase – Physical View
Row Key
TS
Column Family
contents
“au.net.abc.www”
t3
contents:html=”…”
t2
contents:html=”…”
t1
contents:html=”…”
Row Key
TS
Column Family
anchor
“au.net.abc.www”
t5
anchor:abcsi.com=”ABC”
t4
anchor:my.look.ca=”ABC.net.au”
CRICOS code 00025B P. 65
HBase – Shell (commands)
Command
Description
list
Shows list of tables
create ‘users’, ‘info’
Creates users table with a single column family name info.
put ‘users’, ‘row1’, ‘info:fn’, ‘John’
Inserts data into users table and column family info.
get ‘users’, ‘row1’
Retrieve a row for a given row key
scan ‘users’
Iterate through table users
disable ‘users’ drop ‘users’
Delete a table (requires disabling table)
CRUD explained
CREATE = PUT READ = GET UPDATE = PUT DELETE = DELETE
CRICOS code 00025B
HBase – Java API (examples)
Command Description
Get
Get get = new Get(String.valueOf(uid).getBytes()); Result[] results = table.get(gets);
Put
Put p = new Put(Bytes.toBytes(“”+user.getUid())); p.add(Bytes.toBytes(“info”), Bytes.toBytes(“fn”),
Bytes.toBytes(user.getFirstName())); p.add(Bytes.toBytes(“info”), Bytes.toBytes(“ln”),
Bytes.toBytes(user.getLastName())); table.put(p);
Delete (column, column family)
Delete d = new Delete(Bytes.toBytes(“”+user.getUid())); d.deleteColumn(Bytes.toBytes(“info”), Bytes.toBytes(“fn”), Bytes.toBytes(user.getFirstName()), timestapmp1);
Batch Operations
List of Get, Put or Delete operations
Scan
Iterate over a table. Prefer Range / Filtered scan. Expensive operation.
CRICOS code 00025B
Comparison of NoSQL Databases: Cassandra, HBase, MongoDB
Cassandra
HBase
MongoDB
CAP
AP
CP
CP
Network
Masterless (P2P)
Master /Slave
Master/Slave
Data Sharding
Row store with Column oriented
Column Store
Document Store
Architecture
Peer-to-peer
Hierarchical
Nested (Hierarchical)
Data Lake
Too Complex
OK
IoT or Web
Lack of Record-Level Indexing
Text Data
Good
Good
Good
Database Schema
Yes (can replace RDB)
No
No
CRICOS code 00025B 68
Reading List
1. Database Wiki, https://en.wikipedia.org/wiki/Database
2. NoSQL: A Database for Cloud Computing
https://pdfs.semanticscholar.org/f5ba/838ce31f14d83705ba9cc24ba1297ed1808d.pdf
3. SQL Databases v. NoSQL Databases http://delivery.acm.org/10.1145/1730000/1721659/p10- stonebraker.pdf?ip=130.102.82.27&id=1721659&acc=ACTIVE%20SERVICE&key=65D80644F295BC0D.5A1472836B5B8FB5.4D4
702B0C3E38B35.4D4702B0C3E38B35&__acm__=1527074155_3df851f19f656db4a19a2d998ed62938
4. A performance comparison of SQL and NoSQL databases
https://ieeexplore.ieee.org/abstract/document/6625441/
5. MongoDB Tutorial for Beginners
https://www.guru99.com/mongodb-tutorials.html
6. for Beginners
https://teddyma.gitbooks.io/learncassandra/content/index.html
CRICOS code 00025B
P. 69
References
1.Coronel, Carlos, and . Database systems: design, implementation, & management. Cengage Learning, 2016.
2.Why NoSQL? https://www.couchbase.com/resources/why-nosql
3.Introduction to MongoDB. https://docs.mongodb.com/manual/introduction/
4.https://www.tutorialspoint.com/cassandra/cassandra_architecture.htm
5.Stonebraker, Michael. “SQL databases v. NoSQL databases.” Communications of the ACM 53, no. 4 (2010): 10-11.
6.Khan, Heena. “NoSQL: A database for cloud computing.” International Journal of Computer Science and Network 3, no. 6 (2014): 498-501.
7.Li, Yishan, and . “A performance comparison of SQL and NoSQL databases.” In Communications, computers and signal processing (PACRIM), 2013 IEEE pacific rim conference on, pp. 15-19. IEEE, 2013.
8.https://www.w3.org/TR/rdf-sparql-query/ 9.
CRICOS code 00025B
Next (Week 8) Topic:
Distributed File Systems – GFS and Hadoop File System (HDFS)
CRICOS code 00025B