PowerPoint Presentation
Database Systems Infrastructure
Copyright © 2012, SAS Institute Inc. All rights reserved.
Remember this diagram you have been looking at in the last couple of weeks.
The same diagram but today we are looking at different side of what you have been learning up to this point.
Today, we are looking at creating an infrastructure of database system. This is how the databases are connected and stored.
The red lines highlighted the connections between the databases.
The coloured lines are shown to be the storage and how you manage how the data is to be stored.
INFS5710 Week 1
Centralised Database Management System
Centralized database management system
Required that corporate data be stored in a single central site
Data access provided through dumb terminals or PCs/laptops
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
.
The basic is centralised database management system. This is where the database is stored in one place.
As an example, you can think of your Excel Spreadsheet saved on your laptop. You can open and access the spreadsheet on your laptop.
Evolution of Database Management Systems
Distributed database management system (DDBMS) governs storage and processing of logically related data over interconnected computer systems
Data and processing functions are distributed among several sites
“to disperse through an area”
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
.
Instead of one person, i.e. you, can access your laptop, you decide to put your spreadsheet on the university OneDrive to be shared.
However, other people can change the values and delete rows on your spreadsheet. So, you might want to control who can have access to the data and how you want to control who can access to the data, and what they can do to the data.
In a real world, the small and medium sized companies, they still keep their data in spreadsheets, because building a database server to be shared in the office is not cheap and you require knowledge and skills to create and maintain the data.
However, today, these companies can move their data to cloud to be shared but you still someone who has the knowledge and skills of creating and maintaining the data in cloud. You can outsource to a third party to maintain the data for you.
I think all of you know what cloud is. Cloud is where you can store the data on “public” servers such as OneDrive, DropBox, iCloud, AWS and so on.
Now, come back to corporate environment, we will not focus on saving data from the relational database in cloud. Today, companies will only move data to the cloud if disk space and managing data are becoming issues, or there is a project and moving data to the cloud to ensure future growth. Thus, there are still data stored on company premises.
Distributed database management system (DDBMS) governs storage and processing of logically related data over interconnected computer systems.
When we say database is distributed, this means data is saved, i.e. distributed across different machines, or servers as we call it. That is, the data is not saved in one database, but the data is saved on databases that is saved across on different servers, across different geographic locations, i.e. you have different servers on different countries or states or regions. For example, Employee table could be saved in a database server n Sydney, but the product table could be saved database server in Melbourne, and the sales table could be broken up, or fragmentated as we called it, into sales tables in NSW, Victoria, Queensland, and one table for rest of the countries.
What this means is we want to manage how the data is stored and where and how they are stored. Moreover, we have where the data is retrieved and processed.
We also want to manage the movement of data between these database systems or servers. More importantly, how do we control the consistency of the data between the systems or servers, and between the systems to the users.
Changes that Affected the Nature of Systems
Globalization of business operation
Market needs for an on-demand transaction style, based on web-based services
Rapid social and technological changes fueled by low-cost smart mobile devices
Data realms converging in the digital world
Advent of social media to reach new customers and markets
(multiple types of data, such as voice, video, music, and image)
(e.g., Tik Tok, Douyin, live-streaming)
(e.g., internet celebrity, “Wanghong”)
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
.
As stated on the slide…
Database Requirements in a Dynamic Business Environment
Rapid ad hoc data access
Crucial in the quick-response decision-making environment
Distributed data access
Needed to support geographically dispersed business units
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
.
As stated on the slide…
One of the critical issues is the data might be allocated in different geographically, i.e. they are in different countries or in different states.
Factors that influenced DDBMS
Acceptance of Internet as a platform for business
Mobile wireless revolution
Growth of use of “application as a service”
Focus on mobile business intelligence
Emphasis on Big Data analytics
Factors that influenced DDBMS
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
.
As stated on the slide…
Potential Centralized DBMS Problems
greater distance
single point failure
Physical space, temperature
conditioning, power consumption
may not be flexible and agile to respond to changes
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
.
These are potential issues with a company but they might not exist depending on how the data governance policy of the company.
For example for the scalability problems, in bank, the costs to maintain Cobol system remain high because of lack of skills and knowledge of young engineers.
When a company is expanding, the amount of data will grow, the question here is can you expand your specifications on your machine. For example, you have a Mac Air, and if I say to you, you need 32GB to run the SAS Studio version 2020, you might not able to because of the limitation of the hardware.
Organisational rigidity could be a problem as the engineers might lose their jobs if the company decides to move the data distributed into different counties, say.
As stated on the slide…
Performance degradation
High costs
Reliability problems
Scalability problems
Organizational rigidity
Table 12.1 – Distributed DBM Advantages and Disadvantages
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
.
As stated on the slide…
Distributed Processing and Distributed Databases
Distributed processing: Database’s logical processing is shared among two or more physically independent sites via network
Distributed database: Stores logically related database over two or more physically independent sites via computer network
Database fragments: Database composed of many parts in distributed database system
e.g., data I/O, selection, and validation performed on one computer, and a report based on that data created on another computer
a database fragment is a subset of a database that is stored at a given location
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
.
As stated on the slide…
There are basically two common distributed processing and distributed databases, which we will go into more details for the next two slides
Figure 12.2 – Distributed Processing Environment
Single database in Miami
e.g., SETI@home for distributed computing
Distributed processing does not require a distributed DB, but a distributed DB needs distributed processing
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
.
In distributed processing, a database’s logical processing is shared among two or more physically independent sites that are connected through a network.
For example, data selection and data validation might be performed on one machine and a report on that data might be created on a different machine.
In the above example, the data is retrieved from Miami machine but the data are processed on different machines in different locations, one say in New York and the other one in Atlanta.
Figure 12.3 – Distributed Database Environment
data are processed locally
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
.
Distributed database, on the other hand, stores on logically related database over two or more physically independent sites.
These sites are connected via a computer network. In contrast, the distributed processing system only access one single cited database but performed the processing across different sites.
In a distributed database system a database is composed of several parts of the database fragments.
It is called database fragment because the data is not stored in one place only but across different places.
Thus, the database fragments are located at different sites and can be replicated across different sites. The database fragment is managed locally by a local database server and process. When we say the database is replicated, this means some tables of the database are copied exactly or replicated across different database server. The reason is when you run a report, you do not need to get data from a table resided on a different place. These tables are usually master table, for example, product table, the product id, product description and bar code do not change often.
In the example above, some data are retrieved from Miami, and Betty in New York got his report from the processing of data locally with data from Miami and local (i.e. replicated) data in New York data server.
Functions of Fully Distributed DBMS
Receive the request of an application or end user
Validate, analyze, and decompose the request
Map request’s logical-to-physical data components
Decompose request into several I/O operations
Search, locate, read and validate data
Ensure database consistency, security, and integrity
Validate data for conditions specified by request
Present data in required format
Handle all necessary functions transparently to user
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
.
Figure 12.4 – A Fully Distributed Database Management System
as opposed to physical database
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
.
Distributed Database Design
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
.
Turn distributed database have 3 spell issues which you have to address in your design.
Data fragmentation is when you want to partition a table into smaller segments, or we call it fragment. The data is fragmented but you are in full control.
As for data replication, you have to decide which piece of data or which tables you wanted to replicate.
These tables are usually the master tables where data do not change often, so it is easy to replicate as the changes are probably down to minimum.
Having said that, you might have transaction data such as orders which you might need to replicate across different warehouses.
The last design factor to consider is where you want to allocate or store the data.
Data fragmentation
How to partition database into fragments
Data replication
Which fragments to replicate
Data allocation
Where to locate those fragments and replicas
Data Fragmentation
Breaks a single object into two or more segments
Information is stored in distributed data catalog (DDC)
Strategies
Horizontal fragmentation: Division of a relation into subsets (fragments) of tuples (rows)
Vertical fragmentation: Division of a relation into attribute (column) subsets
Mixed fragmentation: Combination of horizontal and vertical strategies
about data fragmentation
about metadata
e.g., location awareness
using a “partition key”
all records but with only the attributes of interest
first horizontal, then vertical
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
.
You can break the data in a table into two or more segments.
There are three types of fragmentations:
In the example, you can cut the table horizontal – select rows and split into a fragment data.
As for vertical fragmentation, you could cut into columns into fragment data.
As for mixed fragmentation, you could do horizontal and vertical fragmentations, i.e. first on horizontal and then vertical.
Data Replication
Storage of data copies at multiple sites served by a computer network
Mutual consistency rule requires all copies of data fragments be identical
Styles of replication
Push replication focuses on maintaining data consistency
Pull replication focuses on maintaining data availability and allows for temporary data inconistencies
to enhance data availability and response time
at different Data Processor (DP) nodes
after a data update, the originating DP node sends the changes to replica nodes to ensure immediate update
the replica nodes decide when to apply the updates to their local fragment
(replicating data takes time and bandwidth)
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
.
As stated on the slide
Data Replication Scenarios
too expensive to be practical
depending on database size, usage frequency and costs
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
.
As stated on the slide – you can classify data replications into three types.
Fully replicated database
Stores multiple copies of each database fragment at multiple sites
Partially replicated database
Stores multiple copies of some database fragments at multiple sites
Unreplicated database
Stores each database fragment at a single site
Data Allocation Strategies
site 1
site 1
site 2
Site 1 Site 2 Site 3
Frag 1 x x
Frag 2 x x
Frag 3 x x
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
.
As stated on the slide.
Centralized data allocation
Entire database stored at one site
Partitioned data allocation
Database is divided into two or more disjoined fragments and stored at two or more sites
Replicated data allocation
Copies of one or more database fragments are stored at several sites
The CAP Theorem
CAP stands for:
Consistency
Availability
Partition tolerance
Trade-off between consistency and availability generated in a new system in which data is basically available, soft state, eventually consistent (BASE)
Data changes are not immediate but propagate slowly through the system until all replicas are consistent
All nodes should see the same data at the same time
A request is always fulfilled by the system. No received request is ever lost
The system continues to operate even in the event of a node failure, unless all nodes fail
state of the system may change over time
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
.
As stated on the slide.
Oracle
Oracle
Database
Flat Files
Machine Learning
Prepared by Vincent Pang, Feb. 2021 Note: In-Memory Database (e.g. SAP Hana) is an alternative data model not shown here
ETL (Data
Cleansing)
Data (DW to BD, or
vice versa or both)
RelationalDatabaseBig DataDataWarehouseNormalisationReporting (Business Intelligence and Visualisation) and Business Analysis (End Users)(De-Normalised)External Data (e.g. Excel)Hadoop Distributed File System (HDFS) and MapReduceEntity Relationship Model (ERM)Unstructured Data(Social Media)Structured Data (Internet of Things (IOT))HadoopSQLSQLSpark and NoSQL (and other tools)Data StreamingData Streaming(“Not Normalised”)External Data (e.g. Excel)Star Schema
/docProps/thumbnail.jpeg