Information Management
Università degli Studi di Databases (slides partially provided by Prof. Samarati)
Copyright By PowCoder代写 加微信 powcoder
Paradigms for data distribution
• Client-server architectures
• Separation between the database server and the client
• Distributed databases
• Several database servers used by the same application
• Parallel databases
• Several storage devices and processors that operate in parallel to improve
performance
• Replicated databases
• Same data physically stored at different servers
• Data warehouses
• servers specialized for the management of data dedicated to decision support
Kinds of architectures
Two kinds of systems
• OLTP (On-Line Transaction Processing)
• for the optimized management and reliable transactions on database server
• specialized on the management of hundreds or even thousands of transactions per second
• OLAP (On-Line Analytical Processing)
• for data analysis
• operate on data warehouse servers, specialized on the management of data for decision support systems
Server typically include both OLAP and OLTP functionalities
Properties of distributed systems
Portability
• possibility of transporting programs from an environment to another • established at compile time
• facilitated by language standards (e.g., SQL-2, SQL-3)
Interoperability
• ability of interacting among heterogeneous systems
• established at compile time
• facilitated by standard protocols for data access (e.g., Database Connectivity (ODBC) e X-Open Distributed Transaction Processing (DTP))
Client-server architectures (1)
Interacting software processes are divided between client (request services) and server (provide services)
• Requires a precise definition of service interface, listing precisely the services offered by the server
• The client has an active role (makes the request)
• The server has a reactive role (replies)
• Usually, a client process requests a few services in a sequence to one or more server processes
• Usually, a server process replies to multiple requests by different client processes
Client-server architectures (2)
• The machine acting as client should
• be adequate for interacting with the final user
• support productivity tools (email, word processing, Internet access, and workflow management)
• The machine acting as server should
• have considerable memory capacity (to support buffer management) • have considerable disk capacity (to store the whole database)
Client-server architectures (3)
Widely used for databases:
• Client and server functions are well defined
• Provides a good separation of the management activities • client suitable for interaction with the user
• server suitable for data management
• SQL offers an ideal programming paradigm for the identification of service interfaces
Client-server architectures (4)
SQL offers an ideal programming paradigm for the identification of services interfaces
• SQL queries formulated by the client and sent to the server
• query results are computed by the server and sent to the client
• SQL standardization, portability, and interoperability enable the development of client applications that involve different servers
Client-server architectures (5)
Often the server is multi-threaded:
• Behaves as a single process that dynamically operates for different
transactions
• Each execution unit of the server process for a given transaction is a thread
Client-server architectures (6)
• Servers are processes that are permanently active and check: • an input queue for client’s requests
• an output queue for query results
• Often, a dispatcher process distributes requests among servers and returns replies to clients
• When dispatchers can dynamically determine the number of active server processes depending on the number of requests, we say that we have a class of servers
Client-server architectures (7)
Input queue
Database Server
Output queue
Server Process
Two-tiered architectures (1)
• There are two machines: a client and a server
• There are different solutions for deciding to which machine each layer
is assigned
• Simple approach:
• the client has the user-interface level
• the server has the processing and data levels
• This is not the only solution
Two-tiered architectures (2)
Fat vs thin client
• Fat Clients
+ More efficient for users
+ Provide higher system scalability
– More complex and hard to manage
– Client software is more error prone and depends on the client machine and its
operating system • Thin Clients
+ Easier to manage
Trend to move computation to the client
Multi-tiered architectures
• A single server is replaced by multiple servers running on different machines
• A server may act as a client
• Different programs in the processing level reside on different servers
Distributed databases
System of databases where at least one client interacts with more servers for the execution of an application
Distributed databases: Pros
• Respond to applications requirements
• organizations have a distributed structure
• data distribution permits the management of data where they are generated and used
• Flexibility and modularity
• can be configured with continuous additions and modifications of
components
• Reliability
• can react to failures reducing performance instead than stopping operability
Distributed databases: classification (1)
• Kind of involved DBMSs
• Homogeneous DDB: all the servers use the same DBMS • Heterogeneous DDB: the servers use different DBMSs
• Kind of network
• Local Area Network (LAN) • Wide Area Network (WAN)
Distributed databases: classification (2)
Kind of DBMS
Kind of network
Homogeneous
Management and financial applications
Booking systems and financial applications
Heterogeneous
Management and inter-functional applications
Integrated booking systems and inter- banking systems
Local independency and cooperation
The distributed database can be considered, from an abstract point of view, as a single database
• It should be designed in such a way to have applications that are executed on a single server, minimizing
• the need of interaction
• the need of data exchange
Data fragmentation (1)
Adopts algebraic operations on a relation R to divide it into fragments R1, …, Rn
• Horizontal fragmentation
• each Ri has a subset of the tuples in R
• each Ri can be interpreted as the result of a selection over R
• Vertical fragmentation
• each Ri has, in its schema, a subset of the attributes in R
• each Ri can be interpreted as the result of a projection over R
Data fragmentation (2)
Correctness property
• completeness: each data in R must be represented in one of its fragments Ri
• reconstructability: R should be fully reconstructable starting from its fragments
• Horizontal fragments are disjoint
• do not have common tuples
• Vertical fragments include the primary key of R • guarantees reconstructablility
Horizontal fragmentation: example (1)
EMPLOYEE(Empnum,Name,Deptnum,Salary,Taxes)
Fragments:
• EMPLOYEE1 = σEmpnum≤3 EMPLOYEE • EMPLOYEE2 = σEmpnum>3 EMPLOYEE
To reconstruct the relation:
• EMPLOYEE = EMPLOYEE1 È EMPLOYEE2
Horizontal fragmentation: example (2)
Production Administration Production Marketing
3.7M 3.5M 5.3M 3.5M
1.2M 1.1M 2.1M 1.1M
EMPLOYEE1 (σEmpnum≤3 EMPLOYEE)
EMPLOYEE2 (σEmpnum>3 EMPLOYEE)
Administration Production
3.7M 3.5M 5.3M
1.2M 1.1M 2.1M
Vertical fragmentation: example (1)
EMPLOYEE(Empnum,Name,Deptnum,Salary,Taxes) Fragments:
• EMPLOYEE1 = ÕEmpnum,Name(EMPLOYEE)
• EMPLOYEE2 = ÕEmpnum,Deptnum,Salary,Taxes(EMPLOYEE)
To reconstruct the relation:
• EMPLOYEE = EMPLOYEE1 EMPLOYEE2
Vertical fragmentation: example (2)
Production Administration Production Marketing
3.7M 3.5M 5.3M 3.5M
1.2M 1.1M 2.1M 1.1M
EMPLOYEE1 = ÕEmpnum,Name(EMPLOYEE) EMPLOYEE2 = ÕEmpnum,Deptnum,Salary,Taxes(EMPLOYEE)
EMPLOYEE1 EMPLOYEE2
Production Administration Production Marketing
3.7M 3.5M 5.3M 3.5M
1.2M 1.1M 2.1M 1.1M
Allocation schema
Describe the mapping of relations or fragments to servers where they are stored
• Each fragment corresponds, at the physical level, to a file and is allocated to a specific server
• fragments are stored
• the original relation is a view over fragments (virtual)
• Allocation can be
• non-redundant: each fragment or relation is allocated to one server only
• redundant: at least one fragment or relation is allocated to multiple servers
Transparency levels (1)
Distinguishing between fragmentation and allocation permits to write applications operating at different levels
• From the most abstract and independent from data fragmentation • To the most concrete and dependent on their physical allocation
Transparency levels (2)
• fragmentation: the programmer
• does not need to know the fragmentation • does not need to know the allocation
• allocation: the programmer
• needs to know the structure of fragments • does not need to know the allocation
• language: the programmer
• needs to know the structure of fragments • needs to know the allocation
• no transparency
• each DBMS accepts its own SQL dialect: the system is heterogeneous and the DBMSs do not support a common interoperability standard
Transparency levels: example
SUPPLIER(Snum,Name,City)
• horizontal fragments:
• SUPPLIER1 = σcity=’Milan’ (SUPPLIER) • SUPPLIER2 = σcity=’Rome’ (SUPPLIER)
• allocation of horizontal fragments (with replication): •
write a procedure that, given a supplier number, returns its name…
Fragmentation transparency: example
• The programmer
• does not need to know the fragmentation • does not need to know the allocation
procedure Query1(:snum,:name); select Name into :name from Supplier
where Snum = :snum;
end procedure;
Allocation transparency: example
• The programmer
• needs to know the structure of fragments
• does not need to know the allocation
• in case of redundancy does not need to indicate which copy to use for access (replication transparency)
procedure Query2(:snum,:name); select Name into :name from Supplier1
where Snum = :snum;
if :empty then
select Name into :name
from Supplier2
where Snum = :snum;
end procedure;
Language transparency: example
• the programmer
• needs to know the structure of fragments
• needs to know the allocation
• in case of replication, needs to indicate which copy to use for access
procedure Query3(:snum,:name); select Name into :name
from where Snum = :snum;
if :empty then
select Name into :name
from where Snum = :snum;
end procedure;
Query optimization
The application can be optimized through:
• parallelism
• submit requests in parallel in contrast to in sequence • reduce the overall response time
• knowledge of the logical properties of fragments • query the fragment where data reside
• increase efficiency but reduce flexibility
Query optimization: example
procedure Query4(:snum,:name,:city); case :city of
select Name into :name from Supplier1
where Snum = :snum;
select Name into :name from Supplier2
where Snum = :snum;
end procedure;
Fragmentation: exercise
SHOP (ID, Name, Kind, Tel, City, IDWarehouse) WAREHOUSE (ID, Street, City, Tel)
• WAREHOUSE is split in 4 fragments based on the city where warehouses are located (Milan, Florence, Rome, Venice), allocated at the three corresponding shipment centres in the same cities
• SHOP is split in 3 fragments based on the kind of shop (clothes, shoes, bath) allocated at three coordination centres, one for each kind of shop, in Cagliari, Siena, and Bari, respectively
Fragmentation: exercise (continuation)
• Write a query that retrieves the name, kind, and telephone number of the shop with ID=123 according to fragmentation, allocation, and language transparency.
Transaction classification
Progressively increasing complexity levels
• Remote requests
• Remote transactions
• Distributed transactions • Distributed requests
Remote requests
• Read-only transactions including an arbitrary number of SQL queries • All the queries are addressed to a single remote DBMS
• The remote DBMS can only be queried
Remote transactions
• Composed of an arbitrary number of SQL commands (select, insert, delete, update)
• All the commands are addressed to a single remote DBMS • Each transaction writes one DBMS only
Distributed transactions
• Composed of an arbitrary number of SQL commands (select, insert, delete, update) addressed to an arbitrary number of remote DBMSs
• Each SQL command refers to a single DBMS
• Each transaction can update different DBMSs • requires two-phase-commit protocol
Distributed requests
• Arbitrary transactions composed of an arbitrary number of SQL commands (select, insert, delete, update) addressed to an arbitrary number of remote DBMSs
• Each command can refer to any DBMS
• Requires a distributed query optimizer
Transaction: example (1)
CC(Num,Name,Balance)
• CC1: σNum≤1000(CC)
• CC2: σNum>1000(CC)
• assume allocation transparency
Distributed transaction
• Transfer 100 Euros from account 354 to account 1487
it is necessary to guarantee atomicity: either both updates are executed or none of them
Transaction: example (2)
begin transaction
update CC1
set Balance = Balance – 100 where CCNum = ʻ354ʼ;
update CC2
set Balance = Balance + 100 where CCNum = ʻ1487ʼ;
end transaction
Technology of distributed databases (1)
Data distribution does not affect
• consistency: integrity constraints describe only local properties
• it is a limit of current DBMS technology
• persistency: each system guarantees persistency to locally stored data
• local recovery (log, checkpoint, dump) mechanisms Data distribution affects
• isolation
• atomicity
Technology of distributed databases (2)
Data distribution requires to modify
• Query optimization • Concurrency control
• isolation
• Reliability control • atomicity
Distributed query optimizer (1)
Under the responsibility of the DBMS that receives the query
• Decides how to divide the query in sub-queries, each addressed to a
specific DBMS
• Defines a strategy (plan) for distributed execution:
• Coordinated execution of different programs at different DBMSs • Data exchange among DBMSs
• Guarantees global optimization
Distributed query optimizer (2)
In the computation of distributed queries cost, the amount of data transmitted over the network is particularly important
Ctot=CI/O ́nI/O +CCPU ́nCPU +Ctr ́ntr
• ntr : amount of data transmitted over the network • Ctr : transmission cost
Concurrency control
In a distributed system, a transaction ti can execute multiple sub- transactions at different nodes:
• tij execution of ti at node j
• t1 : r11(x) w11(x) r12(y) w12(y) • t2 : r22(y) w22(y) r21(x) w21(x)
Concurrency control: example
S1 : r11(x) w11(x) r21(x) w21(x) S2 : r22(y) w22(y) r12(y) w12(y)
• Locally serializable (serial) • Globally non serializable
• The conflict graph includes a cycle:
• on node 1, t1 precedes t2 and is in conflict with t2 • on node 2, t2 precedes t1 and is in conflict with t1
of distributed transactions can be compromised by failures/malfunctions
• node failure (software/hardware)
• message lost: the execution of a protocol is left in a bad state
• each message of the protocol (msg) is followed by an acknowledgement of receipt message (ack)
• the loss of a message leaves the sender not sure about its reception
• failure of connection links: may cause network partitioning
• a transaction can be active simultaneously in different sub-networks
Global serializability
Local serializability does not guarantee global serializability
S1 : r11(x) w11(x) r21(x) w21(x) S2 : r22(y) w22(y) r12(y) w12(y)
• Locally serializable (serial) • Globally non serializable
Global serializability requires the presence of a serial schedule S equivalent to all the local schedules Si resulting at each node
• The projection of S on node i must be equal to Si
Global serializability: properties
Conflict-serializable global schedule
• guaranteed if each scheduler uses strict 2PL and executes atomic commit when all the sub-transactions at the different nodes have all the resources
Serial global schedule
• guaranteed if each distributed transaction acquires a single timestamp and uses it in all its requests to all the schedulers that perform concurrency control based on timestamp
– requires the assignment of a global timestamp
Logical clocks
Need to assign timestamps that reflect precedence among events in a distributed system
• If two processes do not interact it is not necessary that they are synchronized
• No matter that all processes agree on what time it is, but on the order in which events occur
• Lamport clocks and vector clocks are based on these observations
Happened-before
• aàb means that all transactions agree that first a occurs and then b occurs
• a and b are in the same transaction and a occurs before b
• a is the event of sending a message and b is the event of receiving the
• The relationship is transitive
• Events in different transactions that do not exchange messages are concurrent
• it is a partial order relationship
Lamport clocks
• How do we maintain a global view on the system’s behavior that is consistent with the happened-before relationship?
• Attach a timestamp C(e) to each event e s.t.:
• If a and b are two events in the same transaction, and aàb, then C(a)