程序代写代做代考 data mining html algorithm database concurrency information retrieval CS W4111.001 Introduction to Databases Fall 2020

CS W4111.001 Introduction to Databases Fall 2020
Computer Science Department Columbia University

Transaction Processing Overview
Transaction processing studied in depth in CS W4112-Database System Implementation

Transactions
A transaction is a series of actions (Reads and Writes) on a database that form a “logical unit”
Example: all database actions required to transfer money from one bank account to another

A transaction is a series of actions (Reads and Writes) on a database that form a “logical unit”
Example: all database actions required to transfer money from one bank account to another
ACID properties of transactions:
• Atomicity(enforcedvialog+recoveryprotocol)
• Consistency(enforcedbyDBMS)
• Isolation(enforcedbyconcurrencycontrolprotocol) • Durability(enforcedvialog+recoveryprotocol)
Transactions either commit (when they complete successfully) or abort (when they don’t)
Transactions

• Concurrent execution of transactions is essential for good DBMS performance
• Concurrency is achieved by the DBMS, which interleaves actions (reads and writes of database objects) of various transactions
• Users submit transactions, and can think of each transaction as executing by itself, logically speaking
Isolation of Transactions

T1: A=A+100, B=B-100 T2: A=1.02*A,
B=1.02*B
simply as reads R and writes W, where the number next to R or W identifies the corresponding transaction (e.g., R1(A) is a read of object A by transaction T1)
T1: R1(A) W1(A) R1(B) W1(B) T2: R2(A) W2(A)
R2(B) W2(B)
or sometimes, equivalently, in one line as:
R1(A) W1(A) R2(A) W2(A) R1(B) W1(B) R2(B) W2(B)
Transaction Schedules as Series of Reads and Writes
From now on, we will view transactions and schedules as sequences of reads and writes, without worrying about the actual values that are read or written
So we will write a schedule such as:

• Reading Uncommitted Data (“dirty reads”):
T1: R1(A) W1(A) R1(B) W1(B) Abort1 T2: R2(A) W2(A) Commit2
T1: R1(A) R1(A) W1(A) Commit1 T2: R2(A) W2(A) Commit2
• Unrepeatable Reads:
Anomalies with Interleaved Executions of Transactions

Strict 2-Phase Locking (Strict 2PL) Protocol
Strict 2PL allows only serializable schedules
• BeforereadinganobjectA,atransactionmusthave requested and hold a shared lock on A (i.e., S(A))
• Multipletransactionscanholdasharedlockonanobject simultaneously

Strict 2-Phase Locking (Strict 2PL) Protocol
Strict 2PL allows only serializable schedules
• BeforereadinganobjectA,atransactionmusthave requested and hold a shared lock on A (i.e., S(A))
• Multipletransactionscanholdasharedlockonanobject simultaneously
• BeforewritinganobjectA,atransactionmusthave requested and hold an exclusive lock on A (i.e., X(A))
• Atmostonetransactioncanholdanexclusivelockonan object and no transactions can hold a shared lock on the object

Strict 2-Phase Locking (Strict 2PL) Protocol
Strict 2PL allows only serializable schedules
• BeforereadinganobjectA,atransactionmusthave requested and hold a shared lock on A (i.e., S(A))
• Multipletransactionscanholdasharedlockonanobject simultaneously
• BeforewritinganobjectA,atransactionmusthave requested and hold an exclusive lock on A (i.e., X(A))
• Atmostonetransactioncanholdanexclusivelockonan object and no transactions can hold a shared lock on the object
• Transactionsonlyreleasetheirlocks(U)whenthey complete (i.e., as part of COMMIT or ABORT)

• Phase 1: execution of the transaction, where locks are acquired but never released
• Phase 2: COMMIT or ABORT of the transaction, where all locks are finally released
Examples of schedules possible and not possible under Strict 2PL?
What are the 2 phases in Strict 2PL?

Aborting a Transaction
Consider a transaction T1:
T1: R1(A) W1(A) W1(B) R1(C) Abort1
What should we do to remove all effects of T1 from the database as we terminate the transaction?

Aborting a Transaction
Consider a transaction T1:
T1: R1(A) W1(A) W1(B) R1(C) Abort1
What should we do to remove all effects of T1 from the database as we terminate the transaction?
T1: R1(A) W1(A) W1(B) R1(C) W1(B) W1(A) Abort1

Aborting a Transaction
Consider a transaction T1:
T1: R1(A) W1(A) W1(B) R1(C) Abort1
What should we do to remove all effects of T1 from the database as we terminate the transaction?
T1: R1(A) W1(A) W1(B) R1(C) W1(B) W1(A) Abort1
• As part of the abort operation, we write back/restore the old values of B and A that T1 received, which T1 can do because it is still holding exclusive locks on B and A (strict 2PL…)

are recorded
• Finally T1 releases all locks that T1 acquired
Aborting a Transaction
Consider a transaction T1:
T1: R1(A) W1(A) W1(B) R1(C) Abort1
What should we do to remove all effects of T1 from the database as we terminate the transaction?
T1: R1(A) W1(A) W1(B) R1(C) W1(B) W1(A) Abort1
• As part of the abort operation, we write back/restore the old values of B and A that T1 received, which T1 can do because it is still holding exclusive locks on B and A (strict 2PL…)
• The old values are in the database log, in which the old and new values of every write to the database

Cascading Aborts
• If a transaction Ti is aborted, all its actions have to be undone
• Furthermore, if Tj read an object last written by Ti, Tj must be aborted as well

Cascading Aborts
• If a transaction Ti is aborted, all its actions have to be undone
• Furthermore, if Tj read an object last written by Ti, Tj must be aborted as well
• Strict 2PL avoids such cascading aborts because transactions only release their locks at the very end (i.e., at commit or abort time)
So if Ti writes an object, it holds on to its exclusive lock on the object until the end; Tj cannot read the object until after Ti commits or aborts

Consider this schedule for transactions T1 and T2, with the “reading uncommitted data” problem we saw:
T1: R1(A) W1(A) Abort1 T2: R2(A) R2(B) W2(B) Commit2
Is the schedule serializable?
Unrecoverable Schedules

Consider this schedule for transactions T1 and T2, with the “reading uncommitted data” problem we saw:
T1: R1(A) W1(A) Abort1 T2: R2(A) R2(B) W2(B) Commit2
Is the schedule serializable? Yes Is there a problem?
Unrecoverable Schedules

Consider this schedule for transactions T1 and T2, with the “reading uncommitted data” problem we saw:
T1: R1(A) W1(A) Abort1 T2: R2(A) R2(B) W2(B) Commit2
Is the schedule serializable? Yes
Is there a problem? Yes! T2 reads a value of A written by T1, T2 commits (so because of durability, T2 persists forever), and then T1 aborts (so because of atomicity, all of T1’s actions should be erased)
Unrecoverable Schedules

A schedule is recoverable if its transactions commit only after—and if—all transactions whose changes they read have committed
Strict 2PL does not allow unrecoverable schedules, so the schedule in the previous slide is not possible under Strict 2PL
Only serializable, recoverable schedules that avoid cascading aborts are possible under Strict 2PL
Recoverable Schedules

The Phantom Problem
Consider this schedule for T1 and T2, following Strict 2PL:
T1: locks all pages with sailors with rating=1 and finds oldest such sailor is 71 years old
T2:
inserts a new sailor with rating=1 and age=96
locks all pages with sailors with
rating=2 and
finds oldest such sailor is 63,
commits
deletes oldest
sailor with rating=2,
who happens to be
80 years old, commits
Problem?

The Phantom Problem
Consider this schedule for T1 and T2, following Strict 2PL:
T1: locks all pages with sailors with rating=1 and finds oldest such sailor is 71 years old
T2:
inserts a new sailor with rating=1 and age=96
locks all pages with sailors with
rating=2 and
finds oldest such sailor is 63,
commits
deletes oldest
sailor with rating=2,
who happens to be
80 years old, commits
Problem: Yes! Schedule is not equivalent to either serial schedule:
• Not T1; T2: the oldest sailors for rating=1 and rating=2 that T1 observes would have been 71 and 80 (not 63) years old, respectively
• Not T2; T1: the oldest sailors for rating=1 and rating=2 that T1 observes would have been 96 (not 71) and 63 years old, respectively

Dynamic Databases and Phantoms
• This schedule is possible under Strict 2PL yet it’s not serializable!

Dynamic Databases and Phantoms
• This schedule is possible under Strict 2PL yet it’s not serializable!
• Problem: implicit assumption that transaction T1 has locked all sailors, current or future, with rating=1, not just existing sailors

Dynamic Databases and Phantoms
• This schedule is possible under Strict 2PL yet it’s not serializable!
• Problem: implicit assumption that transaction T1 has locked all sailors, current or future, with rating=1, not just existing sailors
• One way to handle such dynamic databases with Strict 2PL to guarantee true serializability is to use predicate locking: lock all current or future tuples satisfying a predicate (e.g., rating=1 OR rating=2 for T1)

Isolation Level
Dirty Read
Unrepeatable Read
Phantom Problem
Read Uncommitted
Possible
Possible
Possible
Read Committed
Not possible
Possible
Possible
Repeatable Read
Not possible
Not possible
Possible
Serializable
Not possible
Not possible
Not possible
Transaction Isolation Levels in SQL
• Transactions have an access mode: READ WRITE vs. READ ONLY
• “Read Uncommitted” transactions must be READ ONLY and don’t use
locks (if concurrency control is based on Strict 2PL)
• For all other isolation levels, exclusive locks are held until the end of
the transactions
• For “Read Committed” transactions, shared locks are released
immediately after the corresponding reads
• “Repeatable Reads” and “Serializable” transactions follow Strict 2PL
• “Serializable” transactions use predicate locking
See https://www.postgresql.org/docs/12/transaction-iso.html for PostgreSQL support

• Atomicity
• Consistency ✅
• Isolation ✅
• Durability
• Atomicity and durability are handled by the recovery manager
• We have already discussed how to terminate a transaction during normal database operation
• Now: what should we do when the system crashes, to guarantee atomicity and durability?
ACID Properties of Transactions

Recovery After a Crash
Assumptions:
• Disk is safe, RAM is not (and goes away when system crashes)
• DBMS uses Strict 2-Phase locking

The Database Log
• Thelogisasingle,sequential,append-onlyfilethatkeeps information on actions by all transactions
• Logrecordforawriteoperation:
TID=transaction ID; pageID, offset, and length refer to the block on disk, the position within the block, and the length that is written
TID
pageID
offset
length
old value
new value

The Database Log
• Thelogisasingle,sequential,append-onlyfilethatkeeps information on actions by all transactions
• Logrecordforawriteoperation:
TID=transaction ID; pageID, offset, and length refer to the block on disk, the position within the block, and the length that is written
• Whenatransactionwritesanobject,thelogrecordincludes both the old and new values for the object
• Whenatransactioncommitsoraborts,alogrecord documents this action
TID
pageID
offset
length
old value
new value

The Database Log
• Thelogisasingle,sequential,append-onlyfilethatkeeps information on actions by all transactions
• Logrecordforawriteoperation:
TID=transaction ID; pageID, offset, and length refer to the block on disk, the position within the block, and the length that is written
• Whenatransactionwritesanobject,thelogrecordincludes both the old and new values for the object
• Whenatransactioncommitsoraborts,alogrecord documents this action
• Themostrecentrecordsofthelogareinmainmemory, while the rest of the log is stored stably on disk (and often replicated as well)
TID
pageID
offset
length
old value
new value

Write-Ahead Logging, or WAL
For efficiency, we allow writes of data and log to happen in memory without an immediate disk-write counterpart: dangerous if not done properly! Why?

Write-Ahead Logging, or WAL
For efficiency, we allow writes of data and log to happen in memory without an immediate disk-write counterpart: dangerous if not done properly!
Write-Ahead Logging rules:
• We must force the log record for an update to disk before the corresponding data page for the update makes it to disk

to disk
Write-Ahead Logging, or WAL
For efficiency, we allow writes of data and log to happen in memory without an immediate disk-write counterpart: dangerous if not done properly!
Write-Ahead Logging rules:
• We must force the log record for an update to disk before the corresponding data page for the update makes it to disk
• We must write all log records for a transaction—including the commit record—to disk before declaring the transaction committed
Note: The log is a sequential file, so if we push a log record to disk, all earlier log records are also pushed

Recovering from a Crash
CRASH!
commit
T1
T2
T3
T4
T5 Recovery?
commit
commit

Recovering from a Crash
CRASH!
commit
T1
T2
T3
T4
T5
During recovery:
commit
commit
• RedoupdatesbycommittedtransactionsT1,T2,andT3 (their updates might not have made it to disk by the time of the crash); use log for this
• Undoupdatesbyin-progresstransactionsT4andT5 (their updates might have made it to disk); use log for this; these transactions get terminated and restarted from scratch

• Analysis: Scan the log to identify all transactions that had committed or aborted, or were in progress at crash time
ARIES Recovery Algorithm: 3 Phases

• Analysis: Scan the log to identify all transactions that had committed or aborted, or were in progress at crash time
• Redo: Redo all updates—by using the new values in the corresponding log records—to ensure that all logged updates are indeed carried out, in chronological order
ARIES Recovery Algorithm: 3 Phases

• Analysis: Scan the log to identify all transactions that had committed or aborted, or were in progress at crash time
• Redo: Redo all updates—by using the new values in the corresponding log records—to ensure that all logged updates are indeed carried out, in chronological order
• Undo: Undo all updates—by using the old values in the corresponding log records—of in-progress transactions, in reverse chronological order
ARIES Recovery Algorithm: 3 Phases

• ACID properties of transactions are critically important for sensitive applications
• Transaction processing—with the enforcement of the ACID properties—is done largely transparently by DBMS
• This is a big advantage of using DBMSs over ad-hoc software
• CS W4112-Database System Implementation covers transaction processing in depth
Summary of Transaction Processing

COMS E6111
Advanced Database Systems
Spring 2021
Prerequisites: COMS W4111 (not W4112); fluency in Python

• Broader families of “data,” beyond relational model
• text, spatial data, time series, …
• Broader families of “queries,” beyond relatively simple SQL
• keyword search, “data mining”/OLAP, …
Advanced Database Systems: 2 Themes

• Information Retrieval
• Web Search
• Data Mining
• Data Warehousing, OLAP, Decision Support • Spatial Data Management
• Time Series Analysis • Information Extraction •…
Sample of Potential Topics Covered

• Regular lectures, but with more discussion • No homework
• Many readings through research papers
• 3 projects (in Python)
• Midterm and final
Both undergraduate and graduate students can register
General Structure of Course (subject to change)