PowerPoint Presentation
File Organizations
R & G – Chapter 9
1
Architecture of a DBMS
Completed
And We’ll Visit
You are Here
Database Management
System
Database
Query Parsing
& Optimization
Relational Operators
Files and Index Management
Buffer Management
Disk Space Management
SQL Client
2
Recall: Heap Files
Unordered collection of records
Recall API for higher layers of the DBMS.
Today we’ll ask: “How? At what cost?”
Insert/delete/modify record
Fetch a particular record by record id …
Record id is a pointer encoding pair of (pageID, location on page)
Scan all records
Possibly with some conditions on the records to be retrieved
Recall: Multiple File Organizations
Many alternatives exist, each good in some situations and less so in others.
This is a theme in DB systems work!
Heap Files: Suitable when typical access is a full scan of all records
Sorted Files: Best for retrieval in order, or when a range of records is needed
Clustered Files & Indexes: Group data into blocks to enable fast lookup and efficient modifications.
More on this soon …
Bigger Questions
What is the “best” file organization?
Depends on access patterns …
How? What are common access patterns anyway?
Can we be quantitative about tradeoffs?
If one is better … by how much?
Goals
Big picture overheads for data access
We’ll (overly) simplify performance models to provide insight, not to get perfect performance
Still, a bit of discipline:
Clearly identify assumptions up front
Then estimate cost in a principled way
Foundation for query optimization
Can’t choose the fastest scheme without an estimate of speed!
Cost Model and Analysis
Cost Model for Analysis
B: The number of data blocks in the file
R: Number of records per block
D: (Average) time to read/write disk block
Focus: Average case analysis for uniform random workloads
For now, we will ignore
Sequential vs Random I/O
Pre-fetching
Any in-memory costs
Good enough to show the overall trends
Block 1
Block 2
…
Block B
Record 1
Record 2
…
Record R
More Assumptions
Single record insert and delete
Equality selection – exactly one match
For Heap Files:
Insert always appends to end of file.
For Sorted Files:
Packed: Files compacted after deletions.
Sorted according to search key
Extra Challenge
After understanding these slides …
You should question all these assumptions and rework
Good exercise to study for tests, and generate ideas
Heap Files & Sorted Files
B: The number of data blocks = 5
R: Number of records per block = 2
D: (Average) time to read/write disk block = 5ms
2, 5
1, 6
4, 7
3, 10
8, 9
Heap File
1, 2
3, 4
5, 6
7, 8
9, 10
For illustration, records are just integers
Sorted File
Cost of Operations: Scan?
B: The number of data blocks = 5
R: Number of records per block = 2
D: (Average) time to read/write disk block = 5ms
Heap File Sorted File
Scan all records
Equality Search
Range Search
Insert
Delete
Scan All Records
B: The number of data blocks
R: Number of records per block
D: Average time to read/write disk block
Pages touched: ?
Time to read the record: ?
2, 5
1, 6
4, 7
3, 10
8, 9
Heap File
1, 2
3, 4
5, 6
7, 8
9, 10
Sorted File
Cost of Operations: Scan Cost
B: The number of data blocks
R: Number of records per block
D: Average time to read/write disk bloc
Heap File Sorted File
Scan all records B*D B*D
Equality Search
Range Search
Insert
Delete
Cost of Operations: Equality Search?
B: The number of data blocks
R: Number of records per block
D: Average time to read/write disk block
Heap File Sorted File
Scan all records B*D B*D
Equality Search
Range Search
Insert
Delete
Find Key 8: Heap File
P(i): Probability that key is on page i is 1/B
T(i): Number of pages touched if key on page i is i
Therefore the expected number of pages touched
Pages touched on average?
Heap File
2, 5
1, 6
4, 7
3, 10
8, 9
Find Key 8: Sorted File
Worst-case: Pages touched in binary search
log2B
Average-case: Pages touched in binary search
log2B?
Sorted File
1, 2
3, 4
5, 6
7, 8
9, 10
Average Case Binary Search
Expected Number of Reads: 1 (1 / B) + 2 ( 2 / B) + 3 (4 / B) + 4 (8 / B)
18
Average Case Binary Search cont
Expected Number of Reads: 1 (1 / B) + 2 ( 2 / B) + 3 (4 / B) + 4 (8 / B)
1 IO
2 IOs
3 IOs
4 IOs
19
Cost of Operations: Equation Search Cost
B: The number of data blocks
R: Number of records per block
D: Average time to read/write disk block
Heap File Sorted File
Scan all records B*D B*D
Equality Search 0.5*B*D (log2B)*D
Range Search
Insert
Delete
Cost of Operations: Range Search?
B: The number of data blocks
R: Number of records per block
D: Average time to read/write disk block
Heap File Sorted File
Scan all records B*D B*D
Equality Search 0.5*B*D (log2B)*D
Range Search
Insert
Delete
Find Keys Between 7 and 9: Heap File
Always touch all blocks. Why?
Heap File
2, 5
1, 6
4, 7
3, 10
8, 9
Find Keys Between 7 and 9: Comparison
Find beginning of range
Search for start of range
Scan right
Heap File
2, 5
1, 6
4, 7
3, 10
8, 9
Sorted File
1, 2
3, 4
5, 6
7, 8
9, 10
Cost of Operations: Range Search Cost
B: The number of data blocks
R: Number of records per block
D: Average time to read/write disk block
Heap File Sorted File
Scan all records B*D B*D
Equality Search 0.5*B*D (log2B)*D
Range Search B*D ((log2B)+pages)*D
Insert
Delete
Cost of Operations: Insert?
B: The number of data blocks
R: Number of records per block
D: Average time to read/write disk block
Heap File Sorted File
Scan all records B*D B*D
Equality Search 0.5*B*D (log2B)*D
Range Search B*D ((log2B)+pages)*D
Insert
Delete
Insert 4.5: Heap File
Stick at end of file
Cost = 2*D
Why 2?
Heap File
2, 5
1, 6
4, 7
3, 10
8, 9
4.5,_
Insert 4.5: Heap VS Sorted File
Read last page, append, write. Cost = 2*D
Find location for record. Cost = log2BD
Heap File
2, 5
1, 6
4, 7
3, 10
8, 9
4.5,
Sorted File
1, 2
3, 4
5, 6
7, 8
9, _
27
Insert 4.5: Heap Vs Sorted Pt 2
Read last page, append, write. Cost = 2*D
Find location for record. Cost = log2BD
Insert and shift rest of file
1, 2
3, 4
5, 6
7, 8
9, 10
2, 5
1, 6
4, 7
3, 10
8, 9
4.5,
10, _
Heap File
Sorted File
4.5, 5
6, 7
8, 9
28
Cost of Operations: Insert Cost
B: The number of data blocks
R: Number of records per block
D: Average time to read/write disk block
Heap File Sorted File
Scan all records B*D B*D
Equality Search 0.5*B*D (log2B)*D
Range Search B*D ((log2B)+pages)*D
Insert 2*D ((log2B)+B)*D
Delete
Cost of Operations: Delete?
B: The number of data blocks
R: Number of records per block
D: Average time to read/write disk block
Heap File Sorted File
Scan all records B*D B*D
Equality Search 0.5*B*D (log2B)*D
Range Search B*D ((log2B)+pages)*D
Insert 2*D ((log2B)+B)*D
Delete
Delete 4.5: Heap File
Average case to find the record: B/2 reads
Delete record from page
Cost = (B/2 + 1) * D
Why + 1?
Heap File
2, 5
1, 6
4, 7
3, 10
8, 9
4.5,
Delete 4.5: Heap File Vs Sorted File
Average case runtime: (B/2+1) * D
Find location for record: log2B
Delete record in page Gap
Heap File
2, 5
1, 6
4, 7
3, 10
8, 9
Sorted File
1, 2
3, 4
4.5, 5
6, 7
8, 9
10, _
_
Delete 4.5: Heap File Vs Sorted File Pt 2
Average case runtime: (B/2+1) * D
Find location for record: log2B
Shift the rest by 1 record 2 * (B/2)
Heap File
2, 5
1, 6
4, 7
3, 10
8, 9
Sorted File
1, 2
3, 4
__, 5
6, 7
8, 9
10, _
7, 8
9, 10
Cost of Operations Complete
B: The number of data blocks
R: Number of records per block
D: Average time to read/write disk block
Heap File Sorted File
Scan all records B*D B*D
Equality Search 0.5*B*D (log2B)*D
Range Search B*D ((log2B)+pages)*D
Insert 2*D ((log2B)+B)*D
Delete (0.5*B+1)*D ((log2B)+B)*D
Cost of Operations Complete Pt 2
B: The number of data blocks
R: Number of records per block
D: Average time to read/write disk block
Can we do better?
Indexes!
Heap File Sorted File
Scan all records B*D B*D
Equality Search 0.5*B*D (log2B)*D
Range Search B*D ((log2B)+pages)*D
Insert 2*D ((log2B)+B)*D
Delete (0.5*B+1)*D ((log2B)+B)*D
BX
i=1
T(i)P(i) =
BX
i=1
i
1
B
=
B(B + 1)
2B
⇡
B
2
log2 BX
i=1
i
2
i�1
B
=
1
B
log2 BX
i=1
i2i�1 = log
2
B �
B � 1
B
/docProps/thumbnail.jpeg