PowerPoint Presentation
Disk Representations: Files, Pages, Records
Prof. Joseph Hellerstein
Storing Data: Files
2
FILE REPRESENTATIONS
Overview: Representations
SSN Last Name First Name Age Salary
123 Adams Elmo 31 $400
443 Grouch Oscar 32 $300
244 Oz Bert 55 $140
134 Sanders Ernie 55 $400
Record
Bob
M
32
400
Harmon
Varchar
Varchar
Char
Int
Int
Slotted Page
Page
Header
File
Page 1
Page 2
Page 3
Page 4
Page 5
Page 6
Header
M
94703
Bob
Harmon
32
Byte Representation of Record
Overview: Files of Pages of Records
Tables stored as logical files
Consist of pages
Pages contain a collection of records
Pages are managed
On disk by the disk space manager:
pages read/written to physical disk/files
In memory by the buffer manager:
higher levels of DBMS only operate in memory
Page
Header
Page
Header
Page
Header
Page
Header
Page
Header
Page
Header
Database Files
Files of Pages of Records
DB FILE: A collection of pages, each containing a collection of records.
API for higher layers of the DBMS:
Insert/delete/modify record
Fetch a particular record by record id …
Record id is a pointer encoding pair of (pageID, location on page)
Scan all records
Possibly with some conditions on the records to be retrieved
Could span multiple OS files and even machines
Or “raw” disk devices
Many DB File Structures
Unordered Heap Files
Records placed arbitrarily across pages
Clustered Heap Files
Records and pages are grouped
Sorted Files
Pages and records are in sorted order
Index Files
B+ Trees, Linear Hashing, …
May contain records or point to records in other files
Unordered Heap Files
Collection of records in no particular order
Not to be confused with “heap” data-structure
As file shrinks/grows, pages (de)allocated
To support record level operations, we must
Keep track of the pages in a file
Keep track of free space on pages
Keep track of the records on a page
Heap File Implemented as List
Header page ID and Heap file name stored elsewhere
Database catalog
Each page contains 2 “pointers” plus free space and data
What is wrong with this?
How do I find a page with enough space for a 20 byte records
Data
Page
Data
Page
Data
Page
Data
Page
Data
Page
Data
Page
Header
Page
Pages with
Free Space
Full Pages
Better: Use a Page Directory
Directory entries include:
#free bytes on the referenced page
Header pages accessed often likely in cache
Finding a page to fit a record required far fewer page loads than linked list
Why?
One header page load reveals free space of many pages
You can imagine optimizing the page directory further
But diminishing returns?
Data
Page 1
Data
Page 2
Data
Page N
Header Page
DIRECTORY
Summary
Table encoded as files which are collections of pages
SSNz Last Name First Name Age Salary
123 Adams Elmo 31 $400
443 Grouch Oscar 32 $300
244 Oz Bert 55 $140
134 Sanders Ernie 55 $400
Page 1
Page 2
Page 3
Page 4
Page 5
Page 6
File
Page Layout
Page Basics: The Header
Header may contain:
Number of records
Free space
Maybe a next/last pointer
Bitmaps, Slot Table
Page Header
Things to Address
Record length? Fixed or Variable
Find records by record id?
Record id = (Page, Location in Page)
How do we add and delete records?
Page Header
Options for Page Layouts
Depends on
Record length (fixed or variable)
Page packing (packed or unpacked)
Indexes: Sneak Preview
A Heap file allows us to retrieve records:
By specifying the record id (page id + offset)
By scanning all records sequentially
Would like to fetch records by value, e.g.,
Find all students in the “CS” department
Find all students with a “GPA” > 3 AND “blue hair”
Indexes: file structures for efficient value-based queries
Content Break
A Note On Imagery
Data is stored in linear order
1 byte per position
Memory addresses are ordered
Disk addresses are ordered
This doesn’t fit nicely on screen
So we will “wrap around” the linear order into a rectangle
Fixed Length Records, Packed
Page Header
Record
Record
Record
Record
Record
Record
Pack records densely
Record id = (pageId, “location in page”)?
(pageId, record number in page)!
We know the offset from start of page!
Easy to add: just append
Delete?
Fixed Length Records, Packed, Pt 2.
Page Header
Record
Record
Record
Record
Record
Record
Pack records densely
Record id = (pageId, “location in page”)?
(pageId, record number in page)!
We know the offset from start of page!
Easy to add: just append
Delete?
Record id:
(Page 2, Record 4)
Pack records densely
Record id = (pageId, “location in page”)?
(pageId, record number in page)!
We know the offset from start of page!
Easy to add: just append
Delete?
Fixed Length Records: Packed, Pt 3.
Record id:
(Page 2, Record 4)
Page Header
Record
Record
Record
Record
Record
22
Page Header
Record
Record
Record
Record
Record
Pack records densely
Record id = (pageId, “location in page”)?
(pageId, record number in page)!
We know the offset from start of page!
Easy to add: just append
Delete?
Packed implies re-arrange!
Record Id pointers need to be updated!
Could be expensive if they’re in other files.
Fixed Length Records: Packed, Pt. 5
Record id:
(Page 2, Record 3)
Page Header
Record
Record
Record
Record
Record
Record
Record
Fixed Length Records: Unpacked
Bitmap denotes “slots” with records
Record id: record number in page
Insert: find first empty slot
Delete: Clear bit
Bitmap
Record id:
(Page 2, Record 4)
24
Fixed Length Records: Unpacked, Pt. 2
Bitmap denotes “slots” with records
Record id: record number in page
Insert: find first empty slot
Delete: Clear bit
Record id:
(Page 2, Record 4)
Page Header
Record
Record
Record
Record
Record
Record
Record
Bitmap
25
Content Break 2
Variable Length Records
How do we know where each record begins?
What happens when we add and delete records?
Record id:
(Page 2, Record 4)
Page Header
Record
Record
Record
Record
Record
Record
27
First: Relocate metadata to footer
We’ll see why this is handy shortly…
Footer
Record
Record
Record
Record
Record
28
Slotted Page
Introduce slot directory in footer
Pointer to free space
Length + Pointer to beginning of record
reverse order
Record ID = location in slot table
from right
Delete?
e.g., 4th record on the page
Footer
Record
Record
Record
Record
Record
16
24
32
12
18
Record Id:
(Page 2, Record 4)
Slot Directory
29
Slotted Page: Delete Record
Delete record (Page 2, Record 4): Set 4th slot directory pointer to null
Doesn’t affect pointers to other records
Footer
Record
Record
Record
Record
Record
16
24
32
18
Slot Directory
Record Id:
(Page 2, Record 4)
Slotted Page: Insert Record
Insert:
Footer
Record
Record
Record
Record
Record
16
24
32
18
Slot Directory
Record Id:
(Page 2, Record 4)
31
Slotted Page: Insert Record, Pt 2.
Insert:
Place record in free space on page
Footer
Record
Record
Record
Record
Record
16
24
32
18
Slot Directory
Record
Record Id:
(Page 2, Record 4)
Slotted Page: Insert Record, Pt. 3
Insert:
Place record in free space on page
Create pointer/length pair in next open
slot in slot directory
Footer
Record
Record
Record
Record
Record
16
24
32
42
18
Slot Directory
Record
Record Id:
(Page 2, Record 4)
Slotted Page: Insert Record, Pt. 4
Insert:
Place record in free space on page
Create pointer/length pair in next open
slot in slot directory
Update the free space pointer
Footer
Record
Record
Record
Record
Record
16
24
32
42
18
Slot Directory
Record
Record Id:
(Page 2, Record 4)
34
Slotted Page: Insert Record, Pt. 5
Insert:
Place record in free space on page
Create pointer/length pair in next open
slot in slot directory
Update the free space pointer
Fragmentation?
Footer
Record
Record
Record
Record
Record
16
24
32
42
18
Slot Directory
Record
Record Id:
(Page 2, Record 4)
35
Slotted Page: Insert Record, Pt. 6
Insert:
Place record in free space on page
Create pointer/length pair in next open
slot in slot directory
Update the free space pointer
Fragmentation?
Reorganize data on page!
Footer
Record
Record
Record
Record
Record
16
24
32
42
18
Slot Directory
Record Id:
(Page 2, Record 4)
36
Slotted Page: Leading Questions
Reorganize data on page
Is this safe?
Yes this is safe because records ids
don’t change.
When should I reorganize?
We could re-organize on delete
Or wait until fragmentation blocks
record addition and then reorganize.
Often pays to be a little sloppy if page
never gets more records.
What if we need more slots?
Let’s see…
Footer
Record
Record
Record
Record
Record
16
24
32
42
18
Slot Directory
Record Id:
(Page 2, Record 4)
37
Slotted Page: Growing Slots
Tracking number of slots in slot directory
Empty or full
Footer
Record
Record
Record
Record
Record
16
24
32
42
18
Slot Directory
5
Record Id:
(Page 2, Record 4)
Slotted Page: Growing Slots, Pt. 2
Tracking number of slots in slot directory
Empty or full
Extend slot directory
Slots grow from end of page inward
Records grow from beginning of page inward.
Easy!
Footer
Record
Record
Record
Record
Record
16
24
32
42
18
Slot Directory
5
Record
16
Record Id:
(Page 2, Record 4)
Slotted Page: Growing Slots, Pt. 3
Tracking number of slots in slot directory
Empty or full
Extend slot directory
Slots grow from end of page inward
Records grow from beginning of page inward.
Easy!
And update count
Footer
Record
Record
Record
Record
Record
16
24
32
42
18
Slot Directory
6
Record
16
Record Id:
(Page 2, Record 4)
Slotted Page: Summary
Typically use Slotted Page
Good for variable and fixed length records
Not bad for fixed length records too.
Why?
Re-arrange (e.g., sort) and squash null fields
But for a whole table of fixed-length non-null records, can be worth the optimization of fixed-length format
Footer
Record
Record
Record
Record
Record
16
24
32
42
18
Slot Directory
6
Record
16
Record Formats
Record LAYOUT
Record Formats
Relational Model
Each record in table has some fixed type
Assume System Catalog stores the Schema
No need to store type information with records (save space!)
Catalog is just another table …
Goals:
Records should be compact in memory & disk format
Fast access to fields (why?)
Easy Case: Fixed Length Fields
Interesting Case: Variable Length Fields
Record Formats: Fixed Length
Field types same for all records in a file.
Type info stored separately in system catalog
On disk byte representation same as in memory
Finding i’th field?
done via arithmetic (fast)
Compact? (Nulls?)
3
T
HELLO_W
INTEGER
DOUBLE
BOOLEAN
INTEGER
CHARACTER(7)
3.142
3
4
8
1
4
7
24
Record Formats: Variable Length
What happens if fields are variable length?
Could store with padding? (Fixed Length)
Record
Bob
M
32
94703
Big, St.
VARCHAR
VARCHAR
CHAR
INT
INT
Bob
M
32
94703
Big, St.
CHAR(20)
CHAR(18)
CHAR
INT
INT
Wasted Space
Alice
M
32
94703
CHAR(20)
CHAR(18)
CHAR
INT
INT
Boulevard of the Allies
Field Not Big Enough
Record Formats: Variable Length, Pt 2.
What happens if fields are variable length?
Could use delimiters (i.e., CSV):
Issues?
Comma Separated Values (CSV)
Bob
VARCHAR
Big, St.
VARCHAR
M
CHAR
32
INT
94703
INT
,
,
,
,
Record
Bob
M
32
94703
Big, St.
VARCHAR
VARCHAR
CHAR
INT
INT
Record Formats: Variable Length, Pt. 3
What happens if fields are variable length?
Could use delimiters (i.e., CSV):
Requires scan to access field
What if text contains commas?
Comma Separated Values (CSV)
Bob
VARCHAR
Big, St.
VARCHAR
M
CHAR
32
INT
94703
INT
,
,
,
,
Record
Bob
M
32
94703
Big, St.
VARCHAR
VARCHAR
CHAR
INT
INT
47
Record Formats: Variable Length, Pt 5.
What happens if fields are variable length?
Store length information before fields:
Requires scan to access field
Idea: Move all variable length fields to end enable fast access
Comma Separated Values (CSV)
Bob
VARCHAR
Big, St.
VARCHAR
M
CHAR
32
INT
94703
INT
,
,
,
,
Record
Bob
M
32
94703
Big, St.
VARCHAR
VARCHAR
CHAR
INT
INT
48
Record Formats: Variable Length, Pt. 7
What happens if fields are variable length?
Introduce a record header
Direct access & no “escaping”, other advantages?
Handle null fields easily
useful for fixed length records too!
Bob
VARCHAR
Big, St.
VARCHAR
M
CHAR
32
INT
94703
INT
Header
Record
Bob
M
32
94703
Big, St.
VARCHAR
VARCHAR
INT
INT
Summary 2
SSN Last Name First Name Age Salary
123 Adams Elmo 31 $400
443 Grouch Oscar 32 $300
244 Oz Bert 55 $140
134 Sanders Ernie 55 $400
Record
Bob
M
32
400
Harmon
Varchar
Varchar
Char
Int
Int
Slotted Page
Page
Header
File
Page 1
Page 2
Page 3
Page 4
Page 5
Page 6
Header
M
94703
Bob
Harmon
32
Byte Representation of Record
Content Break 3
System Catalogs
For each relation:
name, file location, file structure (e.g., Heap file)
attribute name and type, for each attribute
index name, for each index
integrity constraints
For each index:
structure (e.g., B+ tree) and search key fields
52
System Catalogs Pt. 2
For each view:
view name and definition
Plus statistics, authorization, buffer pool size, etc
Catalogs are themselves stored as relation!
PostgreSQL Information Schema
sqlite_master
Files: Summary
DBMS “File” contains pages, and records within pages
Heap files: unordered records organized with directories
Page layouts
Fixed-length packed and unpacked
Variable length records in slotted pages, with intra-page reorg
Variable length record format
Direct access to i’th field and null values
Catalog relations store information about relations,
indexes and views.
/docProps/thumbnail.jpeg