CS代考 CLUSTER 29

Cost Model for Execution
❑ How should we estimate the costs for executing a statement? ✩Number of I/Os
✩CPU Execution Cost

Copyright By PowCoder代写 加微信 powcoder

✩ Network Cost in distributed system (ignore for now)
✩ Assumption in this course
✩I/O cost >>> CPU cost
✩Real systems also consider CPU
❑ Simplifications
✩ only consider disk reads (ignore writes — assume read-only
✩ only consider number of I/Os and not the individual time for each read (ignores page pre-fetch)
✩ Average-case analysis; based on several simplistic assumptions.
☛ Good enough to show the overall trends!

Typical Operations
• Scan over all records
– SELECT * FROM Students
• Point Query
– SELECT * FROM Students
• Equality Query
– SELECT * FROM Students
• Range Search
– SELECT * FROM Students
WHERE sid = 100
WHERE starty = 2015
WHERE starty > 2012 and starty <= 2014 Typical Operations – INSERT INTO Students VALUES (23,‘Bertino”,2016,... ) – DELETE FROM Students WHERE sid = 100 – DELETE FROM Students WHERE endyear < 1950 – Delete+insert File Organization assume each relation is a file: ❑ Heap files: ✩ Linked, unordered list of all pages of the file ✩ How does it do? ● scan retrieving all records (SELECT *)? ▲ ok, you have to retrieve all pages anyways ● Point query on unique attributes ▲ not great: have to read on avg. half the pages to return one record ● range search or equality search on non-primary key ▲ not great: have to read all pages to return subset of records. ▲ yes: Cost for insert low (insert anywhere) ● delete/update ▲ same as for equality/range search -- depends on WHERE clause File Organizations II ❑ Sorted Files: ✩ Recordsareorderedaccordingtooneormoreattributesoftherelation ✩ Isitgoodfor: ● scan retrieving all records (SELECT *)? ▲ yes,youhavetoretrieveallpagesanyways ● equality search on sort attribute ▲ good:findfirstqualifyingpagewithbinarysearchinlog2(number-of-pages) ● range search on sort attribute ▲ good:findfirstqualifyingpagewithbinarysearchinlog2(number-of-pages);adjacent pages might have additional matching records ▲ notgood:havetofindproperpage;overflowpossible ● delete/update ▲ findingtuplesameasequality/rangesearchdependingonWHEREclause ▲ update itself might lead to restructuring of pages ● Sorted output: (ORDER BY) ▲ good if on sorted attribute ❑ Even a sorted file only supports queries on sorted attributes. ❑ Solution: Build an index for any attribute (collection of attributes) that is frequently used in queries ✩ Additional information that helps finding specific tuples faster ✩ We call the collection of attributes over which the index is built the search key attributes for the index. ✩ Any subset of the attributes of a relation can be the search key for an index on the relation. ✩ Search key is not the same as primary key / key candidate Creating an index in DB2 ✩CREATE INDEX ind1 ON Students(sid); ✩DROP INDEX ind1; B+ Tree: The Most Widely Used Index ❑ Each node/leaf represents one page ✩ Since the page is the transfer unit to disk ❑ Leafs contain data entries (denoted as k*) ✩ For now, assume each data entry represents one tuple.The data entry consists of two parts ● Value of the search key (k) ● Record identifier (rid = (page-id, slot)) ✩ That is: data entry is NOT a tuple but a pointer to a tuple ❑ Root and inner nodes have auxiliary index entries Index for skaters On attribute sid Inner Index entry: value + page page pointer Data entry: value + rid Disclaimer – Example relation has only 16 tuples • You don’t build an index for such a relation – Think of thousands, and million and more tuples • Better Example: – McGill’s student relations – Sid = student id = 15-digit number • Index pages: – Inner pages contain hundreds of index entries – Leave pages contain hundreds of data entries B+ Tree (contd.) ❑ height-balanced. ✩Each path from root to tree has the same height ❑ F = fanout = number of children for each node (~ number of index entries stored in node) ❑ N=#leafpages ❑ Insert/delete at log F N cost; ❑ Minimum 50% occupancy (except for root). Index entries in index inner nodes Data entries in index leaves Data records in data pages Example B+ Tree Example tree has height of 1 “Select * from Skaters where sid = 5” – Search begins at root, and key comparisons direct it to a leaf – Number of pages accessed: • three:root,leaf,datapagewiththecorrespondingrecord – Number of I/O: • depends of how much of tree in main memory • rough assumption: root and intermediate nodes are always in main memory; index leaves and data pages not in main memory upon first Example B+ Tree “Select * from Skaters where sid = 5” – I/O costs: • one for leaf page with data entry, one for data page with data record “Select * from Skaters where sid >= 33”
– I/O costs:
• one for leaf page
• four for data pages with records
Good for equality search AND range queries (depending on the range)

Inserting a Data Entry
• Find correct leaf L.
• Put data entry into L.
– If L has enough space, done!
– Else, must split L (into L and a new node L2)
• Redistribute entries evenly, copy up middle key.
• Insert index entry pointing to L2 into parent of L. • This can happen recursively
– To split index node, redistribute entries evenly, but push up middle key. (Contrast with leaf splits.)
• Splits “grow” tree; root split increases height.
– Tree growth: gets wider or one level taller at top. 17

Inserting 8* into Example B+ Tree
Insert into Leaf with leaf split
Insert into internal node with node split
Assume that inner pages
can only contain 4 index entries

Example: After Inserting 8*
2* 3* 5* 7* 8* 14*16* 19*20*22* 24*27*29* 33*34*38*39*
❖ Notice that root was split, leading to increase in height.
❖ In this example, we can avoid split by redistributing entries;
however, this is usually not done in practice.

Data Entry Alternatives: indirect Indexing
• Indirect Indexing I
– so far: k* = (indirect
– on non-primary key search key: (2015, rid1), (2015, rid2), (2015, rid3), …
• several entries with the same search key side by side
• Indirect indexing II
(indirect indexing)
– on non-primary key search key: (2015, (rid1, rid2, rid3,…)), (2016, (rid…
• Comparison:
– first requires more space (search key repeated)
– second has variable length data entries
– second can have large data entries that span a page

Direct Indexing
❑Instead of data-entries in index leaves containing rids, they could contain the entire tuple
✩data-entry = tuple ✩no extra data pages
❑This is kind of a sorted file with an index on top

Index Classification
❑ Primary vs. secondary: If search key contains primary key, then called primary index.
✩ Unique index: Search key is primary key or unique attribute. ❑ Clustered vs. unclustered:
✩ clustered:
● Relation in file sorted by the search key attributes of the index
✩ unclustered:
● Relation in heap file or sorted by an attribute different to the search key
attribute of the index.
✩ Afilecanbeclusteredonatmostonesearchkey.
✩ Costofretrievingdatarecordsthroughindexvariesgreatlybasedon whether index is clustered or not!

Clustered vs. Unclustered Index
❑Example for Students: ✩clustered on name ✩unclustered on sid
CLUSTERED on name
UNCLUSTERED on sid
Index entries
Data Records
Data entries
(Data file)
Data Records

B+-tree cost example: Given
✩ About the relation
✩ Relation R(A,B,C,D,E,F)
✩ A and B are int (each 4 Bytes), C-F is char[40] (160 Bytes) ✩ Values of B are within [0;19999] uniform distribution
✩ 200,000 tuples
✩ About the heap file
✩ Each data page has 4 K and is around 80% full
✩ I know, I know: with fixed sized records we don’t need to leave space, but in real life records don’t have fixed length and we should leave space….
✩ The size of an rid = 10 Bytes ✩ About the index to be built (on B )
✩ An index page has 4K
✩ Index pages are filled between 50% – 100% (this always holds!!) ✩ The size of a pointer in intermediate pages: 8 Bytes

Size of B+tree: To calculate
✩ Number of data pages in the heap file ✩ size of the tupple = 168
✩ 4000 * 0.80 / 168 ≈ 19.0476 tuples in a page on an average. ✩ 200000 / 19.0476 ≈ 10501 pages in total.
✩ The number of leaf pages
✩ distinct values of B = 20000
✩ 200000 / 20000 = 10 records per val of B
✩ sizeof(B) + 10 rids * sizeof(rids)
✩4+10*10=104 bytes/dataentry
✩ 4000 * 0.75 / 104 ≈ 28.846 data entries/page on an average. ✩ 20000/28.846 ≈ 694 leaf pages.
✩ The height of the tree
✩ 4 + 8 = 12 bytes intermediate node entry
✩ Can fit 4000 * 0.75 / 12 = 250 entries per intermediate node.
✩ This requires 3 intermediate nodes right above the leaf page (to manage 694 leaves). ✩ 1 node above them to hold pointers to 3 intermediate pages (i.e, the root).
✩ height = 2 (the number of edges from root to a leaf node)

B+ Trees in Practice
❑ Typical order d of inner nodes: 100 (I.e., an inner node has between 100 and 200 index entries)
✩ Typical fill-factor: 67%. ✩ averagefanout=133
❑ Leaf nodes have often less entries since data entries larger (rids)
❑ Typical capacities (roughly)
✩ Height3:1334=312,900,721records
✩ Height 2: 1333 = 2,352,637 records
✩ Height1:1332= 17,689records
❑ Can often hold top levels in buffer pool:
✩ Level 1 (root) = 1 page =
✩ Level 2 = 133 pages =
✩ Level 3 = 17,689 pages = 26
4 Kbytes 0.5 Mbyte

Multi-attribute index
• Index on Skaters(age,rating);
• Order is important:
– Here data entries are first ordered by age
– Skaters with the same age are then ordered by rating
– assume youngest skater is 3, oldest is 25 (list of rids: *): • the leaf pages then would look like:

What does it support
• What does it support?
– SELECT * FROM Skaters WHERE age = 20;
– SELECT * FROM Skaters WHERE age = 20 AND rating < 5; – SELECT * FROM Skaters WHERE rating < 5; Index in DB2 ✩ CREATE INDEX ind1 ON Students(startyear); ✩ DROP INDEX ind1; ❑ Index also good for referential integrity (uniqueness) ✩ CREATE UNIQUE INDEX indemail ON Students(email) ❑ Additional attributes ✩ CREATE UNIQUE INDEX ind1 ON Students(sid) INCLUDE(name) ✩ Index only on sid ✩ Data entry contains key value (sid) + name + rid ✩ SELECT name FROM Students WHERE sid = 100 ● Can be answered without accessing the real data pages of Students relation! ❑ Clustered index ✩ CREATE INDEX ind1 on Students(name) CLUSTER 29 Static Hashing ❖ Similar to standard hashing but with pages as the unit of storage (compare with array-based main memory implementation) ❖ Decide on a number M of buckets at index creation ❖ Allocate one primary page per bucket ❖ Overflow pages for individual buckets are created later as needed ❖ Buckets contain data entries (same as leaf pages in tree) ❖ let k be the search key of the index, h a hash function ❖ h(k) mod M = bucket to which data entry with key k belongs. h(key) mod N Buckets contain data entries Data entry: value + rid Primary bucket pages Overflow pages ❖Buckets contain data entries. ❖Hash fn works on search key field of record r. Must distribute values over range 0 ... M-1. – h(key) = (a * key + b) usually works well. – a and b are constants; lots known about how to tune h. ❖ Long overflow chains can develop and degrade performance. ❖Several optimizations developed such to handle scale dynamically (e.g., extensible and linear hashing) Summary for B+-trees ❑Tree-structured indexes are ideal for range- searches, also good for equality searches. ✩ High fanout (F) means depth rarely more than 3 or 4. ✩ Almost always better than maintaining a sorted file. ❑Can have several indices on same tables (over different attributes) ❑Most widely used index in database management systems because. One of the most optimized components of a DBMS. File Organizations ❑ Hashed Files:. ✩File is a collection of buckets. Bucket = primary page plus zero or more overflow pages. ✩Hashing function h: h(r) = bucket in which record r belongs. h looks at only some of the fields of r, called the search fields. ✩Best for equality search (only one page access and maybe access to overflow page) ✩No advantage for range queries ✩Fast insert ✩Cost on delete depends on cost for WHERE clause Data Entry Alternatives: Direct Index • Structure –
– Leaf pages contain entire records
• Dataentries=records(notonlypointerstothem) • e.g.,indexonsidofSkaters
• (1,lilly,10,16),(2,debby,8,10)…
– Leafs represent sorted file for data records.
– Inner nodes above leafs provide faster search
– Typically at most one direct index per relation
• (Otherwise, data records duplicated, leading to redundant storage and
potential inconsistency.)
Index entries in index inner nodes
Data entries in index leaves = Data records in data pages

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com