Algorithms and Data 16: Search Tree Menagerie
Professor Kevin Gold
Balanced Search Trees Suppose we want to guarantee that searches
•
for particular values in a tree take O(log n) time
This doesn’t happen by itself — imagine entering values in a binary search tree such
that a1 < a2 < a3 ... then the tree is just a list branching to the right, O(n) to get to the bottom
1
2
Unbalanced binary search tree
•
If we’re filling earlier depths before later depths,
•
the time will be more like O(log n)
There are a wide variety of strategies for
•
achieving O(log n) search time in a tree, and we will see two
3
Basic Binary Search Trees
Rough description: Keys stored in a binary tree such that
Basic purpose: A template for other tree types to riff off of.
How it roughly works: Insertion recursively goes down the tree going left or right until there’s room for a leaf, but deletion requires a little “surgery.”
2
14
36
•
left.key <= key <= right.key.
•
•
Simple Binary Search Tree Operations
Find an element: Look at current node, recursively
•
look in left child if element < key, else look in right
• •
Minimum in tree: Follow left links until leaf Maximum in tree: Follow right links until leaf
2
14
36
•
Finding Successors
To find the next node in key order:
If right child isn’t nil
return minimum of right subtree // Case 1
Else go up tree until at node that is left child
of its parent, return that
If we made it to root, return root
Case 3 2 Case 1
// Case 2
// Case 3
Case 2
4
36
1
0
•
Insertion
Recur down tree — if toInsert < key, recur on left, If the place to recur on is empty, park it there
2
4
36
•
else recur on right
1
0
5
•
and want to remove the node. Three cases:
5
Deletion
to delete 2
Assume we have a pointer to a node z
1
•
•
•
•
successor
6 4
z has no children: Set parent’s pointer to null.
z has one child: Splice out this node, setting parent’s pointer to child.
z has two children: Splice out successor(z), and copy its data into z’s location.
3
0
3
1
5
Why can’t the successor have 2 children?
4
6
0
Red-Black Trees Rough description: A binary tree where each node has a
•
“color,” red or black.
Basic purpose: Ensure the tree is roughly balanced after all
•
insertions and deletions, maintaining O(log n) search time.
How it roughly works: Maintain an invariant that the number of
•
black nodes on a path to a leaf is the same no matter which leaf
you choose.
2
14
nnnn
• • • •
Red-Black Properties
After every insertion or deletion and subsequent fixing,
•
every red-black tree will satisfy:
Every node is either red or black.
The root is black. (“black root property”)
Every leaf is the same value NIL and is black.
If a node is red, its children are both black.
(“red parent property”)
2
14
nnnn
For all nodes, all paths to descendant leaves have the
•
same number of black nodes. (“black-height property”)
The Red-Black Height Limit
A red-black tree with N nodes has height at most 2 lg(n + 1).
•
•
•
Let the “black-height” bh(x) of a node be the number of black nodes A subtree contains at least 2bh(x) - 1 internal nodes.
True at leaf: 20 - 1 = 0.
•
from x to the leaf, not counting x.
Assume true for smaller subtrees: left + right + 1
At least half the nodes from the root to a leaf must be black, so
•
>= 2bh(x)-1 + 2bh(x)-1 +1 = 2bh(x) – 1
•
bh(root) >= (tree_height)/2 and (by above) N >= 2tree_height/2-1.
•
So: lg(n+1) >= lg(2tree_height/2) = tree_height/2; tree_height<= 2 lg(n+1).
•
Since a tree that obeys the red-black properties will be roughly balanced, we just have to maintain those properties as we insert and delete.
Maintaining Red-Black
Properties
Insertion consists of recursively deciding whether we are < or >
•
internal node keys until we reach a [nil] leaf, then replacing that
with a RED node, then fixing the tree so that it still obeys the properties.
2
14
insert 3
2 fix 14
2 13
nnnnnn
3nnnn4
nn
nn
Fixing the Tree After
Insertion Set needsFixing (nF) to the newly inserted node
While nF is not root and parent(nF) is RED:
Let uncle(needsFixing) be the sibling of parent(needsFixing) If uncle(needsFixing) is RED // Case 1
4
28
n nF 3 n n
nn
Case 1
4
28
n3nn
• •
• •
•
uncle
Make both the uncle and the parent BLACK
•
Else // Code below assumes parent is a left child; flip if not
If needsFixing is the right child, rotate it left to make it a
•
parent of its former parent (which is now nF) // Case 2
Rotate needsFixing, parent, and grandparent all to the Set the root to BLACK.
•
right, make them red-black-red // Case 3
•
nn
Fixing the Tree After
Insertion Set needsFixing (nF) to the newly inserted node
While nF is not root and parent(nF) is RED:
Let uncle(needsFixing) be the sibling of parent(needsFixing) If uncle(needsFixing) is RED // Case 1
• •
• •
•
•
Make both the uncle and the parent BLACK
4
2n
n nF 3
nn
Case 2
4
n n
•
Else // Code below assumes parent is a left child; flip if not
If needsFixing is the right child, rotate it left to make it a Rotate needsFixing, parent, and grandparent all to the
•
parent of its former parent (which is now nF) // Case 2
right, make them red-black-red Set the root to BLACK
// Case 3
3
2
•
nF
nn
Fixing the Tree After
4
n n
Case 3
3
4
nn
Insertion Set needsFixing (nF) to the newly inserted node
While nF is not root and parent(nF) is RED:
• •
• •
•
•
3
2
nn
Let uncle(needsFixing) be the sibling of parent(needsFixing) If uncle(needsFixing) is RED // Case 1
nF
Make both the uncle and the parent BLACK
•
Else // Code below assumes parent is a left child; flip if not
If needsFixing is the right child, rotate it left to make it a Rotate needsFixing, parent, and grandparent all to the
•
parent of its former parent (which is now nF) // Case 2
nF
n
2
right, make them red-black-red Set the root to BLACK
// Case 3
•
n
•
•
•
•
Correctness Analysis:
Loop Invariants
Within the “while” loop of the preceding algorithm: needsFixing is red
If the parent is the root, the parent is black.
•
If there is a red-black property violation, there is only one, and it is either of the
•
“black root” property (because needsFixing is the root) or the “red parent” property (because needsFixing and its parent are red)
These start true:
The inserted node is red.
If needsFixing isn’t the root, the root is already black.
•
If needsFixing is the root, it’s red and the other properties can’t be violated.
•
If needsFixing isn’t the root, adding it doesn’t change any black-heights and other properties besides red-red clearly aren’t violated.
•
•
•
•
If these hold, all red-black properties hold in the end:
If the parent is not RED (our termination condition), we don’t violate the We explicitly set the root’s color to BLACK in the end
Correctness Analysis:
Loop Invariants Within the “while” loop of the preceding algorithm:
needsFixing is red
If the parent is the root, the parent is black.
If there is a red-black property violation, there is only one, and it is
•
either of the “black root” property (because needsFixing is the root) or the “red parent” property (because needsFixing and its parent are red)
•
“red parent property”
•
Correctness Analysis: Loop Invariants
•
•
Within the “while” loop of the preceding algorithm: needsFixing is red
If the parent is the root, the parent is black.
4
28
n nF 3 n n
nn
Case 1
28
n3nn
•
If there is a red-black property violation, there is only
•
one, and it is either of the “black root” property
(because needsFixing is the root) or the “red parent” property (because needsFixing and its parent are red)
4
•
Our moves maintain these properties:
Case 1: needsFixing remains red, we set our parent to
•
black, and either we’re the root and didn’t change
anything or we had a “red parent” violation and fixed it.
nn
Correctness Analysis: Loop Invariants
4
n n
•
•
3
2
nn
Within the “while” loop of the preceding algorithm: needsFixing is red
If the parent is the root, the parent is black.
nF
•
•
Our moves maintain these properties:
• Case 3: needsFixing is red, its current parent’s color was unchanged, and either we’re now the root but everything else is okay or we’ve rotated into a position with a red parent, but the children’s colors are okay.
3
n
If there is a red-black property violation, there is only
•
one, and it is either of the “black root” property
(because needsFixing is the root) or the “red parent” property (because needsFixing and its parent are red)
Case 3
nF 2
4
n
n
n
Correctness Analysis: Loop Invariants
Within the “while” loop of the preceding algorithm: needsFixing is red
If the parent is the root, the parent is black.
4
2n
n nF 3
n n
Case 2
3
2 nF n
•
•
•
If there is a red-black property violation, there is only
•
one, and it is either of the “black root” property
(because needsFixing is the root) or the “red parent” property (because needsFixing and its parent are red)
• Our moves maintain these properties:
4
Case 2: We maintain the invariants with the initial
•
rotation — needsFixing is red, the parent wasn’t the
n
root or we wouldn’t be executing this (the root is black), and if just one “red parent” violation existed before, this doesn’t change that.
nn
Red-Black Tree Summary
Red-Black Trees: maintain red/black values on
•
each node as a way of guaranteeing balance.
•
“All paths to leaves have same number of black nodes” plus “red parents have black children” guarantees height 2(lg n + 1)
The tricky nature of fixing the trees after insertions and (even more complicated) deletions makes this a little difficult to code
•
B-Trees
Rough description: Extremely “wide” trees where internal
•
nodes store the values between each of their children.
Basic purpose: Used in filesystems and databases to avoid
•
costly disk seeks.
•
How it roughly works: Maintain a number of children between
t and 2t at each node, splitting or merging nodes as
necessary to obey these limits.
0 100 200 300
-100 -50 -2 -1 2 10 19 51 110 151 191 199 210
Avoiding Slow Disk Seeks
While accessing data in RAM can be pretty fast, accessing a
•
hard drive can be slow
A physical arm needs to move to the correct place on the
•
disk
100,000 times slower than accessing RAM is typical (but
•
such stats always change)
When accessing data on the hard drive, typically a page of data
•
is pulled – 211 to 214 bytes
In trying to find the page with your data, the disk must read
•
some kind of index of addresses – requiring more lookups!
•
B-Tree Structure
All nodes besides the root contain between t-1 and 2t-1 Internal nodes have [keys+1] pointers to child nodes.
•
keys. (The root just needs > 0 if tree is nonempty.)
For any key k in ith child ci, keyi-1 ≤ k ≤ keyi (if those parent
•
keys exist)
The parent’s key values come between different
•
children’s values.
0 100 200 300
-100 -50 -2 -1 2 10 19 51 110 151 191 199 210 151
1
Search in a B-Tree Search
•
•
•
At each node, iterate through keys until a key is found that is larger than the
•
key sought.
This is usually work done in RAM — trivial compared to seeks.
•
Retrieve the child that must contain the desired key.
This typically requires retrieving one page from the hard disk — a node Continue until the leaf with the desired key and attached data is retrieved.
Overall, # disk seeks = height of B-Tree.
0 100 200 300
•
contains exactly what fits in a page.
-100 -50 -2 -1 2 10 19 51 110 151 191 199 210 151
1
•
B-Tree Height
The height of a B-Tree with N keys and minimum t children at
•
each node is at most logt (N+1)/2.
Root contains at least 1 key, so at least 2 nodes at depth 1, 2t Minimum t-1 keys per node
•
at depth 2, 2t2 at depth 3 … 2th-1 at depth h = height
• •
h i-1
≥ 1 + (t-1) Σ 2t
i=1
= 1 + 2(t-1)((th-1)/(t-1))
= 2th – 1
th ≤ (N+1)/2, so h ≤ logt (N+1)/2
N
B-Trees are
Typically Very Branchy
Branching factors t are typically between 50 and 2000
•
(depending on what fits in a page)
With 1000 keys at each node, a depth 2 tree can store
•
1000 + 1001 * 1000 + 1001 * 1001 * 1000 = over 1 billion keys, needing only at most 2 seeks to retrieve any key
[1000 keys]
[1000 keys] [999 children] [1000 keys]
[lots of children]
[1000 keys] [999 children] [1000 keys] [1000 keys] [999 children] [1000
•
•
BC D
G
Insertion With Splitting
To maintain the 2t children limit: on insertion, while recurring, if the child we will recur on is full, split it before entering.
To split, take the median value and make it a key in the parent; then remove the values after that and put them in their own child, separate from the values before.
If we’re recurring down the tree doing this, the parent is guaranteed to have room for a new key.
2t = 4 (max 3 keys), insert A:
FG H
…
FH
•
B C D
…
HI
If the root is full, make a new root to
•
put the median in before splitting.
AB
D
…
CF
G
•
We Are Skipping Deletion in These Trees
Deletion tends to be a little more complicated to repair than insertion, but is possible to do in both Red-Black and B-Trees
There are case-by-case things to do for both Red-
•
Black Trees and B-Trees to repair their properites
The insertion operations give the flavor of how
•
these trees maintain their properties.
Radix Trees/Tries
Rough description: A search tree for strings (or other sequences)
•
where nodes correspond to different possible prefixes
Basic purpose: Find the string’s place in the tree in time
•
proportional to the string’s length
How it roughly works: Create decisions in the tree only where go the
pher 1 (the)
1
(gopher)
•
data strings disagree
We can also store references to data at the leaves
ld
1
(gold)
•
String comparison requires time proportional to the length of a string
Why Not a
Binary Search Tree?
If our internal nodes contain strings, this can result in O(s log n) time
•
in even a balanced tree, where s is the length of the string
A binary search tree can also use more memory in repeating parts
•
of strings
gopher
Looking for “goldfish” will probably take about
7 character comparisons here
gold
the
•
Why not a Hash Table?
We could simply hash strings to store them. Some properties this
•
doesn’t have:
•
•
Strings stored in sorted order — could be useful
We can make a fast sort out of building a trie, then traversing it
If queries are received online, partial data gets us close to the
•
answer. Need to wait for full string before hashing.
(Yes, hashing is O(s) if you factor in the length of the string s.)
Tries can be fast for “auto-complete”/“spell-check” applications Tries can also save some space over hashing all strings.
• •
Insertion Into a Trie Follow the labeled links until your
•
string finds no relevant edge to follow.
go the pher (the)
Find the longest match among the
•
edges leading from the current node.
ld
Replace that edge with an edge
•
having the shared prefix, and make a
new node with descendants being the old descendant and the new node.
(gopher) th
pher at e (gopher)(that)(the)
If no match found because our word
(gold)
go
•
is longer, just make a new edge to it.
If no match found because we’re out
ld (gold)
•
of nodes, just use the internal node.
•
In any case, keep links in lexicographic (alphabetical) order
•
Once all strings are in, do a pre-order traversal
(root, left, right) of the tree to print in lexicographic order. (alphabetical with short strings before long)
Radix Sort
We can sort strings in O(N) time (where N is the sum of
•
all string lengths) using this strategy:
For each string, insert it into the trie. (Requires time
•
linear in the number of characters in the string.)
This strategy is called “radix sort” and also works for
•
sequences of bits.
•
•
•
•
Good Idea, Bad Idea Are the following design choices good or bad?
Storing a bunch of key-value pairs, want to minimize CPU time (assume no disk seeks). We choose to use a very wide B-Tree to store all the data.
We want to store a hierarchy of data, where it matters which nodes are the parents of which. But let’s use a red-black tree and be balanced, too.
I’m a search engine and I want to proactively retrieve search results as the user types. I’ll store search strings as a trie with pointers to search results at each node.
•
Summary Four search trees presented today —
Basic balanced binary search tree: a starting point
•
for others
Red-black trees: guaranteed balanced by fiddling
•
with node colors
B-Trees: Extremely wide to avoid costly steps down Radix trees: Handy way of storing strings
•
tree
•