CS61B, 2019
Lecture 21: Tries
● Tries
● Trie Implementation and Performance
● Alternate Child Tracking Strategies
● Trie String Operations
● Autocomplete
datastructur.es
Tries
datastructur.es
Abstract Data Types vs. Specific Implementations
There are many ways to implement an abstract data type. ● Today we’ll talk about a new way to build a set/map.
Heap
Separate Chaining Hash Table
PQ
Balanced Tree
Ordered Linked List
Set
LinkedList
LinkedList
Resizing Array
Resizing Array
List
Map
BST (Vanilla)
LLRB
B-Trees (2-3 / 2-3-4)
Quick Find Quick Union
DisjointSets
Weighted QU
WQUPC
Heap
datastructur.es
BST and Hash Table Set Runtimes
Runtimes for our Balanced Search Tree and Hash Table implementations were very fast.
If we know that our keys all have some common special property, we can sometimes get even better implementations.
Example: Suppose we know our keys are always single ASCII characters.
● e.g. ‘a’, ‘g’, ‘!’
*: Indicates “on average”.
†: Assuming items are evenly spread.
contains(x)
add(x)
Balanced Search Tree
Θ(log N)
Θ(log N)
Resizing Separate Chaining Hash Table
Θ(1)†
Θ(1)*†
datastructur.es
Special Case 1: Character Keyed Map
Suppose we know that our keys are always ASCII characters.
● Can just use an array!
● Simple and fast.
key type
get(x)
add(x)
Balanced BST
comparable
Θ(log N)
Θ(log N)
RSC Hash Table
hashable
Θ(1)†
Θ(1)*†
data indexed array
chars
Θ(1)
Θ(1)
public class DataIndexedCharMap
public DataIndexedCharMap(int R) {
items = (V[]) new Object[R]; }
public void put(char c, V val) { items[c] = val;
}
public V get(char c) {
return items[c]; }
}
*: Indicates “on average”.
†: Assuming items are evenly spread.
R is the number of possible characters, e.g. 128 for ASCII.
datastructur.es
Special Case 2: String Keyed Map
Suppose we know that our keys are always strings.
● Can use a special data structure known as a Trie.
● Basic idea: Store each letter of the string as a node in a tree.
Tries will have great performance on:
● get
● add
● special string operations
*: Indicates “on average”.
†: Assuming items are evenly spread.
key type
get(x)
add(x)
Balanced BST
comparable
Θ(log N)
Θ(log N)
RSC Hash Table
hashable
Θ(1)†
Θ(1)*†
data indexed array
chars
Θ(1)
Θ(1)
Tries
Strings
?
?
datastructur.es
Sets of Strings
Suppose we have a set containing “sam”, “sad”, “sap”, “same”, “a”, and “awls”.
●
Below, we see the BST and Hash Table representation.
sad
0 sad sam
1 awls
2 a sap 3 same
Hash Table
awls
same
a
sam
BST
sap
datastructur.es
Tries: Each Node Stores One Character
For String keys, we can use a “Trie”. Key ideas:
● Every node stores only one letter.
● Nodes can be shared by multiple keys.
s
a
dm
Above, we show the results of adding “sam” and sad”. Use your intuition to try to insert the remaining items “sap”, “same”, “a”, and “awls”.
datastructur.es
Tries: Each Node Stores One Character
For String keys, we can use a “Trie”. Key ideas:
● Every node stores only one letter.
● Nodes can be shared by multiple keys.
a s
a
dmp
e
Above, we show the results of adding “sam” and sad”. Use your intuition to try to insert the remaining items “sap”, “same”, “a”, and “awls”.
datastructur.es
Tries: Each Node Stores One Character
For String keys, we can use a “Trie”. Key ideas:
● Every node stores only one letter.
● Nodes can be shared by multiple keys.
w
l
s
as
a
dmp
e
Try to figure out a way to make it clear that our set contains “sam”, “sad”, “sap”, “same”, “a”, and “awls”, but not “aw”, “awl”, “sa”, etc.
datastructur.es
Tries: Each Node Stores One Character
For String keys, we can use a “Trie”. Key ideas:
● Every node stores only one letter.
● Nodes can be shared by multiple keys.
w
l
s
as
a
dmp
e
Try to figure out a way to make it clear that our set contains “sam”, “sad”, “sap”, “same”, “a”, and “awls”, but not “aw”, “awl”, “sa”, etc.
datastructur.es
Tries: Each Node Stores One Character
For String keys, we can use a “Trie”. Key ideas:
● Every node stores only one letter.
● Nodes can be shared by multiple keys.
w
l
s
as
a
dmp
e
Try to figure out a way to make it clear that our set contains “sam”, “sad”, “sap”, “same”, “a”, and “awls”, but not “aw”, “awl”, “sa”, etc.
datastructur.es
Tries: Search Hits and Misses
Suppose we insert “sam”, “sad”, “sap”, “same”, “a”, and “awls”.
● contains(“sam”):
● contains(“sa”):
● contains(“a”):
true, blue node false, white node
true, blue node
● contains(“saq”): false, fell off tree
“hit” “miss”
as
wa
ldmp
se
Two ways to have a search “miss”:
● If the final node is white.
● If we fall off the tree, e.g. contains(“sax”).
datastructur.es
Trie Maps
Tries can also be maps, of course.
bst
y4ehh
a6le0oe5
e.g. maps “by” to 4.
llr
s1le7
s3
For an animated demo of the creation of this map, see this demo from our optional Algorithms textbook.
datastructur.es
Tries: A Digit-by-Digit Set Representation
sad
0 sad sam
1 awls
2 a sap 3 same
HashSet
as
wa
ldmp
se
Trie
awls
same
a
sam
BST
sap
datastructur.es
Tries
Trie:
● Short for Retrieval Tree.
● Inventor Edward Fredkin suggested it should be pronounced “tree”, but
almost everyone pronounces it like “try”.
datastructur.es
Trie Implementation and Performance
datastructur.es
Very Basic Trie Implementation
The first approach might look something like the code below.
● Each node stores a letter, a map from c to all child nodes, and a color.
public class TrieSet {
private static final int R = 128; // ASCII private Node root; // root of trie
private static class Node {
private char ch;
private boolean isKey;
private DataIndexedCharMap
ch = c; isKey = b;
next = new DataIndexedCharMap<>(R);
} }
}
Since we know our keys are characters, can use a DataIndexedCharMap.
as
wa
ldmp
se
datastructur.es
Zooming in On a Node
Each DataIndexedCharMap is an array of 128 possible links, mostly null.
private static class Node {
private char ch;
private boolean isKey;
private DataIndexedCharMap
ch = c; isKey = b;
next = new DataIndexedCharMap<>(R); }
…
a
w
…
datastructur.es
Zooming in On a Node
Better drawing of a DataIndexedCharMap based trie is shown to the right.
private static class Node {
private char ch;
private boolean isKey;
private DataIndexedCharMap
ch = c; isKey = b;
next = new DataIndexedCharMap<>(R); }
a
w
a
w …
128 links, with one used, and 127 equal to null.
w
datastructur.es
Very Basic Trie Implementation
If we use a DataIndexedCharMap to track children, every node has R links.
as
a … s w
private static class Node {
private char ch;
private boolean isKey;
private DataIndexedCharMap
ch = c; isKey = b;
next = new DataIndexedCharMap<>(R); }
public class DataIndexedCharMap
public DataIndexedCharMap(int R) {
items = (V[]) new Object[R]; }
…
}
…
…
w
l
…
a
a
…
d
…
l
…
d
…
datastructur.es
Very Basic Trie Implementation
Observation: The letter stored inside each node is actually redundant. ● Can remove from the representation and things will work fine.
as
public class TrieSet {
private static final int R = 128; // ASCII private Node root; // root of trie
private static class Node {
private char ch;
private boolean isKey;
private DataIndexedCharMap
ch = c; isKey = b;
next = new DataIndexedCharMap<>(R); }
} }
… w
…
a
…
…
l
…
…
d
…
…
datastructur.es
Trie Performance in Terms of N
Given a Trie with N keys. What is the:
● Add runtime?
● Contains runtime?
[N = 6]
as
… w
a
…
l … … … dmp
…
s
…
e
…
…
…
datastructur.es
Trie Performance in Terms of N
Given a Trie with N keys. What is the:
● Add runtime? Θ(1)
● Contains runtime? Θ(1)
[N = 6]
as
… w
a
…
l … … … dmp
Runtimes independent of number of keys! Or in terms of L, the length of the key:
● Add: Θ(L)
● Contains: O(L)
…
s
…
e
…
…
…
datastructur.es
Trie Performance in Terms of N
When our keys are strings, Tries give us slightly better performance on contains and add.
Runtimes treating length of keys as a constant
key type
get(x)
add(x)
Balanced BST
comparable
Θ(log N)
Θ(log N)
RSC Hash Table
hashable
Θ(1)†
Θ(1)*†
data indexed array
chars
Θ(1)
Θ(1)
Tries (Data Indexed Char Map)
Strings
Θ(1)
Θ(1)
*: Indicates “on average”.
†: Assuming items are evenly spread.
One downside of the DataIndexedCharMap-based Trie is the huge memory cost of storing R links per node.
● Wasteful because most links are unused in real world usage.
datastructur.es
Alternate Child Tracking Strategies
datastructur.es
Trie Performance in Terms of N
Using a DataIndexedCharMap is very memory hungry.
● Every node has to store R links, most of which are null.
private static class Node {
private boolean isKey;
private DataIndexedCharMap
private Node(char c, boolean b, int R) { ch = c; isKey= b;
next = new DataIndexedCharMap<>(R);
}
as
w
a
l
d
datastructur.es
The DataIndexedCharMap Trie
isKey: links:
97 98 99 100 …
…
F
ac d
Can use ANY kind of map from character to node, e.g.
● BST
● Hash Table
Fundamental problem, our arrays are ‘sparse’.
isKey: links:
97 98 99 100 …
…
F
isKey: links:
…
97 98 99 100 …
T
datastructur.es
isKey: links:
97 98 99 100 …
…
T
Alternate Idea #1: The Hash-Table Based Trie
isKey: links:
0 1 2 3
a
c
F
ac d
isKey: links:
T
0 1 2 3
isKey: links:
F
0 1 2 3
d
datastructur.es
Alternate Idea #2: The BST-Based Trie
isKey: links:
‘c’
‘a’
F
ac d
isKey: links:
F
‘d’
isKey: links:
T
isKey: links:
T
datastructur.es
The Three Trie Implementations
When we implement a Trie, we have to pick a map to our children
● DataIndexedCharMap: Very fast, but memory hungry.
● Hash Table: Almost as fast, uses less memory.
● Balanced BST: A little slower than Hash Table, uses similar amount of
memory?
dictionary from letter to Node
DataIndexedCharMap
this.next
Hash Table
Balanced BST
datastructur.es
Performance of the DataIndexedCharMap, BST, and Hash Table Trie
Using a BST or a Hash Table to store links to children will usually use less memory.
● DataIndexedCharMap: 128 links per node.
● BST: C links per node, where C is the number of children.
● Hash Table: C links per node.
● Note: Cost per link is higher in BST and Hash Table.
Using a BST or a Hash Table will take slightly more time.
● DataIndexedCharMap is Θ(1).
● BST is O(log R), where R is size of alphabet.
● Hash Table is O(R), where R is size of alphabet.
Since R is fixed (e.g. 128), can think of all 3 as Θ(1).
as
wa
l d m p
se
datastructur.es
Trie Performance in Terms of N
When our keys are strings, Tries give us slightly better performance on contains and add.
● Using BST or Hash Table will be slightly slower, but more memory efficient.
● Would have to do computational experiments to see which is best for your application.
Runtimes treating length of keys as a constant
key type
get(x)
add(x)
Balanced BST
comparable
Θ(log N)
Θ(log N)
RSC Hash Table
hashable
Θ(1)†
Θ(1)*†
data indexed array
chars
Θ(1)
Θ(1)
Tries (BST, Hash Table, Data Indexed Char Map)
Strings
Θ(1)
Θ(1)
*: Indicates “on average”.
†: Assuming items are evenly spread.
… but where Tries really shine is their efficiency with special string operations!
datastructur.es
Trie String Operations
datastructur.es
String Specific Operations
Theoretical asymptotic speed improvement is nice. But the main appeal of tries is their ability to efficiently support string specific operations like prefix matching.
● Finding all keys that match a given prefix: keysWithPrefix(“sa”)
● Finding longest prefix of: longestPrefixOf(“sample”)
sad
0 sad sam 1 awls
2 a sap 3 same
awls
same
a s
wa
ldmp
a
sam
sap
se
datastructur.es
Prefix Matching Operations
Theoretical asymptotic speed improvement is nice. But the main appeal of tries is their ability to efficiently support string specific operations like prefix matching.
as
wa
ldmp
Examples:
● Finding the longest prefix of a string: longestPrefixOf(“sample”)
○ Result: sam
● Finding all keys that match a given prefix:
keysWithPrefix(“sa”)
○ Result: [sad, sam, same, sap]
se
datastructur.es
Challenging Warmup Exercise: Collecting Trie Keys
Challenging Exercise: Give an algorithm for collecting all the keys in a Trie. collect() returns [“a”, “awls”, “sad”, “sam”, “same”, “sap”]
as
wa
ldmp
se
datastructur.es
Challenging Warmup Exercise: Collecting Trie Keys
Challenging Exercise: Give an algorithm for collecting all the keys in a Trie. collect() returns [“a”, “awls”, “sad”, “sam”, “same”, “sap”]
collect():
● Create an empty list of results x.
● For character c in root.next.keys():
○ Call colHelp(“c”, x, root.next.get(c)). ● Return x.
Create colHelp.
● colHelp(String s, List
a s
wa
ldmp
se
datastructur.es
Challenging Warmup Exercise: Collecting Trie Keys
Challenging Exercise: Give an algorithm for collecting all the keys in a Trie.
collect():
● Create an empty list of results x.
● For character c in root.next.keys():
○ Call colHelp(“c”, x, root.next.get(c)). ● Return x.
colHelp(String s, List
● If n.isKey, then x.add(s).
● For character c in n.next.keys():
○ Call colHelp(s + c, x, n.next.get(c))
as
wa
ldmp
se
datastructur.es
Challenging Warmup Exercise: Collecting Trie Keys
Challenging Exercise: Give an algorithm for collecting all the keys in a Trie.
collect():
● Create an empty list of results x.
● For character c in root.next.keys():
x = [] ○ Call colHelp(“c”, x, root.next.get(c)).
● Return x.
colHelp(String s, List
● If n.isKey, then x.add(s).
● For character c in n.next.keys():
○ Call colHelp(s + c, x, n.next.get(c))
colHelp(“a”, x, a ) s
wa
ldmp
se
datastructur.es
Challenging Warmup Exercise: Collecting Trie Keys
Challenging Exercise: Give an algorithm for collecting all the keys in a Trie.
collect():
● Create an empty list of results x.
● For character c in root.next.keys():
x = [“a”] ○ Call colHelp(“c”, x, root.next.get(c)).
● Return x.
colHelp(String s, List
colHelp(“a”, x, a ) s
● If n.isKey, then x.add(s).
● For character c in n.next.keys():
colHelp(“aw”, x, w ) a
○ Call colHelp(s + c, x, n.next.get(c))
ldmp
se
datastructur.es
Challenging Warmup Exercise: Collecting Trie Keys
Challenging Exercise: Give an algorithm for collecting all the keys in a Trie.
collect():
● Create an empty list of results x.
● For character c in root.next.keys():
x = [“a”] ○ Call colHelp(“c”, x, root.next.get(c)).
● Return x.
colHelp(String s, List
○ Call colHelp(s + c, x, n.next.get(c))
colHelp(“a”, x, a )
s
a
d m p
● If n.isKey, then x.add(s).
● For character c in n.next.keys(): colHelp(“awl”, x, l )
colHelp(“aw”, x, w )
se
datastructur.es
Challenging Warmup Exercise: Collecting Trie Keys
Challenging Exercise: Give an algorithm for collecting all the keys in a Trie.
collect():
● Create an empty list of results x.
● For character c in root.next.keys():
x = [“a”] ○ Call colHelp(“c”, x, root.next.get(c)).
● Return x.
colHelp(String s, List
colHelp(“a”, x, a ) s
● If n.isKey, then x.add(s).
● For character c in n.next.keys(): colHelp(“awl”, x, l ) d
○ Call colHelp(s + c, x, n.next.get(c)) colHelp(“awls”, x, s )
m p
e
colHelp(“aw”, x, w ) a
datastructur.es
Challenging Warmup Exercise: Collecting Trie Keys
Challenging Exercise: Give an algorithm for collecting all the keys in a Trie.
collect():
● Create an empty list of results x.
● For character c in root.next.keys():
x = [“a”, “awls”] ○ Call colHelp(“c”, x, root.next.get(c)).
● Return x.
colHelp(String s, List
colHelp(“a”, x, a ) s
● If n.isKey, then x.add(s).
● For character c in n.next.keys(): colHelp(“awl”, x, l ) d
○ Call colHelp(s + c, x, n.next.get(c)) colHelp(“awls”, x, s )
m p
e
colHelp(“aw”, x, w ) a
datastructur.es
Challenging Warmup Exercise: Collecting Trie Keys
Challenging Exercise: Give an algorithm for collecting all the keys in a Trie.
collect():
● Create an empty list of results x.
● For character c in root.next.keys():
x = [“a”, “awls”] ○ Call colHelp(“c”, x, root.next.get(c)).
● Return x.
colHelp(String s, List
● If n.isKey, then x.add(s).
● For character c in n.next.keys():
○ Call colHelp(s + c, x, n.next.get(c))
a colHelp(“s”, x, s )
wa
l d m p
se
datastructur.es
Challenging Warmup Exercise: Collecting Trie Keys
Challenging Exercise: Give an algorithm for collecting all the keys in a Trie.
collect():
● Create an empty list of results x.
● For character c in root.next.keys():
x = [“a”, “awls”] ○ Call colHelp(“c”, x, root.next.get(c)).
● Return x.
colHelp(String s, List
● If n.isKey, then x.add(s).
● For character c in n.next.keys():
○ Call colHelp(s + c, x, n.next.get(c))
a colHelp(“s”, x, s )
colHelp(“sa”, x, wa
l d m
)
p
se
datastructur.es
Challenging Warmup Exercise: Collecting Trie Keys
Challenging Exercise: Give an algorithm for collecting all the keys in a Trie.
collect():
● Create an empty list of results x.
x = [“a”, “awls”, “sad”,
“sam”, “same”, “sap”]
● For character c in root.next.keys():
○ Call colHelp(“c”, x, root.next.get(c)).
● Return x.
colHelp(String s, List
a colHelp(“s”, x, s )
colHelp(“sa”, x,
● If n.isKey, then x.add(s). w … a
)
p
● For character c in n.next.keys(): l ○ Call colHelp(s + c, x, n.next.get(c))
d m
se
datastructur.es
Usages of Tries
Challenge: Give an algorithm for keysWithPrefix.
● Example: keysWithPrefix(“sa”) is [“sad”, “sam”, “same”, “sap”].
as
wa
ldmp
se
datastructur.es
Usages of Tries
Challenge: Give an algorithm for keysWithPrefix.
● Example: keysWithPrefix(“sa”) is [“sad”, “sam”, “same”, “sap”].
Algorithm:
● Find the node α corresponding to the string (in pink).
● Create an empty list x.
● For character c in α.next.keys():
○ Call colHelp(“sa” + c, x, α.next.get(c)). Another common operation: LongestPrefixOf. See lab.
a s
wa
ldmp
se
datastructur.es
Autocomplete
datastructur.es
The Autocomplete Problem
Example, when I type “how are” into Google, I get 10 results, shown to the right.
One way to do this is to create a Trie based map from strings to values
● Value represents how important Google thinks that string is.
● Can store billions of strings efficiently since they share nodes.
● When a user types in a string “hello”, we:
○ Call keysWithPrefix(“hello”).
○ Return the 10 strings with the highest
value.
datastructur.es
Autocomplete Example, for Top Three Matches
Suppose we have six strings with values shown below:
● buck: 10
● sad: 12
● smog: 5
● spit: 15
● spite: 20
● spy: 7
If the user types “s”, we:
● Call keysWithPrefix(“s”).
bs
u amp
c 12
k
doiy
7
○ sad, smog, spit, spite, spy 10 ● Return the three keys with highest value.
gt
5 15
e
○ spit, spite, sad
20
datastructur.es
Autocomplete Example, for Top Three Matches
Suppose we have six strings with values shown below:
● buck: 10
● sad: 12
● smog: 5
● spit: 15
● spite: 20
● spy: 7
If the user types “s”, we:
● Call keysWithPrefix(“s”).
bs
u amp
○ sad, smog, spit, spite, spy
● Return the three keys with highest value.
c
k
doiy
gt
e
○ spit, spite, sad
datastructur.es
The Autocomplete Problem
One way to do this is to create a Trie based Dictionary that maps strings to values.
● When a user types in a string hello, we:
○ Call keysWithPrefix(“hello”).
○ Return the ten strings with the highest value.
The approach above has one major flaw. If we enter a short string, the number of keys with the appropriate prefix will be too big.
● We are collecting billions of results only to keep 10!
● This is extremely inefficient.
datastructur.es
A More Efficient Autocomplete
One way to address this issue:
● Each node stores its own value, as well as the value of its best substring.
value = None best = 20
None
b 20s
None u None a None m None p
10 12 5 20 20
None c 12 d None o None i y 7 10 12 5 20 7
None 10
10 10
k
5 g 15 t 5 20
20 20
e
datastructur.es
A More Efficient Autocomplete
One way to address this issue:
● Each node stores its own value, as well as the value of its best substring.
Search will consider nodes in order of “best”.
● Consider ‘sp’ before ‘sm’.
● Can stop when top 3
matches are all better than best remaining.
value = None best = 20
None
b 20s
None u None a None m None p
10 12 5 20 20
None 10
None 10
10 10
c
k
12 12
d
None o 5
5 g 5
None i 20
15 t 20
y 7 7
20 20
e
Details left as an exercise. Hint: Use a PQ! See Bear Maps gold points for more.
datastructur.es
Even More Efficient Autocomplete
Can also merge nodes that are redundant!
● This version of trie is known as a “radix tree” or “radix trie”.
● Won’t discuss.
value = None best = 20
10 10
buck
None
20 s
12
12 ad 5 mog 20 p 20
15 it y 7 20 7
20 20
5
None
e
datastructur.es
Trie Summary
datastructur.es
Tries
When your key is a string, you can use a Trie:
● Theoretically better performance than hash table or search tree.
● Have to decide on a mapping from letter to node. Three natural choices:
○ DataIndexedCharMap, i.e. an array of all possible child links.
○ Bushy BST.
○ Hash Table.
● All three choices are fine, though hash table is probably the most natural.
● Supports special string operations like longestPrefixOf and keysWithPrefix.
○ keysWithPrefix is the heart of important technology like autocomplete.
○ Optimal implementation of Autocomplete involves use of a priority
queue!
Bottom line: Data structures interact in beautiful and important ways!
datastructur.es
Domain Specific Sets and Maps
More generally, we can sometimes take special advantage of our key type to improve our sets and maps.
● Example: Tries handle String keys. Allow for fast string specific operations.
● Note: There are many other types of string sets/maps out there.
○ Suffix Trees (Link).
○ DAWG (Link).
○ Won’t discuss in our course.
datastructur.es