程序代写代做代考 algorithm go data structure information retrieval C chain compiler COMP90038 – Algorithms and Complexity Lecture 17

COMP90038 – Algorithms and Complexity Lecture 17
COMP90038 Algorithms and Complexity
Lecture 17: Hashing
(with thanks to Harald Søndergaard & Michael Kirley)
Casey Myers Casey.Myers@unimelb.edu.au David Caro Building (Physics) 274

COMP90038 – Algorithms and Complexity
Lecture 17
Review from Lecture 16: Sorting by Counting
• We can now create a sorted array S 1, ⋯ , 𝑛 of the items by simply slotting items into pre-determined slots in S (a third linear scan).
𝐴=633810879253531876512153
𝑘𝑒𝑦
0
1
2
3
4
5
6
7
8
9
𝐶𝑢𝑚𝑢
0
1
5
7
12
12
16
18
20
23
• Place the first record (with key 6) in S 18 and decrement 𝐶𝑢𝑚𝑢 6 (so that the next ‘6’ will go into slot 17), and so on.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
6
3
3
8
1
0
8
7
9
2
5
3
5
3
1
8
7
6
5
1
2
1
5
3
0
1
1
1
1
2
2
3
3
3
3
3
5
5
5
5
6
6
7
7
8
8
8
9

COMP90038 – Algorithms and Complexity Lecture 17
Review from Lecture 16: Horspool’s String Search Algorithm
• Comparing from right to left in the pattern.
• Very good for random text strings
• We can do better than just observing a mismatch here.
• Because the pattern has no occurrence of I, we might as well slide it 4 positions
along.
• This is based only on knowing the pattern.

COMP90038 – Algorithms and Complexity Lecture 17
Review from Lecture 16: Horspool’s String Search Algorithm

COMP90038 – Algorithms and Complexity Lecture 17
Improving Searching
• Array or linked list
– Access by position
•Ο𝑛
• Can we do better?

COMP90038 – Algorithms and Complexity Lecture 17
Improving Searching
• Array or linked list
– Access by position
•Ο𝑛
• Can we do better?
• Sorted list/array – Binary search
– Οlog𝑛

COMP90038 – Algorithms and Complexity Lecture 17
Improving Searching
•
Array or linked list
– Access by position
•Ο𝑛
• Can we do better?
Sorted list/array – Binary search – Οlog𝑛
Binary search tree
– Ο log 𝑛 — if balanced
•
•

COMP90038 – Algorithms and Complexity Lecture 17
Improving Searching
• Array or linked list
– Access by position
•Ο𝑛
• Can we do better?
• Sorted list/array – Binary search
– Οlog𝑛
• Binary search tree
– Ο log 𝑛 — if balanced
• Can we do better?
– Hashing (at the cost of spending a bit of space) –Ο1

COMP90038 – Algorithms and Complexity Lecture 17
Hashing
•
• •
• •
Hashing is a standard way of implementing the abstract data type “dictionary”.
Implemented well, it makes data retrieval very fast.
A key can be anything, as long as we can map it efficiently to a positive integer. In particular, the set 𝐾 of keys needs not be bounded.
Assume we have a table of size 𝑚 (the hash table).
The idea is to have a function h: 𝐾 → {1,…,𝑚} (the hash function) determine where records are
stored: A record with key k should be stored in location h(k).
•
The address h(k) is the hash address.

COMP90038 – Algorithms and Complexity Lecture 17
The Hash Table
• We can think of the hash table as an abstract data structure supporting operations:
– find
– insert
– lookup (search and insert if not there) – initialise
– delete
– rehash
• The challenges
– Design of hash functions. – Collision handling.

COMP90038 – Algorithms and Complexity Lecture 17
The Hash Function
• If we have a hash table of size m and keys are integers, we may define hn =nmodm.
• But keys may be other things, such as strings of characters, and the hash function should apply to these and still be easy (cheap) to compute.
• We need to choose 𝑚 so that it is large enough to allow efficient operations, without taking up excessive memory.
• The hash function should distribute keys evenly along the cells of the hash table.

COMP90038 – Algorithms and Complexity Lecture 17
The Hash Function
• If we have a hash table of size m and keys are integers, we may define hn =nmodm.
• Examples of modulo operation (remainder after division): – 5mod11=5
– 76999mod11=10 – 120mod11 = 10
• Example hash function: h n = n mod 23
n
19
392
179
359
262
321
97
468
hn

COMP90038 – Algorithms and Complexity Lecture 17
Hashing of Strings
• For simplicity we assume A ⟼ 0,B ⟼ 1,C ⟼ 2 etc.
𝑐h𝑎𝑟
A
B
C
D
E
F
G
H
I
J
K
L
M
𝑠$
0
1
2
3
4
5
6
7
8
9
10
11
12
𝑐h𝑎𝑟
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
𝑠$
13
14
15
16
17
18
19
20
21
22
23
24
25
• We use this encoding to find the hash function of a string 𝑠 = 𝑠!𝑠”𝑠# ⋯ & ‘”
h𝑠= ?𝑠$ 𝑚𝑜𝑑𝑚 $%!
• Example, consider 𝑚 = 13, and the list of strings:
𝐴, 𝐹𝑂𝑂𝐿, 𝐴𝑁𝐷, 𝐻𝐼𝑆, 𝑀𝑂𝑁𝐸𝑌, 𝐴𝑅𝐸, 𝑆𝑂𝑂𝑁, 𝑃𝐴𝑅𝑇𝐸𝐷

COMP90038 – Algorithms and Complexity Lecture 17
A
0
𝑆𝑈𝑀
0
h(𝑠)
0

COMP90038 – Algorithms and Complexity Lecture 17
A
0
F
5
O
14
O
14
L
11
𝑆𝑈𝑀
0
44
h(𝑠)
0
5

COMP90038 – Algorithms and Complexity Lecture 17
A
0
F
5
A
0
O
14
N
13
O
14
D
3
L
11
𝑆𝑈𝑀
0
44
16
h(𝑠)
0
5
3

COMP90038 – Algorithms and Complexity Lecture 17
A
0
F
5
A
0
H
7
O
14
N
13
I
8
O
14
D
3
S
18
L
11
𝑆𝑈𝑀
0
44
16
33
h(𝑠)
0
5
3
7

COMP90038 – Algorithms and Complexity Lecture 17
A
0
F
5
A
0
H
7
M
12
O
14
N
13
I
8
O
14
O
14
D
3
S
18
N
13
L
11
E
4
Y
24
𝑆𝑈𝑀
0
44
16
33
67
h(𝑠)
0
5
3
7
2

COMP90038 – Algorithms and Complexity Lecture 17
A
0
F
5
A
0
H
7
M
12
A
0
O
14
N
13
I
8
O
14
R
17
O
14
D
3
S
18
N
13
E
4
L
11
E
4
Y
24
𝑆𝑈𝑀
0
44
16
33
67
21
h(𝑠)
0
5
3
7
2
8

COMP90038 – Algorithms and Complexity Lecture 17
A
0
F
5
A
0
H
7
M
12
A
0
S
18
O
14
N
13
I
8
O
14
R
17
O
14
O
14
D
3
S
18
N
13
E
4
O
14
L
11
E
4
N
13
Y
24
𝑆𝑈𝑀
0
44
16
33
67
21
59
h(𝑠)
0
5
3
7
2
8
7

COMP90038 – Algorithms and Complexity Lecture 17
A
0
F
5
A
0
H
7
M
12
A
0
S
18
P
15
O
14
N
13
I
8
O
14
R
17
O
14
A
0
O
14
D
3
S
18
N
13
E
4
O
14
R
17
L
11
E
4
N
13
T
19
Y
24
E
4
D
3
𝑆𝑈𝑀
0
44
16
33
67
21
59
58
h(𝑠)
0
5
3
7
2
8
7
6

COMP90038 – Algorithms and Complexity Lecture 17
Hashing of Strings
• We assume a binary representation of the 26 characters, with 5 bits per character (0—31)
• Instead of adding, we concatenate the binary strings
• Consider the example key: 𝑀 𝑌 𝐾 𝐸 𝑌
• Assume a hash table of size 𝑚 = 101.
𝑐h𝑎𝑟
𝑠$
𝑏𝑖𝑛(𝑠$ )
𝑐h𝑎𝑟
𝑠$
𝑏𝑖𝑛(𝑠$ )
A
0
00000
N
13
01101
B
1
00001
O
14
01110
C
2
00010
P
15
01111
D
3
00011
Q
16
10000
E
4
00100
R
17
10001
F
5
00101
S
18
10010
G
6
00110
T
19
10011
H
7
00111
U
20
10100
I
8
01000
V
21
10101
J
9
01001
W
22
10110
K
10
01010
X
23
10111
L
11
01011
Y
24
11000
M
12
01100
Z
25
11001

COMP90038 – Algorithms and Complexity Lecture 17
Hashing of Strings
𝑐h𝑎𝑟
𝑀
𝑌
𝐾
𝐸
𝑌
𝑠$
12
24
10
4
24
𝑏𝑖𝑛(𝑠$ )
01100
11000
01010
00100
11000
𝑖
0
1
2
3
4
• Now concatenate the binary string:
𝑀 𝑌 𝐾 𝐸 𝑌 ⟼0110011000010100010011000 (=13379736)
13379736 mod 101 = 64
• So 64 is the position of string of string 𝑀 𝑌 𝐾 𝐸 𝑌 in the hash table.
• We deliberately chose m to be prime.
13379736=12×32( +24×32) +10×32# +4×32″ +24×32!
• With m = 32, the hash value of any key is the last character’s value!

COMP90038 – Algorithms and Complexity
Lecture 17
Handling Long Strings as Keys
• More precisely, let 𝑐h𝑟 be the function that gives a character’s number (between 0
and 25 under our simple assumptions), so for example 𝑐h𝑟 𝑐
= 2.
• Then we have
& ‘”
hashs=?𝑐h𝑟𝑠$ ×32&’$'” $%!
• For example,
hash(V E R Y L O N G K E Y) = (21 ×32″! + 4 ×32* + ⋯ ) 𝑚𝑜𝑑 101
• The stuff between parentheses quickly becomes ab impossibly large number!
– DEC: 23804165628760600
– BIN: 1010100100100011100001100110100011110100001001000011000

COMP90038 – Algorithms and Complexity Lecture 17
Horner’s Rule
• Fortunately there is a trick that allows us to avoid large numbers in the hash calculations. Instead of
21 ×32″! + 4 ×32* + 17 ×32+ + 24 ×32, + ⋯ factor out repeatedly:
• Example:
⋯ 21×32+4 ×32 +17 ×32+ ⋯ +24
𝑝𝑥 =2𝑥(−𝑥)+3𝑥#+𝑥−5 =𝑥 2𝑥) −𝑥#+3𝑥+1 −5 =𝑥𝑥2𝑥#−𝑥+3+1 −5 =𝑥𝑥𝑥2𝑥−1+3 −1−5

COMP90038 – Algorithms and Complexity Lecture 17
Horner’s Rule
• Fortunately there is a trick that allows us to avoid large numbers in the hash calculations. Instead of
21 ×32″! + 4 ×32* + 17 ×32+ + 24 ×32, + ⋯ factor out repeatedly:
⋯ 21×32+4 ×32 +17 ×32+ ⋯ +24
• Now utilise these properties of modular arithmetic:
𝑥+𝑦modm= 𝑥modm+ymodmmodm 𝑥×𝑦 modm= 𝑥modm×ymodm modm
• So for each sub-expression it suffices to take values modulo m.

COMP90038 – Algorithms and Complexity Lecture 17
Horner’s Rule
• Fortunately there is a trick that allows us to avoid large numbers in the hash calculations. Instead of
21 ×32″! + 4 ×32* + 17 ×32+ + 24 ×32, + ⋯ factor out repeatedly:
⋯ 21×32+4 ×32 +17 ×32+ ⋯ +24
• Step 1:
• Step 2:
• Step 3:
h 0 = h 1 = h 2 =
⋯
21 × 32 + 4 𝑚𝑜𝑑 101 h(0) × 32 + 17 𝑚𝑜𝑑 101 h(1) × 32 + 24 𝑚𝑜𝑑 101

COMP90038 – Algorithms and Complexity
Lecture 17
The Hash Function and Collisions
• The hash function should be as random as possible.
• Hereweassume𝑚=23andh k =𝑘𝑚𝑜𝑑𝑚.
• In some cases different keys will be mapped to the same hash table address.
• When this happens we have a collision.
• Different hashing methods resolve collisions differently.

COMP90038 – Algorithms and Complexity Lecture 17
Separate Chaining
• Element k of the hash table is a list of keys with the hash value k.
• This gives easy collision handling.
• The load factor α = 𝑛/𝑚, where n is the number of items stored.
• Number of probes in successful search ≈ 1 + α/2.
• Number of probes in unsuccessful search ≈ α.

COMP90038 – Algorithms and Complexity
Lecture 17
Separate Chaining Pros and Cons
• Compared with sequential search, reduces the number of comparisons by a factor of m.
• Good in a dynamic environment, when (number of) keys are hard to predict.
• The chains can be ordered, or records may be “pulled up front” when accessed.
• Deletion is easy.
• However, separate chaining uses extra storage for links.

COMP90038 – Algorithms and Complexity
Lecture 17
Open-Addressing Methods
• With open-addressing methods (also called closed hashing) all records are stored in the hash table itself (not linked lists hanging off the table).
• There are many methods of this type. We only discuss two:
– linear probing
– double hashing
• For these methods, the load factor 𝛼 ≤ 1.

COMP90038 – Algorithms and Complexity Lecture 17
Linear Probing
• In case of collision, try the next cell, then the next, and so on.
• After the arrival of 19 (19), 392 (1), 179 (18), 663 (19), 639 (18), 321 (22):
• Search proceeds in a similar fashion.
• If we get to the end of the hash table, we wrap around.
• For example, if key 20 now arrives it will be placed in cell 0.

COMP90038 – Algorithms and Complexity Lecture 17
Linear Probing
• Again let 𝑚 be the table size, and 𝑛 be the number of records stored.
• As before, α = 𝑛/𝑚 is the load factor.
• Average number of probes:
– Successful search: ” + ”
# #(“‘.)
– Unsuccessful:”+ ”
# #(“‘.)!

COMP90038 – Algorithms and Complexity
Lecture 17
Linear Probing Pros and Cons
• Space-efficient.
• Worst-case performance miserable; must be careful not to let the load factor grow
beyond 0.9.
• Comparative behaviour, 𝑚=11113, 𝑛 = 10000, α=0.9:
– Linear probing: 5.5 probes in average (success) – Binary search: 12.3 probes on average (success) – Linear search: 5000 probes on average (success)
• Clustering is a major problem: the collision handling strategy leads to clusters of contiguous cells being occupied.
• Deletion is almost impossible.

COMP90038 – Algorithms and Complexity Lecture 17
Double Hashing
• To alleviate the clustering problem in linear probing, there are better ways of resolving collisions.
• One is double hashing which uses a second hash function 𝑠 to determine an offset to be used in probing for a free cell.
• For example, we may choose 𝑠 𝑘 = 1 + 𝑘 𝑚𝑜𝑑 97.
• By this we mean, if h 𝑘 is occupied, next try h 𝑘 + 𝑠 𝑘 , then h 𝑘 + 2 𝑠 𝑘 , and
so on.
• This is another reason why it is good to have 𝑚 being a prime number. That way, using h 𝑘 as the offset, we will eventually find a free cell if there is one.

COMP90038 – Algorithms and Complexity
Lecture 17
Collision Handling Example
• Consider h 𝑘 = 𝑘 𝑚𝑜𝑑 7. Draw the resulting hash tables after inserting 19, 26,13,48,17 (in this order).
• Separate chaining
0123456
17
19 13 26 48
•Linearprobing 0 1 2 3 4 5 6
13
48
17
19
26

COMP90038 – Algorithms and Complexity
Lecture 17
Collision Handling Example
• Consider h 𝑘 = 𝑘 𝑚𝑜𝑑 7. Draw the resulting hash tables after inserting 19, 26,13,48,17 (in this order).
• Separate chaining
0123456
•Linearprobing 0 1 2 3 4 5 6
• Double hashing, using s 𝑘 = 5 − 𝑘 𝑚𝑜𝑑 5 offset 0123456
17
19 13 26 48
13
48
17
19
26
48
26
17
19
13

COMP90038 – Algorithms and Complexity Lecture 17
Rehashing
• The standard approach to avoiding performance deterioration in hashing is to keep track of the load factor and to rehash when is reaches, say, 0.9.
• Rehashing means allocating a larger hash table (typically twice the current size), revisiting each item, calculating its hash address in the new table, and inserting it.
• This “stop-the-world” operation will introduce long delays at unpredictable times, but it will happen relatively infrequently.

COMP90038 – Algorithms and Complexity
Lecture 17
Rabin-Karp String Search
• The Rabin-Karp string search algorithm is based on string hashing.
• To search for a string 𝑝 (of length 𝑚) in a larger string s, we can calculate hash(p) and then check every substring 𝑠$ ⋯ 𝑠$01′” to see if it has the same hash value. Of course, if it has, the strings may still be different; so we need to compare them in the usual way.
• If 𝑝 = 𝑠$ ⋯ 𝑠$01′” then the hash values are the same; otherwise the values are almost certainly going to be different.
• Since false positives will be so rare, the Ο(𝑚) time it takes to actually compare the strings can be ignored.

COMP90038 – Algorithms and Complexity
Lecture 17
Rabin-Karp String Search

COMP90038 – Algorithms and Complexity
Lecture 17
Rabin-Karp String Search
• Example: consider the text 31415926535, find the pattern 26 by using the hash function h 𝑘 = 𝑘 𝑚𝑜𝑑 11. The hash for the pattern is: h 26 = 26 𝑚𝑜𝑑 11 = 4.
31415926535 h𝑘 319

COMP90038 – Algorithms and Complexity
Lecture 17
Rabin-Karp String Search
• Repeatedly hashing strings of length 𝑚 seems like a bad idea. However, the hash values can be calculated incrementally. The hash value of the length-𝑚 substring 𝑠 that starts at position 𝑗 is:
1′”
h𝑎𝑠h𝑠,𝑗 =?𝑐h𝑟𝑠20$ ×𝑎1’$'”, $%!
where a is the alphabet size. From that we we can get the next hash value, for the substring that starts at position 𝑗 + 1, quite cheaply:
h𝑎𝑠h 𝑠,𝑗+1 = h𝑎𝑠h 𝑠,𝑗 −𝑎1′”𝑐h𝑟(𝑠2) ×𝑎+𝑐h𝑟(𝑠201)
modulo 𝑚. Effectively we just subtract the contributions of 𝑠2 and add the contributions of 𝑠201, for the cost of two multiplications, one addition and one subtraction.

COMP90038 – Algorithms and Complexity
Lecture 17
Rabin-Karp String Search
• Example: has all 3-substrings of “there”.
• Thefirstsubstring“the”=𝑡m 26 # +hm 26 +𝑒
• If we have “the”, can we compute “her”?
“h𝑒𝑟”=hm 26 # +𝑒m 26 +𝑟 =26m hm 26 +𝑒 +𝑟
=26m 𝑡. 26#+hm 26 +𝑒−𝑡. 26# +𝑟 =26m “𝑡h𝑒”−𝑡. 26 # +𝑟

COMP90038 – Algorithms and Complexity
Lecture 17
Rabin-Karp String Search
• Example: has all 3-substrings of “there”.
• The first substring “the” = 𝑡 m 26 # + h m 26 + 𝑒
• If we have “the”, can we compute “her”?
“h𝑒𝑟”=hm 26 # +𝑒m 26 +𝑟 =26m hm 26 +𝑒 +𝑟
=26m 𝑡. 26#+hm 26 +𝑒−𝑡. 26# +𝑟 =26m “𝑡h𝑒”−𝑡. 26 # +𝑟
• i.e. subtract the first letter’s contribution to the number, shift, and add the last letter.

COMP90038 – Algorithms and Complexity
Lecture 17
Why Not Always Use Hashing?
• Some drawbacks:
– If an an application call for traversal of all items in sorted order, a hash table is
no good.
– Also, unless we use separate chaining, deletion is virtually impossible.
– It may be hard to predict the volume of data, and rehashing is an expensive “stop-the-world” operation.

COMP90038 – Algorithms and Complexity Lecture 17
When to Use Hashing?
• All sorts of information retrieval applications involving thousands to millions of keys.
• Typical example: symbol tables used by compilers. The compiler hashes all (variable, function, etc.) names and stores information related to each – no deletion in this case.
• When hashing is applicable, it is usually superior; a well-tuned hash table will outperform its competitors.
• Unless you let the load factor get too high, or you botch up the hash function. IT is a good idea to print statistics to check that the function really does spread keys uniformly across the hash table.

COMP90038 – Algorithms and Complexity Lecture 17
Coming Up Next
• Dynamic programming and optimisation.

Related Posts