IT代考 Analysis of Algorithms, I

Analysis of Algorithms, I
CSOR W4231.002

Computer Science Department

Columbia University Hashing, bloom filters

2 Analyzing hash tables using balls and bins
3 Saving space: hashing-based fingerprints
4 Bloom filters

The problem
A data structure maintaining a dynamic subset S of a huge universe U.
􏰉 Typically,|S|≪|U|
The data structure should support 􏰉 efficient insertion
􏰉 efficient deletion
􏰉 efficient search
We will call such a data structure a dictionary.

Dictionary data structure
A dictionary maintains a subset S of a universe U so that inserting, deleting and searching is efficient.
Operations supported by a dictionary
1. Create(): initialize a dictionary with S = ∅ 2. Insert(x): addxtoS,ifx̸∈S
􏰉 additional information about x might be stored in the dictionary as part of a record for x
3. Delete(x): delete x from S, if x ∈ S 4. Lookup(x): determine if x ∈ S

A concrete example
We want to maintain a dynamic list of 250 IP addresses
􏰉 e.g., these correspond to addresses of currently active customers of a Web service
􏰉 each IP address consists of 32 bits, e.g. 128.32.168.80

The challenge: U is enormous, that is, |U| ≫ |S|
1. Maintain array S of size |U| such that S[i] = 1 if and only if i ∈ S
􏰉 Insert, Delete, Lookup require O(1) time
Can’t store an array of size anywhere close to |U|!
􏰉 S should have |U| = 232 ≈ 4 billion entries
􏰉 S would be mostly empty (huge waste of space)
2. Store S in a linked list
􏰉 Space: proportional to |S| = 250
􏰉 Time for Lookup: proportional to |S|; too slow
Can we support fast Insert, Delete, Lookup (as in array implementation) but only use space proportional to |S| (linked list implementation)?

Work with array of size |S| rather than one of size |U|
Idea: assign a short nickname to each element in U
􏰉 Each of the 232 IP addresses is assigned a number between 1 and |S| = 250
􏰉 range will be slightly adjusted
􏰉 Total amount of storage: approximately |S|, independent of
􏰉 If not too many IP addresses per nickname, then Lookup is efficient (details coming up)

How can we assign a short name?
By hashing: use a hash function h : U → {0, . . . , n − 1} 􏰉 Typically, n ≪ |U| and is close to |S|
For example,
􏰉 h:{0,…,232 −1}→{0,…,249}
􏰉 IP address x gets name h(x)
􏰉 Hash table H of size 250: store address x at entry h(x)
So Insert(x) takes constant time. What if we try to insert y ̸= x, with h(x) = h(y)?

Collisions
Collision: elements x ̸= y such that h(x) = h(y) Easiest way to deal with collisions: chain hashing
􏰉 Entry i in the hash table is a linked list of elements x such that h(x) = i
􏰉 Alternatively, can think of every entry in the hash table as a bin containing the elements that hash to the same location

Chain hashing
Maintain a linked list at H[i] for all x such that h(x) = i.
128.20.110.80
128.5.110.60
194.66.82.1
168.212.26.204
192.168.1.2
22.231.113.64

Chain hashing: running time for Lookup(x)
Time for Lookup(x):
1. time to compute h(x); typically, constant
2. time to scan the linked list at position h(x) in hash table
􏰉 proportional to the length of the linked list at h(x), which is
proportional to the # elements that collide with x
Goal: find a hash function that “spreads out” the elements well

Simple hash functions might not work
Consider the following two simple hash functions that hash an IP address x from {0,…,232 −1} to {0,…,255}:
􏰉 assign the last 8 bits of x as its name 􏰉 assign the first 8 bits of x as its name
Nothing is inherently wrong with these hash functions: the problem is that our 250 IP addresses might not be drawn uniformly at random from among all 232 possibilities.

No single hash function can work well on all data sets
􏰉 Fix the hash function h.
􏰉 h distributes |U| elements into n names.
⇒ exists data set of at least |U| elements that all map to the
⇒ if our customers come from this data set, lots of collisions
Fact: for any fixed (deterministic) h : U → {0, 1, . . . , n − 1} where |U| ≥ n2, there exists some set S of n elements that all map to the same position.

Randomization can help
􏰉 Extreme example: for every 0 ≤ j ≤ n − 1, assign name j to element x with probability n1 .
􏰉 Fixx,y∈U.ThenPr[h(x)=h(y)]=n1.
􏰉 This doesn’t quite work. (Think Lookup(x): where is x?)
􏰉 However, intuitively, hash functions that spread things around in a random way can effectively reduce collisions.
⇒ Trade-off in hash function design: h must be “random” to scatter things around for all inputs but still be a function
Goal: design h that allows for efficient dictionary operations with high probability

A careful use of randomization
􏰉 Randomize over the choice of the hash function from a suitable class of functions into [0, n − 1] (details coming up)
􏰉 h must have a compact representation

Universal hash function
Idea: choose h at random from a carefully selected class of functions H with the following properties:
1. h behaves almost like a completely random hash function. 􏰉 For x, y ∈ U . The probability that a randomly chosen
h ∈ H satisfies h(x) = h(y) is at most 1/n. 2. Can select a random h efficiently.
3. Given h, can compute h(x) efficiently.
Such hash functions are called universal; their design relies on number theoretic facts.

Example of universal hash function
􏰉 Pickaprimepcloseto|S|=250;setn=p
􏰉 E.g., pick p = 257; set the size n of the hash table to 257
􏰉 LookatIPaddressxas(x1,x2,x3,x4),wherex1,x2,x3,x4 are integers mod n.
􏰉 Defineh:U→{0,1,…,n−1}asfollows:
􏰉 Choose a1,a2,a3,a4 randomly from {0,1,…,n−1}
􏰉 E.g., a1 = 80,a2 = 35,a3 = 168,a4 = 220 􏰌4􏰍
􏰉 MapIPaddressxtoh(x)= 􏰅aixi modn i=1
􏰉 E.g., x = 128.32.168.80,
h(x) = (80·128+35·32+168·168+220·80)

h is a universal hash function
Consider any pair x = (x1, x2, x3, x4), y = (y1, y2, y3, y4). If a1, . . . , a4 are chosen uniformly at random from {0,…,n−1}, then
Pr[ha(x1,…,x4) = ha(y1,…,y4)] = n1
The proof relies on elementary number theory.
Corollary 1.
Fix x ∈ U. The expected #elements colliding with x is less than 1. Hence the expected lookup time is constant.

Ideal hash functions
From now on, assume a completely random hash function exists. △ Does not exist! But can provide a good rough idea of how
hashing schemes perform in practice.
􏰉 Leth:U→{0,1,…,n−1}beacompletelyrandom (ideal)hashfunction. Forallx∈U,0≤j≤n−1
P r [ h ( x ) = j ] = n1
h(x) is fixed for every x: it just takes one of the n possible values with equal probability.

2 Analyzing hash tables using balls and bins
3 Saving space: hashing-based fingerprints
4 Bloom filters

Hashing modeled as a balls and bins problem
Q1: How many elements can we insert in the hash table before it is more likely than not that there is a collision?

Hashing modeled as a balls and bins problem
Q1: How many elements can we insert in the hash table before it is more likely than not that there is a collision?
This is just an occupancy problem!

Hashing modeled as a balls and bins problem
Q1: How many elements can we insert in the hash table before it is more likely than not that there is a collision?
Occupancy problems, revisited: find the distribution of balls into bins when m balls are thrown independently and uniformly at random into n bins.
Hashing as an occupancy problem:
􏰉 balls correspond to elements from U 􏰉 bins are slots in the hash table
􏰉 each ball falls into one of the n bins independently and with probability 1/n

Hashing modeled as a balls and bins problem
Q1: How many elements can we insert in the hash table before it is more likely than not that there is a collision?
Hashing as an occupancy problem:
􏰉 balls correspond to elements from U
􏰉 bins are slots in the hash table
􏰉 each ball falls into one of the n bins independently and with probability 1/n
Q1 (rephrased): How many balls can we throw before it is more likely than not that some bin contains at least two balls?
Answer: Ω( n) (see the birthday paradox)

Towards analyzing time/space efficiency of hash table
􏰉 What is the expected time for Lookup(x)?
􏰉 What is the expected wasted space in the hash table?
􏰉 What is the worst-case time for Lookup(x)?

Towards analyzing time/space efficiency of hash table
􏰉 What is the expected time for Lookup(x)? Corresponds to expected load of a bin.
􏰉 What is the expected wasted space in the hash table? Corresponds to expected number of empty bins.
􏰉 What is the worst-case time for Lookup(x)? Corresponds to load of the fullest bin.

Towards analyzing time/space efficiency of hash table
􏰉 What is the expected time for Lookup(x)?
􏰉 What is the expected wasted space in the hash table?
At least a third of the slots are empty.
􏰉 What is the worst-case time for Lookup(x), with high probability?
Θ(ln n/ ln ln n), with high probability.

Max load in any bin, with high probability (case m = n)
Proposition 1.
When throwing n balls into n bins uniformly and independently at random, the maximum load in any bin is Θ(ln n/ ln ln n) with probability close to 1 as n grows large.
Two-sentence sketch of the proof.
1. Upper bound the probability that any bin contains more than k nnn1l 1n−l
ballsbyaunionbound: 􏰅 􏰅􏰊 􏰋􏰊 􏰋 􏰊1− 􏰋 .
2. Compute the smallest possible k∗ such that the probability above is less than 1/n (which becomes negligible as n grows large).

2 Analyzing hash tables using balls and bins
3 Saving space: hashing-based fingerprints
4 Bloom filters

A password checker
􏰉 We want to maintain a dictionary for a set S of 216 bad passwords so that, when a user tries to set up a password, we can check as quickly as possible if it belongs to S and reject it.
􏰉 We assume that each password consists of 8 ASCII characters
􏰉 hence each password requires 8 bytes (64 bits) to represent

A dictionary data structure that uses less space
Let S be the set of bad passwords.
Input: a 64-bit password x, and a query of the form
“Is x a bad password?”
Output: a dictionary data structure for S that answers queries
as above and
􏰉 is small: uses less space than explicitly storing all bad passwords
􏰉 allows for erroneous yes answers occasionally
􏰉 that is, we occasionally answer “x ∈ S” even though x ̸∈ S

Approximate set membership
The password checker belongs to a broad class of problems, called approximate set membership problems.
Input: a large set S = {s1,…,sm}, and queries of the form “Is x ∈ S?”
We want a dictionary for S that is small (smaller than the explicit representation provided by a hash table).
To achieve this, we allow for some probability of error
􏰉 False positives: answer yes when x ̸∈ S 􏰉 False negatives: answer no when x ∈ S
Output: small probability of false positives, no false negatives

Fingerprints: hashing for saving space
􏰉 Useahashfunctionh:{0,…,264−1}→{0,…,232−1} to map each password into a 32 bit string.
􏰉 This string will serve as a short fingerprint of the password.
􏰉 Keep the fingerprints in a sorted list.
􏰉 To check if a proposed password is bad:
1. calculate its fingerprint
2. binary search for the fingerprint in the list of fingerprints; if
found, declare the password bad and ask the user to enter a new one.

Setting the length b of the fingerprint
Why did we map passwords to 32-bit fingerprints? Motivation: make fingerprints long enough so that the false
positive probability is acceptable
Let b be the number of bits used by our hash function to map
the m bad passwords into fingerprints, thus h:{0,1,…,264 −1}→{0,…,2b −1}
We will choose b so that the probability of a false positive is acceptable, e.g., at most 1/m.

Determining the false positive probability
There are 2b possible strings of length b.
Let x be a good password.
Fix a y ∈ S (recall that all m passwords in S are bad).
􏰉 Pr[x has the same fingerprint as y] = 1/2b
􏰉 Pr[xdoesnothavethesamefingerprintasy]=1−1/2b
􏰉 letp=1−1/2b
􏰉 Pr[x does not have the same fingerprint as any w ∈ S] = pm
􏰉 Pr[xhasthesamefingerprintassomew∈S]=1−pm Hence the false positive probability is
1−pm =1−(1−1/2b)m ≈1−e−m/2b

Constant false positive probability and bound for b
To make the probability of a false positive less than, say, a constant c, we require
1−e−m/2b ≤c⇒b≥log m . 2 ln(1/(1−c))
So b = Ω(log2 m ) bits. ln (1/(1−c))

Improved false positive probability and bound for b
Now suppose we use b = 2 log2 m.
Plugging back into the original formula for the probability of
false positive, which is 1 − (1 − 1/2b)m, we get 􏰌1􏰍m 􏰌1􏰍1
1− 1−m2 ≤1− 1−m =m
Thus if our dictionary has |S| = m = 216 bad passwords, using a hash function that maps each of the m passwords to 32 bits yields a false positive probability of about 1/216.

2 Analyzing hash tables using balls and bins
3 Saving space: hashing-based fingerprints
4 Bloom filters

Fast approximate set membership
Input: a large set S, and queries of the form “Is x ∈ S?”

Fast approximate set membership
Input: a large set S, and queries of the form “Is x ∈ S?”
We want a data structure that answers the queries
􏰉 fast (faster than searching in S)
􏰉 is small (smaller than the explicit representation provided by hash table)
To achieve the above, allow for some probability of error 􏰉 False positives: answer yes when x ̸∈ S
􏰉 False negatives: answer no when x ∈ S

Bloom filter
A Bloom filter consists of:
1. an array B of n bits, initially all set to 0.
2. k independent random hash functions h1, . . . , hk with range {0,1,…,n−1}.
A basic Bloom filter supports 􏰉 Insert(x)
􏰉 Lookup(x)

Representing a set S = {x1, . . . , xm} using a Bloom filter
SetupBloomFilter(S, h1 , . . . , hk )
Initialize array B of size n to all zeros for i = 1 to m do
Insert(xi ) end for
for i = 1 to k do compute hi(x) set B[hi(x)] = 1
Remark: an entry of B may be set multiple times; only the first change has an effect.

Setting up the Bloom filter
S={x1, x2, x3} m=k=3 n=16
h3(x1) h2(x1)
h1(x3) h2(x3)

Bloom filter: Lookup
To check membership of an element x in S do: Lookup(x)
for i = 1 to k do
compute hi(x)
if B[hi(x)] == 0 then
return no end if
end for return yes
􏰉 If B[hi(x)] ̸= 1 for some i, then clearly x ̸∈ S.
􏰉 Otherwise, answer “x ∈ S” —might be a false positive!

False positive example
Query: “x4 ∊S?”
h3(x4) h2(x4)
Lookup(x4): h1(x4)=h2(x4)=h3(x4)=1 Answer: “yes”

Probability of false positive
􏰉 After all elements from S have been hashed into the Bloom filter, the probability that a specific bit is still 0 is
1 − n ≈ e−km/n = p.
􏰉 To simplify the analysis, assume that the probability that a specific bit is still 0 is exactly p.
􏰉 The probability of a false positive is the probability that all k hashes evaluate to 1:
f = (1 − p)k

Optimal number of hash functions
f = (1−p)k = (1−e−km/n)k
􏰉 Trade-off between k and p: using more hash functions 􏰉 givesusmorechancestofinda0whenx̸∈S;
􏰉 but reduces the number of 0s in the array!
􏰉 Compute optimal number k∗ of hash functions by minimizing f as a function of k:
k∗ =(n/m)·ln2
􏰉 Then the false positive probability is given by
f = (1/2)k∗ ≈ (0.6185)n/m

Big savings in space
􏰉 Space required by Bloom filter per element of S: n/m bits.
􏰉 Forexample,setn=8m.Thenk∗=6andf≈0.02.
⇒ Small constant false positive probability by using only 8 bits (1 byte) per element of S, independently of the size of S!

Summary on Bloom filters
Bloom filter can answer approximate set membership in
􏰉 “constant” time (time to hash)
􏰉 constant space to represent an element from S 􏰉 constant false positive probability f.

Application 1 (historical): spell checker
􏰉 Spelling list of 210KB, 25K words.
􏰉 Use 1 byte per word.
􏰉 Maintain 25KB Bloom filter.
􏰉 False positive = accept a misspelled word.

Application 2: implementing joins in database
􏰉 Join: Combine two tables with a common domain into a single table.
􏰉 Semi-join: A join in distributed DBs in which only the joining attribute from one site is transmitted to the other site and used for selection. The selected records are sent back.
􏰉 Bloom-join: A semi-join where we send only a BF of the joining attribute.

Cost Of Living

Create a table of all employees that make < 50K and live in city where Cost Of Living = COL > 50K.
􏰉 Join: send (City, COL) for COL > 50.
􏰉 Semi-join: send just (City) for COL > 50.
􏰉 Bloom-join: send a Bloom filter for all cities with COL > 50

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts