Count Sketch: Finding Heavy Hitters
COMPCSI 753: Algorithms for Massive Data
Instructor: Ninh Pham
University of Auckland
Auckland, Aug 24, 2020
1
Basic definitions
Let U be a universe of size n, i.e. U = {1, 2, 3, …, n}.
Cash register model stream:
Sequence of m elements a1,…, am where ai Î U.
Elements of U may or may not occur once or several times in the stream.
Finding heavy hitters in data stream (today’s lecture): Given a stream, finding frequent items.
2
Frequent items
Each element of data stream is a tuple.
Given a stream of m elements a1,…, am where ai Î U,
finding the most/top-k frequent elements. Example:
{1, 2, 1, 3, 4, 5} f = {2, 1, 1, 1, 1}
{1, 2, 1, 3, 1, 2, 4, 5, 2, 3} f = {3, 3, 2, 1, 1}
We need an approximation solution with much smaller memory with theoretical guarantees.
3
Randomized: CountMin sketch
Setup:
d independent universal hash functions h over range [0, w)
d different counters, C1, … , Cd. Each of size w initialized with 0s Process an element aj:
For each hash function, compute hi(aj) and increment Ci[hi(aj)] by 1
4
aj
h1(aj) h2(aj) h3(aj)
Randomized: CountMin sketch
Query: How many times aj occurred?
For each hash function, compute hi(aj) and get Ci[hi(aj)]
Return min(C1[h1(aj)], … , Cd[hd(aj)])
hd(aj)
5
Randomized: Count sketch
Setup:
d independent 2-wise hash functions h over range [0, w).
d independent 2-wise hash functions s over range {+1, -1}.
d different counters, C1, … , Cd. Each of size w initialized with 0s.
Process an element aj:
For each hash function, compute hi(aj) and increment Ci[hi(aj)] by s(aj).
1 1
1
1
6
aj
h1(aj) h2(aj) h3(aj)
Randomized: Count sketch
Query: How many times aj occurred?
For each hash function, compute hi(aj) and get Ci[hi(aj)].
Return median(C1[h1(aj)], … , Cd[hd(aj)]).
hd(aj)
median
7
2-wise hash function family
2-wise hash function definition:
A family of hash function H = {h : U {0, 1, … , w-1}} is 2-wise independent if for any 2 distinct keys xi ≠ xj Î U and 2 hash values (not necessary distinct) yi, yj Î {0, 1, … , w-1}, we have
Prh[h(xi) = yi h(xj) = yj] = 1/w2 On our CountSketch:
Given two different items ai ≠ aj, what is the prob. ai and aj collide? Prh[h(ai) = h(aj)] = 1/w2 * w = 1/w
8
Analysis on 1 array of counters
Notation:
Stream of m items {a1, … , am} from the universe U of size n. Frequency vector f ={f1, … , fn}, 𝒇 𝟐𝟐 ∑𝒏𝒊𝟏 𝒇𝟐𝒊
Question:
Given a particular item a1, how many times ai ≠ a1 collide by h?
a1 a2 +1
a3 a1 a2 -1
a1
9
Analysis on 1 array of counters
Question:
Given a particular item a1, how many times ai ≠ a1 collide by h.
a1 a2 +1
a3 a1 a2 -1
Answer:
Let Xi be contribution of ai in the bucket h(a1).
a1
𝑓𝑖, which happens with prob. 1/2𝑤 𝑋𝑖 0, which happens with prob. 1 1/𝑤
𝑓𝑖, which happens with prob. 1/2𝑤
Error source caused by the item ai: E[Xi] = 0
10
Analysis on 1 array of counters
Observation: Our estimate is unbiased. Question: How large error do we have?
Analysis of variance of contribution of a particular item ai: Let Xi be contribution of ai in the bucket h(a1).
𝑓𝑖, which happens with prob. 1/2𝑤 𝑋𝑖 0, which happens with prob. 1 1/𝑤
𝑓𝑖, which happens with prob. 1/2𝑤
Expectation: E[Xi] = 0.
Variance: Var [Xi] = E[Xi2] – (E[Xi])2 = fi2/w.
11
Analysis on 1 array of counters
Observation: Our estimate is unbiased. Question: How large error do we have?
Analysis of variance of error:
Let X2, … , Xn be contributions of a2, … , an in the bucket h(a1).
Let Y = X2+ … + Xn be the total contributions by other item ai ≠ a1. Variance of error:
Var[Y] =Var[X2] + … +Var[Xn] = (f22 + … + fn2)/w ≤ 𝒇 𝟐𝟐/w.
12
Basic tools: Tail inequality
Probability distribution
Tail probability
ee
Chebyshev’s inequality for E[Y] = m and Var[Y] = s2:
Pr[|Y–m|≥e]≤s2 foranye>0. e2
13
Analysis on 1 array of counters Question: How large error do we have?
Analysis of variance of error:
Let X2, … , Xn be contributions of a2, … , an in the bucket h(a1).
Let Y = X2+ … + Xn be the total contributions by other item ai ≠ a1. Expectation: E[Y] = 0.
Variance: Var[Y] = Var[X2] + … + Var[Xn] = (f22 + … + fn2)/w ≤ Chebyshev’s inequality:
𝒇 𝟐𝟐 𝒘
Pr[|Y|≥e 𝒇 ]≤ 1 =1/2ifwechoosew=2/e2. 2 𝒘e2
The error is at most e 𝒇 2 with the probability 1/2.
14
Analysis on d arrays of counters Boosting the accuracy:
Using d independent hash functions corresponding to d independent arrays of counters.
F1 = median(C1[h1(a1)], … , Cd[hd(a1)]) = median(f’1, f’2, …, f’d ). Analysis:
E [ f’1] = E [ f’2] = … = E [ f’d] = f1.
Choose d = log(1/d) and apply Chernoff bound, we have
Pr[|F1 –f1|≤e 𝒇 2]≥1-d
With probability at least 1 – d, we have f1-e 𝒇 2≤F1≤f1+e 𝒇 2
15
Homework
Implement the CountSketch algorithm on the dataset from assignment 1:
Description: Each line (doc ID, word ID, freq.) as a stream tuple.
Query: What are the most and top-10 frequent word ID have been used?
What are the average errors of CountMin Sketch and CountSketch?
16