CountMin Sketch: Finding Heavy Hitters
COMPCSI 753: Algorithms for Massive Data
Instructor: Ninh Pham
University of Auckland
Auckland, Aug 24, 2020
1
Basic definitions
Let U be a universe of size n, i.e. U = {1, 2, 3, … , n}.
Cash register model stream:
Sequence of m elements a1, … , am where ai Î U.
Elements of U may or may not occur once or several times in the stream.
Finding heavy hitters in data stream (today’s lecture): Given a stream, finding frequent items.
2
Frequent items
Each element of data stream is a tuple.
Given a stream of m elements a1, … , am where ai Î U,
finding the most/top-k frequent elements. Example:
{1, 2, 1, 3, 4, 5} f = {2, 1, 1, 1, 1}
{1, 2, 1, 3, 1, 2, 4, 5, 2, 3} f = {3, 3, 2, 1, 1}
We need an approximation solution with much smaller memory with theoretical guarantees.
3
Deterministic: Misra Gries
Process an element a:
If we already have a counter for a, increment it.
Else, if there is no counter for a, but fewer k counters, create a counter for a initialized to 1.
Else, decrease all counters by 1. Remove 0 counters (key step). Example: {1, 2 ,3, 1, 4, 2, 1, 4, 5, 2, 6}, n=6, k=3, m=11
{1, 2, 3, 1, 4, 2, 1, 4, 5, 2, 6} 1232426
4
Randomized: CountMin sketch
Setup:
d independent universal hash functions h over range [0, w)
d different counters, C1, … , Cd. Each of size w initialized with 0s. Process an element aj:
For each hash function, compute hi(aj) and increment Ci[hi(aj)] by 1
5
aj
h1(aj) h2(aj) h3(aj)
Randomized: CountMin sketch
Query: How many times aj occurred?
For each hash function, compute hi(aj) and get Ci[hi(aj)]
Return min(C1[h1(aj)], … , Cd[hd(aj)])
hd(aj)
6
Universal hash function family
Universal hash function definition:
A family of hash function H = {h : U {0, 1, … , w-1}} is universal
ifforany2distinctkeysxi ≠xj ÎU,wehave Prh[h(xi) = h(xj)] ≤ 1/w
In our CountMin sketch:
Given two different items ai ≠ aj, what is the prob. ai and aj collide? Prh[h(ai) = h(aj)] ≤ 1/w
7
Analysis on 1 array of counters
Notation:
Stream of m items {a1, … , am} from the universe U of size n. Frequencyvectorf={f1,…,fn}and 𝒇 1 =m.
Question:
Given a particular item a1, how many times ai ≠ a1 collide by h?
a1 a2
a3 a1 a2
a1
8
Analysis on 1 array of counters
Question:
Given a particular item a1, how many times ai ≠ a1 collide by h.
a1 a2
a3 a1 a2
a1
Answer:
Let X2, … , Xn be contributions of a2, … , an in the bucket h(a1).
Pr [h(ai) = h(a1)] ≤ 1/w E [Xi] ≤ fi/w
Let Y=X2+…+Xn bethetotalincrementsbyotheritemai ≠a1.
E[Y]=E[X2]+…+E[Xn]=(f2 +…+fn)/w≤m/w
9
Basic tools: Tail inequality
Probability distribution
Tail probability
Markov’s inequality for E[Y] = m: Pr[Y≥1+e]≤ m foranye>0.
1+e 1
Pr[Y ≥ (1+e) m] ≤ 1+e for any e > 0.
10
Analysis on 1 array of counters Observation: We always over-estimate.
Question: How large we over estimate? Analysis for a particular item a1:
Let Y=X2+…+Xn bethetotalincrementsbyotheritemai ≠a1. The value of counter h(a1): f’1 = f1 + Y.
Using Markov’s inequality:
Pr[f’1 ≥f1 +em] =Pr[Y≥em]≤E[Y]/em≤m/(w*em)
Choose w = 2/e, we have this prob. is at most 1/2.
Choose w = 2/e, the probability that our error is larger than
em is smaller than 1/2.
11
Analysis on d arrays of counters Boosting the accuracy:
Using d independent hash functions corresponding to d independent arrays of counters.
F1 = min(C1[h1(a1)], … , Cd[hd(a1)]) = min(f’1, f’2, … , f’d ).
Analysis:
Pr[F1 ≥f1 +em] =Pr[f’1 ≥f1 +em…f’d ≥f1 +em]≤1/2d Choosed=log(1/d),wehavePr[F1 ≥f1 +em] ≤d.
With probability at least 1 – d, we have F1 < f1 + em = f1 + e 1
12
Homework
Implement the CountMin Sketch algorithm on the dataset from assignment 1:
Description: Each line (doc ID, word ID, freq.) as a stream tuple.
Query: What are the most and top-10 frequent word ID have been used?
13