程序代写代做代考 algorithm AI CountMin Sketch: Finding Heavy Hitters

CountMin Sketch: Finding Heavy Hitters
COMPCSI 753: Algorithms for Massive Data
Instructor: Ninh Pham
University of Auckland
Auckland, Aug 24, 2020
1

Basic definitions
 Let U be a universe of size n, i.e. U = {1, 2, 3, … , n}.
 Cash register model stream:
 Sequence of m elements a1, … , am where ai Î U.
 Elements of U may or may not occur once or several times in the stream.
 Finding heavy hitters in data stream (today’s lecture):  Given a stream, finding frequent items.
2

Frequent items
 Each element of data stream is a tuple.
Given a stream of m elements a1, … , am where ai Î U,
finding the most/top-k frequent elements.  Example:
 {1, 2, 1, 3, 4, 5}  f = {2, 1, 1, 1, 1}
 {1, 2, 1, 3, 1, 2, 4, 5, 2, 3}  f = {3, 3, 2, 1, 1}
 We need an approximation solution with much smaller memory with theoretical guarantees.
3

Deterministic: Misra Gries
 Process an element a:
 If we already have a counter for a, increment it.
 Else, if there is no counter for a, but fewer k counters, create a counter for a initialized to 1.
 Else, decrease all counters by 1. Remove 0 counters (key step). Example: {1, 2 ,3, 1, 4, 2, 1, 4, 5, 2, 6}, n=6, k=3, m=11
{1, 2, 3, 1, 4, 2, 1, 4, 5, 2, 6} 1232426
4

Randomized: CountMin sketch
 Setup:
 d independent universal hash functions h over range [0, w)
 d different counters, C1, … , Cd. Each of size w initialized with 0s.  Process an element aj:
 For each hash function, compute hi(aj) and increment Ci[hi(aj)] by 1
5

aj
h1(aj) h2(aj) h3(aj)
Randomized: CountMin sketch
 Query: How many times aj occurred?
 For each hash function, compute hi(aj) and get Ci[hi(aj)]
 Return min(C1[h1(aj)], … , Cd[hd(aj)])
hd(aj)
6

Universal hash function family
 Universal hash function definition:
 A family of hash function H = {h : U  {0, 1, … , w-1}} is universal
ifforany2distinctkeysxi ≠xj ÎU,wehave Prh[h(xi) = h(xj)] ≤ 1/w
 In our CountMin sketch:
 Given two different items ai ≠ aj, what is the prob. ai and aj collide? Prh[h(ai) = h(aj)] ≤ 1/w
7

Analysis on 1 array of counters
 Notation:
 Stream of m items {a1, … , am} from the universe U of size n. Frequencyvectorf={f1,…,fn}and 𝒇 1 =m.
 Question:
 Given a particular item a1, how many times ai ≠ a1 collide by h?
a1 a2
a3 a1 a2
a1
8

Analysis on 1 array of counters
 Question:
 Given a particular item a1, how many times ai ≠ a1 collide by h.
a1 a2
a3 a1 a2
a1
 Answer:
 Let X2, … , Xn be contributions of a2, … , an in the bucket h(a1).
Pr [h(ai) = h(a1)] ≤ 1/w  E [Xi] ≤ fi/w
Let Y=X2+…+Xn bethetotalincrementsbyotheritemai ≠a1.
E[Y]=E[X2]+…+E[Xn]=(f2 +…+fn)/w≤m/w
9

Basic tools: Tail inequality
Probability distribution
Tail probability



 Markov’s inequality for E[Y] = m: Pr[Y≥1+e]≤ m foranye>0.
1+e 1
Pr[Y ≥ (1+e) m] ≤ 1+e for any e > 0.
10

Analysis on 1 array of counters  Observation: We always over-estimate.
 Question: How large we over estimate?  Analysis for a particular item a1:
Let Y=X2+…+Xn bethetotalincrementsbyotheritemai ≠a1.  The value of counter h(a1): f’1 = f1 + Y.
 Using Markov’s inequality:
Pr[f’1 ≥f1 +em] =Pr[Y≥em]≤E[Y]/em≤m/(w*em)
 Choose w = 2/e, we have this prob. is at most 1/2.
 Choose w = 2/e, the probability that our error is larger than
em is smaller than 1/2.
11

Analysis on d arrays of counters  Boosting the accuracy:
 Using d independent hash functions corresponding to d independent arrays of counters.
 F1 = min(C1[h1(a1)], … , Cd[hd(a1)]) = min(f’1, f’2, … , f’d ).
 Analysis:
Pr[F1 ≥f1 +em] =Pr[f’1 ≥f1 +em…f’d ≥f1 +em]≤1/2d Choosed=log(1/d),wehavePr[F1 ≥f1 +em] ≤d.
 With probability at least 1 – d, we have F1 < f1 + em = f1 + e 1 12 Homework  Implement the CountMin Sketch algorithm on the dataset from assignment 1:  Description: Each line (doc ID, word ID, freq.) as a stream tuple.  Query: What are the most and top-10 frequent word ID have been used? 13