程序代写代做代考 algorithm AI Count Sketch: Finding Heavy Hitters

Count Sketch: Finding Heavy Hitters
COMPCSI 753: Algorithms for Massive Data
Instructor: Ninh Pham
University of Auckland
Auckland, Aug 24, 2020
1

Basic definitions
 Let U be a universe of size n, i.e. U = {1, 2, 3, …, n}.
 Cash register model stream:
 Sequence of m elements a1,…, am where ai Î U.
 Elements of U may or may not occur once or several times in the stream.
 Finding heavy hitters in data stream (today’s lecture):  Given a stream, finding frequent items.
2

Frequent items
 Each element of data stream is a tuple.
 Given a stream of m elements a1,…, am where ai Î U,
finding the most/top-k frequent elements.  Example:
 {1, 2, 1, 3, 4, 5}  f = {2, 1, 1, 1, 1}
 {1, 2, 1, 3, 1, 2, 4, 5, 2, 3}  f = {3, 3, 2, 1, 1}
 We need an approximation solution with much smaller memory with theoretical guarantees.
3

Randomized: CountMin sketch
 Setup:
 d independent universal hash functions h over range [0, w)
 d different counters, C1, … , Cd. Each of size w initialized with 0s  Process an element aj:
 For each hash function, compute hi(aj) and increment Ci[hi(aj)] by 1
4

aj
h1(aj) h2(aj) h3(aj)
Randomized: CountMin sketch
 Query: How many times aj occurred?
 For each hash function, compute hi(aj) and get Ci[hi(aj)]
 Return min(C1[h1(aj)], … , Cd[hd(aj)])
hd(aj)
5

Randomized: Count sketch
 Setup:
 d independent 2-wise hash functions h over range [0, w).
 d independent 2-wise hash functions s over range {+1, -1}.
 d different counters, C1, … , Cd. Each of size w initialized with 0s.
 Process an element aj:
 For each hash function, compute hi(aj) and increment Ci[hi(aj)] by s(aj).
􏰌1 􏰌1
􏰌1
􏰌1
6

aj
h1(aj) h2(aj) h3(aj)
Randomized: Count sketch
 Query: How many times aj occurred?
 For each hash function, compute hi(aj) and get Ci[hi(aj)].
 Return median(C1[h1(aj)], … , Cd[hd(aj)]).
hd(aj)
median
7

2-wise hash function family
 2-wise hash function definition:
 A family of hash function H = {h : U  {0, 1, … , w-1}} is 2-wise independent if for any 2 distinct keys xi ≠ xj Î U and 2 hash values (not necessary distinct) yi, yj Î {0, 1, … , w-1}, we have
Prh[h(xi) = yi  h(xj) = yj] = 1/w2  On our CountSketch:
 Given two different items ai ≠ aj, what is the prob. ai and aj collide? Prh[h(ai) = h(aj)] = 1/w2 * w = 1/w
8

Analysis on 1 array of counters
 Notation:
 Stream of m items {a1, … , am} from the universe U of size n.  Frequency vector f ={f1, … , fn}, 𝒇 𝟐𝟐 􏰀 ∑𝒏𝒊􏰍𝟏 𝒇𝟐𝒊
 Question:
 Given a particular item a1, how many times ai ≠ a1 collide by h?
a1 a2 +1
a3 a1 a2 -1
a1
9

Analysis on 1 array of counters
 Question:
 Given a particular item a1, how many times ai ≠ a1 collide by h.
a1 a2 +1
a3 a1 a2 -1
 Answer:
 Let Xi be contribution of ai in the bucket h(a1).
a1
𝑓𝑖, which happens with prob. 1/2𝑤  𝑋𝑖 􏰀 􏰎0, which happens with prob. 1 􏰉 1/𝑤
􏰉𝑓𝑖, which happens with prob. 1/2𝑤
 Error source caused by the item ai: E[Xi] = 0
10

Analysis on 1 array of counters
 Observation: Our estimate is unbiased.  Question: How large error do we have?
 Analysis of variance of contribution of a particular item ai:  Let Xi be contribution of ai in the bucket h(a1).
𝑓𝑖, which happens with prob. 1/2𝑤  𝑋𝑖 􏰀 􏰎0, which happens with prob. 1 􏰉 1/𝑤
􏰉𝑓𝑖, which happens with prob. 1/2𝑤
 Expectation: E[Xi] = 0.
 Variance: Var [Xi] = E[Xi2] – (E[Xi])2 = fi2/w.
11

Analysis on 1 array of counters
 Observation: Our estimate is unbiased.  Question: How large error do we have?
 Analysis of variance of error:
 Let X2, … , Xn be contributions of a2, … , an in the bucket h(a1).
 Let Y = X2+ … + Xn be the total contributions by other item ai ≠ a1.  Variance of error:
Var[Y] =Var[X2] + … +Var[Xn] = (f22 + … + fn2)/w ≤ 𝒇 𝟐𝟐/w.
12

Basic tools: Tail inequality
Probability distribution
Tail probability



ee
 Chebyshev’s inequality for E[Y] = m and Var[Y] = s2:
Pr[|Y–m|≥e]≤s2 foranye>0. e2
13

Analysis on 1 array of counters  Question: How large error do we have?
 Analysis of variance of error:
 Let X2, … , Xn be contributions of a2, … , an in the bucket h(a1).
 Let Y = X2+ … + Xn be the total contributions by other item ai ≠ a1.  Expectation: E[Y] = 0.
 Variance: Var[Y] = Var[X2] + … + Var[Xn] = (f22 + … + fn2)/w ≤  Chebyshev’s inequality:
𝒇 𝟐𝟐 𝒘
Pr[|Y|≥e 𝒇 ]≤ 1 =1/2ifwechoosew=2/e2. 2 𝒘e2
 The error is at most e 𝒇 2 with the probability 1/2.
14

Analysis on d arrays of counters  Boosting the accuracy:
 Using d independent hash functions corresponding to d independent arrays of counters.
 F1 = median(C1[h1(a1)], … , Cd[hd(a1)]) = median(f’1, f’2, …, f’d ).  Analysis:
 E [ f’1] = E [ f’2] = … = E [ f’d] = f1.
 Choose d = log(1/d) and apply Chernoff bound, we have
Pr[|F1 –f1|≤e 𝒇 2]≥1-d
 With probability at least 1 – d, we have f1-e 𝒇 2≤F1≤f1+e 𝒇 2
15

Homework
 Implement the CountSketch algorithm on the dataset from assignment 1:
 Description: Each line (doc ID, word ID, freq.) as a stream tuple.
 Query: What are the most and top-10 frequent word ID have been used?
 What are the average errors of CountMin Sketch and CountSketch?
16