Misra-Gries Summary: Finding Heavy Hitters in Data Stream
COMPCSI 753: Algorithms for Massive Data
Instructor: Ninh Pham
University of Auckland
Auckland, Aug 18, 2020
1
Basic definitions
Let U be a universe of size n, i.e. U = {1, 2, 3, … , n}
Cash register model stream:
Sequence of m elements a1, … , am where ai Î U
Elements of U may or may not occur once or several times in the stream
Finding heavy hitters in data stream (today’s lecture): Given a stream, finding frequent items.
2
Frequent items
Each element of data stream is a tuple.
Given a stream of m elements a1, … , am where ai Î U,
finding the most/top-k frequent items. Example:
{1, 2, 1, 3, 4, 5} f = {2, 1, 1, 1, 1}
{1, 2, 1, 3, 1, 2, 4, 5, 2, 3} f = {3, 3, 2, 1, 1}
3
Applications
Networking:
Tracking the most popular source, destinations, or source-destination
Facts:
pairs (those with the highest amount of traffic).
Web analytics:
Tracking the most popular queries to a search engine, or the most
popular pieces of content in a large content host.
Typical frequency distribution are highly skewed.
Top 10% elements have 90% of total occurrences (active rating users, most rated movies in Netflix).
4
Skewed distribution
Rating distribution for Netflix (solid line) and Movielens (dashed line) datasets.
Items are ordered according to popularity (most popular at the bottom).
Cremonesi et al. RecSys’10
5
Exact solution
Create a counter for each distinct element on its first occurrence.
When processing an element, increase its counter.
Example:
Stream: {1, 2, 3, 1, 4, 2, 1, 4, 2}
1 2 3 4
Problem:
Maintain n counters.
We can only maintain k << n counters.
6
Sampling solution
Reservoir sampling:
Reservoir sampling of size k to maintain k elements so far and the size of
stream m.
Estimate frequency based on the reservoir summary.
Note that frequency distribution are highly skewed. What occurs if some
elements have frequency >> m/k? Example:
1234
Stream: {2,2,2,4,3,4,1,1,1,1,1,1} Reservoir sampling with k = 3
7
Misra Gries’82
Process an element a:
If we already have a counter for a, increment it.
Else, if there is no counter for a, but fewer k counters, create a counter for a initialized to 1.
Else, decrease all counters by 1. Remove 0 counters (key step).
Example: {1, 2, 3, 1, 4, 2, 1, 4, 5, 2, 6}, n=6, k=3, m=11.
{1} 1
8
Misra Gries’82
Process an element a:
If we already have a counter for a, increment it.
Else, if there is no counter for a, but fewer k counters, create a counter for a initialized to 1.
Else, decrease all counters by 1. Remove 0 counters (key step).
Example: {1, 2, 3, 1, 4, 2, 1, 4, 5, 2, 6}, n=6, k=3, m=11.
{1, 2} 12
9
Misra Gries’82
Process an element a:
If we already have a counter for a, increment it.
Else, if there is no counter for a, but fewer k counters, create a counter for a initialized to 1.
Else, decrease all counters by 1. Remove 0 counters (key step).
Example: {1, 2, 3, 1, 4, 2, 1, 4, 5, 2, 6}, n=6, k=3, m=11.
{1, 2, 3} 123
10
Misra Gries’82
Process an element a:
If we already have a counter for a, increment it.
Else, if there is no counter for a, but fewer k counters, create a counter for a initialized to 1.
Else, decrease all counters by 1. Remove 0 counters (key step).
Example: {1, 2, 3, 1, 4, 2, 1, 4, 5, 2, 6}, n=6, k=3, m=11.
{1, 2, 3, 1} 123
11
Misra Gries’82
Process an element a:
If we already have a counter for a, increment it.
Else, if there is no counter for a, but fewer k counters, create a counter for a initialized to 1.
Else, decrease all counters by 1. Remove 0 counters (key step).
Example: {1, 2, 3, 1, 4, 2, 1, 4, 5, 2, 6}, n=6, k=3, m=11.
{1, 2, 3, 1, 4} 123
12
Misra Gries’82
Process an element a:
If we already have a counter for a, increment it.
Else, if there is no counter for a, but fewer k counters, create a counter for a initialized to 1.
Else, decrease all counters by 1. Remove 0 counters (key step).
Example: {1, 2, 3, 1, 4, 2, 1, 4, 5, 2, 6}, n=6, k=3, m=11.
{1, 2, 3, 1, 4, 2} 1232
13
Misra Gries’82
Process an element a:
If we already have a counter for a, increment it.
Else, if there is no counter for a, but fewer k counters, create a counter for a initialized to 1.
Else, decrease all counters by 1. Remove 0 counters (key step).
Example: {1, 2, 3, 1, 4, 2, 1, 4, 5, 2, 6}, n=6, k=3, m=11.
{1, 2, 3, 1, 4, 2, 1} 1232
14
Misra Gries’82
Process an element a:
If we already have a counter for a, increment it.
Else, if there is no counter for a, but fewer k counters, create a counter for a initialized to 1.
Else, decrease all counters by 1. Remove 0 counters (key step).
Example: {1, 2, 3, 1, 4, 2, 1, 4, 5, 2, 6}, n=6, k=3, m=11.
{1, 2, 3, 1, 4, 2, 1, 4} 12324
15
Misra Gries’82
Process an element a:
If we already have a counter for a, increment it.
Else, if there is no counter for a, but fewer k counters, create a counter for a initialized to 1.
Else, decrease all counters by 1. Remove 0 counters (key step).
Example: {1, 2, 3, 1, 4, 2, 1, 4, 5, 2, 6}, n=6, k=3, m=11.
{1, 2, 3, 1, 4, 2, 1, 4, 5} 12324
16
Misra Gries’82
Process an element a:
If we already have a counter for a, increment it.
Else, if there is no counter for a, but fewer k counters, create a counter for a initialized to 1.
Else, decrease all counters by 1. Remove 0 counters (key step).
Example: {1, 2 ,3, 1, 4, 2, 1, 4, 5, 2, 6}, n=6, k=3, m=11.
{1, 2, 3, 1, 4, 2, 1, 4, 5, 2} 123242
17
Misra Gries’82
Process an element a:
If we already have a counter for a, increment it.
Else, if there is no counter for a, but fewer k counters, create a counter for a initialized to 1.
Else, decrease all counters by 1. Remove 0 counters (key step).
Example: {1, 2 ,3, 1, 4, 2, 1, 4, 5, 2, 6}, n=6, k=3, m=11.
{1, 2, 3, 1, 4, 2, 1, 4, 5, 2 , 6} 1232426
18
Misra Gries’82
Process an element a:
If we already have a counter for a, increment it.
Else, if there is no counter for a, but fewer k counters, create a counter for a initialized to 1.
Else, decrease all counters by 1. Remove 0 counters (key step). Query: How many times the element a occurred?
If we have a counter for a, return its value. Else, return 0.
Observation: We always under-estimate the frequency!
19
Why it works?
Question:
How many decrements to a particular a can we have?
How many decrement step can we have? Answer:
The number of elements in stream: m.
The number of elements in the summary: m’.
Given a does not occur in the summary, one decrement step takes out k items and not count a. That is k + 1 “uncounted” occurrences.
There is at most (m – m’) / (k + 1) decrement steps.
Estimate is smaller than exact count by at most 𝒎 – 𝒎’.
𝒌𝟏
20
Why it works?
Estimate is smaller than exact count by at most 𝒎 – 𝒎’. 𝒌𝟏
We can find the most frequent items if their frequencies > 𝒎 – 𝒎’. 𝒌𝟏
We get good estimate for a if its frequency >> 𝒎 – 𝒎’. 𝒌𝟏
Error bound:
Inversely proportional to k. Hence larger k gives better accuracy. Can be computed by knowing k and m’, and tracking m.
21
Carl Colglazier GitHub
Why it works?
Misra Gries works because typical frequency distributions have few very popular elements “Zipf law”.
22
Homework
Implement the Misra-Gries algorithm on the dataset from Assignment 1:
Description: Each line (doc ID, word ID, freq.) as a stream tuple.
Query: What are the most and top-10 frequent word ID have been used?
23