程序代写代做代考 algorithm AI Misra-Gries Summary: Finding Heavy Hitters in Data Stream

Misra-Gries Summary: Finding Heavy Hitters in Data Stream
COMPCSI 753: Algorithms for Massive Data
Instructor: Ninh Pham
University of Auckland
Auckland, Aug 18, 2020
1

Basic definitions
 Let U be a universe of size n, i.e. U = {1, 2, 3, … , n}
 Cash register model stream:
 Sequence of m elements a1, … , am where ai Î U
 Elements of U may or may not occur once or several times in the stream
 Finding heavy hitters in data stream (today’s lecture):  Given a stream, finding frequent items.
2

Frequent items
 Each element of data stream is a tuple.
Given a stream of m elements a1, … , am where ai Î U,
finding the most/top-k frequent items.  Example:
 {1, 2, 1, 3, 4, 5}  f = {2, 1, 1, 1, 1}
 {1, 2, 1, 3, 1, 2, 4, 5, 2, 3}  f = {3, 3, 2, 1, 1}
3

Applications
 Networking:
 Tracking the most popular source, destinations, or source-destination
 Facts:
pairs (those with the highest amount of traffic).
 Web analytics:
 Tracking the most popular queries to a search engine, or the most
popular pieces of content in a large content host.
 Typical frequency distribution are highly skewed.
 Top 10% elements have 90% of total occurrences (active rating users, most rated movies in Netflix).
4

Skewed distribution
 Rating distribution for Netflix (solid line) and Movielens (dashed line) datasets.
 Items are ordered according to popularity (most popular at the bottom).
Cremonesi et al. RecSys’10
5

Exact solution
 Create a counter for each distinct element on its first occurrence.
 When processing an element, increase its counter.
Example:
 Stream: {1, 2, 3, 1, 4, 2, 1, 4, 2}
1 2 3 4
 Problem:
 Maintain n counters.
 We can only maintain k << n counters. 6 Sampling solution  Reservoir sampling:  Reservoir sampling of size k to maintain k elements so far and the size of stream m.  Estimate frequency based on the reservoir summary.  Note that frequency distribution are highly skewed. What occurs if some elements have frequency >> m/k?  Example:
1234
Stream: {2,2,2,4,3,4,1,1,1,1,1,1}  Reservoir sampling with k = 3
7

Misra Gries’82
 Process an element a:
 If we already have a counter for a, increment it.
 Else, if there is no counter for a, but fewer k counters, create a counter for a initialized to 1.
 Else, decrease all counters by 1. Remove 0 counters (key step).
 Example: {1, 2, 3, 1, 4, 2, 1, 4, 5, 2, 6}, n=6, k=3, m=11.
{1} 1
8

Misra Gries’82
 Process an element a:
 If we already have a counter for a, increment it.
 Else, if there is no counter for a, but fewer k counters, create a counter for a initialized to 1.
 Else, decrease all counters by 1. Remove 0 counters (key step).
 Example: {1, 2 ,3, 1, 4, 2, 1, 4, 5, 2, 6}, n=6, k=3, m=11.
{1, 2, 3, 1, 4, 2, 1, 4, 5, 2} 123242
17

Misra Gries’82
 Process an element a:
 If we already have a counter for a, increment it.
 Else, if there is no counter for a, but fewer k counters, create a counter for a initialized to 1.
 Else, decrease all counters by 1. Remove 0 counters (key step).
 Example: {1, 2 ,3, 1, 4, 2, 1, 4, 5, 2, 6}, n=6, k=3, m=11.
{1, 2, 3, 1, 4, 2, 1, 4, 5, 2 , 6} 1232426
18

Misra Gries’82
 Process an element a:
 If we already have a counter for a, increment it.
 Else, if there is no counter for a, but fewer k counters, create a counter for a initialized to 1.
 Else, decrease all counters by 1. Remove 0 counters (key step).  Query: How many times the element a occurred?
 If we have a counter for a, return its value.  Else, return 0.
 Observation: We always under-estimate the frequency!
19

Why it works?
 Question:
 How many decrements to a particular a can we have?
 How many decrement step can we have?  Answer:
 The number of elements in stream: m.
 The number of elements in the summary: m’.
 Given a does not occur in the summary, one decrement step takes out k items and not count a. That is k + 1 “uncounted” occurrences.
 There is at most (m – m’) / (k + 1) decrement steps.
 Estimate is smaller than exact count by at most 𝒎 – 𝒎’.
𝒌􏰓𝟏
20

Why it works?
 Estimate is smaller than exact count by at most 𝒎 – 𝒎’. 𝒌􏰓𝟏
 We can find the most frequent items if their frequencies > 𝒎 – 𝒎’. 𝒌􏰓𝟏
 We get good estimate for a if its frequency >> 𝒎 – 𝒎’. 𝒌􏰓𝟏
 Error bound:
 Inversely proportional to k. Hence larger k gives better accuracy.  Can be computed by knowing k and m’, and tracking m.
21

Carl Colglazier GitHub
Why it works?
 Misra Gries works because typical frequency distributions have few very popular elements “Zipf law”.
22

Homework
 Implement the Misra-Gries algorithm on the dataset from Assignment 1:
 Description: Each line (doc ID, word ID, freq.) as a stream tuple.
 Query: What are the most and top-10 frequent word ID have been used?
23

Related Posts