Mining Data Streams
INF 553
Slides from:
Wensheng Wu
and
Mining of Massive Datasets
Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University
http://www.mmds.org
1
Roadmap • Motivation
• Sampling
– Fixed-portion & fixed-size (reservoir sampling)
• Filtering
– Bloom filter
• Counting
– Estimating # of distinct values, moments
• Sliding window
– Counting # of 1’s in the window
2
Data streams & applications • Query streams
– How many unique users at Google last month? • URL streams while crawling
– Which URLs have been crawled before?
• Sensor data
– What is the maximum temperature so far?
3
Stream data processing
• Stream of tuples arriving at a rapid rate
– In contrast to traditional DBMS where all tuples
are stored in secondary storage
• Infeasible to use all tuples to answer queries
– Cannot store them all in main memory – Too much computation
– Query response time critical
4
Query types
• Standing queries
– Executed whenever a new tuple arrives
– keep only one value
– e.g., report each new maximum value ever seen in the stream
• Ad-hocqueries
– Normal queries asked one time
– Need entire stream to have an exact answer
– e.g., what is the number of unique visitors in the last two months?
5
Stream Processing Model
Ad-Hoc Queries
Standing Queries
Processor
. . . 1, 5, 2, 7, 0, 9, 3
… a,r,v,t,y,h,b
. . . 0, 0, 1, 0, 1, 1, 0
time
Streams Entering.
Each is stream is composed of
elements/tuples
Output
Limited Working Storage
Archival Storage
6
Example: Running averages
• Given a window of size N
– report the average of values in the window
whenever a value arrives
– N is so large that we can not store all tuples in the window
• How to do this?
7
Example: running averages • First N inputs, accumulate sum and count
– Avg = sum/count • A new element i
– Change the average by adding (i – j)/N
• j is the oldest element in the window
• window size is fixed so we need to discard j
8
Roadmap • Motivation
• Sampling
– Fixed-portion & fixed-size (reservoir sampling)
• Filtering
– Bloom filter
• Counting
– Estimating # of distinct values, moments
• Sliding window
– Counting # of 1’s in the window
9
Sampling from a data stream
• Scenario: Search engine query stream
– Stream of tuples: (user, query, time)
– Answer questions such as: How often did a user run the same query in a single day
– Have space to store 1/10th of query stream
• Method 1: sample a fixed portion of elements
– e.g., 1/10
• Method 2: maintain a fixed-size sample
10
Sampling a fixed proportion
• Search engine query stream
– Stream of tuples: (user, query, time)
• Example query
– what fraction of queries by a user are duplicates?
• Assumption
– Have space to store 1/10 of stream tuples
11
Naive solution
• For each tuple,
– generate a random number in [0..9]
– store it in the sample if the number is 0 • Sample rate?
• Example stream tuples:
– (john, data mining, 2015/12/01 9:45) – (mary, inf 553, 2015/12/01 10:08)
– (john, data mining, 2015/12/01 11:30) –…
• Problems?
12
True fraction of queries with duplicates • A user issued s queries once & d queries twice
– E.g., data mining, inf 553, movie, movie, tom, tom (s=d=2)
• Total number of queries & duplicates = s + 2d
• True fraction of queries with duplicates = d/(s+d) • Sampling rate = 1/10
13
Queries with duplicates in sample
• Thesamplecontains:
– s/10 “s” queries, e.g., data mining
– 2d/10 “d” tuples, e.g., movie, tom, tom
• If sample = data mining, movie, tom, tom, question:
– What is the expected number of d queries with duplicates in the sample?
• movie – d query without duplicates in the sample • tom, tom – d query with duplicates in the sample • E.g., both tom’s appear in the sample
14
Queries with duplicates in sample
• s1, s2, …, s800, d1, d1’, d2, d2’, …, d100, d100’ – s = 800, d = 100
• Each dj has probability of 1/10 being selected – so prob. of two dj’s being selected = 1/10 * 1/10
• There are d number of dj’s
Þ so the expected number of pairs = d/100
15
“d” Queries without duplicates in sample
• The sample contains:
– s/10 “s” queries, e.g., data mining
– 2d/10 “d” tuples, e.g., movie, tom, tom
• Question:
– What is the expected number of d queries without
duplicates in the sample?
– E.g., only one movie appears in the sample
16
“d” Queries without duplicates in sample • s1, s2, …, s800, d1, d1’, d2, d2’, …, d100, d100’
– s = 800, d = 100
• Expected # of singleton d queries in sample – dj selected, dj’ not selected: 1/10 * 9/10
– dj not selected, dj’ selected: 9/10 * 1/10
=> 9d/100+9d/100 = 18d/100
17
Fraction of queries in sample w/ duplicates • s1, s2, …, s800, d1, d1’, d2, d2’, …, d100, d100’
– s=800,d=100 !””
• Fraction= “## = $ ! ! !”%!
“# “## “##
#$%!#&”
≠
%!”
• Note total # of d queries with and without duplicates =
– d/100 * 2 + 18d/100 = 20d/100 = 2d/10
18
What has been the problem?
• Mistake: sample by position
– When search tuple arrives, we flip a 10-side coin – Retain it if the coin shows up as 0
• Solution: sample by user (by key)
– When search tuple (user, query, time) arrives – We extract its user (key) component
– Hash user into 10 buckets: 0, 1, …, 9
– Retain all tuples for the user in the bucket 0
19
Sample by user
• Sample = all queries for users hashed into bucket 0
• All or none of queries of a user will be selected
• Thus, the fraction of unique queries in sample – will be the same as that in the stream as a whole
20
General Sampling Problem
• Stream of tuples with n components – Key = a subset of components
• Search stream: (user, query, time) – Sample size: a/b
• Sampling strategy:
– Hash key (e.g., user) to b buckets – Accept tuple if key value < a
21
Example
• Tuples: (empID, dept, salary)
• Query: avg. range of salary within a dept.?
• Randomly selecting tuples
– Might miss mix/max salary
• Key = dept.
• Sample: some departments and all tuples in these departments
22
Roadmap • Motivation
• Sampling
– Fixed-portion & fixed-size (reservoir sampling)
• Filtering
– Bloom filter
• Counting
– Estimating # of distinct values, moments
• Sliding window
– Counting # of 1’s in the window
23
Problem with fixed portion sample
• Sample size may grow too big when data stream in
– Even 10% could be too big, e.g., tuples in bucket 1/10 exceed the memory size
• Idea: throw away some queries
• Key: do this consistently
– remove all or none of occurrences of a query
24
Controlling the sample size • Put an upper bound on the sample size
– Start out with 10%
• Solution:
– Hash queries to a large # of buckets, say 100
– Put them into sample if they hash to bucket 0 to 9 – When sample grows too big, throw away bucket 9 – So on
25
Maintaining a fixed-size sample
• Suppose we need to maintain a random sample, S, of size exactly s tuples (instead of %)
– E.g., main memory size constraint
• Why? Don’t know length of stream in advance
• Suppose at time n we have seen n items
– Each item is in the sample S with equal prob. s/n
How to think about the problem: say s = 2
Stream: a x c y z k c d e g...
At n= 5, each of the first 5 tuples is included in the sample S with equal prob. At n= 7, each of the first 7 tuples is included in the sample S with equal prob.
Impractical solution would be to store all the n tuples seen so far and out of them pick s at random 26
Solution: Fixed Size Sample
• Algorithm (a.k.a. Reservoir Sampling)
– Store all the first s elements of the stream to S – Suppose we have seen n-1 elements, and now
the nth element arrives (n > s)
• With probability s/n, keep the nth element, else discard
it
• If we picked the nth element, then it replaces one of the s elements in the sample S, picked uniformly at random
• Claim: This algorithm maintains a sample S with the desired property:
– After n elements, the sample contains each element seen so far with probability s/n
27
Proof: By Induction
• We prove this by induction:
– Assume that after n elements, the sample contains
each element seen so far with probability s/n
– We need to show that after seeing element n+1 the
sample maintains the property
• Sample contains each element seen so far with probability s/(n+1)
• Base case:
– After we see n=s elements the sample S has the desired property
• Each out of n=s elements is in the sample with probability s/s = 1
28
Proof: By Induction
• Inductive hypothesis: After n elements, the sample
S contains each element seen so far with prob. s/n
• Now element n+1 arrives
• Inductive step: For elements already in S, probability that the algorithm keeps it in S is:
æ1- s ö+æ s öæs-1ö= n
çè n + 1 ÷ø çè n + 1 ÷ø çè s ÷ø n + 1
Element n+1 Element in the
Element n+1 discarded
• So, at time n, tuples in S were there with prob. s/n
• Time n®n+1, tuple stayed in S with prob. n/(n+1)
• Soprob.tupleisinSattimen+1=𝒔 ⋅ 𝒏 = 𝒔
𝒏 𝒏!𝟏 𝒏!𝟏
not discarded sample not picked
29
Roadmap • Motivation
• Sampling
– Fixed-portion & fixed-size (reservoir sampling)
• Filtering
– Bloom filter
• Counting
– Estimating # of distinct values, moments
• Sliding window
– Counting # of 1’s in the window
30
Stream filtering • Application: spam filtering
– Check incoming emails against a large set of known email addresses (e.g., 1 billion)
• Application: URL filtering in Web crawling
– Check if discovered URL’s have already been crawled
31
Bloom filter • CheckifanobjectoisinasetS
– w/o comparing o with all objects in S (explicitly) • No false negatives
– If Bloom says no, then o is definitely not in S • But may have false positives
– If Bloom says yes, it is possible that o is not in S
32
Components in a Bloom filter • An array A of n bits, initially all 0’s: A[0..n-1]
• A set of hash functions, each
– takes an object (stream element) as the input – returns a position in the array: 0..n-1
• A set S of objects
33
Construct and apply the filter
• Construction: for each object o in S, – Apply each hash function hj to o
– If hj(o) = i, set A[i] = 1 (if it was 0)
• Application: check if new object o’ is in S
– Hash o’ using each hash function
– If for some hash function hj(o’) = i and A[i] = 0, stop and report o’ not in S
34
Example • 11-bit array
• Stream elements = integers
• Two hash functions
– h1(x) = (odd-position bits from the right) % 11 – h2(x) = (even-position bits from the right) % 11
35
Example: Building the filter Stream element h1 h2 Filter
25 = 1 1 0 0 1 5 2 159 = 1 0 0 1 1 1 1 1 7 0
585 = 1 0 0 1 0 0 1 0 0 1 9 7
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
1
0
0
0
0
0
1
0
1
0
0
1
0
1
0
0
0
1
0
1
0
0
1
0
1
0
1
0
36
Example: Using the filter • Is 118 in the set?
– No false negative in Bloom
Stream element h1 h2 Filter 118 = 1 1 1 0 1 1 0 3 5
1
0
1
0
0
1
0
1
0
1
0
37
Example: False positive
• Is 38 in the set (25, 159, 585)?
– It turns on the same bits as 25, but in diff. ways
Stream element h1 h2
Filter
1
0
1
0
0
1
0
1
0
1
0
38 = 1 0 0 1 1 0 2
Recall
25 = 1 1 0 0 1
5
0
0
1
0
0
1
0
0
0
0
0
5
2
38
False positive • x not in S, but identified as in S
• Reason:
– For all hash functions, x hashes into an 1 position
– That is, hi(x) = hj(e), for some e in S – Note: j may be different from i
39
False positive rate
• n=#ofbitsinarray
• k = # of hash functions
• m = # of elements inserted
• f = fraction of 1’s in bit array • Falsepositiverate=fk
– The probability of saying YES to the question “Is X in the set?”
• f≤m*k/n(thisisanupperbound,why?)
40
Example
• n = 8 billions (bits in array)
• m = 1 billion (objects in the set) • k = 1 (# of hash function)
• f is estimated to be km/n = 1/8 – 1/8 of bits in the array are 1
• False positive rate ≤ 1/8 = .125
41
Accurate Estimation of fraction of 1’s
• n=#ofbitsinarray
• k = # of hash functions
• m = # of elements inserted
• Fraction of 1’s = the probability that a bit in the array is set to 1 by at least one hashing
– Total # of hashings: k * m
42
Estimation model • Consider throwing d darts on t targets
• What is the probability, denoted as p, of a given target hit by at least one dart?
– Prob. of a target not hit by a dart = 1-1/t
– Prob. of a target not hit by all darts = (1-1/t)d
since (1-1/t)t = ((1 + ! )”#)”!)= 𝑒”!= 1/e for large t “#
– we have (1-1/t)t*d/t = e-d/t – p=1-e-d/t
43
Estimating the fraction of 1’s
• n=#ofbitsinarray
• k = # of hash functions
• m = # of elements inserted
• Fraction of 1’s = the probability that a bit in the array is set to 1 by at least one hashing
– 1 – e-km/n
44
Accurate false positive rate
• f = fraction of 1’s in bit array • k = # of hash functions
• m = # of elements inserted
• Falsepositiverate=fk
• Insteadoff≤m*k/n,wehavef=1–e-km/n
• so false positive rate = (1 – e-km/n)k
– Multiple hash functions hit the same bit
45
Example
• n = 8 billions (bits in array)
• m = 1 billion (objects in the set) • k = 1 (# of hash function)
• Recall that false positive rate ≤ 1/8 = .125 • Actual rate = (1 – e-km/n)k = 1-e-1/8 = .1175
• Whatifk=2?
46
Example
• n = 8 billions (bits in array)
• m = 1 billion (objects in the set) • k = 2 (# of hash function)
• Actual rate = (1 – e-km/n)k = (1-e-2/8)2 = .22122 = .0489
47
• n = 8 billions
• m = 1 billion
• k = # of hash functions
• Rate = (1 – e-km/n)k = (1 – e-k/8)k
0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02
0 2 4 6 8 10 12 14 16 18 20
• Optimal k = n/m ln(2)
– E.g., k = 8 ln (2) = 5.54 ~ 6
Optimal k
Number of hash functions, k • Error rate at k = 6: (1-e-6/8)6 = .0216
48
False positive rate
Roadmap • Motivation
• Sampling
– Fixed-portion & fixed-size (reservoir sampling)
• Filtering
– Bloom filter
• Counting
– Estimating # of distinct values, moments
• Sliding window
– Counting # of 1’s in the window
49
Compute # of distinct elements
• Applications
– Compute # of distinct users to Facebook
– Compute # of distinct queries submitted to Google
• Obvious solution:
– Hash table of distinct elements
– Check if new element is there; if not, add it
• Problems?
50
Flajolet-Martin algorithm • Estimating the counts
1. Hash every element a to a sufficiently long bit- string (e.g., h(element a) = 1100 – 4 bits)
2. Maintain R = length of longest trailing zeros among all bit-strings (e.g., R = 2)
3. Estimate count = 2R e.g.,= 4
51
Example
• Consider 4 distinct elements: a, b, c, d
• Hash value into bit string of length 4
• How likely do we see at least one hash value with a 0 in the last bit? Next slide
– hash(a) = 0010 – hash(b) = 0111 – hash(c) = 1010 – hash(d) = 1111
52
Example: at least one ends with 0
• E.g.,
– hash(a) = 0010 – hash(b) = 0111 – hash(c) = 1010 – hash(d) = 1111
• Prob. of none of hash values ending with 0:
– (1-1⁄2)4 (every string ends with 1; there are four strings)
• Prob. of at least one ending with 0: – 1-(1-1⁄2)4= .9375
53
Example: at least one ends with 00
• E.g.,
– hash(a) = 0100 – hash(b) = 0111 – hash(c) = 1010 – hash(d) = 1111
• Prob. of someone ending with 00: – (1⁄2)2
• Prob. of none ending with 00: – (1-(1⁄2)2)4=.32
• Prob. of at least one ending with 00: – 1-.32 = .68
54
Example: at least one ends with 000
• E.g.,
– hash(a) = 0000 – hash(b) = 0111 – hash(c) = 1010 – hash(d) = 1111
• Prob. of someone ending with 000: – (1⁄2)3
• Prob. of none ending with 000: – (1-(1⁄2)3)4=.59
• Prob. of at least one ending with 000: – 1-.59 = .41
55
Why it works?
• Prob. of h(a) having at least r trailing 0’s
– 2-r
• Supposetherearemdistinctelements
• Prob. that h(a) for some element a has at least r trailing 0’s (as in the examples in the previous slides)
#! 𝟐𝐫 –1−(1−2″$)% =1−(1−2″$)&!” =𝟏−𝐞”𝐦
– Recall:
56
Why it works?
• Prob. that h(a) for some element a has at least
r trailing 0’s –𝑝=1−(1−2!”)# =𝟏− 𝒆!𝒎
• Tayler expansion of ex
– 1 + $$ + $% + $& + ⋯ ~𝟏 + 𝒙𝟏 , 𝒊𝒇 𝑿 ≪ 𝟏
𝟐𝒓
%! ‘! (! 𝟏!
•𝑝=1−𝑒!’~1− 1−” =𝒎,if2&≫m () #)𝟐𝒓
57
Why it works?
• Prob. that h(a) for some element a has at least r trailing 0’s
!! – 𝑝=1−(1−2!”)#=1−𝑒”#
• Prob. that h(a) for NO element a has at least r trailing 0’s !!
– 𝑝′=(1−2!”)#=𝑒”#
(1) If2!≫m,𝑝=”→0;p’=1
#!
– the probability of finding a tail length at least r approaches 0
– R is unlikely to be too large, since otherwise it may not show up at all
#!
(2) If2!≪m,𝑝=1−1/𝑒” →1;p’=0
– the probability that we shall find a tail of length at least r approaches 1
– Since R is the longest 0’s, it is unlikely to be too small.
Þ So, R is unlikely to be either much too high or much too low Þ 2R is around m
58
• •
Problems in combining estimates We can hash multiple times, take avg. of 2R values
•
– Contribution from each large R to E(2R) grows, when R grows
– Can have very large 2R
What about taking median instead?
Problem: ExpectedValue(2R) → ∞
– When 𝟐𝑹 ≥ 𝒎, increase R by 1 => probability halves, but
value 𝟐𝑹 doubles
• 𝑝=1−(1−2$!)” =1− 𝑒$”
#!
59
Combining the estimates • Avg only: what if one very large value?
• Median: all values are power of 2 – 1, 2, 4, 8,…,1024, 2048,…
• Solution:
– Partition hash functions into small groups
– Take average for each group
– Take the median of the averages
60
Roadmap • Motivation
• Sampling
– Fixed-portion & fixed-size (reservoir sampling)
• Filtering
– Bloom filter
• Counting
– Estimating # of distinct values, moments
• Sliding window
– Counting # of 1’s in the window
61
Moments
• A stream S of elements drawn from a universal
set: v1, v2, …, vn
– mi is the number of occurrences of vi in S – (1, 4, 1, 3, 4, 1) M1 = 3; M2 = 2;
• k-th moment of S:
–∑𝒏 (𝒎)𝒌 𝒊,𝟏 𝒊
62
K-th moments
• 0-th moment of the stream S – # of distinct elements in S
– (1, 4, 1, 3, 4, 1)
• 1st moment of S – Length of S
• 2nd moment of S : sum of squared frequencies
– Surprise number, measuring the evenness of distribution of elements in S
63
Surprise number
• Steam of 100 elements, 11 values appear
• Unsurprising: mi =10, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9 –Surprisenumber=102 +10*92 =910
• Surprising:90,1,1,1,1,1,1,1,1,1,1 – Surprise number = 902 + 10 * 12 = 8,110
64
AMS (Alon-Matias-Szegedy) Algorithm • Estimating 2nd moment of a stream of length n
• Pick k random numbers between 1 to n • For each, construct a variable X at t
– Record its count from position t on as X.value
• Estimate=𝟏/𝒌∑𝒌 𝒏(𝟐𝒙 .𝒗𝒂𝒍𝒖𝒆−𝟏) – Explain next 𝒊”𝟏 𝒌
65
Example
• Stream:a,b,c,b,d,a,c,d,a,b,d,c,a,a,b – Stream length n = 15
– a occurs 5 times
– b occurs 4 times
– c and d each occurs 3 times
• 2nd moment=52 +42 +32 +32=59
66
Random variables
• Each random variable X records:
– X.element: element in X
– X.value: # of occurrences of X from time t to n
• X.value = 1 at time t
• X.value++, when another X.element is seen
67
Random variables
• Stream:a,b,c,b,d,a,c,d,a,b,d,c,a,a,b 3 8 13
• Suppose we keep three variables: X1, X2, X3 – Introduced at position: 3, 8, and 13
• Position 3: X1.element = c, X1.value = 1 • Position 7: X1.value = 2
68
Random variables
• Stream:a,b,c,b,d,a,c,d,a,b,d,c,a,a,b 3 8 13
• Position 8: X2.element = d, X2.value = 1
• Position 11: X2.value = 2
• Position 12: X1.value = 3
• Position 13: X3.element = a, X3.value = 1 • Position 14: X3.value = 2
69
Random variables: final values • X1.value = 3, X2.value = 2, X3.value = 2
• Estimate of 2nd moment = n(2*X.value – 1) • Estimate using X1: 15(6-1) = 75
• Estimate using X2or X3: 15(4-1) = 45
• Avg. = (75+45+45)/3 = 55 (recall actual is 59)
70
Why AMS works?
• Stream:a,b,c,b,d,a,c,d,a,b,d,c,a,a,b 3 8 13
• e(i): element that appears in position i
– A random variable introduced at this positon will have
X.element = e(i)
• c(i): # of times e(i) appears in positions i..n
– X.value = c(i)
• e(6)=a,c(6)=4
71
Why AMS works?
• Stream:a,b,c,b,d,a,c,d,a,b,d,c,a,a,b 3 8 13
• e(1)=a,c(1)=5=ma • e(2)=b,c(2)=4=mb • e(3)=c,c(3)=3=mc • e(4)=b,c(4)=mb-1 •…
1, 2, 3, …, ma
• e(13)=a,c(13)=2 • e(14)=a,c(14)=1 • e(15)=b,c(15)=1
If we can have a lot of random variables (e.g., 15)…
72
Why AMS works?
• Estimate=𝟏/𝒌∑𝒌 𝒏(𝟐𝒙𝒌.𝒗𝒂𝒍𝒖𝒆−𝟏)
𝒊”𝟏 • 𝐸 𝑛 2𝑋.𝑣𝑎𝑙𝑢𝑒−1
– average over all positions i between 1 and n =%∑& 𝑛(2𝑐𝑖−1)
& & ‘”%
=∑'”%(2𝑐𝑖 −1) =∑(1+3+⋯+(2𝑚( −1) = ∑ ( 𝑚 ()
for each x
• Note: 1+3+… + (2n-1) = n2
$! 𝑚! 𝑚!+1 %
‘(2𝑖−1)=2 2 −𝑚! =(𝑚! ) !”#
73
Expectation Analysis
123 ma
Count:
Stream: aabbbab a
• 2nd momentis𝑺=∑𝒙𝒎𝟐𝒙
• ct … number of times item at time t appears
from time t onwards (c1=ma , c2=ma-1, c3=mb)
• 𝑬𝒇(𝑿) =𝟏∑𝒏 𝒏(𝟐𝒄𝒕−𝟏)
mi … total count of item i in the stream (we are assuming
𝒏
𝒕+𝟏
=𝒏𝟏∑𝒙𝒏(𝟏+𝟑+𝟓+⋯+𝟐𝒎𝒙 −𝟏)streamhaslengthn)
Group times the last x is
by the value seen (ct=1) seen
74
Time t when
Time t when the penultimate
i is seen (ct=2)
Time t when the first x is seen (ct=mi)
Expectation Analysis
123 ma
Count:
Stream: aabbbab a
• 𝐸 𝑓(𝑋) =-,∑.𝑛(1+3+5+⋯+2𝑚. −1) –Littlesidecalculation: 1+3+5+⋯+2𝑚$ −1 =
∑#((2𝑥−1)=2#(#(/% −𝑚$=(𝑚$)’ $,% ‘
• Then𝑬𝒇(𝑿) =𝒏𝟏∑𝒙 𝒏 𝒎𝒙 • So, 𝐄 𝐟(𝐗) = ∑𝒙 𝒎𝒙 𝟐 = 𝑺
• We have the second moment (in expectation)! 75
𝟐
Higher-Order Moments
• For estimating kth moment we essentially use the same algorithm but change the estimate:
– Fork=2weusedn(2·c–1)
– For k=3 we use: n (3·c2 – 3c + 1) (where c=X.val)
• Why?
–Fork=2:Rememberwehad 1+3+5+⋯+2𝑚+−1
and we showed terms 2c-1 (for c=1,…,m) sum to m2 •∑, 2𝑐−1=∑, 𝑐-−∑, 𝑐−1-=𝑚-
)*+ )*+ )*+ • So: 𝟐𝒄 − 𝟏 = 𝒄𝟐 − 𝒄 − 𝟏 𝟐
– Fork=3:c3-(c-1)3=3c2-3c+1
• Generally:Estimate=𝑛(𝑐*− 𝑐−1*)
76
Combining Samples
• In practice:
–Compute𝒇(𝑿) = 𝒏(𝟐𝒄– 𝟏)for
as many variables X as you can fit in memory – Average them in groups
– Take median of averages
• Problem: Streams never end
– We assumed there was a number n,
the number of positions in the stream
– But real streams go on forever, so n is
a variable – the number of inputs seen so far
77
• •
Streams Never End: Fixups (1) The variables X have n as a factor –
keep n separately; just hold the count in X (2) Suppose we can only store k counts.
We must throw some Xs out as time goes on: – Objective:
– Estimating 2nd moment of a stream of length n
– Each starting time t is selected with probability k/n
– Solution: (fixed-size sampling!)
• Choose the first k times for k variables
• When the nth element arrives (n > k), choose it with probability k/n
• If you choose it, throw one of the previously stored variables
X out, with equal probability
78
Roadmap • Motivation
• Sampling
– Fixed-portion & fixed-size (reservoir sampling)
• Filtering
– Bloom filter
• Counting
– Estimating # of distinct values, moments
• Sliding window
– Counting # of 1’s in the window
79
Sliding Windows
• A useful model of stream processing is that queries are about a window of length N – the N most recent elements received
• Interesting case: N is so large that the data cannot be stored in memory, or even on disk
– Or, there are so many streams that windows for all cannot be stored
• Amazon example:
– For every product X we keep 0/1 stream of whether
that product was sold in the n-th transaction
– We want answer queries, how many times have we sold X in the last k sales (e.g., k = 10, 20, or 200; N=100,000)
80
Sliding Window: 1 Stream
N=6
• Sliding window on a single stream: qwertyuiopasdfghjklzxcvbnm
qwertyuiopasdfghjklzxcvbnm
qwertyuiopasdfghjklzxcvbnm
qwertyuiopasdfghjklzxcvbnm Past Future
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
81
Counting Bits (1)
• Problem:
– Given a stream of 0s and 1s
– Be prepared to answer queries of the form How many 1s are in the last N bits?
• Obvious solution:
Store the most recent N bits
– When new bit comes in, discard the N+1st bit
0 1 0 0 1 1 0 1 1 1 0 1 0 1 0 1 1 0 1 1 0 1 1 0 Suppose N=6
Past Future
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
82
Counting Bits (2)
• You cannot get an exact answer without storing the entire window
• Real Problem:
What if we cannot afford to store N bits?
– E.g., we’re processing 1 billion streams and
010011011101010110110110 Past Future
N = 1 billion
• But we are happy with an approximate answer
J. Leskovec, A. Rajaraman, J. Ullman:
Mining of Massive Datasets, 83
http://www.mmds.org
An attempt: Simple solution
• Q:Howmany1sareinthelastNbits?
• A simple solution that does not really solve our
problem: Uniformity assumption N
010011100010100100010110110111001010110011010
Past Future
• Maintain 2 counters:
– S: number of 1s from the beginning of the stream – Z: number of 0s from the beginning of the stream
• Howmany1sareinthelastNbits?𝑵F 𝑺
• But, what if stream is non-uniform? 𝑺,𝒁 – What if distribution changes over time?
– Cannot handle various k
84
[Datar, Gionis, Indyk, Motwani]
DGIM Method
• DGIM solution that does not assume uniformity • Westore𝑶(log𝑵×log𝑵)bitsperstream
– 𝑶(log𝟐𝑵)
• Solution gives approximate answer,
never off by more than 50%
– Error factor can be reduced to any fraction > 0, with more complicated algorithm and proportionally more stored bits
J. Leskovec, A. Rajaraman, J. Ullman:
Mining of Massive Datasets, 85
http://www.mmds.org
DGIM Algorithm • Storage: O(log2N) bits
• Error rate for the queries ≤ 50%
• Key idea:
– Partition N into a (small) set of buckets
– Remember count for each bucket
– Use the counts to approximate answers to queries
88
DGIM: Timestamps
• Each bit in the stream has a timestamp,
starting 1, 2, …
• Record timestamps modulo N (the window size), so we can represent any relevant timestamp in 𝑶(𝒍𝒐𝒈𝟐𝑵) bits
J. Leskovec, A. Rajaraman, J. Ullman:
Mining of Massive Datasets, 90
http://www.mmds.org
Bucket
• Each represents a sequence of bits in window – It does not store the actual bits
– Rather a timestamp and # of 1’s in the sequence
• Timestamp of bucket
– Timestamp of its end time
• Bucket size = # of 1’s in the bucket – Always some power of 2: 1, 2, 4, …
91
Representing a Stream by Buckets • Either one or two buckets with the same
power-of-2 number of 1s
• Buckets do not overlap in timestamps
• Buckets are sorted by size
– Earlier buckets are not smaller than later buckets
• Buckets disappear when their end-time is > N time units in the past
J. Leskovec, A. Rajaraman, J. Ullman:
Mining of Massive Datasets, 93
http://www.mmds.org
Example buckets
0 10010 0
1
1
..101
10110001
One size-8 bucket
Window size N = 25
Two size-4 buckets
11101
One size-2 bucket
Two size-1 buckets
94
Example: Bucketized Stream
At least 1 of 2 of size 16. Partially size 8 beyond window.
1001010110001011010101010101011010101010101110
N
2 of size 4
1 of 2 of size 2 size 1
00010110010
Three properties of buckets that are maintained:
– Either one or two buckets with the same power-of-2 number of 1s – Buckets do not overlap in timestamps
– Buckets are sorted by size
1010101110101
95
Updating Buckets (1)
• When a new bit comes in, drop the last (oldest) bucket if its end-time is prior to N time units before the current time
• 2 cases: Current bit is 0 or 1 • If the current bit is 0:
no other changes are needed
J. Leskovec, A. Rajaraman, J. Ullman:
Mining of Massive Datasets, 96
http://www.mmds.org
•
Updating Buckets (2)
If the current bit is 1:
– (1) Create a new bucket of size 1, for just this bit • End timestamp = current time
– (2) If there are now three buckets of size 1, combine the oldest two into a bucket of size 2
– (3) If there are now three buckets of size 2, combine the oldest two into a bucket of size 4
– (4) And so on …
J. Leskovec, A. Rajaraman, J. Ullman:
Mining of Massive Datasets, 97
http://www.mmds.org
1001010110001011 10101010101011 001010110001011
001010110001011
10101010101011 1010101010111 010110001011 10101010101011 1010101010111
010110001011
101
1 1
1010101110101
110101
Example: Updating Buckets
Current state of the stream:
0 0101010101011101010101110101000 10010
Bit of value 1 arrives
0101010101010110101010101011101010101 000101100101 Two blue buckets get merged into a yellow bucket
0101010101010110101010101011101010101110101000101100101
Next bit 1 arrives, new blue bucket is created, then 0 comes, then 1:
0101100010110 0 0101010111010100010110010110 Buckets get merged…
0 0 0101010111010100010110010110 State of the buckets after merging
010101010101011010101010101110 000101100101101
98
•
DGIM: Buckets
A bucket in the DGIM method is a record consisting of:
– (A) The timestamp of its end [O(log2 N) bits]
– (B) The number of 1s between its beginning and end [O(log2 log2 N) bits]
• log2 Nisthemaximum#ofbit,X,ina bucket(ofsizeN) • to store X, we need log2 X bits
Constraint on buckets:
Number of 1s must be a power of 2 That explains the O(log2 log2 N) in (B) above
•
–
1001010110001011010101010101011010101010101110101010111010100010110010
N
99
Storage for each bucket
• Timestamp: 0..N-1 (N is the window size)
– So log2N bits
• Number of 1’s: 2j ≤ N, j ≤log2N
– So log2(log2N) for representing j
– Can ignore; too small comparing to the timestamp requirement
• Each bucket requires ≈ O(log2N)
100
Storage requirements of DGIM • At most 2*j buckets
– of sizes: 2j, 2j-1, …, 1 (at most two for each size) • Size of the largest bucket 2j ≤ N
– So j ≤ log2N and 2j ≤ 2log2 N
• Total storage: O(log2N * log2N) or O(𝒍𝒐𝒈𝟐𝑵) – Recall that each bucket requires O(log2N)
101
•
How to Query?
To estimate the number of 1s in the most recent N bits:
1. Sum the sizes of all buckets but the last (note “size” means the number of 1s in the bucket)
2. Add half the size of the last bucket
Remember: We do not know how many 1s of the last bucket are still within the wanted window
J. Leskovec, A. Rajaraman, J. Ullman:
Mining of Massive Datasets, 102
•
http://www.mmds.org
1001010110001011
Example: Bucketized Stream
At least 1 of 2 of 2 of 1 of 2 of size 16. Partially size 8 size 4 size 2 size 1 beyond window.
010101010101011010101010101110101010111010100010110010
N
J. Leskovec, A. Rajaraman, J. Ullman:
Mining of Massive Datasets, 103
http://www.mmds.org
How close is the estimate?
• Suppose # of 1’s in the last bucket b = 2j
• Case 1: estimate < actual value c
– Worst case: all 1’s in bucket b are within range
– So estimate missed “at most half of 2j”-> 2j-1
– Ifc≥2j ,wemissednomorethan50%ofc:2j-1<=0.5*c
– so what’s the minimum C?
• C has at least one 1 from b, plus at least one of buckets of lower
– So estimate missed at most 50% of c
– That is, the estimate is at least 50% of c
powers:2j-1 +2j-2...+1=2j-1;c>=1+2j-1;missedatmost2j-1 • 1+2+4+..+2r-1 =2r-1
104
How close is the estimate?
• Suppose#of1’sinthelastbucketb=2j
• Case 2: estimate > actual value c
– Worst case: only rightmost bit of b is within range – And only one bucket for each smaller power
– Minimum c = 1 + 2j-1 + 2j-2 + … + 1 = 1 + 2j -1 = 2j
– Estimate = 2j-1 (last bucket) + 2j-1 + … + 1
= 2j (c) – 1 + 2j-1 (last bucket)
– 2j-1 + 2j-2… + 1 = 2j -1
– So estimate is no more than 50% greater than c
105
Extensions
• Can we use the same trick to answer queries
How many 1’s in the last k? where k < N?
– A: Find earliest bucket B that at overlaps with k. Number of 1s is the sum of sizes of more recent buckets + 1⁄2 size of B
1001010110001011010101010101011010101010101110101010111010100010110010 k
J. Leskovec, A. Rajaraman, J. Ullman:
Mining of Massive Datasets, 106
http://www.mmds.org