幻灯片 1
Lecture 4: Mining Data Streams
1
CMSC5741 Big Data Tech. & Apps.
Prof. Michael R. Lyu
Computer Science & Engineering Dept.
The Chinese University of Hong Kong
Motivation
In many data mining situations, we know the entire data set in advance
Stream Management is important when the input rate is controlled externally:
Google queries
Twitter and Facebook status updates
We can think of the data as infinite and non-stationary (the distribution changes over time)
2
Explanation for the first sentence:
In many data mining situations, we know the entire data set in advance. On the other side, we face data stream, and are not able to get the entire data set
2
Google Trends
When we search for “big data”
3
March 2018
Explanation for the motivation of this figure:
In technique perspective, Google trends is to find frequent item from the huge streaming queries in a specific interval.
3
Election 2016: Trump vs Clinton
4
4
5
The Stream Model
Input tuples (e.g., [user, query, time]) enter at a rapid rate, at one or more input ports
The system cannot store the entire stream accessibly
How do you make critical calculations about the stream using a limited amount of (secondary) memory?
6
7
Processor
Limited
Working
Storage
. . . 1, 5, 2, 7, 0, 9, 3
. . . a, r, v, t, y, h, b
. . . 0, 0, 1, 0, 1, 1, 0
time
Streams Entering
Ad-Hoc
Queries
Output
Archival
Storage
Standing
Queries
Explanation for input:
Different data types such like 1,0 or a,r, or 1,5 come.
7
Problems on Data Streams
Types of queries one wants on answer on a stream:
Sampling data from a stream
Construct a random sample
Queries over sliding windows
Number of items of type x in the last k elements of the stream
Filtering a data stream
Select elements with property x from the stream
8
Problems on Data Streams
Types of queries one wants on answer on a stream:
Counting distinct elements
Number of distinct elements in the last k elements of the stream
Estimating moments
Estimate avg./std. dev. of last k elements
Finding frequent elements
9
Applications (1)
Mining query streams
Google wants to know what queries are more frequent today than yesterday
Mining click streams
Yahoo! wants to know which of its pages are getting an unusual number of hits in the past hour
Mining social network news feeds
E.g., look for trending topics on Twitter, Facebook
10
Applications (2)
Sensors Networks
Many sensors feeding into a central controller
Telephone call records
Data feeds into customer bills as well as settlements between telephone companies
IP packets can be monitored at a switch
Gather information for optimal routing
Detect denial-of-service attacks
11
Outline
Sampling from a Data Stream
Queries over a (long) Sliding Windows
Filtering Data Streams
Counting Distinct Elements
Computing Moments
Counting Itemsets
12
Outline
Sampling from a Data Stream
Queries over a (long) Sliding Windows
Filtering Data Streams
Counting Distinct Elements
Computing Moments
Counting Itemsets
13
Sampling from a Data Stream
Since we cannot store the entire stream, one obvious approach is to store a sample
Two different problems:
Sample a fixed proportion of elements in the stream (say 1 in 10)
Maintain a random sample of fixed size over a potentially in finite stream
At any “time” t we would like a random sample of n elements. For all t, each of n elements seen so far has equal prob. of being sampled
14
Sampling a Fixed Proportion
Problem 1: Sampling fixed proportion
Scenario: Search engine query stream
Stream of tuples: (user, query, time)
Answer questions such as: How often did a user run the same query on two different days?
Have space to store 1/10th of query stream
Naive solution:
Generate a random integer in [0..9] for each query
Store the query if the integer is 0, otherwise discard
15
15
Problem with Naive Approach
Simple question: What fraction of queries by an average user are duplicates?
Suppose each user issues s queries once and d queries twice (total of s+2d queries), sample rate is p
Correct answer: d/(s+d)
Sample will contain sp of the singleton queries and 2dp of the duplicate queries at least once
But only dp2 pairs of duplicates
dp2 = p * p * d
Of d ”duplicates” 2p(1-p)d appear once
2p(1-p)d = ((p*(1-p))+((1-p)*p))*d
So the sample-based answer is: dp2/(sp+dp2+ 2p(1-p)d)
16
Explanation for how to get d/(s+d) and d/(10s+19d):
For the s queries, each user just searched once; for the d queries, each user searched twice; totally each user inputted s+d+d=s+2d queries
fraction of queries by an average user are duplicates: duplicated queries/ total non-duplicated queries=d/(s+d)
For the sampling situation: duplicated queries/ total non-duplicated queries=(d/100)/(s/10+d/100+18d/100)=d/(10s+19d)
16
Problem with Naive Approach
A concrete example:
Query stream: 1, 2, 3, 4, 5, 6, 7, 7, 8, 8
Sample 50% of the queries in this case
Correct answer: 2/(6+2) = 25% are duplicates
If our sample is 1, 2, 3, 4, 5, then 0% are duplicates
If our sample is 6, 7, 7, 8, 8, then 67% are duplicates
What is the expectation of fraction of duplicates if we use sample-based method?
17
Answer: 1/9
Solution?
3 singleton and 2 of the duplicate queries at once
1/2 pairs of duplicates appears
1 of 2 duplicates appears once
So the answer is (2*.5*.5)/(6*.5+2*.5*.5+2*.5*.5*2) = (1/2)/(3+1/2+1) = 1/9 = 11%
17
Solution: Sample Users
Pick 1/10th of users and take all their searches in the sample
Use a hash function that hashed the user name or user id uniformly into 10 buckets
Generalized: Pick 1/dth of users, we need to use d buckets
18
Generalized Solution
Stream of tuples with keys:
Key is some subset of each tuple’s components
E.g., tuple is (user, search, time); key is user
Choice of key depends on application
To get a sample of size a/b (a 0, with more complicated algorithm and proportionally more stored bits
30
[Datar, Gionis, Indyk, Motwani]
DGIM Method
Idea: Exponential Windows
Solution that doesn’t (quite) work:
Summarize exponentially increasing regions of the stream, looking backward
Drop small regions if they begin at the same point as a larger region
31
32
0 1 0 0 1 1 1 0 0 0 1 0 1 0 0 1 0 0 0 1 0 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 0 1 1 0 1 0
0
1
1
2
2
3
4
10
N
We can construct the count of the last N bits, except we’re not sure how many of the last 6 are included.
?
6
Window of width 16 has 6 1s
An Exponential Window Example
What’s Good?
Stores only O(log2N ) bits
O(log N) counts of log N bits each
Easy update as more bits enter
Error in count no greater than the number of 1s in the “unknown” area
33
A bucket in the DGIM method is a record consisting of:
The timestamp of its end [O(log N ) bits]
The number of 1’s between its beginning and end [O(log log N ) bits]
So for a bucket, store O(log N ) bits
Store O(log N ) buckets, and then we get O(logN ) O(logN ) = O(log2N )
33
What’s Not So Good?
As long as the 1s are fairly evenly distributed, the error ratio due to the unknown region is small – no more than 50%
But it could be that all the 1s are in the unknown area at the end
In that case, the error is unbounded
34
Fixup: DGIM Method
Instead of summarizing fixed-length blocks, summarize blocks with specific numbers of 1s
Let the block sizes (number of 1s) increase exponentially
When there are few 1s in the window, block sizes stay small, so errors are small
35
[Datar, Gionis, Indyk, Motwani]
DGIM: Timestamps
Each bit in the stream has a timestamp, starting 1, 2, …
Record timestamps modulo N (the window size), so we can represent any relevant timestamp in O(log2N ) bits
36
DGIM: Buckets
A bucket in the DGIM method is a record consisting of:
The timestamp of its end [O(log N ) bits]
The number of 1’s between its beginning and end: [O(log log N ) bits]
Constraint on buckets: Number of 1s must be a power of 2
That explains the O(log log N) in (2)
37
37
Representing a Stream by Buckets
Either one or two buckets with the same power-of-2 number of 1s
Buckets do not overlap in timestamps
Buckets are sorted by size
Earlier buckets are not smaller than later buckets
Buckets disappear when their end-time is > N time units in the past
38
Example: Bucketized Stream
39
Properties we maintain:
Either one or two buckets with the same power-of-2 number of 1s
Buckets do not overlap in timestamp
Buckets are sorted by size
Updating Buckets – (1)
When a new bit comes in, drop the last (oldest) bucket if its end-time is prior to N time units before the current time
2 cases: Current bit is 0 or 1
If the current bit is 0, no other changes are needed
40
Updating Buckets – (2)
If the current bit is 1:
Create a new bucket of size 1, for just this bit
End timestamp = current time
If there are now three buckets of size 1, combine the oldest two into a bucket of size 2
If there are now three buckets of size 2, combine the oldest two into a bucket of size 4
And so on …
41
Example
42
Explanation for the example:
This example illustrates how DGIM algorithm updates buckets when new element comes.
First we have buckets as shown in the first figure. We have two size of 1 buckets, one size of 2 buckets, two size of 4 buckets, two size of 8 buckets. The last one should be size of 16 but in finite N, we cannot say all elements.
In the second figure, a new 1 comes. It forms a size of 1 bucket. Now we have three size of 1 buckets so we have to combine the older two size of 1 buckets. After updating, we get the third figure.
In the third figure, it just executes the update operation. Now we have one size of 1 bucket, two size of 2 buckets, two size of 4 buckets, two size of 8 buckets and the last one.
In the fourth figure, we add three elements: 101. After the first new 1 comes, it forms a size of 1 bucket. we have two size of 1 buckets, two size of 2 buckets, two size of 4 buckets, two size of 8 buckets and the last one. No extra updates. After 0 comes, no change. After the second new 1 comes, it forms a size of 1 bucket. we have three size of 1 buckets. We have to combine the older two size of 1 buckets. Then we have three size of 2 buckets. That is the figure 5.
In the fifth figure, elements are the same with figure 4. Just combine the older two size of 1 buckets and get new buckets.
In the sixth figure, because we have three size of 2 buckets now, we have to combine the older two to form a new size of 4 bucket. That results in three size of 4 buckets so that we continue to update until there are no three buckets for a specific size.
42
How to Query?
To estimate the number of 1s in the most recent N bits:
Sum the sizes of all buckets but the last
Add half the size of the last bucket
Remember: we don’t know how many 1s of the last bucket are still within the window
43
Example: Bucketized Stream
44
In-Class Practice 1
Go to practice
45
Error Bound: Proof
Suppose the last bucket has size 2r
Then by assuming 2r -1 of its 1s are still within the window, we make an error of at most 2r -1
Since there is at least one bucket of each of the sizes less than 2r, the true sum is no less than 1 + 2 + 4 + … + 2r-1 = 2r -1
Thus, error ratio is at most 2r-1 / (2r – 1) ≈ 50%
46
Explanation for the proof.
Computation method: estimation = sum of all the bucket size before the last one + the last one size/2
The situation of taking the most error: we have only one element in the last bucket. At this moment, we add the last one size/2 to the final estimate. So we make an error at most the last one size/2=2r -1.
All the bucket size before the last one is obviously what we need. Every bucket must appear once or twice. And then this value is at least 1 + 2 + 4 + … + 2r-1 = 2r -1 when every bucket just appears only one time.
Estimation=2r-1 +2r -1 , actual size= 1+2r -1=2r . (There is only one element while we estimate as 2r -1)
So the error ratio is at most 2r-1 /(2r – 1) ≈ 50%
46
Extensions (For Thinking)
Can we use the same trick to answer queries “How many 1s in the last k?” where k < N?
Can we handle the case where the stream is not bits, but integers, and we want the sum of the last k?
47
Answer for the 1st question: Find earliest bucket B that at overlaps with k. Number of 1s is the sum of sizes of more recent buckets + ½ size of B
Answer for the 2nd question
Solution 1: suppose that all integers have at most m bits.
Treat m bits of each integer as a separate stream ; Use DGIM to count 1s in each integer; The sum is sum(Ci*2^(𝑚−1) ). Ci is the estimated count for i-th bit.
Solution 2:Use buckets to keep partial sums。
Sum of elements in size b bucket is at most 2^b
Idea: Sum in each bucket is at most 2^b (unless bucket has only 1 integer)
47
Reducing the Error
Instead of maintaining 1 or 2 of each size bucket, we allow either r-1 or r for r > 2
Except for the largest size buckets; we can have any number between 1 and r of those
Error is at most 1/r
By picking r appropriately, we can tradeoff between number of bits and the error
48
Outline
Sampling from a Data Stream
Queries over a (long) Sliding Windows
Filtering Data Streams
Counting Distinct Elements
Computing Moments
Counting Itemsets
49
Filtering Data Streams
Each element of data stream is a tuple (a finite list of elements)
Given a list of keys S
How to determine which elements of stream have keys in S?
Obvious solution: Hash table
But suppose we do not have enough memory to store all of S in a hash table
E.g., we might be processing millions of filters on the same stream
50
Applications
Example: Email spam filtering
We know 1 billion “good” email addresses
If an email comes from one of these, it is NOT spam
Publish-subscribe systems
People express interest in certain sets of keywords
Determine whether each message matches user’s interest
51
First Cut Solution – (1)
Given a set of keys S that we want filter
Create a bit array B of n bits, initially all 0s
Choose a hash function h with range [0,n)
Hash each member of s S to one of m buckets, and set that bit to 1, i.e., B[h(s)]=1
Hash each element a of the stream and output only those that hash to bit that was set to 1
Output a if B[h(a)] == 1
52
First Cut Solution – (2)
Creates false positives but no false negatives
If the item is in S we surely output it, if not we may still output it
53
First Cut Solution – (3)
|S| = 1 billion email addresses
|B| = 1GB = 8 billion bits
If the email address is in S, then it surely hashes to a bucked that has the bit set to 1, so it always gets through (no false negatives)
Approximately 1/8 of the bits are set to 1, so about 1/8th of the addresses not in S get through to the output (false positives)
Actually, less than 1/8th, because more than one address might hash to the same bit
54
Analysis: Throwing Darts
More accurate analysis for the number of false positives
Consider: If we throw m darts into n equally likely targets, what is the probability that a target gets at least one dart?
In our case:
Targets = bits/buckets
Darts = hash values of items
55
Analysis: Throwing Darts – (2)
We have m darts, n targets
What is the probability that a target gets at least one dart?
56
Analysis: Throwing Darts – (3)
Fraction of 1s in the array B == probability of false positive ==
Example: darts, targets
Fraction of 1s in B = = 0.1175
Compare with our earlier estimate: 1/8 = 0.125
Can we improve this error?
57
Bloom Filter
Consider: |S| = m, |B| = n
Use k independent hash functions
Initialization:
Set B to all 0s
Hash each element using each hash function , set (for each )
Run-time:
When a stream element with key x arrives
If for all , then declare that x is in S
Otherwise discard the element x
58
Bloom Filter – Analysis
What fraction of the bit vector B are 1s?
Throwing k·m darts at n targets
So fraction of 1s is
But we have k independent hash functions
So, false positive probability =
59
Bloom Filter – Analysis (2)
m = 1 billion, n = 8 billion
k = 1: = 0.1175
k = 2: = 0.0493
What happens as we keep increasing k?
“Optimal” value of k:
E.g.:
60
Bloom Filter: Wrap-up
Bloom filters guarantee no false negatives, and use limited memory
Great for pre-processing before more expensive checks
E.g., Google’s BigTable, Squid web proxy
Suitable for hardware implementation
Hash function computations can be parallelized
61
Outline
Sampling from a Data Stream
Queries over a (long) Sliding Windows
Filtering Data Streams
Counting Distinct Elements
Computing Moments
Counting Itemsets
62
Counting Distinct Elements
Problem:
Data stream consists of a universe of elements chosen from a set of size N
Maintain a count of the number of distinct elements seen so far
Obvious approach:
Maintain the set of elements seen so far
63
Applications
How many different words are found among the Web pages being crawled at a site?
Unusually low or high numbers could indicate artificial pages (spam?)
How many different Web pages does each customer request in a week?
64
Using Small Storage
Real Problem: What if we do not have space to store the complete set?
Estimate the count in an unbiased way
Accept that the count may be in error, but limit the probability that the error is large
65
Flajolet-Martin Approach
Pick a hash function h that maps each of the n elements to at least log2N bits
For each stream element a, let r(a) be the number of trailing 0s in h(a)
r(a) = position of first 1 counting from the right
Record R = the maximum r(a) seen
R = maxar(a), over all the items a seen so far
Estimated number of distinct elements = 2R
Based on a variant due to AMS (Alon, Matias, and Szegedy)
66
Why It Works
The probability that a given h(a) ends in at least r 0s is 2-r
h(a) hashes elements uniformly at random
Probability that a random number ends in at least r 0s is 2-r
If there are m different elements, the probability that R ≥ r is 1 – (1 – 2-r )m
Prob. a given h(a) ends in fewer than r 0s.
Prob. all h(a)’s end in fewer than r 0s.
67
Why It Works – (2)
Note:
Prob. of NOT finding a tail of length r is:
If , then prob. tends to 1
as
So, the probability of finding a tail of length r tends to 0
If , then prob. tends to 0
as
So, the probability of finding a tail of length r tends to 1
Thus, will almost always be around m.
68
Why It Doesn’t Work
E[2R] is actually infinite
Probability halves when R R +1, but value doubles
Workaround involves using many hash functions and getting many samples
How are samples combined?
Average? What if one very large value?
Median? All values are a power of 2
Solution:
Partition your samples into small groups
Take the average of groups
Then take the median of the averages
69
In-Class Practice 2
Go to practice
70
One-Slide Takeaway
Sampling from a streaming data
How to get a fixed proportion or a fixed-size Sample
Queries over a long sliding windows
understand DGIM algorithm
Filtering Data Streams
understand first cut solution and Bloom Filter
Counting distinct elements
Understand Flajolet-Martin Approach
Appendix: computing moments and counting item sets
71
References
Book:
Mining of Massive Datasets
Massive Online Analysis (MOA) Software:
72
Appendix
Sampling from a Data Stream
Queries over a (long) Sliding Windows
Filtering Data Streams
Counting Distinct Elements
Computing Moments
Counting Itemsets
73
Generalization: Moments
Suppose a stream has elements chosen from a set of N values
Let ma be the number of times value a occurs
The kth moment is
74
Special Cases
0th moment = number of different elements
The problem just considered
1st moment = count of the numbers of elements = length of the stream
Easy to compute
2nd moment = surprise number = a measure of how uneven the distribution is
75
Example: Surprise Number
Stream of length 100; 11 values appear
Item counts: 10, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9 Surprise # = 910
Item counts: 90, 1, 1, 1, 1, 1, 1, 1 ,1, 1, 1 Surprise # = 8,110
76
AMS Method
Works for all moments
Gives an unbiased estimate
We’ll just concentrate on 2nd moment
Based on calculation of many random variables X:
For each random variable X, we store X.el and X.val
Each random variable represents one separate item
Note this requires a count in main memory, so number of Xs is limited
77
Due to AMS (Alon, Matias, and Szegedy)
One Random Variable
Assume stream has length n
Pick a random time to start, so that any time is equally likely
Let the chosen time have element a in the stream
X = n * ((twice the number of as in the stream starting at the chosen time) – 1)
Note: store n once, count of as for each X
78
Expected Value of X
2nd moment is
E(X) = all times t n * ([twice the number of times the stream element at time t appears from that time on] – 1)
=
=
Group times by the value seen
Time when the last a is seen
Time when
the penultimate
a is seen
Time when
the first a
is seen
79
Combining Samples
One random variable only represent one sampled item; we should do many concurrent samples
Compute as many variables X as can fit in available memory
Average them in groups
Take median of averages
Proper balance of group sizes and number of groups assures not only correct expected value, but expected error goes to 0 as number of samples gets large
80
Problem: Streams Never End
We assumed there was a number n, the number of positions in the stream
But real streams go on forever, so n is a variable – the number of inputs seen so far
81
Stream Never End: Fixups
The variables X have n as a factor – keep n separately; just hold the count in X
Suppose we can only store k counts. We must throw some X ’s out as time goes on
Objective: each starting time t is selected with probability k/n
Solution: (fix-size sampling!)
Choose the first k times for k variables
When the nth element arrives (n > k), choose it with probability k/n
If you choose it, throw one of the previously stored variables out, with equal probability
82
Appendix
Sampling from a Data Stream
Queries over a (long) Sliding Windows
Filtering Data Streams
Counting Distinct Elements
Computing Moments
Counting Itemsets
83
Counting Itemsets
New Problem: Given a stream, which items appear more than s times in the window?
Possible solution: Think of the stream of baskets as one binary stream per item
1 = item present; 0 = not present
Use DGIM to estimate counts of 1s for all items
84
84
Extensions
In principle, you could count frequent pairs or even larger sets the same way
One stream per itemset
Drawbacks:
Only approximate
Number of itemsets is way too big
85
Exponentially Decaying Windows
Exponentially decaying windows: A heuristic for selecting likely frequent itemsets
What are “currently” most popular movies?
Instead of computing the raw count in last N elements
Compute a smooth aggregation over the whole stream
If stream is a1, a2,… and we are taking the sum of the stream, take the answer at time t to be:
c is a constant, presumably tiny, like or
When new arrives: Multiply current sum by (1-c) and add
86
Example: Counting Items
If each is an “item” we can compute the characteristic function of each possible item x as an exponentially decaying window (E.D.W.).
That is:
where if , and 0 otherwise
Imagine that for each item x we have a binary stream (1 … x appears, 0 … x does not appear)
New item x arrives:
Multiply all counts by (1-c)
Add +1 to count for x
Call this sum the “weight” of item x
87
Sliding Versus Decaying Windows
88
Important property: Sum over all weights is =
Counting Items
Suppose we want to find those items of weight > ½
Important property: Sum over all weights is =
Thus:
There cannot be more than items with weight of ½ or more
So, is a limit on the number of movies being counted at any time
89
Extension to Larger Itemsets
Count (some) itemsets in an E.D.W.
Problem: Too many itemsets to keep counts of all of them in memory
When a basket B comes in:
Multiply all counts by (1-c)
For uncounted items in B, create new count
Add 1 to count of any item in B and to any itemset contained in B that is already being counted
Drop counts < ½
Initiate new counts (next slide)
90
Initiation of New Counts
Start a count for an itemset if every proper subset of S had a count prior to arrival of basket B
Intuitively: If all subsets of S are being counted this means they are “frequent/hot” and thus S has a potential to be “hot”
Example
Start counting {i, j} iff both i and j were counted prior to seeing B
Start counting {i, j, k} iff {i, j}, {i, k}, and {j, k} were all counted prior to seeing B
91
How Many Counts?
Counts for single items < (2/c) * (average number of items in a basket)
Counts for larger itemsets = ??
But we are conservative about starting counts of large sets
If we counted every set we saw, one basket of 20 items would initiate 1M counts
92
In-Class Practice 1
There are several ways that the bit-stream 1001011011101 could be partitioned into buckets. Find all of them.
93
In-Class Practice 2
Suppose our stream consists of the integers 3, 1, 4, 1, 5, 9, 2, 6, 5. Our hash functions will all be of the form for some a and b. You should treat the result as a 5-bit binary integer. Determine the tail length for each stream element and the resulting estimate of the number of distinct elements if the hash function is:
94
/docProps/thumbnail.jpeg