7CCSMBDT – Big Data Technologies Week 9
Grigorios Loukides, PhD (grigorios.loukides@kcl.ac.uk)
Spring 2016/2017
1
Today
More algorithms for streams:
(1) Filtering a data stream: Bloom filters
Select elements with property x from stream
(2) Counting distinct elements: Flajolet-Martin
Number of distinct elements in the last k elements of the stream
(3) Estimating moments: AMS method Estimate std. dev. of last k elements
2
Filtering data streams
Each element of data stream is a tuple e.g., (user1, item1),…
Problem
Given a list of keys S (e.g., [user100, user150]) Determine which tuples of stream are in S
Obvious solution: Hash table
But suppose we do not have enough memory to
store all of S in a hash table
E.g., stream may contain millions of tuples WHY?
3
Filtering data streams: application1
Mobile push notifications to app users
Routed to users based on tokens generated by mobile platforms
Token size up to 4kb
A platform generates hundreds of
millions of tokens
https://clevertap.com/push-notifications/
4
Filtering data streams: application 1
The problem
An app user expresses interest for some apps:
[token1,token2] corresponds to [app1, app2]
Stream of tokens is being consumed [token1, token3, token 5, token100, ….
If token1 or token2 in the stream, send notification to the user
5
Filtering data streams: application 2
The problem
Business receives many emails, same are spam
Users a white-list approach to identify spam a list of 1 billion email addresses that are good [abc@gmail.com , abd@gmail.com, …
Stream of email addresses is consumed
[cnn@gmail.com, xyz@gmail.com, abc@gmail.com, ….
If an address in the stream is in the white-list, forward it to the user
6
First idea
Given a set of keys S that we want to filter
E.g., white-list of emails
1. Create a bit array B of n bits, initially all 0s 2. Choose a hash function h with range [0,n) 3. Hash each member of s S to one of
n buckets, and set that bit to 1, i.e., B[h(s)]=1 4. Hash each element a of the stream with h
and output a if it hashes to a bit of B that is 1 i.e., Output a if B[h(a)] == 1
7
First idea
If an element is in S, it will hash to a bucket that has its bit set to 1, so it will always get through
(no false negatives)
good emails will always be forwarded
If an element is NOT in S, what is the probability that it will get through?
(false positives)
bad emails may also be forwarded Let’s compute this probability.
8
First idea
If an element is NOT in S, what is the probability that it will get through?
Probability that bucket hashes an element: 1 𝑛1
Probability that bucket does not hash an element: 1 − 𝑛
Probability that bucket does not hash any element after hashing
all |S| stream elements: 1 − 1 |𝑆| 𝑛
Probability that bucket hashes at least one element after hashing all |S| stream elements: 1 − 1 − 1 |𝑆| .
Using
𝑛
𝑆 𝑛⋅𝑆 𝑆
11𝑛1𝑛−𝑆 1− 1−𝑛 =1− 1−𝑛 ≈1− 𝑒 =1−𝑒 𝑛
9
Bloom filter
Probabilistic (approximate) set membership test
Given a set S = {x1,x2,…,xn}, is y in S ?
Construct a data structure to answer that is:
Fast (Faster than searching through S).
Small (Smaller than explicit representation).
To obtain speed and size improvements, allow some probability of error.
False positives: y S but we report y S False negatives: y S but we report y S
Bloom filter
Data structure for probabilistic set membership testing Zero false negatives
Nonzero false positives
10
Bloom filter
Given a set of keys S that we want to filter
1. Create a bit array B of n bits, initially all 0
2. Choose k hash functions h1 ,…, hk with range [0,n)
3. Hash each member of s S
use k hash functions, h𝑖 𝑠 , 𝑖 ∈ 1, 𝑘 , which map s into random numbers
uniform in [1, n] (need modulo n if the hash function outputs larger numbers)
set the elements 𝐵 h1 𝑠 , … , 𝐵 h𝑘 𝑠 to 1
4. When a stream element with key y arrives use k hash functions
if 𝐵 h1 𝑦 , … , 𝐵 h𝑘 𝑦 are all 1, output y else discard y
11
Bloom filter
Bloom filter example Keys in S: a, b, y, l
Stream keys q, z
Tarkoma et al. Theory and practice of bloom filters for distributed systems. DOI: 10.1109/SURV.2011.031611.00024
12
False positive
Bloom filter
Bloom filter key parameters
Size of stream |S|
Size of bloom filter (array B): n
Number of hash functions: k
Impact of increasing key parameters
|S|: higher false positive rate
n: more space, lower false positive rate
k: more computation, lower false positive rate (as 𝑘 → 𝑘𝑜𝑝𝑡)
Tarkoma et al. Theory and practice of bloom filters for distributed systems. DOI: 10.1109/SURV.2011.031611.00024
13
Bloom filter
(Lower bound)* of false positive ratio (accurate for n>>k) Prh𝑖 𝑥 =1 =1⇒Prh𝑖 𝑥 =0 =1−1
𝑘
After adding all |S| elements of S, the probability that a given bit𝐵𝑗 =0,
1 𝑘|𝑆| 1 𝑘|𝑆| −𝑘|𝑆| isPr𝐵[𝑗]=0 = 1−𝑛 ⇒Pr𝐵𝑗 =1 =1− 1−𝑛 ≈1−𝑒 𝑛 .
The probability that the Bloom filter claims an element y is in S, while it is not:
1 𝑘|𝑆| 𝑘 −𝑘|𝑆| 𝑘 p𝑓𝑎𝑙𝑠𝑒 = 1− 1−𝑛 ≈ 1−𝑒 𝑛
Bose et al. On the false-positive rate of Bloom filters, Information Processing Letters (2008) 210–213.
Christensen et al. A new analysis of the false positive rate of a Bloom Filter. Information Processing Letters (2010) 944-949
Prh𝑖𝑥=0,𝑖∈1,𝑘 =1−1 𝑛
𝑛𝑛
14
Bloom filter
What happens as we keep increasing k?
0.2 0.18 0.16
p𝑓𝑎𝑙𝑠𝑒 = 1− 1−1 𝑛
𝑘|𝑆| 𝑘
𝑘|𝑆| 𝑘 0.14 ≈ 1−𝑒− 𝑛 0.12
0.1 0.08 0.06 0.04
0.020 2 4 6 8 10 12 14 16 18 20
k
Find the optimal k by minimizing 𝑝𝑓𝑎𝑙𝑠𝑒 w.r.t. k
𝑘𝑜𝑝𝑡 = 𝑛 ln 2 ⇒𝑝𝑓𝑎𝑙𝑠𝑒,𝑘𝑜𝑝𝑡 = 1−𝑒−ln 2 |S| |𝑆|
n ln(2)
𝑛 ln(2)
= 1 |𝑆| ≈0.618𝑛/|𝑆|
2
Given S, we want to maintain a desired false positive ratio p 𝒏 𝒍𝒏(𝟐)
𝟏 |𝑺| |𝑺|𝒍𝒏 𝒑
𝟐 =𝒑⇒𝒏=− 𝒍𝒏 𝟐 𝟐 , linearin|S|
15
𝒑𝒇𝒂𝒍𝒔𝒆
Bloom filters
Other uses of Bloom filter
Google BigTable*: Google’s column-oriented NoSQL Big Data database service. It’s the same database that powers many core Google services, including Search, Analytics, Maps, and Gmail.
Apache Hbase: open-source, distributed, versioned, non-relational database modeled after Google’s Bigtable
Apache Cassandra : Key-value NoSQL database
Postgresql : Object-Relational DBMS
use Bloom filters to reduce the disk lookups for non-existent rows or columns.
https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf
16
Bloom filters
Bloom filter in python (a fast library)
https://axiak.github.io/pybloomfiltermmap/#
Install in cloudera terminal: sudo pip install pybloomfiltermmap
Note this works for python2 Example:
capacity, false positive prob, filename
Add the item to the bloom filter.
17
Today
More algorithms for streams:
(1) Filtering a data stream: Bloom filters
Select elements with property x from stream
(2) Counting distinct elements: Flajolet-Martin
Number of distinct elements in the last k elements of the stream
(3) Estimating moments: AMS method Estimate std. dev. of last k elements
18
Counting distinct elements
Problem:
Data stream consists of many elements chosen from a set of size N
events in activity monitoring, products purchased, …
Maintain a count of the number of distinct elements seen so far
Obvious approach:
Keep a hash table of elements seen so far
Number of distinct events for a day
Number of distinct products last week
But we don’t have so much space
19
Counting distinct elements: Applications
Big data applications
approxCountDistinct in Spark
https://databricks.com/blog/2016/05/19/approximate-algorithms-in- apache-spark-hyperloglog-and-quantiles.html
Security monitoring- if more than X attempts, report
Propagation rate of viruses
Distributed computing: multiple parties can combine their sketches to find number of total distinct elements, and number of common elements (inclusion/exclusion)
e.g., document overlap for plagiarism
20
Counting distinct elements
Estimate number of distinct elements
Select a hash function h that maps each of N elements to at least log2 N bits
N is the max. number of distinct elements e.g.,a1100 (12inbinary)
Define r(h(a)) as the number of consecutive 0s from the right (before we hit an 1)
e.g., ah(a)=1100r(h(a))=2
21
Counting distinct elements
Estimate number of distinct elements Flajolet-Martin (FM) sketch
For each element x in stream S Compute r(h(x))
LetR=max𝑟 h 𝑥 𝑥∈𝑆
Return 2𝑹 as the estimated number of distinct elements in S
22
Counting distinct elements
Why 2𝑹 is an estimate of number of distinct elements? (intuition)
h maps x uniformly to each bit string 1 of elements will have last bit 0
2
if we had seen 2 elements, we would expect one trailing zero
1 = 1 of elements will have two last bits 0 22 4
if we had seen 4 elements, we would expect 2 trailing zeros
…
1 of elements will have r last bits 0
if we had seen r elements, we would expect log(r) trailing zeros
2𝑟
So, with r trailing zeros, we have seen 2𝑟 elements
23
Counting distinct elements
We hash each user_id, compute # of trailing zeros (i.e., r(h(user_id))) and
then estimate the number of distinct elements as 2^{max(trailing zeros)) (i.e., 2^R)
24
Counting distinct elements
Exercise
Given the stream {3,1,3,2} and h(x)=4xmod32, compute the estimate output by an FM sketch that uses h(x) and stores each h(x) using 4 bits.
25
Counting distinct elements
Given the stream {3,1,3,2} and h(x)=4xmod32, compute the estimate output by an FM sketch that uses h(x) and stores each h(x) using 4 bits.
Element x
Hash h(x)
r(h(x))
3
1100
2
1
0100
2
3
1100
2
2
1000
3
R=max(2,2,2,3)=3 Estimate is 2^R=8
h(2)=4*2mod32=8 (in binary 1000)
26
Counting distinct elements
How good is the estimate 2𝑹?
Let the true number of distinct elements be F
Pr 1≤2𝑅 ≤𝑐 >1−3,for𝑐>3 𝑐𝐹𝑐
E.g., F=100
Pr(25<=2𝑹<=400)>0.25
Pr(20<=2𝑹<=500)>0.4
Pr(10<=2𝑹<=1000)>0.7 The standard deviation 𝜎 is too high
http://www.cse.cuhk.edu.hk/~taoyf/course/wst501/notes/lec4.pdf
27
Counting distinct elements
How good is the estimate 2𝑹? Can we reduce 𝜎?
Construct m different sketches, each with an independent hash function
𝜎 𝑚
Estimateis (2𝑹𝟏+⋯+𝟐𝑹𝒎)/𝒎st.dev.
28
Counting distinct elements
How good is the estimate?
Let 𝑧 be the estimate. Given 𝛿 > 0, can we guarantee
Pr 1 ≤ 𝑧 ≤ 𝑐 ≥ 1 − 𝛿, for any c >6? 𝑐𝐹
𝟗𝒍𝒏 𝟐
Construct m= 𝜹 different sketches, each with an
𝒄 𝟎.𝟓−𝟑 𝟐 𝒄
independent hash function
Estimate z = median(2𝑹𝟏, … , 𝟐𝑹𝒎)
E.g., F=100, 𝛿=0.01
with m=120, Pr 10 ≤ 𝑧 ≤ 1000 ≥ 1 − 0.01 = 0.99
http://www.cse.cuhk.edu.hk/~taoyf/course/wst501/notes/lec4.pdf 29
Counting distinct elements
Hashfunctions
SHA1 https://docs.python.org/2/library/hashlib.html Murmurhash3https://pypi.python.org/pypi/mmh3 faster
Accuracy (mean trick)
Much better than probabilistic bounds
30
Counting distinct elements
Correction factor 𝜑=0.77351. Estimate becomes 2𝑅/𝜑 Composability
If we have two FM sketches F1, F2 built with the same hash function for sets S1, S2, then bit-wisely ORing them produces a new FM sketch that counts the distinct values in S1 U S2
31
Today
More algorithms for streams:
(1) Filtering a data stream: Bloom filters
Select elements with property x from stream
(2) Counting distinct elements: Flajolet-Martin
Number of distinct elements in the last k elements of the stream
(3) Estimating moments: AMS method Estimate std. dev. of last k elements
32
Estimating frequency moments
What are frequency moments
The 𝑘-th frequency moment of a stream comprised of N
different types of elements 𝑎1, … , 𝑎𝑁 each appearing 𝑚1, … , 𝑚𝑁 times is defined as 𝑓 = 𝑁 𝑚𝑘
𝑘 𝑖=1 𝑖
𝑓 is the number of distinct elements 0
𝑓 is the total frequency (length of stream) 1
𝑓 computation 2
http://www.tau.ac.il/~nogaa/PDFS/amsz4.pdf
33
Estimating frequency moments
Why they are important
Indicate the degree of data skew, which is useful in many parallel database applications (e.g., determines the selection of algorithms for data
partitioning)
𝑓 is the surprise number S (how uneven is the distribution)
Stream of length 100 11 distinct values
Item counts: 10, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9 Surprise S = 10^2+10*(9^2)=910
Item counts: 90, 1, 1, 1, 1, 1, 1, 1 ,1, 1, 1 Surprise S = 90^2+10*(1^2)=8,110
2
http://www.tau.ac.il/~nogaa/PDFS/amsz4.pdf
34
AMS method
AMS method works for all moments
For now, we focus on the 2nd moment
We pick and keep track of many variables X:
For each variable X we store X.el and X.val X.el corresponds to the element i
X.val corresponds to the count of element i
Note this requires a count in main memory, so number of Xs is limited
Our goal is to estimate 𝐟 = 𝒎𝟐 𝟐𝒊𝒊
http://www.tau.ac.il/~nogaa/PDFS/amsz4.pdf
35
AMS method
How to set X.val and X.el?
Assume stream has length n (we relax this later)
Pick some random time t (t
If you choose it, throw one of the previously stored variables X out, selected uniformly at random
42
Sketch library for Python
Streamlib https://github.com/jiecchen/StreamLib
API: http://xmerge.me/StreamLib/api.html
Implementations of many sketches, including AMS for F2
sketch from streamlib import F2
myf2 = F2(20,1)
mylist=[0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 3, 3, 4]
myf2.processBatch(mylist) print(myf2.estimate())
#F2 is the AMS sketch we examined #AMS sketch with 20 buckets and m=1
#list representing our stream
#add stream to AMS
#print estimate of F2 moment
43
Tell us what you think of the module
Please be as honest and constructive as possible
Your feedback is extremely valuable
(The survey is completely anonymous)
The results…
go to the module leader, any other staff who taught on the module and the Head of Department
On KEATS
we will provide you with…
a breakdown of the results of all module
evaluations (the free text comments will NOT be
made available)
all module leaders’ responses to the results of
their module evaluations. These will include the key points of the action plans
Module Evaluation
in Informatics What happens next…
The module leader and teaching staff will…
1. fully consider the results of the survey, and
2. create an action plan detailing how your responses will inform
how the module is taught next year. This may include…
How all the things you like about the module can be enhanced
How any issues identified will be resolved
The Head of Department…
receives the results of all module evaluations and will contribute to action plans where necessary
highlights examples of good practice that could be shared across more modules
highlights any common issues raised across modules in the department
44