程序代写代做代考 algorithm go Hive html distributed system hbase database C data structure 7CCSMBDT – Big Data Technologies Week 9

7CCSMBDT – Big Data Technologies Week 9
Grigorios Loukides, PhD (grigorios.loukides@kcl.ac.uk)
Spring 2016/2017
1

Today
 More algorithms for streams:
 (1) Filtering a data stream: Bloom filters
 Select elements with property x from stream
 (2) Counting distinct elements: Flajolet-Martin
 Number of distinct elements in the last k elements of the stream
 (3) Estimating moments: AMS method  Estimate std. dev. of last k elements
2

Filtering data streams
 Each element of data stream is a tuple  e.g., (user1, item1),…
 Problem
 Given a list of keys S (e.g., [user100, user150])  Determine which tuples of stream are in S
 Obvious solution: Hash table
 But suppose we do not have enough memory to
store all of S in a hash table
 E.g., stream may contain millions of tuples WHY?
3

Filtering data streams: application1
 Mobile push notifications to app users
 Routed to users based on tokens generated by mobile platforms
 Token size up to 4kb
 A platform generates hundreds of
millions of tokens
https://clevertap.com/push-notifications/
4

Filtering data streams: application 1
 The problem
 An app user expresses interest for some apps:
[token1,token2] corresponds to [app1, app2]
 Stream of tokens is being consumed [token1, token3, token 5, token100, ….
 If token1 or token2 in the stream, send notification to the user
5

Filtering data streams: application 2
 The problem
 Business receives many emails, same are spam
 Users a white-list approach to identify spam  a list of 1 billion email addresses that are good [abc@gmail.com , abd@gmail.com, …
 Stream of email addresses is consumed
 [cnn@gmail.com, xyz@gmail.com, abc@gmail.com, ….
 If an address in the stream is in the white-list, forward it to the user
6

First idea
 Given a set of keys S that we want to filter
 E.g., white-list of emails
1. Create a bit array B of n bits, initially all 0s 2. Choose a hash function h with range [0,n) 3. Hash each member of s S to one of
n buckets, and set that bit to 1, i.e., B[h(s)]=1 4. Hash each element a of the stream with h
and output a if it hashes to a bit of B that is 1  i.e., Output a if B[h(a)] == 1
7

First idea
 If an element is in S, it will hash to a bucket that has its bit set to 1, so it will always get through
(no false negatives)
 good emails will always be forwarded
 If an element is NOT in S, what is the probability that it will get through?
(false positives)
 bad emails may also be forwarded  Let’s compute this probability.
8

First idea
 If an element is NOT in S, what is the probability that it will get through?
 Probability that bucket hashes an element: 1 𝑛1
 Probability that bucket does not hash an element: 1 − 𝑛
 Probability that bucket does not hash any element after hashing
all |S| stream elements: 1 − 1 |𝑆| 𝑛
 Probability that bucket hashes at least one element after hashing all |S| stream elements: 1 − 1 − 1 |𝑆| .
 Using
𝑛
𝑆 𝑛⋅𝑆 𝑆
11𝑛1𝑛−𝑆 1− 1−𝑛 =1− 1−𝑛 ≈1− 𝑒 =1−𝑒 𝑛
9

Bloom filter
 Probabilistic (approximate) set membership test
 Given a set S = {x1,x2,…,xn}, is y in S ?
 Construct a data structure to answer that is:
 Fast (Faster than searching through S).
 Small (Smaller than explicit representation).
 To obtain speed and size improvements, allow some probability of error.
 False positives: y  S but we report y  S  False negatives: y  S but we report y  S
 Bloom filter
 Data structure for probabilistic set membership testing  Zero false negatives
 Nonzero false positives
10

Bloom filter
 Given a set of keys S that we want to filter
1. Create a bit array B of n bits, initially all 0
2. Choose k hash functions h1 ,…, hk with range [0,n)
3. Hash each member of s S
 use k hash functions, h𝑖 𝑠 , 𝑖 ∈ 1, 𝑘 , which map s into random numbers
uniform in [1, n] (need modulo n if the hash function outputs larger numbers)
 set the elements 𝐵 h1 𝑠 , … , 𝐵 h𝑘 𝑠 to 1
4. When a stream element with key y arrives  use k hash functions
 if 𝐵 h1 𝑦 , … , 𝐵 h𝑘 𝑦 are all 1, output y  else discard y
11

Bloom filter
 Bloom filter example Keys in S: a, b, y, l
Stream keys q, z
Tarkoma et al. Theory and practice of bloom filters for distributed systems. DOI: 10.1109/SURV.2011.031611.00024
12
False positive

Bloom filter
 Bloom filter key parameters
 Size of stream |S|
 Size of bloom filter (array B): n
 Number of hash functions: k
 Impact of increasing key parameters
 |S|: higher false positive rate
 n: more space, lower false positive rate
 k: more computation, lower false positive rate (as 𝑘 → 𝑘𝑜𝑝𝑡)
Tarkoma et al. Theory and practice of bloom filters for distributed systems. DOI: 10.1109/SURV.2011.031611.00024
13

Bloom filter
 (Lower bound)* of false positive ratio (accurate for n>>k) Prh𝑖 𝑥 =1 =1⇒Prh𝑖 𝑥 =0 =1−1
𝑘
 After adding all |S| elements of S, the probability that a given bit𝐵𝑗 =0,
1 𝑘|𝑆| 1 𝑘|𝑆| −𝑘|𝑆| isPr𝐵[𝑗]=0 = 1−𝑛 ⇒Pr𝐵𝑗 =1 =1− 1−𝑛 ≈1−𝑒 𝑛 .
 The probability that the Bloom filter claims an element y is in S, while it is not:
1 𝑘|𝑆| 𝑘 −𝑘|𝑆| 𝑘 p𝑓𝑎𝑙𝑠𝑒 = 1− 1−𝑛 ≈ 1−𝑒 𝑛
Bose et al. On the false-positive rate of Bloom filters, Information Processing Letters (2008) 210–213.
Christensen et al. A new analysis of the false positive rate of a Bloom Filter. Information Processing Letters (2010) 944-949
Prh𝑖𝑥=0,𝑖∈1,𝑘 =1−1 𝑛
𝑛𝑛
14

Bloom filter
What happens as we keep increasing k?
0.2 0.18 0.16
p𝑓𝑎𝑙𝑠𝑒 = 1− 1−1 𝑛
𝑘|𝑆| 𝑘
𝑘|𝑆| 𝑘 0.14 ≈ 1−𝑒− 𝑛 0.12
0.1 0.08 0.06 0.04
0.020 2 4 6 8 10 12 14 16 18 20
k
 Find the optimal k by minimizing 𝑝𝑓𝑎𝑙𝑠𝑒 w.r.t. k
 𝑘𝑜𝑝𝑡 = 𝑛 ln 2 ⇒𝑝𝑓𝑎𝑙𝑠𝑒,𝑘𝑜𝑝𝑡 = 1−𝑒−ln 2 |S| |𝑆|
n ln(2)
𝑛 ln(2)
= 1 |𝑆| ≈0.618𝑛/|𝑆|
2
 Given S, we want to maintain a desired false positive ratio p 𝒏 𝒍𝒏(𝟐)
𝟏 |𝑺| |𝑺|𝒍𝒏 𝒑
 𝟐 =𝒑⇒𝒏=− 𝒍𝒏 𝟐 𝟐 , linearin|S|
15
𝒑𝒇𝒂𝒍𝒔𝒆

Bloom filters
 Other uses of Bloom filter
 Google BigTable*: Google’s column-oriented NoSQL Big Data database service. It’s the same database that powers many core Google services, including Search, Analytics, Maps, and Gmail.
 Apache Hbase: open-source, distributed, versioned, non-relational database modeled after Google’s Bigtable
 Apache Cassandra : Key-value NoSQL database
 Postgresql : Object-Relational DBMS
use Bloom filters to reduce the disk lookups for non-existent rows or columns.
https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf
16

Bloom filters
 Bloom filter in python (a fast library)
 https://axiak.github.io/pybloomfiltermmap/#
 Install in cloudera terminal: sudo pip install pybloomfiltermmap
 Note this works for python2  Example:
capacity, false positive prob, filename
Add the item to the bloom filter.
17

Today
 More algorithms for streams:
 (1) Filtering a data stream: Bloom filters
 Select elements with property x from stream
 (2) Counting distinct elements: Flajolet-Martin
 Number of distinct elements in the last k elements of the stream
 (3) Estimating moments: AMS method  Estimate std. dev. of last k elements
18

Counting distinct elements
 Problem:
 Data stream consists of many elements chosen from a set of size N
 events in activity monitoring, products purchased, …
Maintain a count of the number of distinct elements seen so far
 Obvious approach:
Keep a hash table of elements seen so far
 Number of distinct events for a day
 Number of distinct products last week
 But we don’t have so much space
19

Counting distinct elements: Applications
 Big data applications
 approxCountDistinct in Spark
 https://databricks.com/blog/2016/05/19/approximate-algorithms-in- apache-spark-hyperloglog-and-quantiles.html
 Security monitoring- if more than X attempts, report
 Propagation rate of viruses
 Distributed computing: multiple parties can combine their sketches to find number of total distinct elements, and number of common elements (inclusion/exclusion)
 e.g., document overlap for plagiarism 
20

Counting distinct elements
 Estimate number of distinct elements
 Select a hash function h that maps each of N elements to at least log2 N bits
 N is the max. number of distinct elements e.g.,a1100 (12inbinary)
 Define r(h(a)) as the number of consecutive 0s from the right (before we hit an 1)
e.g., ah(a)=1100r(h(a))=2
21

Counting distinct elements
 Estimate number of distinct elements  Flajolet-Martin (FM) sketch
For each element x in stream S Compute r(h(x))
LetR=max𝑟 h 𝑥 𝑥∈𝑆
Return 2𝑹 as the estimated number of distinct elements in S
22

Counting distinct elements
 Why 2𝑹 is an estimate of number of distinct elements? (intuition)
 h maps x uniformly to each bit string  1 of elements will have last bit 0 
2
if we had seen 2 elements, we would expect one trailing zero
 1 = 1 of elements will have two last bits 0 22 4
if we had seen 4 elements, we would expect 2 trailing zeros
…
 1 of elements will have r last bits 0
if we had seen r elements, we would expect log(r) trailing zeros

2𝑟
So, with r trailing zeros, we have seen 2𝑟 elements
23

Counting distinct elements
We hash each user_id, compute # of trailing zeros (i.e., r(h(user_id))) and
then estimate the number of distinct elements as 2^{max(trailing zeros)) (i.e., 2^R)
24

Counting distinct elements
Exercise
 Given the stream {3,1,3,2} and h(x)=4xmod32, compute the estimate output by an FM sketch that uses h(x) and stores each h(x) using 4 bits.
25

Counting distinct elements
 Given the stream {3,1,3,2} and h(x)=4xmod32, compute the estimate output by an FM sketch that uses h(x) and stores each h(x) using 4 bits.
Element x
Hash h(x)
r(h(x))
3
1100
2
1
0100
2
3
1100
2
2
1000
3
R=max(2,2,2,3)=3 Estimate is 2^R=8
h(2)=4*2mod32=8 (in binary 1000)
26

Counting distinct elements
 How good is the estimate 2𝑹?
 Let the true number of distinct elements be F
Pr 1≤2𝑅 ≤𝑐 >1−3,for𝑐>3 𝑐𝐹𝑐
 E.g., F=100
 Pr(25<=2𝑹<=400)>0.25
 Pr(20<=2𝑹<=500)>0.4
 Pr(10<=2𝑹<=1000)>0.7 The standard deviation 𝜎 is too high
http://www.cse.cuhk.edu.hk/~taoyf/course/wst501/notes/lec4.pdf
27

Counting distinct elements
 How good is the estimate 2𝑹?  Can we reduce 𝜎?
 Construct m different sketches, each with an independent hash function
𝜎 𝑚
Estimateis (2𝑹𝟏+⋯+𝟐𝑹𝒎)/𝒎st.dev.
28

Counting distinct elements
 How good is the estimate?
 Let 𝑧 be the estimate. Given 𝛿 > 0, can we guarantee
Pr 1 ≤ 𝑧 ≤ 𝑐 ≥ 1 − 𝛿, for any c >6? 𝑐𝐹
𝟗𝒍𝒏 𝟐
 Construct m= 𝜹 different sketches, each with an
𝒄 𝟎.𝟓−𝟑 𝟐 𝒄
independent hash function
 Estimate z = median(2𝑹𝟏, … , 𝟐𝑹𝒎)
 E.g., F=100, 𝛿=0.01
 with m=120, Pr 10 ≤ 𝑧 ≤ 1000 ≥ 1 − 0.01 = 0.99
http://www.cse.cuhk.edu.hk/~taoyf/course/wst501/notes/lec4.pdf 29

Counting distinct elements
 Hashfunctions
 SHA1 https://docs.python.org/2/library/hashlib.html Murmurhash3https://pypi.python.org/pypi/mmh3 faster
 Accuracy (mean trick)
 Much better than probabilistic bounds
30

Counting distinct elements
 Correction factor 𝜑=0.77351. Estimate becomes 2𝑅/𝜑  Composability
 If we have two FM sketches F1, F2 built with the same hash function for sets S1, S2, then bit-wisely ORing them produces a new FM sketch that counts the distinct values in S1 U S2
31

Today
 More algorithms for streams:
 (1) Filtering a data stream: Bloom filters
 Select elements with property x from stream
 (2) Counting distinct elements: Flajolet-Martin
 Number of distinct elements in the last k elements of the stream
 (3) Estimating moments: AMS method  Estimate std. dev. of last k elements
32

Estimating frequency moments
 What are frequency moments
 The 𝑘-th frequency moment of a stream comprised of N
different types of elements 𝑎1, … , 𝑎𝑁 each appearing 𝑚1, … , 𝑚𝑁 times is defined as 𝑓 = 𝑁 𝑚𝑘
𝑘 𝑖=1 𝑖
 𝑓 is the number of distinct elements 0
 𝑓 is the total frequency (length of stream) 1
𝑓 computation 2
http://www.tau.ac.il/~nogaa/PDFS/amsz4.pdf
33

Estimating frequency moments

Why they are important
 Indicate the degree of data skew, which is useful in many parallel database applications (e.g., determines the selection of algorithms for data
partitioning)
 𝑓 is the surprise number S (how uneven is the distribution)
Stream of length 100 11 distinct values
Item counts: 10, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9 Surprise S = 10^2+10*(9^2)=910
Item counts: 90, 1, 1, 1, 1, 1, 1, 1 ,1, 1, 1 Surprise S = 90^2+10*(1^2)=8,110
2
http://www.tau.ac.il/~nogaa/PDFS/amsz4.pdf
34

AMS method
 AMS method works for all moments
 For now, we focus on the 2nd moment
 We pick and keep track of many variables X:
 For each variable X we store X.el and X.val  X.el corresponds to the element i
 X.val corresponds to the count of element i
 Note this requires a count in main memory, so number of Xs is limited
 Our goal is to estimate 𝐟 = 𝒎𝟐 𝟐𝒊𝒊
http://www.tau.ac.il/~nogaa/PDFS/amsz4.pdf
35

AMS method
 How to set X.val and X.el?
 Assume stream has length n (we relax this later)
 Pick some random time t (t k), choose it with probability k/n
If you choose it, throw one of the previously stored variables X out, selected uniformly at random
42

 Sketch library for Python
 Streamlib https://github.com/jiecchen/StreamLib
 API: http://xmerge.me/StreamLib/api.html
 Implementations of many sketches, including AMS for F2
sketch from streamlib import F2
myf2 = F2(20,1)
mylist=[0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 3, 3, 4]
myf2.processBatch(mylist) print(myf2.estimate())
#F2 is the AMS sketch we examined #AMS sketch with 20 buckets and m=1
#list representing our stream
#add stream to AMS
#print estimate of F2 moment
43

Tell us what you think of the module
Please be as honest and constructive as possible
Your feedback is extremely valuable
(The survey is completely anonymous)
The results…
go to the module leader, any other staff who taught on the module and the Head of Department
On KEATS
we will provide you with…
 a breakdown of the results of all module
evaluations (the free text comments will NOT be
made available)
 all module leaders’ responses to the results of
their module evaluations. These will include the key points of the action plans
Module Evaluation
in Informatics What happens next…
The module leader and teaching staff will…
1. fully consider the results of the survey, and
2. create an action plan detailing how your responses will inform
how the module is taught next year. This may include…
 How all the things you like about the module can be enhanced
 How any issues identified will be resolved
The Head of Department…
 receives the results of all module evaluations and will contribute to action plans where necessary
 highlights examples of good practice that could be shared across more modules
 highlights any common issues raised across modules in the department
44