7CCSMBDT – Big Data Technologies Practical
STREAMS2
This practical is to be done in the lab’s pcs; not in cloudera.
1. Use of bloom filters with Python Bloomfilter
Download the file spam_websites.txt from KEATS. This is a list of spam websites.
(i) The following program adds the words [“Angel”,”Bruce”,”Charlie”,”Dan”,”Erin”,”Fred”,”Greg”,”Helen”] into a Bloom Filter bf of size 100 with error rate 0.5. Notice that the library requires the size of the filter to be larger than its input. Also, notice that we need to encode the word w to be added into the bloom filter.
from pybloomfilter import BloomFilter words=[“Angel”,”Bruce”,”Charlie”,”Dan”,”Erin”,”Fred”,”Greg”,”Helen”] bf=BloomFilter(100, 0.5)
for w in words:
bf.add(str.encode(w)) Complete the program so that
it creates a list words3 with all the contents of the list words and also of the list [”Irene”,”Ken”,”Lucy”,”Mary”]
it checks whether each word in words3 is in the Bloom Filter bf .
Is every word in words in the bloom filter bf? Are there any words in words3 that are not
in bf? Why? Now reduce the error rate from 0.5 to 0.01 and answer the questions again.
SOL:
words2=[“Erin”,”Fred”,”Greg”,”Helen”,”Irene”,”Ken”,”Lucy”,”Mary”,”Nick”,”Olive”,”Peter”,”Q uentin”,”Rola”,”Steven”,”Tom”,”Ursula”]
words3=words+words2 for w in words3:
if str.encode(w) in bf: print(w,” possibly in bf”)
All elements of words will be contained in bf, since bloom filters have zero false negatives. However, the if may fire for some words in words3 because the bloom filter has non-zero false positives. Different runs yield different results (e.g., “Lucky” and “Nick”
7CCSMBDT – Big Data Technologies Practical
are not in words but we get them as false positives (e.g., the if str.encode(w) in bf fires for them). This is prevented (with high probability) if the error rate is reduced to 0.01.
(ii) Write a program that
Stores the first 514 websites (i.e., lines) from spam_websites.txt into a list, and the remaining 3000 ones in a bloomfilter of size 6000 with error rate 0.01. Note that the library does not check if the size is sufficient, and to get appropriate error values we need a size larger than 3000.
Checks if each of the first 514 websites is in the bloomfilter
Outputs the words that are false positives and their total number.
SOL:
from pybloomfilter import BloomFilter bf=BloomFilter(6000, 0.01) first_words=[]
with open(“/home/Desktop/stream2_lab/spam_websites.txt”) as f: for word in f:
if len(first_words)<514: first_words.append(word.strip())
else: bf.add(str.encode(word.strip()))
for w in first_words:
if str.encode(w) in bf:
print("Word: ",w," is a false positive")
7CCSMBDT – Big Data Technologies Practical
(iii) Modify your program in (ii) so that:
it is executed 100 times
outputs just the maximum false positive probability, over all executions.
How does the maximum false probability compare to the error?
SOL:
from pybloomfilter import BloomFilter max_fp=0; number_of_executions=100;
for j in range(0,number_of_executions):
size_of_bf=3514-514 #3514 is the number of lines of spam_websites.txt bf=BloomFilter(2*(size_of_bf), 0.01) #make it larger otherwise error is not acceptable first_words=[]
with open("/home/cloudera/Desktop/stream2_lab/spam_websites.txt") as f:
for word in f:
if len(first_words)<514:
first_words.append(word.strip()) else:
bf.add(str.encode(word.strip())) fp=0
for w in first_words:
if str.encode(w) in bf:
fp+=1
if fp>max_fp: max_fp=fp
print(“Max number of false positives: “,max_fp)
print(“Max false positive probability: “,max_fp/len(first_words))
Max false positive probability was 0.0019 (yours may vary but it should be <0.01)
7CCSMBDT – Big Data Technologies Practical
2. Approximate count distinct with a simple implementation of FM sketch
The FM implementation below follows our slides, with the difference that it uses a fixed- length bit representation of the hash values. That is, each element of the set that we want to add into the FM sketch is mapped to a hash value of the same size. Specifically, the size is 160 bits.
import random,math,hashlib def trailing_zeros(num):
"""Counts the number of trailing 0 bits in num.""" if num == 0:
return 64
p =0
# Refresh your knowledge about bitwise operators
# https://wiki.python.org/moin/BitwiseOperators to see why it counts trailing zeros
while (num >> p) & 1 == 0: p += 1
return p
def estimate_cardinality(values):
“””Estimates the number of unique elements in the input set values. Uses fixed length bit representations (equal to the key sha1 (160 bits) Arguments:
values: An iterator of hashable elements to estimate the cardinality of. “””
my_max=0;
for value in values:
#creates a hexadecimal hash see https://docs.python.org/3/library/hashlib.html digest=hashlib.sha1(str(value).encode(‘utf-8’)).hexdigest()
#maps the hash to integer, needed to count trailing zeros
h=int(digest,16)
#prints the integer, its hash in binary, and the number of trailing zeros print(value,”\t”,bin(h)[2:].zfill(160),”\t T: “,trailing_zeros(h))
7CCSMBDT – Big Data Technologies Practical
(i) Study the code and the references in the comments to make sure you understand the way we count trailing zeros (zeros from the right to left, in the bit representation), and the way we generate hash values.
(ii) Complete the function estimate_cardinality(values) so that it produces the estimate of distinct values.
SOL: We need to get the maximum number of trailing zeros over each value in values (my_max) and when the for ends to return 2^(my_max)
if my_max