7CCSMBDT – Big Data Technologies Practical
STREAMS2
This practical is to be done in the lab’s pcs; not in cloudera.
1. Use of bloom filters with Python Bloomfilter
Download the file spam_websites.txt from KEATS. This is a list of spam websites.
(i) The following program adds the words [“Angel”,”Bruce”,”Charlie”,”Dan”,”Erin”,”Fred”,”Greg”,”Helen”] into a Bloom Filter bf of size 100 with error rate 0.5. Notice that the library requires the size of the filter to be larger than its input. Also, notice that we need to encode the word w to be added into the bloom filter.
from pybloomfilter import BloomFilter words=[“Angel”,”Bruce”,”Charlie”,”Dan”,”Erin”,”Fred”,”Greg”,”Helen”] bf=BloomFilter(100, 0.5)
for w in words:
bf.add(str.encode(w)) Complete the program so that
it creates a list words3 with all the contents of the list words and also of the list [”Irene”,”Ken”,”Lucy”,”Mary”]
it checks whether each word in words3 is in the Bloom Filter bf .
Is every word in words in the bloom filter bf? Are there any words in words3 that are not
in bf? Why? Now reduce the error rate from 0.5 to 0.01 and answer the questions again.
(ii) Write a program that
Stores the first 514 websites (i.e., lines) from spam_websites.txt into a list, and the remaining 3000 ones in a bloomfilter of size 6000 with error rate 0.01. Note that the library does not check if the size is sufficient, and to get appropriate error values we need a size larger than 3000.
Checks if each of the first 514 websites is in the bloomfilter
Outputs the words that are false positives and their total number.
7CCSMBDT – Big Data Technologies Practical
(iii) Modify your program in (ii) so that:
it is executed 100 times
outputs just the maximum false positive probability, over all executions. How does the maximum false probability compare to the error?
2. Approximate count distinct with a simple implementation of FM sketch
The FM implementation below follows our slides, with the difference that it uses a fixed- length bit representation of the hash values. That is, each element of the set that we want to add into the FM sketch is mapped to a hash value of the same size. Specifically, the size is 160 bits.
import random,math,hashlib def trailing_zeros(num):
“””Counts the number of trailing 0 bits in num.””” if num == 0:
return 64
p =0
# Refresh your knowledge about bitwise operators
# https://wiki.python.org/moin/BitwiseOperators to see why it counts trailing zeros
while (num >> p) & 1 == 0: p += 1
return p
def estimate_cardinality(values):
“””Estimates the number of unique elements in the input set values. Uses fixed length bit representations (equal to the key sha1 (160 bits) Arguments:
values: An iterator of hashable elements to estimate the cardinality of. “””
my_max=0;
for value in values:
#creates a hexadecimal hash see https://docs.python.org/3/library/hashlib.html
7CCSMBDT – Big Data Technologies Practical
digest=hashlib.sha1(str(value).encode(‘utf-8′)).hexdigest()
#maps the hash to integer, needed to count trailing zeros h=int(digest,16)
#prints the integer, its hash in binary, and the number of trailing zeros print(value,”\t”,bin(h)[2:].zfill(160),”\t T: “,trailing_zeros(h))
(i) Study the code and the references in the comments to make sure you understand the way we count trailing zeros (zeros from the right to left, in the bit representation), and the way we generate hash values.
(ii) Complete the function estimate_cardinality(values) so that it produces the estimate of distinct values.
(iii) The code below should be appended after the code you wrote in (ii).
Modify it to output the True and Estimated number of distinct values in my_list, as well as the average relative error over 100 repetitions. The average relative error is defined as abs(true-estimated)/true*100% .
#generates a random list of k integers from a gaussian distribution with mean 100 and standard deviation 5. k this is the size of stream that we want to put into the FM sketch.
k=10 repetitions=100
for j in range(0,repetitions): my_list=[int(random.gauss(100,5)) for i in range(k)]
(iv) Now increase k to 1000 and compare the average relative error with the one you got from question (iii). What do you observe?
7CCSMBDT – Big Data Technologies Practical
3. Approximating the F2 moment using AMS sketches.
(i) Read the documentation for AMS sketch –
class streamlib.summary.F2(w=20, mu=5, typecode=’i’) in http://xmerge.me/StreamLib/api.html
Type the following into a file ams1.py , read the comments and execute ams1.py to see the result. It will estimate the second frequency moment.
#this imports the class we need for using the AMS sketch from streamlib import F2
#this creates an AMS sketch with 20 buckets and m=1 myf2 = F2(20,1)
#this is a list representing our stream
mylist=[0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 3, 3, 4]
#this adds our stream into the AMS sketch myf2.processBatch(mylist)
#this prints the estimate of the second frequency moment print(myf2.estimate())
7CCSMBDT – Big Data Technologies Practical
(ii) Compute the second frequency moment by hand. What is its value?
(iii) Modify the program so that it computes and outputs the second frequency moment and the relative error between the true and estimated second frequency moment. The relative error is defined as abs(true-estimate)/true*100.
Extra assignment – if you have time
You are given the following lists
mylist1=[int(random.gauss(100,3)) for i in range(1000)] mylist2=[int(random.gauss(100,5)) for i in range(1000)] mylist3=[int(random.gauss(100,15)) for i in range(1000)]
Modify the program in (ii) to compute the surprise number of each list. See the lecture slides for what is the surprise number. What do you observe?