程序代写代做代考 Hive C algorithm html data structure 7CCSMBDT – Big Data Technologies Practical

7CCSMBDT – Big Data Technologies Practical
STREAMS1
1. Install the required ibraries in cloudera with
sudo yum install python-pip cd /usr/bin
sudo ./pip install csample sudo ./pip install dgim
2. Get a sample of a/b fraction from a stream. (i) You are given the following log data structure:
logs = (
(‘brad’, ‘event a’, 2), (‘charlie’, ‘event a’, 3), (‘charlie’, ‘event a’, 4), (‘brad’, ‘event b’, 5), (‘brad’, ‘event c’, 6), (‘maria’, ‘event a’, 7), (‘maria’, ‘event a’, 7), (‘maria’, ‘event a’, 7), (‘roger’, ‘event a’, 7), (‘roger’, ‘event b’, 8), (‘alan’, ‘event c’, 0), (‘alan’, ‘event b’, 1), (‘boris’, ‘event b’, 1), (‘cornelius’, ‘event b’, 1), (‘dragan’, ‘event b’, 1), (‘elaine’, ‘event b’, 1), (‘elaine’, ‘event b’, 1),
)

7CCSMBDT – Big Data Technologies Practical
Construct and print out a sample containing 2/3 of the keys in log, where the key is the event type.
HINT: check the hashlib library https://docs.python.org/3/library/hashlib.html for creating a hash key k out of a string and especially the hexdigest() method.
Note that two keys k1 and k2 collide when k1=k2 mod m.
SOL:
import hashlib
logs=((‘brad’, ‘event a’, 2),
… # I am skipping the remaining entries of logs )
sample=[];
#We want an alpha/buckets sample
buckets=3;
alpha=2
key_id=1 #this is the index of the event type in logs (e.g., ‘event a’) for i in range(0,len(logs)):
#I get an integer mod buckets out of logs[i][key_id].
# hexdigest() produces hexadecimal number, which we cast to int()
# Note there may be hash collisions hash_id=int(hashlib.sha1(logs[i][key_id].encode(‘utf-8’)).hexdigest(), 16) % buckets #comment out the following to see if there were collisions
#print(“id: “,logs[i][key_id],” hash(id): “,hash_id)
if hash_id=2.7. For older versions, you # need to copy a package into Python as described in
# http://code.activestate.com/recipes/576611-counter-class/
from collections import Counter import math
import fileinput
my_stream=[]
for line in fileinput.input([‘entree_stream.txt’]):
my_stream.append(int(line))
samples = csample.reservoir(my_stream, math.floor(0.1*len(my_stream)),keep_order=True)
k=10
print(“Relative frequencies of top-k in stream”)
d = Counter( my_stream)

7CCSMBDT – Big Data Technologies Practical
common_stream=d.most_common(k) top_k=[]
for i in range(0,len(common_stream)):
print(“Event: “,common_stream[i][0],” Rel. freq: “,common_stream[i][1]/len(my_stream)) top_k.append(common_stream[i])
print(“Relative frequencies of top-k in sample”) c = Counter( samples )
dict_c=dict(c)
for i in range(0,len(top_k)):
print(“Event: “,top_k[i][0],” Rel.freq. :”,dict_c.get(top_k[i][0])/len(samples)) The relative frequencies of these events are very similar.
4. DGIM for a short stream
Write a python program that computes the estimated count of 1s on the stream in the
slide “Example: Bucketized stream” of the lecture. This is slide number 39.
Note that the size of the leftmost bucket is 16, but in the slide only the last 8 1s are shown. Thus, you need to add 8 more 1s in the bucket, before the leftmost 1.
The API of dgim is in: https://dgim.readthedocs.io/en/latest/genindex.html and the “Usage” section in https://dgim.readthedocs.io/en/latest/readme.html is self-explanatory.
SOL:
dgim=Dgim(N=65,error_rate=0.5)
for i in range(8): dgim.update(True) #1
dgim.update(True) #1 dgim.update(False) #0

7CCSMBDT – Big Data Technologies Practical
dgim.update(False) #0 dgim.update(True) #1 dgim.update(False) #0 dgim.update(True) #1 dgim.update(False) #0 dgim.update(True) #1 dgim.update(True) #1 dgim.update(False) #0 dgim.update(False) #0 dgim.update(False) #0 dgim.update(True) #1 dgim.update(False) #0 dgim.update(True) #1 dgim.update(True) #1 dgim.update(False) #0 for i in range(6):
dgim.update(True) #1
dgim.update(False) #0 dgim.update(True) #1 dgim.update(True) #1
dgim.update(False) #0 for i in range(5):
dgim.update(True) #1
dgim.update(False) #0 dgim.update(True) #1 dgim.update(True) #1

7CCSMBDT – Big Data Technologies Practical
dgim.update(True) #1 dgim.update(False) #0
for i in range(3): dgim.update(True) #1 dgim.update(False) #0
dgim.update(True) #1 dgim.update(True) #1 dgim.update(True) #1 dgim.update(False) #0 dgim.update(True) #1 dgim.update(False) #0 dgim.update(True) #1 dgim.update(False) #0 dgim.update(False) #0 dgim.update(False) #0 dgim.update(True) #1 dgim.update(False) #0 dgim.update(True) #1 dgim.update(True) #1 dgim.update(False) #0 dgim.update(False) #0 dgim.update(True) #1 dgim.update(False) #0
dgim_result=dgim.get_count() print(dgim_result)

7CCSMBDT – Big Data Technologies Practical
5. DGIM and relative error
(i) Write a python program that asks the user to enter a stream of 0s and 1s, the parameter N in DGIM algorithm, and the maximum error, and it outputs the approximate answer (i.e., count of 1s computed by DGIM, using the given maximum error), the exact answer, and the relative error. The relative error is defined as the absolute difference between the approximate and exact answer, divided by the exact answer.
SOL:
from dgim import Dgim
user_stream=raw_input(‘Enter stream of 0s and 1s: ‘) print(user_stream)
user_N=input(‘Enter N for your stream: ‘)
print(user_N)
user_max_error=input(‘Enter max_error for your stream: ‘) dgim2=Dgim(int(user_N),float(user_max_error))
for i in range(len(user_stream)): if(int(user_stream[i])==0):
dgim2.update(False) elif(int(user_stream[i])==1):
dgim2.update(True)
count_ones=0
for i in range(len(user_stream)-int(user_N),len(user_stream)):
if(int(user_stream[i])==1):

7CCSMBDT – Big Data Technologies Practical
count_ones+=1 dgim2_result=dgim2.get_count()
print(“Approx: “,dgim2_result)
print(“Exact: “,count_ones)
print(“Rel. Errlr”,abs(dgim2_result-count_ones)/count_ones)
(ii) How does the relative error compare to the maximum error, for a given stream 11011111111, N=10, and maximum error equal to 0.5?
SOL:
Relative error is (9-8)/9 and maximum error is 0.5.