Final Exam 2018S Solution
Written Part
1. (2 points for each)
(1) A
(2) B
(3) A
(4) C
(5) D
2. (2.5 points for each)
(1) A
(2) C
(3) C
(4) B and C
3. Not good, (2 points) because it collects all data to the driver program, who then does the addition sequentially by itself. (3 points)
r1 = RDD1.zipWithIndex().map(lambda (x,y): (y,x))
r2 = RDD2.zipWithIndex().map(lambda (x,y): (y,x))
RDD3 = r1.union(r2).reduceByKey(lambda x,y: x+y)
(5 points)
4. (1) join will cause a shuffle (3 points), reduceByKey will not (2 points)
(2) Now both join and reduceByKey will cause shuffles. (2 points) Change map to mapValues. (3 points)
5.
(1) (3 points)
[(1, 2), (3, 3), (2, 4)]
[(1, 2), (2, 4)]
(2) (3 points)
[(1, 2), (3, 3), (2, 4)]
[(1, 2), (3, 3), (2, 4)]
(3) (3 points)
[(1, 2), (2, 4), (3, 3)]
[(1, 2), (2, 4), (3, 3)]
1 extra point if all correct.
Programming Part
Grading Schema:
Indentation: For all questions, if your program does not contain indentation, there will be 10% penalty on the final grade, at least one point.
Performance: For all questions, if your program contains any unnecessary operations, like unnecessary for-loop, there will be at most 10% penalty on the final grade.
Example:
for i in range (0,n):
a[i] = a[i] + 1
will be considered as unnecessary operations, the correct way is
a.map(lambda a: a+1)
Output: For all questions, if your program has different output format from the requirements, there will be 10% penalty on each correct test point.
Question 1:
• Four test points, each one contains 2 points.
• P = (5771,4907), K = 1
OUTPUT: [(5771, 4907)]
• P = (2000,5000), K = 1
OUTPUT: [(2095, 4488)]
• P = (3476,9529), K = 5
OUTPUT:[(3765, 9373), (3577, 9962), (3137, 8984), (3338, 8776), (2698, 9476)]
• P = (7895,9229), K = 20
OUTPUT: [(7163, 9536), (8849, 10082), (9170, 8944), (7791, 7746), (7065, 10516), (7677, 10749), (9486, 9204), (9441, 8181), (8017, 7277), (7814, 7189), (5884, 10132), (8674, 11475), (9239, 7187), (5840, 10737), (10474, 9888), (9595, 11489), (9078, 6641), (8560, 6307), (8663, 12156), (9201, 11962)]
• The implementation of “Divide” function contains 7 points.
• Calculating the nearest point in partition: 2 points
• yield no more than K points in each partition: 5 points
• if 2) is satisfied and the nearest point is in the K points, 1) is satisfied.
• The implementation of main function contains 3 points.
• If your program does not implement a divide-and-conquer algorithm, there will be 80% penalty on the final grade.
• If your program does sorting in “Divide” part, there will be 60% penalty on the final grade, at most 9 points.
• If your program only solve K=1 case, there will be 1 point penalty (Totally 15 points if your program can solve nearest neighbor problem).
• If your program does not achieve any of above goals, you can get 2 points if you calculate the distance correctly.
Question 2:
• Three test points, each one contains 2 points. For grading, we only consider the first five windows. if you can output the correct answer for the first three window, which means your program cannot handle window stream, you can get 1 point for each test point. In this case, you can get total 10 points.
• Random seed = 1024
• Output:
45709
• 46126
• 48452
• 49226
50538
• Random seed = 2018
• Output:
43386
• 44120
• 45155
• 48534
48469
• Random seed = 5003
• Output:
47214
• 48174
• 50523
• 53765
53900
• Using any Spark Streaming function correctly: 2 points
• Using any Spark Streaming function with window correctly: 2 points
• Calculating the counts in a window: 1 points
• Calculating the sum in a window: 2 points
• Calculating the average: 2 point
• Submitting unrelated code to the question will get 0 point.
Question 3:
• Three test points, each one contains 2 points
• userID = “h”
• Output:
+——-+
• | name|
• +——-+
• | Issac|
• | Jason|
• |Charlie|
+——-+
• userID = “c”
• Output:
+—–+
• | name|
• +—–+
• |Gabby|
• | Bob|
• |Alice|
• |David|
+—–+
• userID=”e”
• Output:
+—–+
• | name|
• +—–+
• |Alice|
• |David|
+—–+
• Calculating the situation 1: 2 points
• Calculating the situation 2: 2 points
• Calculating the situation 3: 2 points
• Calculating the situation 4: 2 points
• Calculating the final result: 1 point
• Not using GraphFrame/DataFrame API will lose all points from (2)-(6)
• If the program can only output the result for the case “userID = ‘h'”, there will be 1 points penalty (Solving “userID = ‘h'” can get totally 10 points) .
Sample Solution:
Question 1:
from math import sqrt
def distance(a,b): # define the function for distance between points
return sqrt((a[0]-b[0])*(a[0]-b[0])+(a[1]-b[1])*(a[1]-b[1]))
T = pairs.map(lambda pair: (pair, distance(pair,P))) # map (point->(point, distance))
def f(iterator):
result = [] # set the result list as empty
max = ((0,0),0) # set the farthest point in the result list as 0.
for i in iterator:
if (len(result) < K): # if the list is not full (Maximum = K)
result.append(i) # insert new point into the list
if (i[1] > max[1]) : # update the farthest point if possible
max = i
else : # if the list is full
if (i[1] < max[1]) : # If the new point is closed to P than the farthest point in the list
# Replace the farthest point with the new point
result.remove(max)
result.append(i)
# Renew the farthest point
max = i
for j in result:
if (j[1] > max[1]):
max = j
# Report every point in the list
for i in result:
yield i
result = T.mapPartitions(f)
result = result.sortBy(lambda a: a[1]).map(lambda a:a[0]).take(K) # Sorting the K*numPartition elements and get the first K.
Question 2:
Stat = numbers.map(lambda u:(u,1)).reduceByWindow(lambda x ,y: (x[0]+y[0],x[1]+y[1]), lambda x,y: (x[0]-y[0],x[1]-y[1]), 30,10 )
def printResult(rdd):
result = rdd.take(1)
print result[0][0]/result[0][1]
Stat.foreachRDD(printResult)
Question 3:
one_hop = g.find(“(a)-[e]->(b)”)
one_hop = one_hop.filter((one_hop[‘a.id’] == userID)).select(‘b.name’) # Calculate condition 1) and 2)
two_hop = g.find(“(a)-[e]->(b); (b)-[f]->(c)”)
two_hop = two_hop.filter(two_hop[‘a.id’] == userID).filter(two_hop[‘e.relationship’] == “friend”) # Calculate condition 3) and 4)
two_hop = two_hop.filter(two_hop[‘c.id’] != userID).select(‘c.name’) # Remove userID from the result
result = one_hop.union(two_hop).distinct() # Union two results