Question 1 2 pts
The following piece of code computes the frequencies of the words in a text file:
from operator import add
lines = sc.textFile(‘README.md’)
counts = lines.flatMap(lambda x: x.split()) \
.map(lambda x: (x, 1)) \
.reduceByKey(add)
Add one line to find the most frequent word. Output this word and its frequency.
Hint: Use sortBy(), reduce(), or max()
Answer:
Using Python version 2.7.12 (default, Dec 4 2017 14:50:18)
SparkSession available as ‘spark’.
>>> from operator import add
>>> lines = sc.textFile(‘README.md’)
>>> counts = lines.flatMap(lambda x: x.split()).map(lambda x: (x, 1)).reduceByKey(add)
>>> print(counts.map(lambda (k,v): (v,k)).sortByKey(False).take(1))
[(24, u’the’)]
Question 2 2 pts
Modify the word count example above, so that we only count the frequencies of those words consisting of 5 or more characters.
Answer:
Using Python version 2.7.12 (default, Dec 4 2017 14:50:18)
SparkSession available as ‘spark’.
>>> from operator import add
>>> lines = sc.textFile(‘README.md’)
>>> counts = lines.flatMap(lambda x: x.split())
>>> y = counts.filter(lambda word:len(word)>=5)
>>> t = y.map(lambda x:(x, 1))
>>> p = t.reduceByKey(add)
Question 3 1 pts
Consider the following piece of code:
A = sc.parallelize(xrange(1, 100))
t = 50
B = A.filter(lambda x: x < t)
print B.count()
t = 10
C = B.filter(lambda x: x > t)
print C.count()
What’s its output? (Yes, you can just run it.)
Answer: Using Python version 2.7.12 (default, Dec 4 2017 14:50:18)
SparkSession available as ‘spark’.
>>> A = sc.parallelize(xrange(1, 100))
>>> t=50
>>> B = A.filter(lambda x: x < t)
>>> print B.count()
49 #output
>>> t=10
>>> C = B.filter(lambda x: x > t)
>>> print C.count()
0 #output
Question 4 2 pts
The intent of the code above is to get all numbers below 50 from A and put them into B, and then get all numbers above 10 from B and put them into C. Fix the code so that it produces the desired behavior, by adding one line of code. You are not allowed to change the existing code.
Answer:
Using Python version 2.7.12 (default, Dec 4 2017 14:50:18)
SparkSession available as ‘spark’.
>>> A = sc.parallelize(xrange(1, 100))
>>> t=50
>>> B = A.filter(lambda x: x < t)
>>> B.cache()
PythonRDD[1] at RDD at PythonRDD.scala:48
>>> print B.count()
49
>>> t=10
>>> C = B.filter(lambda x: x > t)
>>> print C.count()
39
Question 5 3 pts
Modify the PMI example by sending a_dict and n_dict inside the closure. Do not use broadcast variables.
Answer: