TDDE31 Big Data Analytics Exam
Part 2
June 2, 2020 9:50 – 12:00
Instructions: See https://www.ida.liu.se/732A54/exam/distanceexam.en.shtml
Grades: You can get up to 15 points for this second part of the exam. Together with the max 14 points for the first part, you thus may get an overall of max 29 points. To pass the exam (grade 3 or E) you have to meet both of the following two conditions: First, you need to achieve at least 7 of the 14 points that can be achieved in the first part of the exam. Second, for both parts together, you need to achieve at least 14.5 of the 29 points that can be achieved overall. If you do not meet the first condition, your second part will not be considered for grading.
After fulfilling the aforementioned requirements to pass the exam, then for grade 4, you need at least 20 points (for both parts together), and for grade 5, you need at least 26 points.
Questions: If you have clarification questions regarding some of the exercises in the exam, please do the following depending on the exercise.
If you need clarifications on Questions 14–17, then email christoph.kessler@liu.se
If you need clarifications on Question 18, then email jose.m.pena@liu.se
If you need clarifications on Questions 11–13, or about something more general related to the exam, the examiner will be available in the following Zoom meeting room throughout the whole time of the exam.
https://liu-se.zoom.us/j/63456272023?pwd=M3NWUXkyY1lFdlBaekc2dndxVVBwZz09
Meeting ID: 634 5627 2023 Password: 150426
Notice that this Zoom meeting room has been set up using the waiting room feature of Zoom. Hence, when you enter, you will be put into the waiting room and, from there, you will then be admitted to the meeting room to ask your question.
1
Question 11 (1p)
Describe a concrete application / use case for which data scalability is important. Note that “scalability” is a key word in this question; it is not enough if your application / use case simply has to do with a huge amount of data.
To answer this question write a maximum of 200 words.
Question 12 (1 + 1 = 2p)
(a) Consider the following key-value database which contains four key-value pairs where the keys are user IDs and the values consist of a user name, the user’s year of birth, and an array of IDs of users that the current user likes (for instance, Bob likes Charlie).
“alice in se” → “Alice, 1987, [bob95 charlie]” “bob95″→”Bob, 1995, [charlie]”
“charlie”→”Charlie, 1996, []” “selaya”→”Alice, 1974, [charlie]”
Describe how the types of queries typically implemented in a key-value store can be used to retrieve the birth years of all users named Alice.
(b) Describe how the given key-value database can be changed/extended such that retrieving the birth years of users named Alice becomes more efficient.
For each of these two questions, write a maximum of 200 words, respectively.
Question 13 (1p)
Assume we use consistent hashing to map keys to compute nodes in a distributed key-value store. Explain in about three to five sentences what happens in such a system if a compute node is removed.
Question 14 (0.5p)
Does a combiner function need to be commutative? Explain why or why not.
To answer this question write a maximum of 100 words.
2
Question 15 (0.5 + 1 = 1.5p)
(a) In the lecture, the MapReduce mechanism was referred to as the “Swiss Army
Knife” of distributed parallel big-data computing over distributed files. Why?
To answer this question write a maximum of 50 words.
(b) Explain how the Spark programming interface differs, and what the benefit could be.
To answer this question write a maximum of 150 words.
Question 16 (1.5p)
Why is fault tolerance an important aspect in big-data computations? Compare the fault tolerance mechanisms in MapReduce and Spark to each other.
(Note that “compare” means compare, not just present each one for itself.)
To answer this question write a maximum of 300 words.
Question 17 (1.5p)
Consider the MapReduce wordcount program as in our lecture but without a Com- biner. The user has configured the MapReduce framework to use M > 1 mapper tasks but only one reducer task. Assume for simplicity that mapper tasks and re- ducer tasks each take time proportional to their input size, i.e.,
timemapper(N) = timereducer(N) = C ⋅ N seconds
for some constant C > 0 and the amount N of data the task processes. Assume also for simplicity that a mapper reading N bytes of input will also produce N bytes of output, and that the amount of input data in HDFS is large enough to provide work for at least M mappers.
Derive a tight upper bound for the speedup that the user can achieve with M mapper tasks and 1 reducer task (even when using an arbitrarily large number of execution resources on the cluster), compared to a sequential scenario (i.e., 1 map- per and 1 reducer task). Justify your answer. A formal calculation is expected. (Hint: Which fundamental theorem of parallel computing can you apply here?)
To answer this question write a maximum of 200 words.
Question 18 (6p)
You are asked to implement in Spark (PySpark) the perceptron algorithm. This algorithm for binary classification is a predecessor of modern neural networks.
3
Specifically, consider a binary classification problem with class labels t ∈ {−1, +1}. Then, the class label assigned to a point x is given by
⎧⎪+1 if wT x ≥ 0 y(x) = ⎨⎪⎩−1 if wT x < 0.
To determine the parameter values w, some learning data {(x1, t1), . . . , (xN , tN )} is available. We seek the parameter values that minimize the following error func- tion:
E(w) = − ∑ wTxntn n∈M
where M denotes the set of all misclassified instances in the learning data, i.e., correctly classified instances give zero error. The error function above is minimized by using batch gradient descent, i.e., the weights are iteratively updated as follows:
w(new) = w(old) − α∇E(w(old)) = w(old) + α ∑ xntn n∈M
where α is the learning rate. You can assume that α is given. You can also assume that x and w are of dimension two. Finally, you can assume that the learning data is linearly separable and, thus, the learning process will converge.
It is not required that your code actually compiles; nevertheless, it must be code rather than pseudocode, i.e., you have to use the transformations and actions prop- erly.
To get full points you need to comment your code.
4