Time Allowed Rubric
2 hours
ANSWER ALL FOUR QUESTIONS.
Calculators Notes
Calculators are not permitted
Books, notes or other written material may not be brought into this examination
7CCSMBDT BIG DATA TECHNOLOGIES
King¡¯s College London
This paper is part of an examination of the College counting towards the award of a degree. Examinations are governed by the College Regulations under the authority of the Academic Board.
Degree Programmes Module Code Module Title Examination Period
MSc
7CCSMBDT
Big Data Technologies May 2018 (Period 2)
PLEASE DO NOT REMOVE THIS PAPER FROM THE EXAMINATION ROOM ý 2018 King¡¯s College London
QUESTION CONTINUES ON NEXT PAGE
2
7CCSMBDT BIG DATA TECHNOLOGIES
Question One
(a) What are the categories of data based on who creates them? For each category, provide an example of data and justify why the data in your example belongs to a certain category.
[4 marks]
(b) What are the categories of data based on their format? For each category, provide an example of data and justify why the data in your example belongs to a certain category.
[6 marks]
(c) How do the different types of Big Data analytics differ in terms of their goal?
[4 marks]
7CCSMBDT BIG DATA TECHNOLOGIES
(d) For each of the following analytic tasks, identify its type and provide justification for the type you have identified.
(i) Computing the median of a given stream of data comprised of temperature readings.
(ii) Applying Principal Component Analysis (PCA) to a matrix of 50 rows and 50 columns, whose elements are integers from 0 to 10.
(iii) Clustering the records of a dataset that is stored over 50 machines (nodes of a computing cluster), based on their pairwise similarity.
(iv) Computing the shortest path between two nodes of a graph, representing a social network. The graph is comprised of 10 nodes and 100 edges.
[8 marks]
(e) Identify the most appropriate analytic setting for performing each of the tasks (ii), (iii), and (iv) in part (c) of Question One. Provide justification for your answers.
[3 marks]
3
SEE NEXT PAGE
7CCSMBDT BIG DATA TECHNOLOGIES
Question Two
(a) What is a publish-subscribe messaging data access connector? Briefly describe its main components. How does a publish-subscribe messaging data access connector differ from a Source-sink data access connector?
[6 marks]
(b) Consider the following Apache Sqoop command:
sqoop import –connect jdbc:mysql://localhost/hadoop –username U – -password P –table demographics -m 5 –columns “age, gender”
Describe the process that is performed when this command is executed.
[6 marks]
(c) Consider the task of using Apache Flume for transferring Twitter data from Twitter into memory and then writing the data into HDFS and into local files.
Describe the main architectural components of the Apache Flume system that are used to perform the task, providing their names and types (if any), as well as their main functionality for the given task.
Draw a diagram to illustrate how these components are connected.
[7 Marks]
QUESTION CONTINUES ON NEXT PAGE
4
SEE NEXT PAGE
7CCSMBDT BIG DATA TECHNOLOGIES
(d) Consider the following part of a MapReduce program (using mrjob) where the mapper and reducer is as follows:
def mapper(self, _, line): yield ¡°chars¡±, len(line)
def reducer(self, key, values): yield key, f(values)
and f() is a given function.
Provide an example of a function f(), for which a combiner can be used to improve the efficiency of the program. Justify your choice of f() and explain the input and output of the combiner that uses the function f() you chose.
[6 marks]
5
SEE NEXT PAGE
(b) For each of the following example applications, select an appropriate type of NoSQL database. Justify your choice.
7CCSMBDT BIG DATA TECHNOLOGIES
Question Three
(a) Briefly describe three differences between Relational Data Base Management Systems and NoSQL databases.
[6 marks]
(i) An application requiring to efficiently read key/value pairs representing customers, where the key is a customer¡¯s id and the value is a string containing customer information.
Key
Value
111
¡°store 1, 13:00, 10.5GBP, tip¡±
324
¡°12:00, 14:00, waiting time=10¡±
742
¡°part 3, 120GBP, 30 mins¡±
…
…
(ii) An application requiring to perform filtering operations (e.g., regular expressions) on customer information.
(iii) A social application that predicts the behaviour of customers based on their connections and interactions on a social network.
(iv) An application requiring to perform queries on data represented as a high-dimensional and sparse matrix (i.e., a matrix with many columns and many zero values).
(v) An application involving queries based on relationships between customers (e.g., to find groups of interconnected customers).
QUESTION CONTINUES ON NEXT PAGE
6
SEE NEXT PAGE
[10 marks]
7CCSMBDT BIG DATA TECHNOLOGIES
(c) What is replication in MongoDB and what benefits it offers?
[4 marks]
(d) What is a replica set in MongoDB? Briefly describe the components of a replica set in MongoDB and their main role.
[5 marks]
7
SEE NEXT PAGE
Question Four
(a) What are transformations and actions in Apache Spark? How do they differ?
(b) Consider the following lines of pySpark code:
MyRDD=sc.parallelize([1,2,3,4,5,6,7,8,9,10]) MyRDD2=MyRDD.filter(lambda x: x*x+1>4) MyRDD2.collect()
7CCSMBDT BIG DATA TECHNOLOGIES
(i) Which of the following commands (if any) are NOT computed instantly?
[2 marks]
(ii) What benefits does this ¡°lazy evaluation¡± offer?
[2 marks] (iii) What is the result of executing these lines of code and why?
QUESTION CONTINUES ON NEXT PAGE
8
SEE NEXT PAGE
[5 marks]
[2 marks]
7CCSMBDT BIG DATA TECHNOLOGIES
(c) Consider a stream of integers D=[4,2,2,1,3,5] and a hash function h(x)=(3x) mod 16.
(i) Assuming the most basic type of estimate (no improvements to the estimate), what is the estimate that is output by a single Flajolet-Martin (FM) sketch when it is applied to D, using h(x), and the hash values (i.e., outputs of h(x)) are stored using 4- bits ? Justify your answer.
[10 marks]
(ii) Propose two ways to improve the estimate. Your answers can assume multiple Flajolet-Martin sketches.
[4 marks]
9
FINAL PAGE