7CCSMBDT: Coursework 1 example solutions
Task 1
(a) Transportation data can be characterized as Big Data, based on the following properties:
Volume: Systems that record passenger data generate a lot of data (e.g., systems that airlines use to record flyers’ data).
Velocity: The location of a passengers’ car changes fast, and the collection of locations can be represented by a fast-changing data stream.
Variety: A passenger is associated with various types of data such as demographics (formatted as strings or integers), check-in data (formatted as strings, or GPS co-ordinates), pictures (e.g., scanned passport for a flyer), etc.
Veracity: A passenger’s data may be noisy, due to limitations of collecting devices (e.g., low- quality scan of a passport), or it may contain erroneous values, due to typos (e.g., misspelled name).
Value: A passenger’s data is of great importance in analytics. For example, it can be used in counter-terrorist or anti-fraud applications, and in applications for pollution management and traffic control.
(b) Volume: Different storage and processing requirements for transportation data. For example, checkin data from an airline may be large and require storage in servers in the Cloud and processing by distributed computing frameworks.
Velocity: Data speed may be high, e.g., as the location of a car changes over time. We should be able to accommodate these high-speed data by having infrastructure such as servers and suitable software that allow us to process the data efficiently and also to account for changes in data speed.
Variety: The various types of data (see part a) require integration to be meaningfully analysed together and also possible data transformation operations. This requires special software for processing the data, as well as suitable software, such as NoSQL databases, for storing the data before and after integration in appropriate formats.
Veracity: The noisy and potentially erroneous data values should be removed, or replaced by suitable values. These require appropriate techniques and software.
Value: The value of data in transportation is typically large (see part a), especially when the issues regarding volume, velocity, variety, and veracity have been dealt with appropriately. It is challenging though to extract value from transportation data, because this requires storing and processing the data effectively (so that the data are not out of date) and also appropriate methods that can extract useful knowledge, such as descriptive statistics, from the data.
Task 2
(a) See the slides on Apache sqoop for the export command in Lecture 2 (and in particular slide 20). The only difference is the “– input-fields-terminated-by `\001` argument, which indicates the separator ’\001’ between input fields.
* mentioning the connection to the mysql database hadoop, in server localhost, using jdbc * mentioning the specific table “mytable”
* mentioning username
* mentioning password
* mentioning the output directory
* mentioning that 1 mapper is used. * mentioning the role of `\001’
(b) The benefits of Apache Sqoop are similar to those of Apache Flume discussed in Lecture 2 – only their justification differs slightly. For example, the benefits include:
(i) Reliability: It can store data into HDFS, which offers fault tolerance.
(ii) Scalable: It can store the data in parallel, using multiple mappers.
(iii) Enables high performance data transfer: It uses direct connectors that are specific for the RDBMS that stores the data.
(iv) Manageable: It automates the data export to an RDBMS, by writing the INSERT statements.
(v) Customizable: It works with different RDBMSs, it allows specifying which data to export, supports updates, etc.
(vi) Low cost installation and maintenance: Apache License – free software
(vii) Low cost operation: It is easy to use, it does not require special software or hardware, or programmers (e.g., to write custom code for transfers).
Task 3
A function f that computes the average of some input numbers cannot be used in a combiner because it is not associative. For example, for input numbers {1,2,3}, f({f({1,2}),3}) = f({1.5,3})=1.5, while f({{1},f({2,3})}) = f({1,2.5})=1.75, and since 1.5 is not equal to 1.75, f is not associative. Note that violation of either associativity or commutativity suffices.
The following script takes as input a file where each line contains an integer, and it outputs the correct average of the input integers if the combiner is commented out. Otherwise, it produces wrong output.
from mrjob.job import MRJob class task3c(MRJob):
def mapper(self, _, line): yield None, line
def combiner(self, _, list_of_values):
numbers=[int (number) for number in list_of_values] yield None, float(sum(numbers))/float(len(numbers))
def reducer(self, _, list_of_values):
numbers=[int (number) for number in list_of_values] yield “avg: “,float(sum(numbers))/float(len(numbers))
if __name__==”__main__”: MRmyjob.run()
Task 4
The mapper distinguishes between the input lines corresponding to each file, based on the if and else statements. If the first element in split is an integer, then we are seeing the id_age_occ.csv, otherwise the id_educ_marital.csv. If we see the id_age_occ.csv file, we add a symbol ‘A’ (any other symbol except ‘B’ would do) in the value of the yield inside the if. If we see the id_educ_marital.csv, we add a symbol ‘B’ (any other symbol except ‘A’ would do) in the value of the yield inside the else. Then, the reducer creates the joined tuple, record by record (a record contains id, then the attributes from the first file (whose first element is ‘A’) and then the attributes from the second file (whose second element is ‘B’).