Assignment 2 –Hadoop and HBase

Assessment Summary

Weighting: 15%

Due Date: 23pm Sun 3 Jun

Group Assignment

Submission

A word document containing all your answers

Assignment Overview

In this assignment, you write a report to show how Hadoop and HBase can be used to store the dblp data, and how queries retrieving data can be answered in these systems at the conceptual level. This type of report supplies critical information to an organization when a decision on whether a new type of system should be employed is needed.

The Week8 slides is critical to this assignment. You need to make sure you understand it. The contents involve a lot of technical details and acquiring good understanding requires serious effort. You may also use the recommended web videos and web tutorials (week 8 folder), and other resources and/or books that you can find.

You complete the assignment in a group of 3 students whom you choose yourself. If you decide to do it individually, you must complete all parts. External students can use Skype etc. for collaboration.

If you do the assignment in a group, only one of you submits. Make sure that your answer filename contains the email IDs of all group members, the names, and the tasks to which they make major contributions.

Application and Requirements

You have used dblp data in Assignment 1. This assignment continuous to use dblp data. The assignment aims to explore the possibilities of storing some dblp data in Hadoop and HBase systems respectively. The assignment has three parts presented in the order of difficulty next.

[4 marks]

Choose 6 publications of 3 types (each type has 2 publications) from the given dblp data in the file dblp.xml to show how they are stored. These publications are to be stored firstly in a Hadoop system with 2 racks (R1 and R2) each having 2 nodes (nodes N1 and N2 belong to R1 and nodes N3 and N4 belong to R2).

Partition the publications you choose into blocks and store the blocks on the nodes. (Reference slide is Slide 7 of L7-Hadoop Slides). You decide the number of blocks, the publications in each block, the locations of the blocks on the nodes and their replications.

Describe the reasons/logic that you follow to get the partition, the allocations and the replications.

[6 marks]

Choose a query from the following list. Different queries have different complexity factors leading to different marks. If you do the first one well, you get 6*.7=4.2 marks for this part.

The total number of distinct authors of all publications (complexity factor 0.7)

The total number of distinct authors of each type of publications (0.8)

The maximum number of authors per publication in all publications (0.9)

The maximum number of words per publication title in all publications (0.9)

The maximum number of attributes per publication in all publications (0.9)

The average number of authors in all publications (1.0)

The average number of words in all publication titles (1.0)

For the selected query, write an algorithm for the mapper and an algorithm for the reducer. The reference slides are Hadoop Slides 22-25. To write correct algorithms, you must understand how mapReduce works (Slides 17-25). Note that in Hadoop, data is stored as a file like a file in Windows or Mac. You assume that a library is available for you to get the DOM tree of an XML document. You can use ‘/’ and ‘//’ to traverse the tree like what you did in XQuery.

Use a table to show how your algorithms run in the MapReduce architecture against your data. You show the nodes on which the mapper runs, the input blocks and the results of the mapper, the nodes on which the reducer runs, the input key-list pairs and results of the reducer.

[6 marks]

In this part, you design an alternative way to store the chosen publications. This time, you store them in HBase. In HBase, data is logically modeled in HTables. In other words, data is transformed and then stored. This is different from Hadoop where data is stored in its raw format in a file.

Design HTable(s) to store the publications you chosen in Part 1. The reference slide is L8-HBase Slide 17. What you need to decide includes the row key, the number of column families, and the information to be stored in each column family. Draw a table like Slide 17 and put your data in. The aim of your design is for the stored data to be used to answer the queries in Part2 more efficiently. There are many possibilities for this design. A solution is good as far as it can be justified in some ways.

Describe the logic that you follow to get the design. This is your justification of the design.

Following Slide 18, you draw the physical model for your HTable(s). You can ignore the timestamp. This is a reasonable practice because ignoring the timestamp simply means that we do not allow value versions, which is often the case in many applications.

HBase physical model is often implemented as a type of multiple level tree-map. This type of implementation is difficult to answer aggregate queries. We often need to write programs to answer aggregation queries in HBase. Choose a query from Part2 (complexity factors still apply), write an algorithm to answer the query. If m represents a map (implementation of a HTable), the possible interface functions of a map used in answering the queries are:

keys() returns all (row) keys of the HTable.
get(k) returns the values for all the column families of the key k.
get(k.cf) returns the values for the specific column family cf of the key k
get(k.cf.a) returns the value(s) for the attribute a in the specific column family cf for the key k.

You can use these interfaces in your algorithm if necessary.

[Optional, 3 bonus marks]

If you install Hadoop and implement Part2 algorithms, you get a maximal of 3 bonus marks depending correctness. To get the bonus marks, you paste your code in the report, and you come to Jixue(Jerry) Liu’s office to demonstrate your work. Externals can organize demonstration in the virtual room by appointment.

Marking Criteria

Marks are awarded based on

The correctness of the code and whether it works for its aim,
The quality of the description and whether nodes, racks, blocks, variables etc are referenced specifically when necessary,
the logics and concepts are clear and correct,
the writing as a whole is correct.

Plagiarism

If your solution, or part of it, is not written by you (your idea and your own typing), you commit plagiarism. The investigation will involve an oral or written test. If you commit plagiarism, you will be penalised and a record will be kept in your file.

Extensions

Extensions for assignments are available under the following conditions

permanent or temporary disability, or
compassionate grounds

In all cases, documentary evidence (e.g. valid medical certificate, road accident report, obituary) must be presented to the Course Coordinator. A medical certificate produced on or after the due date will not be accepted unless you are hospitalized.

If you apply for extension within 24 hours before the deadline, you must see the course coordinator in person unless you are in an emergency situation like being admitted in a hospital.

Late Penalties

Unless you have an extension, late submission will incur a penalty of 30% deduction per day (or part of it) of lateness.