Assignment 1 – XML Data

Assessment Summary

Weighting: 15%

Due Date: 23PM Sun 15 Apr

Group Assignment

Submission

Right-click on the folder DBE-A1 to zip the folder and submit the zip file to the Assignment 1 submission area on the course website.

Assignment Overview

This assignment enables you to apply your knowledge of XML, DTD, XQuery, JSON, and Relational Databases to an application. The objectives include

Data modeling in the JSON model
Data modeling in XML
Data Modeling in the relational database
Conversion of data formats
Understanding of modelling power of different models

Before starting, you should complete all the tutorials and practicals on XML Documents, DTD, XML Schema and XQuery.

You complete the assignment in a group of three students whom you choose yourself. If your group has less members or you work by your own, please complete all tasks. External students may choose to collaborate in Skype which has white board and desktop sharing functions. Submission by one group member is sufficient. The people.txt should include email IDs of all group members.

To start:

Unzip the DBE-A1.zip file to a convenient location and this will give you the DBE-A1 folder. Here is a note about the location. A good option for the location is a cloud storage like dropbox or google drive which makes sharing easy. At the same time, you may need to back up the folder periodically on your local disk to prevent the situation where one member deletes all files and the deletion propagates to everyone (then everything is lost). Periodically you may also zip the folder and send it to you email as a backup. In the cases of laptop collapse or being stolen, or a dispute, the email backup can cover you.

In the DBE-A1 folder, you see some empty files as shown below. You will gradually put your answers in these files. The files with a green tick are input files.

Modify the people.txt file to put your group members’ details. Note that this file is important, especially when there is a dispute among group members. Generally, the marker will give all group members the same marks. However, when the work quality is significant in consistent or in the case of a dispute, you may be given marks based how well your tasks are done.

Application and requirements

DBLP data is about publications of academic results made by researchers in computer science and information technology areas. This data is then searched and displayed here http://dblp.uni-trier.de/ for people to make references, to investigate progresses in certain research direction, and to calculate statistics of many kinds. The whole DBLP data is huge (http://dblp.uni-trier.de/xml). In this assignment, you have a small portion of the data from http://www.cs.washington.edu/research/xmldatasets/www/repository.html. Based on this data, you will complete the following tasks. The data file is dblp.xml which is in the gz file dblp.xml.gz. This dblp.xml.gz file is further in the zip file DBE-A1.zip. You can open the zip and gz files using 7zip or similar tools.

Task 1: Choose 10 publication entries of the given document dblp.xml and convert the elements to the JSON format.

Task 2: Derive a DTD from your chosen.xml and compare the derived DTD with the given DTD.

Task 3: Transform the XML document chosen.xml to relational tables.

Task 4: Discuss on the differences of modeling power among XML and relational tables.

Task 5: Automate Task3 using XQuery.

The detailed requirements of the tasks are given below.

Task1. Choose 10 publication entries in the given document dblp.xml and convert the elements to the JSON format.

Publication entries in dblp.xml are the child elements of the root element <dblp>. You browse the child elements of the root, choose to copy the data for 10 publications of various types (books, articles, thesis, inproceedings etc), and store the chosen data in the file chosen.xml as a well-formed XML document. (1 mark)

You convert the data in your chosen.xml into data in the JSON format and store the data in chosen.json file. (2 marks)

Task2. Derive a DTD from your chosen.xml and compare the derived DTD with the given DTD.

Based on your data of the 10 selected publications in your chosen.xml file, you derive a DTD. You save your DTD in the file chosen.dtd file. Your DTD must be from your data. If your data does not justify your DTD, the deduction will be heavy. (1.5 marks)

Then you fill in the table of Task2 in the file “Report.docx” to show the differences of your DTD in chosen.dtd from the given DTD in dblp.dtd, and to explain the reasons why yours is different. When you describe the difference, you must reference specific element names and constructs of your DTD. For example, you may say: “in element Person, the ‘*’ following Address is extra”. If your DTD has many differences, choose to describe only 5 of them with different types. (1.5 marks)

Task 3: Transform data in chosen.xml to relational tables.

Design relational tables by showing table schemas and constraints so that the XML data can be stored without redundancy and without null values. Redundancy means the repetition of non-key values. The format of a table schema is shown below. You put all your table schemas in the Report.docx file: (2 marks)

TableName1 (attribute1, attribute2, attribute3, …)

PK = (attribute1, attribute3)

FK1 = (attribute1) references TargetTable (TargetAttribute)

FK2 = (attribute2) references TargetTable (TargetAttribute)

TableName2 (attribute1, attribute2, attribute3, …)

PK = …

FK1 = …

Keep the table names short: no more than 12 characters in length and allowed characters are the underscore ‘_’, letters and numeric numbers.

Transform the data in chosen.xml into the table format conforming to the above design and store the transformed data in csv files, one csv file is for one table. The csv file’s base name must be the same as the table name. (2 mark)

Task 4: Discuss on the differences of modeling power between XML and relational tables.

Modelling power means the ability of representing concepts and relationships. You write dot points to explain the differences of data modelling of XML and the relational model. You must use an example for each point and the example must reference the data in your chosen.xml and the data in your csv files. (3 marks, one well written point is 1 mark. Points without a good example get no mark)

Task 5: Write XQuery files to automate Step 3.

Write XQuery transformation to transform data in dblp.xml to data for the tables designed in Step 3. Each XQuery program file will produce data to csv for each table. If you have m number of tables designed in Step 3, you will have m XQuery files. Each query file will be executed in the following way in a command-line environment:

java -cp BaseX841.jar org.basex.BaseX tablename1.xql > tt.csv

where tablename1.xql is your query file, and tt.csv is an arbitrary csv file name.

When the marker opens the tt.csv file, he should see a table.

Fully working code is worth 4 marks.

The query filename must be the same as the table name.

Caution:

Please use only the filenames specified above for writing answers. Otherwise, the testing scripts will not work on your answers and you will lose marks.
For zipping, right-click on DBE-A1 folder to get it. If this is not the case, you may have extra layers of folders in your answers. This again hinders the work of the testing scripts.

Marking Criteria

Marking will consider filenames, correctness of results, their formats, and quality of writing.

If you attempt only some tasks and your results are correct, you get marks for the parts.

Plagiarism

If your solution, or part of it, is not written by you (your idea and your own typing), you commit plagiarism. The investigation will be handled by an academic integrity officer and may involve an oral or written test. If you commit plagiarism, you will be penalized and a record will be kept in your file.

Extensions

Extensions for assignments are available under the following conditions

illness or emergency
permanent or temporary disability, or
compassionate grounds

In all cases, documentary evidence (e.g. medical certificate, road accident report, obituary) must be presented to the Course Coordinator. A medical certificate produced on or after the due date will not be accepted unless you are hospitalized.

If you apply for extension within 24 hours before the deadline, you must see the course coordinator in person unless you are in an emergency situation like being admitted in a hospital.

Late Penalties

Unless you have an extension, late submission will incur a penalty of 30% deduction per day (or part of it) of lateness.