School of Information Technologies
COMP5338: Advanced Data Models 2.Sem./2018
Project: NoSQL Schema Design and Query Workload Implementation
Group Work: 20%
Introduction
11.09.2018
In this assignment, you will demonstrate that you are able to work with both MongoDB and Neo4j in terms of designing suitable schema and writing practical queries. You will also demonstrate that you understand the strength and weakness of each system with respect to certain query workload features. You will be given a real world data set in Question and Answer area and a set of target queries. The target queries include very basic OLTP type queries and analytic queries.
You are asked to design two storage options: one with MongoDB as the storage system and the other with Neo4j as the storage system. For each option, you need to store a full copy of the data in the system and implement a subset of the target queries.
Data set
The data that you will use is the latest dump (publication date: 2018-06-05) of the Artificial Intelligence Stack Exchange question and answer site (https://ai.stackexchange. com/). The dump is released and maintained by stackexchange: https://archive.org/ details/stackexchange. The original dump contains many files in XML format. The as- signment uses a subset of the data stored in four tsv files. The data files and the description (readme.txt) can be downloaded from Canvas.
The assignment data set contains the following files:
- Posts.tsv stores information about post, each row represents a post, which could be a question or an answer
- Users.tsv stores user’s profile, each row represents a user
- Votes.tsv stores detailed vote information about post, each row represents a vote,
including the vote type, the date this vote is made and a few other information
- Tags.tsv contains summary of tag usage in this site.
1
Target Queries
Simple queries
- [SQ1] Find all users involved in a given question (identified by id) and their respective profiles including the creationDate, DisplayName, upVote and DownVote. Note, we are only interested in existing users who either posted or answered the question. You may ignore users that do not have an id.
- [SQ2] Assuming each tag represents a topic, find the most viewed question in a given topic.1
Analytic queries
- [AQ1] Given a list of topics (tags), find the question easiest to answer in each topic.
We measure easiness using the time it took to receive an accepted answer. The shorter the time, the easier the question. For instance, the question easiest to an- swer in topic ‘neural-networks’ is question with id 1: What is ‘‘backprop’’?. The question was posted on 2016-08-02T15:39:14 and received an accepted answer on 2016-08-02T15:40:24. It took only a little over 1 minute to receive an accepted answer.
- [AQ2] Given a time period as indicated by starting and ending date, find the top 5 topics in that period.
We rank a topic by the number of users that has either posted a question in that topic or has answered a question in that topic. This would help us to understand the trending topics in different periods of time. For instance, the trending topics and respective user numbers in August, 2018 as indicated by start date 2018-08-01T00:00:00 and end date 2018-08-31T00:00:00 are:
User Number neural-networks 65 machine-learning 44 deep-learning 39 reinforcement-learning 28 convolutional-neural-networks 24
- [AQ3] Given a topic, find the champion user and all questions the user has answers accepted in that topic.
We define a champion user as the one who has the most answers accepted in a given topic. For instance, the champion user of topic ‘deep-learning’ is 4398 and 1847. Both users have 9 answers being accepted in this topic. Your result may show either of the two users. Below is an example of the questions user 4398 has answers accepted:
1For any query, if there are more results than the specified limit, e.g. two questions have the highest view count but we only want one, you can return any of the valid results within the limit
2
Topic
Id Title
3402 Is there ever a need to combine deep learning … 3453 What are the pros and cons of using a spatial … 4080 Is it necessary to clear the replay memory reg… 4085 Policy gradients for multiple continuous actions 4167 Reinforcement learning for robotic motion plan… 4346 Is the new Alpha Go implementation using Gener… 4425 Deep Learning – Classification or Regression A… 4740 I don’t understand the ”Target Network” in Dee… 5185 What is the purpose of the GAN
• [AQ4] Recommend unanswered questions to a given user.
Some question may have been posted for a period of time but may not have an accepted answers yet. We refer to such question as unanswered question. We would like to recommend unanswered questions to potential answerers. For any user with n answers accepted in a certain topic with n greater than or equal to a threshold value α, we consider the user a potential answerer of unanswered questions in that topic.
For instance, given a user with id 4398, an α value 4 and a cutoff date of question creation as 2018-08-30T00:00:00. We will find user 4398 is a potential answerer in the following topics: reinforcement-learning, deep-learning, machine-learning and ai-design. The user has 10 answers accepted in reinforcement-learning area, 9 in deep- learning area, 5 in machine-learning area and 4 in ai-design area. The most 5 recent unanswered question in those topics that are posted before 2018-08-30T00:00:00 are:
Id Title
7755 How to implement a constrained action space in… 7737 creating application to transform human comput… 7736 In imitation learning, do you simply inject op… 7734 AI composing music
7727 How is it possible to teach a neural network t…
They should be the questions recommended to user 4398.
• [AQ5] Discover questions with arguable accepted answer.
CreationDate 2018-08-29 16:04:16.113 2018-08-28 00:17:55.907 2018-08-27 18:41:56.223 2018-08-27 16:13:16.433 2018-08-27 10:18:17.893
Users can give upVote to both question and answer. Usually the accepted answer of a question receives the highest number of upVote among all answers of this question. In rare case, another answer(s) may receive higher upVote count than the upVote count of the accepted answer. In this query, you are asked to discover such questions whose accepted answer has less upVote count than the upVote count of one of its other answers. Note We are only interested in questions with upVote count greater than a given threshold value α. With high α value, you are likely to get an empty set as the result. A reasonable α value would be between 5-15. Your result should show the
3
question id, the upVote count of its accepted answer, and the highest upVote count received by other answers.
• [AQ6] Discover the top five coauthors of a given user.
Consider all users involved in a question as co-authors, for a given user, we rank the coauthors by the number of questions this user and the coauthor appear together either as originator or answerer. For instance, user 4398 has the following top co-authors: 1671(5), 11571(4), 9161(4), 4302(3) and 6019(3). User 4398 and user 1671 appeared together in five questions; user 4398 and user 1571 appeared together in 4 questions. Your result should include both the user id and the number of questions the pair appeared together (co-authored).
Tasks
Your tasks include:
- Decide on the query workload to implement in each option
Among all eight target queries, you are required to implement five for each storage system option. These include the two simple queries: [SQ1] and [SQ2] and three analytic queries. Note that for any analytic query, you need to implement it either in MongoDB or Neo4j. Below is an example of valid query workload combination:
– MongoDB query workload: {SQ1, SQ2, AQ1, AQ2, AQ3} – Neo4j query workload: {SQ1, SQ2, AQ4, AQ5, AQ6}
Below is an example of not valid combination, because AQ5 is not implemented in any system:
– MongoDB query set: {SQ1, SQ2, AQ1, AQ2, AQ3} – Neo4j query set: {SQ1, SQ2, AQ3, AQ4, AQ6}
- Schema Design for MongoDB and Neo4j
For each storage option design a proper schema that would best support the workloadand data set feature.
For each schema version, make sure you utilize features of the storage system such as indexing, aggregation, ordering, filtering and so on. Please note that your schema may deviate a lot from the relational structure of the original data set. You will not get point if you present a schema that is an exact copy of the relational structure in the original data set.
You may discard data that are not useful or not involved in the query. You may du- plicate data if you find that helps with certain queries. You may run preprocessing
4
to reorganize the data to allow easy importing. You should avoid running preprocess- ing to generate partial or final results for any target query. You need to justify any preprocessing you have employed. The justification should include benefits, potential performance cost as well as how often the preprocessing should be carried out. Keep in mind that the storage system is also used to support online live transaction. Any activity such as posting or voting will send one or more write queries to the system. You should not use preprocessing that would cause significant delay to regular queries. The data set is stored with utf-8 encoding. Please ensure that your preprocessing script would save data back in the same encoding. This is important for the ‘posts.tsv’ and ‘users.tsv’ files, which contain non-ascii characters.
• Query Design and Implementation
Load the full data set (after some necessary preporocessing) into both systems and set up proper indexes that will be used by the target queries. Design and implement the chosen query workload for each system. You may implement a query using the shell command (e.g. MongoDB shell or Cypher query) alone, or a combination of JavaScript and shell commands in the case of MongoDB or as Python/Java program. In case that a programming language is used, make sure that you do majority of the processing on the database side. The client side processing should be restricted to activities like collecting output from previous database query and send the output as is to the subsequent one. In particular, you should avoid sorting, filtering and grouping query output on the client side.
Deliverable and Submission Guideline
This is a group project, each group can have up to 2 students. Each group needs to produce the following:
• A Written Report .
The report should contain five sections. The first section is a brief introduction of the project. Section two and three should cover a storage option each. Section four should provide a comparison and summary. Section five should be an appendix for sample results.
There is no point allocated on section one. It is included to make the report complete. So please keep it short.
Section two and three should contain the following three sub sections
- – Query Workload
In this section, briefly describe the query workload you have chosen to work on the system. You may add short explanation on your selection criteria. - – Schema Design
In this section, describe the schema with respect to the particular system. Your5
description should include information at “table” and “column” level as well as possible primary keys/row keys and secondary indexes. You should show sample data based on schema. For instance, you may show sample documents of each MongoDB collection, a sample property graph involving all node types and rela- tionship types for Neo4j. Highlight the data that are different to the given one and justify your decision. These would include data you decide not to put in the database, data that are duplicated, or data that are results of preprocessing.
– Query Design and Execution
In this section, describe implementation of each target query in the workload. You may include the entire shell query command, or show pseudo code. You should also run sample queries and provide a brief performance analysis of each query by citing actual query execution statistics if such statistics are available or by indicating that certain indexing would be used.
In section four, compare the two systems with respect to ease of use, query performance and schema differences. You can also describe problems encountered in schema design or query design.
In section five, document the sample query results as well as the respective argument(s) you use for target queries.
Submit a hard copy of your report together with a signed group assignment sheet in Week 10 lab.
• System Demo
Each group will demo in week 10 lab. You can run demo on your own machine, on lab machine or on some cloud servers. Please make sure you prepare the data before the demo. The marker does not need to see your data loading steps. The marker will ask you to run a few randomly selected queries to get an overview of the data model and query design. All members of the group are required attend the demo. The marker will ask each member a few questions to establish their respective contribution to the project. Members in the same group may get different marks depending on their individual contributions.
• Source Code/Script and soft copy of report
Submit a zip file through the eLearning site, before 5pm on Tuesday 9th of October, 2018 (week 10). The zip file should contain the following:
- – a soft copy of the report
- – query script or program code for each option
- – any data preprocessing and loading script.
- – a Readme document for how to run the preprocessing script and the target queries. The instruction should be detailed enough for the markers to quickly prepare the data and to run the queries. For instance, you should indicate where and how
6
run-time argument are supplied. If you use special features only available in a particular version or environment, indicate that as well.
7