COMP9315 18s2 Assignment 2 SIMC Signature Index Files DBMS Implementation Last updated: Tuesday 16th October 9:52pm Most recent changes are shown in red; older changes are shown in brown. Aims This assignment aims to give you an understanding of how database files are structured and accessed how superimposed codeword (SIMC) signatures are implemented how partialmatch retrieval searching is implemented using SIMC signatures The goal is to build a simple implementation of a SIMC signature file, including application to create SIMC files, insert tuples into them, and search for tuples based on partialmatch retrieval queries. Summary Deadline: 23:59:59pm on Sunday 21 October Late Penalty: 0.09 marks off the ceiling mark for each hour late Marks: Contributes 15 marks toward your total mark for this course. Groups: do this assignment in pairs or individually (you can use the same groups as for Assignment 1) Submission: Login to Course Web Site > Assignments > Assignment 2 > Submission > upload ass2.tar The ass2.tar file must contain the Makefile plus all of the *.c and *.h files that are needed to compile the create, insert and select executables. However, you should not change or submit create.c, insert.c and select.c. Details on how to build the ass2.tar file are given below. Make sure that you read this assignment specification carefully and completely before starting work on the assignment. Questions which indicate that you haven’t done this will simply get the response “Please read the spec”. Note: this assignment does not require you to do anything with PostgreSQL. Introduction Signatures are a style of indexing where (in its simplest form) each tuple is associated with a compact representation of its values. They are used in the context of partialmatch retirieval queries, and are particularly effective for large tuples. Selection is performed by scanning signatures, matching them against a query signature, and then examining tuples that are flagged as potential matches. Efficient signature matching (small signatures, simple bitcomparison) allows for “false matches”, where the query and tuple signatures match, but the tuple is not a valid result for the query. The kind of signature matching described above uses one signature for each tuple (as in the diagram below). Other kinds of signatures exist, and one goal is to implement them and compare their performance to that of tuple signatures. Signatures can be formed in several ways, but we will consider only signatures that are formed by superimposing codewords (SIMC). Each codeword is formed using the value in one attribute. https://cgi.cse.unsw.edu.au/~cs9315/18s2/index.php https://cgi.cse.unsw.edu.au/~cs9315/18s2/index.php In our context, SIMCindexed relations are a collection of files that represent one relational table, and can be manipulated by a number of supplied commands: gendata #tuples #attributes [startID] [seed] Generates a specified number of nattribute tuples in the appropriate format to insert into a created relation. All tuples are the same format and look like UniqID,RandomString,a3-Num,a4-Num,…,an-Num For example, the following 4attribute tuples could be generated by a call like gendata 1000 4 7654321,aTwentyCharLongStrng,a3-013,a4-001 3456789,aTwentyChrLongString,a3-042,a4-128 A tuple is a sequence of commaseparated fields. The first field is a unique 7digit number; the second field is a random 20char string; the remaining fields have a field identifier followed by a nonunique 3digit number. The size of each tuple is