CS 3110 Fall 2020
A3: Search
In this assignment you will develop a search engine for text documents. Your engine will crawl through a directory on a local disk looking for documents and answer queries posed by users.
This assignment is more di cult than A2. I¡¯ve never given exactly this
version of the assignment before. Pandemic and the resulting
academic calendar have caused me to create a mash-up of two previous assignments. The best estimate I can give you is based on a different mash-up from Spring 2020, in which students reported working a mean of 12.0 hours, and a standard deviation of 6.3 hours. Please get started right away and make steady progress each day. Please track the time that you spend on the assignment. I will ask you to report it.
Collaboration policy: This assignment may be completed either as an individual or with a group. Groups must be formed in CMS within the rst couple days of the assignment; see below for details. Beyond your optional group, you may have only limited collaboration with other students as described in the syllabus.
Imperative features: Imperative features are now allowed. There is no good use that I know of for them on this assignment, though.
What you¡¯ll do: Implement some functors, and use the Bisect glass-box testing tool. Objectives:
Learn more about the OCaml module system.
Implement data structures for sets and dictionaries with association lists and red-black trees. Document and implement abstraction functions and representation invariants.
Gain experience with glass-box testing.
Optionally, teach yourself the red-back tree deletion algorithm and implement it.
Table of contents:
Step 0: Form a Group in CMS Step 1: Get Started
Step 2: List Dictionaries
Step 3: Bisect
Step 4: Sets
Step 5: Red-Black Trees Step 6: Query
Bonus Step 7: Removal from Red-Black Trees Rubric
Submission
Step 0: Form a Group in CMS
Originally this assignment was going to allow partners. Because of the unusual circumstances this semester, I am extending that to allow groups of size 1 through 3 ¡ªbut not 4.
Groups are optional. The assignment is de nitely do-able by one person. Nonetheless, working with other people is useful because it gives you a chance to bounce ideas off one another, and to get their help with xing faults in your code. See the discussion of ¡°pair programming¡±, below, for more details.
As a reminder, here is what the syllabus says about partners, and it applies to groups of 3 as well:
Collaboration with a partner is subject to this rule: you must do ALL the work together. It is against that rule to split up the work, or to have one person do it and another person ¡°check¡± it, or to have one person write the code and another person write the tests, etc. When you submit as a partnership, by default you are asserting that both partners are equally responsible for the entire code base. If an AI violation is detected, you would both therefore be prosecuted. So if you do violate the rule, make sure to disclose that in the designated comment in the release code.
The deadline to form a CMS group is Tuesday, October 27, at 11:59 pm ET. There will be a short grace period. Starting on Wednesday, October 28, you will no longer be able to form new groups. Instead, you will have to email one of your section TAs, cc¡¯ing your intended group members. The TA can then manually put you into a CMS group. However, there will be a penalty for late group formation: -5 points on Wednesday and -10 points on Thursday. The penalty applies to all group members¡ªno exceptions. Starting on Friday, October 30, at 12:01 am ET, no new groups may be formed.
Why do I have this early group formation deadline, and the penalties? It¡¯s to make sure you are working with your group on the entire assignment, as required by the collaboration policy, rather than joining forces part-way through.
If you want to split up with your group before the nal assignment deadline, you must acknowledge their in uence on your work in the Authors interface. You may form a new group, subject to the deadlines and penalties stated above.
Pair Programming. If you have a group, I encourage you to try out pair programming with them. Pair programming is a speci c way for people to write code together, and for them to collectively
own the result. Please watch this video, which explains the driver and navigator model of pair programming:
How to Pair Program (Pair Programming Tips from Steven …
If you¡¯d optionally like to read more about the bene ts of pair programming, Strengthening the Case for Pair Programming by Williams et al. (2000) is a good place to start. It includes this quote:
For years, programmers in industry have claimed that by working collaboratively, they have produced higher-quality software products in shorter amounts of time. But their evidence was anecdotal and subjective: ¡°It works¡± or ¡°It feels right.¡± To validate these claims, we have gathered quantitative evidence showing that pair programming¡ªtwo programmers working side by side at one computer on the same design, algorithm, code, or test¡ªdoes indeed improve software quality and reduce time to market. Additionally, student and professional programmers consistently find pair programming more enjoyable than working alone. Yet most who have not tried and tested pair programming reject the idea as a redundant, wasteful use of programming resources: ¡°Why would I put two people on a job that just one can do? I can¡¯t afford to do that!¡± But we have found, as Larry Constantine wrote, that ¡°Two programmers in tandem is not redundancy; it¡¯s a direct route to greater efficiency and better quality.¡±
Even though we are socially distanced because of pandemic, pair programming in VS Code using Live Share or even over Zoom works remarkably well. Please note that Live Share really requires a local install rather than Ugclinux.
Step 1: Get Started
Download the release code. There are many les, and you will need to read most of them before the assignment is over, but you don¡¯t have to do so yet.
Create a new git repository for this assignment. Make sure the repo is private. Add the release code to your repo, referring back to the instructions in A1 if you need help with that. Make sure you unzip and copy the les correctly, including hidden dot les, as described in the A1 handout.
If you are working with a group, grant them access to the repo: Go to the repo¡¯s webpage on the COECIS Github.
Click ¡°Settings¡±; on the settings page click ¡°Collaborators¡±, and type your
group member¡¯s NetID into the search box. Click ¡°Add collaborator¡±. Repeat as needed.
Your group members should now click on the GitHub icon on the top left in their own browser. They should see the repo you just created listed on the left as a repo to which they have access.
In the release code, there is a make le provided with the usual targets: build, test, check, nalcheck, zip, docs, and clean.
This assignment will be autograded as usual, so your submission must pass make check . Now that you¡¯re used to working with .mli les, I have omitted the comments about what you are allowed to change. Any names and types provided in an interface must not be changed. You may add new declarations and de nitions, but there is no need to do so: recall that you should not be exposing or testing helper functions. If you¡¯re ever in doubt about whether a change is permitted or not, just run make check . It will tell you whether you have changed the interface in a way that would make it impossible to autograde your submission.
Take a moment now to ll out your name(s) and NetID(s) in authors.mli .
Note that starting with this assignment I will no longer speci cally direct you to delete TODO comments after you complete them. I trust you know to do that by now. Unlike some other courses you might have taken, the 3110 autograder does not look for such comments. So this is just about code cleanliness, not about grading.
Step 2: List Dictionaries
The most important ADT that your search engine will need is a dictionary¡ªthat is, a map from keys to values. You¡¯re going to implement two data structures for dictionaries: association lists
d dbl k W¡¯ll ih i i li
and red-black trees. We¡¯ll start with association lists.
First, read the interfaces in dictionary.mli . There are a few advanced OCaml features used in
them that you might not recognize:
The Formattable.format function is a custom printer intended for use by the toplevel. Custom printers are discussed in the textbook in the section on abstract types under the heading ¡°Custom Printers¡±.
Some signatures have code like . This is a destructive
include Sig with type t := t
substitution. You don¡¯t need to know how to use this yourself. Essentially, it means that the type t inside Sig should be replaced with the type t from the surrounding signature. For more information, see the OCaml manual.
DictionaryMaker uses module sharing constaints, which are similar to the type sharing constraints discussed in the textbook. There¡¯s nothing fundamentally different between those two kinds of sharing constraints: both re ne a signature by specifying equations that must hold.
Second, read the interface in listDictionary.mli . All it does is include the Dictionary interface and declare a functor.
Third, read the code in , followed by the example in
. That latter le is intended to be loaded in utop with #use , not to be
compiled with . Do so now, making sure that you get the expected output from evaluating d in it.
Fourth, in ListDictionary.Make , implement type t , empty , rep_ok , and format :
You are free to use any code I have provided during this semester. You do not need to cite it.
Take time to design t , its AF, and its RI. Consider the speci cations of the operations. Because you are using association lists, the time e ciency of many operations will have to be linear. The grading test suite will test list dictionaries of size at most about 10,000. That means you generally should not have to worry about tail recursion.
How to implement rep_ok is discussed in the textbook in the section on implementing representation invariants. You are not required to insert calls to rep_ok throughout your
ListDictionary implementation, but you might nd it useful for debugging purposes.
If you do insert them, it¡¯s ne to comment out the ¡°real¡± rep_ok implementation and change it to the identity function. But, you must leave the commented-out version in your source code so that the grader can assess it.
exampleDictionary.ml
listDictionary.ml
ocamlbuild
You should not be writing OUnit tests for or . Indeed, it would be hard or perhaps even impossible to write such tests.
Fifth, implement the remaining functions in based on their speci cations in . Use TDD as you go. Implement unit tests for
in . See the next step for more information about how many tests you will eventually need to write.
Sixth, double check your implementation to make sure you were extra careful to use the comparison operator provided by the KeySig when comparing keys. Do not use
Stdlib.compare , nor the built-in polymorphic comparison operators: < , <= , = , etc. The grading test suite will use some keys with non-standard comparison orders. If you use the built-in comparisons, your dictionary will not process keys in the right order, and you solution will lose a couple points. Create some unit tests for that yourself. For example, you could test integer keys that are ordered decreasingly, or string keys that are case insensitive.
Step 3: Bisect
Bisect is a glass-box testing tool that we covered earlier in the course. Run make bisect to use Bisect to measure the code coverage of your test suite. Open _coverage/index.html . Examine the coverage you have achieved in listDictionary.ml . It¡¯s unlikely you are at 100%. Indeed, it¡¯s unlikely you can achieve that. For example, unit tests are unlikely to ever exercise your
format and rep_ok functions. But outside of those, look for any lines of code that haven¡¯t been covered. Invent new unit tests that will cause those lines to be executed, and add those tests to test.ml .
How high do you need to get your coverage? In an ideal world you would cover every line that you know it¡¯s possible to cover, or at least that is feasible to write unit tests to cover. With modest effort, I was able to to achieve 90-100% code coverage, excluding my format and rep_ok functions (and their helpers), and excluding expressions in which a precondition violation was detected. See the testing rubric at the end of this handout for how the graders will assess your code coverage.
To exclude a function from measurement by Bisect, put immediately after the function body. You can nd an example of that in . This syntax is an attribute that is used to decorate some OCaml code in way that the type checker mostly ignores, but that can be used by external tools. You can ignore individual expressions with the attribute, or an entire module-level block of declarations by surrounding them with and [@@@coverage on] .
Needless to say, you need to be ethical about turning off coverage. If it¡¯s for any reason other than or a precondition violation you must write a comment explaining why it is
format
rep_ok
Dictionary.Dictionary
ListDictionary.Make
ListDictionary
test.ml
[@coverage off]
[@@coverage off]
ListDictionary.format_assoc_list
[@@@coverage off]
format rep ok
format , rep_ok , or a precondition violation, you must write a comment explaining why it is
fair to turn off coverage, i.e., why that code could never be executed. FWIW, I did not encounter any such circumstances in my own solution.
Step 4: Sets
Your search engine will also need a data structure for sets. In this assignment, we¡¯ll use a dictionary to represent a set. The core idea of this representation is that the elements of the set are the keys in the dictionary. The values in the dictionary are thus irrelevant, so they might as well simply be () , the unit value. Although there is a bit of extra space overhead using this idea, it nonetheless provides an easy way to turn any dictionary data structure into a set data structure.
First, read the provided code in the DictionarySet compilation unit.
Second, implement . Follow the same general order as in implementing : rst , , , and rep_ok , then the rest of the
functions. As always, carefully follow the speci cations, and use TDD. Implement unit tests for DictionarySet in test.ml . Here are two hints:
In designing t , you need to gure out the appropriate way to use functor input DM . The output of is a dictionary module; let¡¯s call it D . You can just let
be D.t .
Do not use remove to implement any of the other functions. Later in the assignment that
would have bad consequences for your red-black tree implementation.
Third, check your Bisect coverage for dictionarySet.ml . Improve it as necessary.
This is the stopping point for a satisfactory solution. Let me reassure you: you are passing this assignment at this point. I believe that every one of you is capable of it. So make it a goal at least get to here. Then if you have the desire to go on, go for it!
Step 5: Red-Black Trees
Association lists are slow. In particular,
the find operation on association lists runs in linear time; and
the insert operation can be as fast as constant time (if you kept the list in unsorted order), but might be slower if you enforced stronger representation invariants (such as sorted order).
Let¡¯s improve on that by implementing dictionaries with red-black trees. They offer logarithmic time find and insert operations. Each node in your red-black tree will store a single binding
DictionarySet.Make
ListDictionary.Make
t
empty
format
DM
DictionarySet.Make.t
fk ldhillbdddihkThihbi h
from a key to a value, and the tree will be ordered according to the keys. That is, the binary search tree (BST) invariant, which red-black trees must satisfy, will apply to the keys: all nodes in a left
subtree must have a key less than the key at the parent, and all nodes in a right subtree must have a key greater than the key at the parent.
First, read the TreeDictionary compilation unit. It should look quite familiar from its list counterpart.
Second, in TreeDictionary.Make , implement t , empty , rep_ok , and format . Your implementation of needs to check the BST invariant, local invariant, and global invariant. As with :
You are free to use any code I have provided during this semester. You do not need to cite it.
In documenting the RI, it¡¯s ne to assume that the reader knows what the BST and red-black invariants are. You don¡¯t have to restate them.
You are not required to insert calls to rep_ok throughout your TreeDictionary implementation, but in this case I strongly encourage it. Make sure all your tree operations check that rep_ok holds of inputs and outputs. This strategy will make you a more productive programmer: you will quickly discover your faults and x them, rather than having to live with mysterious or unknown faults. After all your tests are passing, comment out your expensive implementation of rep_ok and replace it with something like:
rep_ok
ListDictionary
let debug = false
let rep_ok x =
if debug then (* your actual implementation *)
else x
Then later you could quickly re-enable rep_ok by changing debug to true , if you suspected there was a new fault you needed to diagnose.
You must leave the ¡°real¡± version of rep_ok present in your source code, even if just in a comment, so that the grader can assess it.
You should not be writing OUnit tests for format or rep_ok .
Third, nish the remaining functions in TreeDictionary.Make . But do not implement remove : it is considerably more challenging, and will be left as an bonus step at the end of the
assignment.
Use TDD, and write unit tests for TreeDictionary in test.ml . You should be able to re-use your ListDictionary tests. Don¡¯t copy-and-paste them, which would duplicate code. Instead, use a functor to create the test suite The input to that functor would be the structure that you
use a functor to create the test suite. The input to that functor would be the structure that you
want to test, and the output would be a structure that contains a list of OUnit tests. You can then put those tests into the suite passed to the OUnit test runner. The textbook has an example of
how to do that with assert rather than OUnit.
Since most or all of your tests will initially fail for trees, you might want to begin by commenting out all your former list tests. Then uncomment the tests for whatever tree operation you want to implement next. Implement that operation. Make sure your trees and lists pass those tests. Repeat until all tests are uncommented.
Fourth, again test your implementation to make sure you were extra careful to use the comparison operator provided by the KeySig when comparing keys. Do not use
Stdlib.compare , nor the built-in polymorphic comparison operators: < , <= , = , etc.
Fifth, check and improve your Bisect code coverage as necessary for treeDictionary.ml and
any of the earlier les you might have changed.
Sixth, test your tree dictionary to ensure that it scales up to about 100,000 bindings. A few points in the grading test suite will be devoted to this kind of scalability testing. We¡¯ll give such unit tests about 10 seconds to run, which is more than enough if you have implemented e cient algorithms.
Seventh, make sure to test sets built from tree dictionaries. A couple of points in the grading test suite will be devoted to making sure your sets work with trees, as they did with lists.
This is the stopping point for a good solution. You now have two well-tested dictionary implementations, and you¡¯ve had some funky fun with functors.
Step 6: Query
Now that we have the data structures we need, it¡¯s time to implement the search engine. The search engine will index a directory of plain text les and create a dictionary. The keys in that dictionary will be words. The values will be sets of lenames. So, the dictionary tells us in which les any given word appears.
First, read the provided code in the compilation unit. I have provided the implementation of . It handles nding all the les in a directory, and it decides what counts as a word. You don¡¯t actually need to understand the implementation of that function. It¡¯s crucial that we all use the same code for parsing words, though, so you must not change the implementation of index_of_dir or any of its helper functions.
Engine
Engine.Make.index_of_dir
Second, read the provided code in the and compilation units. That
ListEngine
TreeEngine
code is complete: there is nothing you need to write. It uses and the appropriate dictionary functor to create a search engine based on association lists or red-black trees.
Third, in implement
write unit tests for
to index the provided directory
could write a property-based test for that fact, rather than hardcoding all the words as a test case. We also provide a directory preamble that could be the basis for other tests. You are welcome to construct your own test directories, too.
Fourth, add support for queries. The queries that users pose will have one of two forms. Abstractly those two forms are as follows, in which the NOT clause is always optional:
¡°and-not¡± queries: AND (w1, w2, ..., wn), NOT (u1, u2, ... um)
¡°or-not¡± queries: OR (w1, w2, ..., wn), NOT (u1, u2, ... um)
For example, ¡°AND (far, away)¡± would return all les that contain both the words ¡°far¡± and ¡°away¡±, whereas ¡°AND (far, away), NOT (day, night)¡± would return all les that do contain both ¡°far¡± and ¡°away¡± but do not contain ¡°day¡± nor ¡°night¡±. Likewise, ¡°OR (all, nothing)¡± would return any le that contains ¡°all¡± or ¡°nothing¡± (or both), and ¡°OR (all, nothing), NOT (between)¡± would return any le that contains ¡°all¡± or ¡°nothing¡± (or both) but does not contain ¡°between¡±.
Queries must be case insensitive, which is the behavior you expect from Google. That means queries for the words ¡°far¡±, ¡°Far¡±, and ¡°FAR¡± should all return the same results. Likewise, a le containing ¡°far¡±, ¡°Far¡±, or ¡°FAR¡± would be returned by any of those queries.
Implement and_not and in Engine.Make . Leverage the operations from the Set interface. Do not use the operation, because it has not been implemented for red-black trees.
Implement unit tests for ListEngine and TreeEngine in test.ml . To avoid duplicating code, again use a functor to produce the tests for whichever engine is passed in as an argument to the functor.
Fifth, check and improve your Bisect code coverage as necessary for and any of the
Engine.Make
ListEngine.to_list
format
, , and words . To test the latter two, and in test.ml . When used
to_list
ListEngine.words
engine_test/alice
, you should nd 3278 distinct words. You
or_not
remove
earlier les you might have changed. But note that your code coverage for be graded.
will not
Sixth, use your search engine to index some large directories. Observe the difference in performance between association lists and red-black trees. You don¡¯t need to turn in anything for this piece, nor will the grading test suite attempt to evaluate it. Just have fun!
engine.ml
engine.ml
Here are two cautions about test cases:
Engine.Make
The grading test suite will test both list- and tree-based engines. So, an error in either dictionary implementation could result in deductions in your engine. Likewise, an error in your
engine could result in deductions for both the list- and tree-based engines. So on the one hand it¡¯s important to test thoroughly, but on the other hand since this is excellent scope you can¡¯t lose very many points anyway.
The grader has to be able to run make bisect as part of the testing rubric. That means any directories your OUnit test suite actively uses (as opposed to being commented out) need to be submitted as part of your ZIP le. You can add your own subdirectories in engine_test , and make zip will include them. But be careful: the CMS le size limit will (nearly silently) reject uploading ZIPs that are too big. So if you add some large directories as index tests, make sure to do that outside of engine_test , and to leave those OUnit test cases commented out. As always, double check that you have submitted the intended version of your solution before the deadline.
Bonus Step 7: Removal from Red-Black Trees
If you would like an additional challenge, try implementing the remove operation for red-black trees. This is worth 2 bonus points. Before attempting it, submit a completed version of the assignment without remove . Since it is bonus, you are mostly on your own. To better serve students with questions about the rest of the assignment, I regret that consultants and TAs will not be able to spend much or any time answering questions about remove .
Use the algorithm de ned in Section 2 of this short paper by Germane and Might. Although the code in that paper is written in Racket and Haskell, two other functional languages, the algorithm is fully described in English and pictures. It took me around 100 lines of OCaml code, including modifying the representation type to account for double black nodes and leaves, and therefore modifying pattern matches in other operations.
There are just two places I found any ambiguity in their description of the algorithm:
1. For their Figures 6, 7, and 9: the double black nodes shown in those could in fact be double black leaves.
2. When they say the redden operation changes the root node to be red ¡°if possible,¡± they mean: if the root is black with two black children, which themselves are black nodes or (black) leaves. And when they say the blacken operation changes the root node to be black ¡°if necessary,¡± they mean: if the root is red with at least one red child.
Also, note carefully when the paper mentions the ¡°re ections¡± of trees at the bottom of p. 426: your code will need to account for those re ections.
After you have implemented remove , check and improve your Bisect code coverage as necessary. Even though this is bonus scope, your testing score will go down if you don¡¯t get the
coverage as high as usually required by the testing rubric, below.
Rubric
25 points: submitted and compiles 25 points: satisfactory scope
25 points: good scope
10 points: testing
10 points: code quality
5 points: excellent scope
2 bonus points: red-black remove
Testing Rubric. The graders will continue to assess these aspects of your test suite: Whether your tests have descriptive names.
Whether your tests are constructed with helper functions to minimize duplicated code.
And new this time will be:
Whether your list vs. tree dictionary tests are constructed with a functor (or functors) to minimize duplicated tests. If you get to the point of engine tests, you should obviously use a functor there too, but it won¡¯t be assessed.
Your code coverage as measured by Bisect. The graders will run make bisect and look at your Bisect report. They will examine the percent coverage you achieve in
(2 points), listDictionary.ml (3 points), and treeDictionary.ml (5 points). The graders will not examine from the
Excellent Scope, regardless of whether you attempted it. If doesn¡¯t succeed, you won¡¯t get any of those points.
Any percent that is at least 90% will be rounded up to a full 100%. Then you¡¯ll get a number of points that is your percent code coverage on a le times the number of points the le is worth, rounded to one decimal place.
There is what might at rst seem like a weird tradeoff here: someone who doesn¡¯t implement much code at all in a le could easily get high coverage points. But that will be offset by a much higher reduction in autograder correctness points for the scope. Honestly, I expect that most of you will get high testing scores. It¡¯s not that hard to nd the test cases you¡¯re missing, and it even becomes fun: nally, you have a principled answer as to whether you have enough
dictionarySet.ml
engine.ml
make bisect
test cases!
The graders will scan through your source code for any places you used one of the
[coverage off] attributes. The Bisect step above explains a few common situations where justi cations are not required, but otherwise, there will be deductions made if you did not justify the usage.
As usual:
Make sure your test suite can be run with make test .
Your test suite will be graded independently from the rest of your implementation. So you can gain or lose points for it regardless of which functions you complete.
You may violate the 80-column rule for long string literals in test cases. Code Quality Rubric. This will be assessed the same as in A2, except as follows:
The graders will no longer be checking for use of List.hd , List.tl , or List.nth . I trust you get the point on avoiding those by now.
You must document an AF and RI in each place the release code says to do so.
You must provide an implementation of format and rep_ok in each place the release code says to do so. As discussed earlier in this handout, for sake of performance it might be wise to comment out the implementation of rep_ok and instead use the identity function. But you must leave the implementation in your source code so that the grader can see you did it.
Submission
Make sure your NetIDs are in authors.mli , and set the hours_worked variable at the end of authors.ml . Note that variable is now a list. Order does not matter in it.
Make sure make bisect works. Otherwise you won¡¯t get any testing points.
Run make zip to construct the ZIP le you need to submit on CMS. Our autograder needs to be
able to nd the les you submit inside that ZIP without any human assistance, so:
¡ú DO NOT use your operating system¡¯s graphical le browser to construct the ZIP le. ¡û
Use only the make zip command we provide. Mal-constructed ZIP les will receive a zero from the autograder. If CMS says your ZIP le is too large, it¡¯s probably because you did not use
make
i h l i li i i CMSi l l f l dZIP l
zip
to construct it; the le size limit in CMS is plenty large for properly constructed ZIP les. Ensure that your solution passes make finalcheck . Submit your ZIP le on CMS. Double-
check before the deadline that you have submitted the intended version of your le. Congratulations! Your search has come to an end.
Acknowledgement: Adapted from Prof. Greg Morrisett, Dean and Vice Provost of Cornell Tech.
ý 2020 Cornell University