程序代写 MCIT 553 Project 2

MCIT 553 Project 2
A DHT-based Search Engine

MCIT Online 553

Copyright By PowCoder代写 加微信 powcoder

Directions

You must work in groups of 4 for this project. Please regularly check Piazza through out this course for
project specification clarifications. You are required to version control your code, but please only use the
GitHub repository created for you by the 553 staff. Do not work in public GitHub repositories! Please
avoid publishing this project at any time, even post-submission, to observe course integrity policies. If
you are caught using code from other groups or any sources from public code repositories, your entire
group will receive ZERO for this assignment, and will be sent to the Office of Student Conduct where
there will be additional sanctions imposed by the university.

1 Overview

In this project, you will implement a peer-to-peer search engine (PennSearch) that runs over your imple-
mentation of the Chord Distributed Hash Table (PennChord). You are expected to read the entire Chord
paper carefully, and clarify any doubts with your TAs, instructor, or post questions on Piazza.

You will start from your routing protocol implementation in project 1. You can choose either dis-
tance vector or link state, or if your project did not work properly, configure to use OLSR as your
routing protocol. The use of OLSR can be turned on using a command line flag –routing=NS3
to simulator-main.cc . You are then responsible for first developing Chord as an overlay network
layered on top of your routing protocol, followed by building the search engine application that uses

To help you get started, files related to project 2 include:

• simulator-main.cc: In addition to what you learnt in project 1, it has SEARCH LOG() and
CHORD LOG() functions to log all messages relevant to the search engine and Chord overlay,

respectively. It also includes modules for CHORD and PENNSEARCH.

• scenarios/pennsearch.sce: An example scenario file that contains a simple 3 node Chord net-
work, the publishing of document keywords by two nodes, and three example queries.

• keys/metadata0.keys, keys/metadata1.keys: These keys are located inside the contrib/upenn-
cis553/ directory and not in the scenario directory. Each file contains the meta-data for a set of
documents, where each row “DOC T1 T2 T3. . . ” is a set of keywords (in this case, T1, T2, and
T3) that can be used to search for a document with identifier DOC. Each document identifier can
be a web URL or a library catalog number, and for the purpose of this project, they are simply a
string of bytes that uniquely represent each document. In practice, these keywords are generated
by reading/parsing web content or other documents to extract keywords. However, since parsing
web pages is not the focus of this project, we have skipped this step, and supply these document
keywords to you.

• penn-chord-*.[h/cc]: Skeleton code for your Chord implementation.

• penn-search-*.[h/cc]: Skeleton code for your PennSearch implementation.

The command to compile and run project 2 is the same as Section 2.3 and 2.4 in project 1 code doc-
umentation. Please do not use -result-check flag in the command and please comment out calls to
checkNeigbhorTableEntry() and checkRoutingTableEntry() if you use your LS and DV implementation.
You can also opt to use ns-3’s OLSR implementation instead of your own LS and DV.

Note that our skeleton code is a starting point, and you are allowed to be creative and structure
your code based on your own design. For regular credits, you are however not allowed to make any
code changes outside of upenn-cis553 directory, nor are you allowed to change simulator-main.cc. In
addition to our sample test script, we expect you to design additional test cases that demonstrate the
correctness of your PennSearch implementation. We encourage you to get started early. As a result, we
have included a milestone 1 where you should have a basic Chord implementation.

2 PennChord

Chord nodes will execute as an overlay network on top of your existing routing protocol, i.e., all mes-
sages between any two Chord nodes 1 and 2 will traverse the shortest path computed by the underlying
routing protocol. The PennChord overlay network should be started only after the routing protocol has
converged (i.e. finish computing all the routing tables). You can assume that the links used in the under-
lying routing protocol computation does not changed while PennChord is executed. Also, not all nodes
need to participate in the Chord overlay, i.e. the number of nodes in your underlying network topology
may be larger than the number of nodes in the Chord overlay. Your PennChord implementation should
include finger tables and the use of 160-bit cryptographic hashes.

Correct Chord Behavior

We will also be using our own test cases to ensure correct Chord behavior. For simplicity, assume
each node only maintains one successor (i.e. the closest node in the clockwise direction). You have
to implement the actual stabilization functionality described in the Chord paper. In particular, we will
check for the following features:

• Correct lookup. All lookups routed via Chord has to be proven correct (i.e. reach the correct
node via the right intermediate nodes. You need to support the basic Chord API, which is IPAddr
← lookup(k), as described in lecture. This API is not exposed to the scenario file explicitly, but
used by PennSearch to locate nodes responsible for storing a given keyword.

• Consistent routing. All nodes agree on lookup(k).

• Well-formed ring. For each node n, it’s successor’s predecessor is itself. One easy way is to
make use of your RINGSTATE command described below to ensure that the Chord ring is formed
correctly.

• Correct storage. Every item K is stored at the correct node (i.e. lookup(k))

• Performance. For a given network size, you need to compute and output the average hop count
required by Chord lookups that occur during the duration of your simulation. In other words,
you should implement the code that will capture the number of hops required by each lookup and
output using CHORD LOG the average hop count across all lookups when the simulation ends.
The average hop count must exclude lookups that are generated during periodic finger fixing.
After finger fixing is correctly implemented, the average hop count should be log(N) instead of
N, where N is the number of nodes in the Chord overlay network.

• Stabilization protocol. Enough debugging messages (not too many!) to show us that periodic
stabilization is happening correctly. Since stabilization generates large numbers of messages, you
should provide mechanisms to turn on/off such debugging in your code.

Summary of Commands

• Start landmark node: Designate a node as your landmark (i.e. node 0). E.g. “0 PENNSEARCH
CHORD JOIN 0” will designate node 0 as the landmark node, since the source and landmark
nodes are the same.

• Nodes join: A PennChord node joins the overlay via the initial landmark node. E.g. “1 PENNSEARCH
CHORD JOIN 0” will allow node 1 to join via the landmark node 0. Once a node has joined the
network, items stored at the successor must be redistributed to the new node according to the
Chord protocol. For simplicity, you can assume all joins and leaves are sequential, i.e. space all
your join events far apart such that the successors and predecessors are updated before the next
join occurs.

• Voluntary node departure: A PennChord node leave the Chord network by informing its suc-
cessor and predecessor of its departure. E.g. “1 PENNSEARCH CHORD LEAVE” that will
result in node 1 leaving the PennChord network. All data items stored should be redistributed to
neighboring nodes accordingly. For simplicity, you can assume all joins and leaves are sequential.

• Ring state debug: At any node X, a “X PENNSEARCH CHORD RINGSTATE” command will
initiate a ring output message that initiates from node X, and traverse the entire Chord ring in
a clockwise direction to output all successors and predecessors. This message will terminate
at the initiating node X. Each node curretNodeAddr that receives the message will generate the
following output in a single line using PRINT LOG :

Ring State
Curr
Pred Succ

Debug Logs (for Grading)

For grading purposes, in addition to ring state output above, we require the following information
to be printed using CHORD LOG . These debugging messages can be turned off using command ”*
PENNSEARCH VERBOSE CHORD OFF”. All Chord identifiers should be printed in hexadecimal.

• Lookup issue debug: Every time a node issues a lookup request, the following message is printed:

LookupIssue

• Lookup forwarding debug: Every time a node forwards a lookup request, the following message
is printed:

LookupRequest: NextHop

• Lookup results debug: Every time a node returns a result in response to a lookup request back
to the node that originated the initial lookup request, the following message is printed:

LookupResult

Note that there are no specific commands that will generate lookups. They are generated either by
Chord’s periodic finger fixing, or PennSearch’s invocation of PUBLISH and SEARCH (see below). For
our testing purposes, you must leave out lookups generated by finger fixing in the CHORD LOG output,
but only print out lookups that are initiated by PennSearch.

Cryptographic Hashes

Note that the 160-bit Chord cryptographic hashes can be generated using SHA1, which is available using
the OpenSSL Crypto development library. In order to do that, you need to include the libcypto library
during compilation. Specifically, take the following steps:

1. Install OpenSSL in your VM:

sudo apt-get install libssl-dev

2. Run ”./waf configure” to update your configure if this is your first time to run this program. It can
also check if you have installed OpenSSL correctly; otherwise the configuration would fail.

3. To use libcrypto, you need to:

#include

4. To generate digest for a message, use:

SHA1(message, length, digest);

For the purpose of this project, you can use the first 32 bits (m = 32) of the SHA1 value as the Chord
ID and will receive full credit if your functionality is working correctly. If you can use the entire 160
bits as Chord ID, you will receive extra credit.

3 PennSearch

To test your Chord implementation, we will require you to write PennSearch, which is a simple keyword
based search engine.

Basics of Information Retrieval

We first provide some basic knowledge that you would need to understand keyword-based information
retrieval. We consider the following three sets of document keywords, one for each of Doc1, Doc2, and

Doc1 T1 T2
Doc2 T1 T2 T3 T4
Doc3 T2 T3 T4 T5

Doc1 is searchable by keywords T1 or T2. Doc2 is searchable by T1, T2, T3, and T4, and Doc3
is searchable by T3, T4, and T5. Typically, these searchable keywords are extracted from the actual
documents with identifiers Doc1, Doc2, and Doc3.

Based on these keywords, the inverted lists are {Doc1, Doc2} for T1, {Doc1, Doc2, Doc3} for T2,
{Doc2, Doc3} for T3, {Doc2, Doc3} for T4, and {Doc3} for T5. Each inverted list for a given keyword
essentially stores the set of documents that can be searched using the keyword. In a DHT-based search
engine, for each inverted list Tn, we store each list at the node whose Chord node is responsible for the
key range that includes hash(Tn).

A query for keywords “T1 AND T2” will return the document identifiers “Doc1” and “Doc 2”, and
the results are obtained by intersecting the sets {Doc1, Doc2}, and {Doc1, Doc2, Doc3}, which are the
inverted lists of T1 and T2 respectively. We only deal with AND queries in PennSearch, so you can
ignore queries such as “T1 OR T2”.

Note that the query result is not the actual content of the documents, but rather the document identi-
fiers that represent documents that include both T1 and T2. In typical search engine, an extra document
retrieval phase occurs at this point to fetch the actual documents. We consider the actual document
content retrieval step out of scope of this project.

Summary of Commands

• Inverted list publishing: “2 PENNSEARCH PUBLISH metadata0.keys” means that node 2 reads
the document metadata file named metadata0.keys. Node 2 then reads each line, which is of the
form “Doc0 T1 T2 T3 . . . ”, which means that the Doc0 is searchable by T1, T2, or T3. After
reading the metadata0.keys file, node 2 constructs an inverted list for each keyword it encounters,
and then publishes the respective inverted indices for each keyword into the PennChord overlay.
For instance, if the inverted list for “T2” is “Doc1, Doc 2”, the command publishes the inverted
list “Doc1, Doc2” to the node that T2 is hashed to. This node can be determined via a Chord
lookup on hash(T2). As a simplification, the inverted lists are append-only, i.e. new DocIDs are
added into a set of existing document identifiers for a given keyword, but never deleted from an
inverted list.

• Search query: “1 PENNSEARCH SEARCH 4 T1 T2” will initiate the search query from node
1, and take the following steps via node 4:

(a) Node 1 contacts node 4 with query ”T1 AND T2”;

(b) Node 4 issues a Chord lookup to find the node that stores the inverted list of T1, i.e., the node
that T1 is hashed to (e.g., Node T1), and sends the query ”T1 AND T2” to Node T1;

(c) Node T1 retrieves inverted list for T1 from its local store, issues a Chord lookup to find the
node that T2 is hashed to (e.g., Node T2), and sends the retrieved inverted list for T1 together with
query ”T2” to Node T2;

(d) Node T2 send the intersection of the inverted lists of T1 and T2 as the final results back either
directly back to node 1, or to node 4 (which forwards the results to node 1). If there are no
matching documents, a “no result” is returned to node 1.

For simplicity, you can assume there is no search optimization, so inverted lists are intersected
based on left-to-right ordering of search terms. Note that the search may contain arbitrary number
of search terms, e.g. “1 PENNSEARCH SEARCH 4 T1 T2 T3”.

• You can assume that each document identifier appears in only one metadataX.keys file. For in-
stance, if node 0 publishes metadata0.keys, and node 1 publishes metadata1.keys, you can assume
that both files do not contain overlapping document identifiers. This emulates the fact that in prac-
tice, each node will publish inverted indexes for documents that it owns, and one can make the

assumption that each node owns a unique set of documents. On the other hand, each searchable
keyword may return multiple document identifiers. For instance, there are two documents Doc2
and Doc3 that can be searched using the keyword T3.

• You can assume that only nodes that participate in the PennSearch overlay can have permission
to read document keywords and publish inverted indexes. Nodes, which are outside the Chord
network, may initiate SEARCH by contacting any node, which is part of the PennSearch overlay.
For instance, in our example command above, “1 PENNSEARCH SEARCH 4 T1 T2” means that
node 1 (which may be a node outside the PennSearch overlay) can issue a search query for “T1
and T2” via node 4, which is already part of the PennSearch overlay.

Debug Logs (for Grading)

For grading purposes, we require the following information to be printed using SEARCH LOG :

• Inverted list publish debug: Whenever a node publishes a new inverted list entry, the following
debug message is generated:

Publish

• Inverted list storage debug: Whenever a node (that the keyword is hashed to) receives a new
inverted list entry to be stored, the following debug message is generated:

Store

Note that if CHORD LOG is turned on, this means that between each Publish and Store output
message, we should see a series of lookup output messages.

• Search debug: Whenever a node issues a search query with terms T1, T2,. . . ,Tn, you should

Search

• Inverted list shipping debug: For each inverted list being shipped in the process
of the search, you should output:

InvertedListShip

• Search results debug: At the end of intersecting all keywords (T1, T2, . . . , Tn), output the final
document list that is being sent back to the initial query node:

SearchResults

• Empty search results debug: In the process of the search, if any intermediate node results in an
empty inverted list to be sent back to the initial query node, you should output:

SearchResults

4 Milestones

• Milestone 1: (15%)

We expect a basic Chord implementation where the ring stabilization protocol works correctly.
At this stage, finger table implementation is optional for this milestone. We require only the
ringstate PRINT LOG output to make sure your ring is formed correctly and maintained as nodes
enter/leave the Chord overlay. For this milestone, upload a video recording of your functioning
basic Chord to Gradescope.

• Final (85%)

Complete working implementation of PennChord and PennSearch.

In the final submission, submit your tar files cis553project2.tar.gz on Gradescope. Note that you
have to submit the entire upenn-cis553 directory. You will schedule and give a 45 minutes demo with
the TA (or at least an hour long if you did extra credits below). All test scripts have to be ready and
submitted together with the rest of the project. We will provide a scenario file that you will use in the
final demo.

Extra credits

We have suggested several avenues for extra credit, to enable students to experiment with the challenges
faced in designing the Chord protocol to work 24×7 over the wide-area. Note that doing extra credit is
entirely optional. We offer extra-credit problems as a way of providing challenges for those students
with both the time and interest to pursue certain issues in more depth.

Similar to project 1, extra credit will be submitted separately from regular credit on Gradescope.
You can demo your extra credit to your assigned TA after the deadline.

• Handling big numbers (5%): If you use the entire 160 bits from SHA1 as the Chord ID, you
will get 5% extra credit. This will require you to design a mechanism to store, add, mod, compare,
and display big integers that cannot fit in primitive data types.

• Demonstrate using your LS/DV from project 1 (5%): If you use your own LS or DV instead of
the default ns-3 routing for your demo, you will get 5% extra credit. This may require some mod-
ifications to your project 1 code to get this to work properly, in particularly implement routeInput
and routeOutput functionality which is not tested by our autograder.

• Bandwidth efficient search engine (10%): The basic PennSearch can be enhanced to save band-
width as follows: perform the intersection of keywords starting from the one with smallest inverted
list, and use of bloom filters. Implement these features and show that your implementation results
in lower bandwidth utilization compared to the regular implementation.

• Enhanced search features (5% for each bullet point): Enhanced your search engine with at
least the following features:

– Generating the document keywords (e.g. metadata0.keys, metadata1.keys) with actual web
pages (>20) that you have downloaded, and replacement docID with actual URLs. After re-
turning the search results, fetch the actual documents themselves. Do the last step sparingly
so as not to overload existing web servers with your http requests.

– Rank search results based on a reasonable search metric, such as TF-IDF.

– Support inverted lists that are larger than MTU size of 1500 bytes. For instance, you can

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com