程序代写代做 Java html data structure go database crawler algorithm cache Group Formation and Management

Group Formation and Management
You are required to work in a group of 3 members. Members of the
same will in general has the same score as long as the workload is more or less the same. However, you are required to state the contribution of each member and the final project score of a member may be adjusted depending on his/her contribution to the project. You need to plan before the implementation of the project on how to distribute the workload, and run regular project meetings during the project.
Description of Basic Requirements
You are required to develop a web-based search engine with the following functions:
1. A spider (or called a crawler) function to fetch pages recursively from a given web site:
o Given a starting URL and the number of pages to be indexed, recursively fetch the required number of pages from the given site into the local system using a breadth-first strategy.
o For each page fetched into the local system, extract all the hyperlinks so that the spider can recursively process more pages, and proceed to the indexing functions in Step 2.
o Build a file structure containing the parent/child link relation. As noted elsewhere, every URL is represented internally as a page-ID, so the file structure should be able to return the page-IDs of the children pages given the page-ID of the parent page and vice versa.
o Before a page is fetched into the system, it must perform several checks:
! If the URL doesn’t exist in the inverted index, go ahead to retrieve the URL
! If the URL already exists in the index but the last modification date of the URL is later than that recorded in the index, go ahead to retrieve the URL; otherwise, ignore the URL
! To handle cyclic links gracefully
o Resource: The HTML parser library from http://htmlparser.sourceforge.net/ provides the basic functions to fetch the we bpages and to extract the keywords and links from the web pages; notice that the HTML parser is a very large library, but we need no more than a few basic functions from it.
1. An indexer which extracts keywords from a page and inserts them into an inverted file

o The indexer first removes all stop words from the file; a dictionary of stop words will be provided
o
algorithm
o
!
It then transforms words into stems using the Porter’s
It inserts the stems into the two inverted files:
all stems extracted from the page body, together with all
statistical information needed to support the vector space model (i.e., no need to support Boolean operations), are inserted into one inverted file
! all stems extracted from the page title are inserted into another inverted file
o The indexes must be able to support phrase search such as “hong kong” in page titles and page bodies.
o Resource: A Java Implementation of Porter’s algorithm is available here. The RocksDB library from https://github.com/facebook/rocksdb is suggested to be used to create and manipulate the file structures for storing the inverted file and other file structures needed.
1. A retrieval function (or called the search engine) that compares a list of query terms against the inverted file and returns the top documents, up to a maximum of 50, to the user in a ranked order according to the vector space model. As noted about, phrase must be supported, e.g., “hong kong” universities
o Term weighting formula is based on tfxidf/max(tf) and document similarity is based on cosine similarity measure.
o Derive and implement a mechanism to favor matches in title. For example, a match in the title would significantly boost the rank of a page
1. A web interface that accepts a user query in a text box, submits the query to the search engine, and displays the returned results to the user
o Since only the vector space model is required, you don’t need to support logical operators; the query is just a list of keywords that allows phrases to be specified in double quotes
o The results are ranked in descending order of the document scores
o Each item returned is displayed in the following format:
score
page title
url
last modification date, size of page keyword 1 freq 1; keyword 2 freq 2; … Parent link 1
Parent link 2
……
Child link 1

o
!
the remote server
score
Child link 2 ……
page title ……
The title and URL are hyperlinked to the actual page on
The list of keywords displays up to 5 most frequent stemmed keywords (excluding stop words) in the page together with their
occurrence frequencies
! Some pages do not contain the last modified date field and the size field in their headers. In these cases, you can consider the date field to be the last modified date and consider the length of the content (i.e. directly count the number of characters) to be the size of the page.
o While the interface doesn’t have to be fancy, each result item should be format with clarity
o All of the results (up to 50 web pages) can be displayed on one page
o In order to interface your search engine with the web interface, you need to write a JSP page to pass the query string to the search engine. While you are allowed to use PHP or other standard web-to-application interface languages, you are encouraged to use PHP and you will learn JSP in Lab4.
Submissions PHASE 1:
• It is worth 5% of the course marks. However, if you fail to submit the phase 1, you will get zero marks for your entire project.
• You need to:
o implement a spider (integrated with an indexer) for fetching and indexing webpages
o index 30 pages from http://www.cse.ust.hk
o implement a test program which read data from the RocksDB and outputs a plain-text file named spider_result.txt. The format of the spider_result.txt file should be as follows:
o
!
Page title
URL
Last modification date, size of page
Keyword1 freq1; Keyword2 freq2; Keyword3 freq3; ……

Child Link1
Child Link2 ….. ——————————————————————————————- Page title
URL
Last modification date, size of page
Keyword1 freq1; Keyword2 freq2; Keyword3 freq3; ……
Child Link1
Child Link2 ….
….
….
• You should submit:
o a document containing the design of the RocksDB database scheme of the indexer. All supporting databases should be defined, for example, forward and inverted indexes, mapping tables for URL <=> page ID and word <=> word ID conversion. The RocksDB database schema depends on the functions implemented. You should include an explanation on your design.
o the source codes of the spider and the test program
o a readme.txt file containing the instructions to build the spider and the test program, and how to execute them.
o the db file which contains the indexed 30 pages starting from http://www.cse.ust.hk/
o spider_result.txt which is the output of test program
• Zip the files and submit via CASS system. The assignment name is Phase1. Use Canvas to submit your Phase 1. Only one submission from one member is needed.
FINAL SUBMISSION:
It is worth 20% of the course marks. You need to:
•
•
o
inputs queries the search engine through the web interface. The search engine compares the query terms against the indexed terms in RocksDB and returns the top documents to the user through the web interface.
o deploy your search engine in your VM provided by the department and send the URL to the TAs. They will access your search engine directly. If you have any specific need, please work it out with the Tas.
o Index as many pages in www.cse.ust.hk as possible (preferable the whole website); try out some queries to test your search engine works properly.
implement the search engine and the web interface. The user

• You need to submit:
o the source codes of the spider, indexer, search engine and the web interface (DO NOT submit the db files)
o a document (around 8-10 pages long) containing
!
!
!
title matches) !
Overall design of the system
The file structures used in the index database Algorithms used (including the mechanism for favoring
Installation procedure (it could be as simple as “Type make in the project directory”)
! Highlight of features beyond the required specification
! Testing of the functions implemented; include screenshots if applicable in the report
! Conclusion: What are the strengths and weaknesses of your systems; what you would have done differently if you could re-implement the whole system; what would be the interesting features to add to your system, etc.
• Zip the files and submit via CASS system. The assignment name is FinalPhase.Use Canvas to submit your Phase 2. Only one submission from one member is needed.
MARKING Scheme
Phase1(5)
Final Submission(15)
Correctness and completeness
70%
60%
Program design and programming style
10%
10%
Documentation
20%
10%
Basic user interface design
0%
5%
Demonstration (to TA)
0%
10%
Demonstration (in class)
0%
5%
TOTAL
100%
100%

Bonus
• You can enhance your system beyond what is required in the project description to get a bonus [See Grading Scheme in course homepage for how much the bonus is worth]. You should communicate with the TAs about the effort and value of your additional work. The final decision, however, is up to the TAs. The following is some examples that may be considered for bonus. Whether you will get any bonus, and if so, how much, depends on how much effort is required to do the extra work and how well is the implementation.
1. Implementation of the relevance feedback feature “get similar pages” (cf. Google); clicking the “get similar pages” button will extract the top 5 most frequent keywords (excluding stop words) from the page; rewrite the original query and submit the revised query for a new search.
2. Allow the user to see a list of all (stemmed) keywords indexed in your database, browse through them, and select the keywords he/she is interested in, and then submit them as a vector-space query to your search engine.
3. A user-friendly interface for all of the functions (e.g., making use of AJAX and 3D to make interaction instantaneous and adventurous).
4. Keep track of query history and allow users to view the result of a previous query (or operate on the results, e.g., merging the results of two previous queries or search within the result of a previous query).
5. Many other features that you can observe from existing commercial search engines.
6. Consider links in result ranking (e.g., PageRank).
7. Exceedingly good speed by using special implementation
techniques (e.g., use better index and cache design).
8. Any interesting and innovative ideas you can dream of.
Implementation Hints

Using RocksDB in JSP
To use RocksDB in JSP, the name of the database folder should
be
“/home/your_account/public_html/database”;
o In case your JSP pages need to write on the database, make sure that the tomcat user has the permission to write on the folder. To do so, execute the command to set the permission: chmod g+rw public_html
• Using Java classes and libraries in JSP
o To use java classes, create a folder named WEB-INF in public_html folder. In the WEB-INF folder, create another folder named classes and all the compiled class files should be placed in it. Make sure that all the folders and the classes should be accessible for the tomcat user. i.e. chmod g+r WEB-INF -R
o To use java libraries, the jar files should be placed into WEB-INF/lib/. Similar to the java classes case, the permissions should be set properly.
Some Suggestions
1. You are free to choose any programming language you like to do your project. However, you should bear your own risk if you choose another language, and you should not expect us to help you with debugging.
2. You are also free to choose other packages to implement the project. However, you have to make sure you implement a “real” search engine. For example, storing all the index information in arrays or other main-memory data structures is not acceptable.
3. You are suggested to use the stopword list.
4. Speed is not essential but must be reasonable in that you program
(spider or search) should run at the average speed (i.e., if your program takes 5 times more than your friend’s, then there must be something wrong with your design, or your friend’s). If you had implemented special techniques that can speed up indexing and search (e.g., by using a special index design and/or intelligent caching), you can highlight it in your report and to the TA for qualifying bonus points. If your system is judged to be slower than normal, you should be able to give an explanation to avoid penalty.
5. If you are so unfortunate that you cannot finish the phases before deadlines, you should still submit the parts that you have done so that the TA
•
o
set as absolute path. That is, dbPath =

can give partial marks to you. I won’t be harsh in grading as I realize that you have lots of things to do. However, don’t use this as an excuse for copying from other or from the old projects. If you are so unlucky that I discover you copy from other, you and the one who gives you the source would get ZERO MARK!

Related Posts