Project
Design a search engine about Yelp data in Jason format, which can be downloaded from
https://www.yelp.com/dataset/
You can ignore the photo dataset. The scheme of the data can be found in the other file.
Your task includes the following:
1) Create a Lucene index for the collection, write a program that takes in a query from the user and returns a list of top 20 documents (for a ranking query). The index should include fields from the data, like the name of POI.
a) It should support both Boolean query, and ranking query.
b) It is expected that Boolean query can include field information
2) Create 20 queries, and retrieve top 10 results. You can use two retrieval models, and evaluation their performance. You need to design the experiments.
3) Discuss how to improve the accuracy of the retrieval models.
4) Clustering the documents using a clustering algorithm. Display the top frequent words in each cluster. Advance topic:
Find out how to index the coordinate information in Lucene. Design several queries with both location information and keyword information (such as finding a restaurant in an area or finding nearest restaurant) , which is like the queries supported by Google Maps, and implement your queries in Lucene.
What to deliver:
Hardcopy: The final report is up to 8 A4 pages (not necessary to write 8 pages). Softcopy: Your report, and your source code.
Due: 13 April