程序代写代做代考 scheme Java hadoop COMP9313 2018s2 Project 1 (25 marks)

COMP9313 2018s2 Project 1 (25 marks)

Problem statement:

Guild the inverted index for a given set of documents (compute the term

weights by TF-IDF as shown in slide 5 of Chapter 4, using base 10

logarithm). Ignore the letter case, i.e., consider all words as lower case.

Input files:

Each line is in format of “DocID DOC”, where DocID is the ID of the

document, and DOC is the document content (a list of terms). For example:

0 Construct Inverted Index

1 using MapReduce

2 inverted index code

3 index index file

This sample file “tiny-doc.txt” can be downloaded at:

https://webcms3.cse.unsw.edu.au/COMP9313/18s2/resources/20465

Output:

If a term W is contained in N documents, the corresponding output for this

term contains N lines, in format of: “W\tDocID1, weight1”, “W\tDocID2,

weight2”, …, and “W\tDocIDn, weightn”. The DocIDs are sorted in

ascending order, and the term weights are of double precision (stored in

double/DoubleWritable). Given the example graph, the output file is like

(assumed in one file):

code\t2, 0.6020599913279624

construct\t0, 0.6020599913279624

file\t3, 0.6020599913279624

index\t0, 0.12493873660829993

index\t2, 0.12493873660829993

index\t3, 0.24987747321659987

inverted\t0, 0.3010299956639812

inverted\t2, 0.3010299956639812

mapreduce\t1, 0.6020599913279624

using\t1, 0.6020599913279624

Meiyan Pan

Meiyan Pan

Code format:

Name your package as “comp9313.proj1” and name your driver class as

“Project1.java”. Your program should receive 3 parameters: the input folder,

the output folder and the number of reducers. Finally, package all your java

files as a zip file with name “InvertedIndex.zip”.

Compile:

Your java code will be compiled and packaged as a jar file, and we will use

the following commands to check the correctness of your solution:

$ $HADOOP_HOME/bin/hadoop jar YOURJAR.jar YOURCLASS input output 1

$ $HADOOP_HOME/bin/hdfs dfs –cat output/*

Please ensure that the code you submit can be compiled and packaged. Any

solution that has compilation errors will receive no more than 5 points for

the entire assignment.

Documentation and code readability

Your source code will be inspected and marked based on readability and

ease of understanding. The documentation (comments of the codes) in your

source code is also important. Below is an indicative marking scheme:

Result correctness: 18

Efficiency: 5

Code structure, Readability, and

Documentation: 2

Submission:

Deadline: Sunday 9th Sep 09:59:59 PM

Log in any CSE server (e.g., williams or wagner), and use the give

command below to submit your solutions:

$ give cs9313 project1 InvertedIndex.zip

Or you can submit through:

https://cgi.cse.unsw.edu.au/~give/Student/give.php

If you submit your assignment more than once, the last submission will

replace the previous one. To prove successful submission, please take a

screenshot as assignment submission instructions show and keep it by

https://cgi.cse.unsw.edu.au/~give/Student/give.php

yourself. If you have any problems in submissions, please email to

xuefeng.chen@student.unsw.edu.au.

Late submission penalty

10% reduction of your marks for the 1st day, 30% reduction/day for the

following days.

Plagiarism:

The work you submit must be your own work. Submission of work partially

or completely derived from any other person or jointly written with any

other person is not permitted. The penalties for such an offence may include

negative marks, automatic failure of the course and possibly other academic

discipline. Assignment submissions will be examined manually.

Relevant scholarship authorities will be informed if students holding

scholarships are involved in an incident of plagiarism or other misconduct.

Do not provide or show your assignment work to any other person – apart

from the teaching staff of this subject. If you knowingly provide or show

your assignment work to another person for any reason, and work derived

from it is submitted you may be penalized, even if the work was submitted

without your knowledge or consent.

mailto:xuefeng.chen@student.unsw.edu.au