CS 422: Data Mining
Department of Computer Science Illinois Institute of Technology Vijay K. Gurbani, Ph.D.
Spring 2019: Homework 4 (10 points) Due date: Sunday, May 5 2019, 11:59:59 PM Chicago Time
1 Exercises (4 Points divided evenly each question) Please submit a PDF file containing answers to these questions. Any other file format will lead to a loss of 0.5 point. Non-PDF files that cannot be opened by the TAs will lead to a loss of 2 points.
1.1 Leskovec, Ch. 3, Finding Similar Items (Page numbers correspond to the online version of the book; see syllabus for the URL.)
Page 77, Exercise 3.1.1; Page 80, Exercise 3.2.1; Page 86, Exercise 3.3.3; Page 91, Exercise 3.4.1 (evaluate the S curve only for the first two bullets); Page 97, Exercise 3.5.4; Page 98, Exercise 3.5.5.
1.2 More to come…
2 Practicum problems (6 points divided evenly by each assignment) Please label your answers clearly, see Homework 0 R notebook for an example (Homework 0 R notebook is available in ¡°Blackboard¡úAssignment and Projects ¡ú Homework 0¡±.) Each answer must be preceded by the R markdown as shown in the Homework 0 R notebook (### Part 2.1-A-ii, for example). Failure to clearly label the answers in the submitted R notebook will lead to a loss of 2 points per problem below.
2.1 Topic: Locality sensitive hashing (2 Points)
In this problem, you will run LSH or Locality Sensitive Hashing.
You are provided a corpus of 100 documents (see corpus.zip). This corpus is donated by Paul Clough and Mark Stevenson (University of Sheffield). They designed this corpus to represent varying degrees of plagiarism and hoped that it will be a useful addition to the set of resources available for the evaluation of plagiarism detection systems.
Read this corpus in by unzipping the ZIP file in a directory. Let¡¯s assume you unzip it in /home/vkg/corpus. To read all the files in, use the following R command:
file s <- list.files("/home/vkg/corpus", full.names=T)
Please read all of the parts of the homework carefully before attempting any question. If you detect any ambiguities in the instructions, please let me know right away instead of waiting until after the homework has been graded.
More parts will be added to this homework description.
Make sure you set the seed to be
100
when
you invoke textreuse::minhash_generator().
If you do not do this, your output will not match expectation and
points will be taken off!
The above command will cause all of the files to be stored in the list files, using the complete path name of the files. You can subsequently pass the files list to textreuse::TexReuseCorpus() function.
Based on this corpus, please answer the following questions.
(a) How many shingles (or tokens) are there in all of the 100 documents? (Hint: look at package
TextReuse::tokens()).
(b) What are the dimensions of the characteristic matrix?.
(c) Print the first 5 shingles (or tokens) of the file orig_taske.txt.
(d) We will fix our signatures (or hashes, or the rows in the signature matrix) at 240. This represents what percentage reduction in the size of the problem?
(e) At 240 signatures (or hashes), how many bands would we need to detect a minimum Jaccard similarity of 0.23? (Hint: Use lsh_threshold().)
(f) Using the number of bands you determined in (e), what is our probability of catching similar documents at a minimum Jaccard similarity of 0.23?
(g) We will first use the brute-force method to determine how many documents are similar. To do this, run pairwise_candidates() on the characteristic matrix (i.e., original data in the corpus).
(i) How many comparisons were made when we used the characteristic matrix?
(ii) Examine the object returned by pairwise_candidate(). The object returned is called a tibble. Tibbles are data frames except that they are invariant, i.e., once created, they do not change. From this object, how many documents have a Jaccard similarity score of at least 0.23?
(iii) List all the rows in the tibble that contain a Jaccard similarity of at least 0.23, sorted in decreasing order by the score.
(h) We will now use LSH to determine how many documents are similar. To do this, use lsh() and lsh_compare() find the candidate pairs using the bands determined in (e).
(i) While running LSH on the corpus, how many comparisons were made?
(ii) Compared to the number of comparisons made in (g)(i), how much was the percentage decrease in the computation needed to find the candidate pairs?
(iii) List all the rows in the tibble that was obtained in (h), sorted in decreasing order by the score. How many rows are there in the tibble?
(iv) How many rows contain a similarity index of at least 0.23?
(v) List all the rows in the tibble that contain a similarity of at least 0.23, sorted in decreasing order by the score.
(i) Examine the output of g(iii) and h(v). Comment on the output in terms of how well LSH did in catching duplicate documents.
2.2 Topic: Content-based recommendation system (2 Points)
To fill in.
2.3 Topic: Collaborative Filtering (2 points)
To fill in.