CS计算机代考程序代写 js algorithm Name: ___________________________ USC ID: ______________________ Consider the following characteristic matrix of two sets: S1 and S2.

Name: ___________________________ USC ID: ______________________ Consider the following characteristic matrix of two sets: S1 and S2.
Row #
S1
S2
0
1
0
1
1
1
2
1
1
3
0
1
4
1
1
5
1
0
6
0
1
1. (2 pts) What are the minhash values of S1 and S2 based on the permutation using h1(x) = (x + 1) mod 7 (1 pt) and h2(x) = (x + 4) mod 7 (1 pt).
(Answer)
Row #
S1
S2
(x+1) mod 7
(x+4) mod 7
0
1
0
1
4
1
1
1
2
5
2
1
1
3
6
3
0
1
4
0
4
1
1
5
1
5
1
0
6
2
6
0
1
0
3
minhash value: the number of the first row, in the permuted order, in which the column has a 1. h1(S1) = 1, h1(S2) = 0
h2(S1) = 1, h2(S2) = 0
or calculate minhash values by following algorithm:
0123
→→→→
456
→→→
S1
S2
h1
∞
∞
h2
∞
∞
S1
S2
h1
1
∞
h2
4
∞
S1
S2
h1
1
2
h2
4
5
S1
S2
h1
1
2
h2
4
5
S1
S2
h1
1
2
h2
4
0
S1
S2
h1
1
2
h2
1
0
S1
S2
h1
1
2
h2
1
0
S1
S2
h1
1
0
h2
1
0
1

Name: ___________________________ USC ID: ______________________
2. (2pts) Construct a signature for S1 and S2 based on the minhash values obtained from h1(x) and h2(x) above. Estimate the Jaccard similarity of S1 and S2 using the signature. What is the actual Jaccard similarity of S1 and S2 (1pt)? Is the estimate close to the actual Jaccard similarity? If not, suggest a way to improve the estimate (1pt).
(Answer)
Estimated the Jaccard similarity (JS) of S1 and S2 = 0/2 = 0 The actual JS of S1 and S2 = 3/7
No, the estimate is not close to the actual JS. Increasing the number of permutations or hash functions will improve the estimate.
3. (2pts) Given four documents A, B, C, and D and their top two TF-IDF words, A: nba, basketball; B: cancer, health; C: vote, democratic; D: basketball, baseball, write the Boolean feature vectors for each document (1pt) and calculate the cosine similarity between A, D (1pt)
(Answer)
Boolean feature vectors
cosine similarity between A and D = 1 = 1/2 √2√2
S1
S2
h1
1
0
h2
1
0
nba
basketball
cancer
health
vote
democratic
baseball
A
1
1
0
0
0
0
0
B
0
0
1
1
0
0
0
C
0
0
0
0
1
1
0
D
0
1
0
0
0
0
1
2

Name: ___________________________ USC ID: ______________________
4. (4pts) Given a set of document, briefly explain how to calculate TF and IDF in TF-IDF score. You need to describe any preprocessing you need to apply to the words in a document (e.g., stemming) (1 pt) and how to calculate both the TF and IDF components (1 pt). Then briefly explain how you would use Min-Hash and LSH to find similar documents (2 pts).
(Answer)
(1) Preprocessing (1 pt)
For each document, obtain words (terms) by tokenization, lower case, removing stop words, stemming. (2) Calculating a TF-IDF score for each word in each document (1 pt)
TF𝑖𝑗 = the number of occurrences of word i in document j maximum occurrences of word in document j
IDF =log totalnumberofdocuments
𝑖 2 the number of documents having word i
Then, calculate TF-IDF score (wij = TFij X IDFi) for each word i in each document j
(3) Min-Hash and LSH (2 pts)
1) construct a feature vector for each document based on words having the highest TF-IDF
scores
Example of feature matrix for all documents
2) minhash: construct signatures by permutating rows of the feature matrix or hash functions to reduce size of the matrix. That is, the length of the signature n is less than the size of the feature vector m. (0.5pts)
3) LSH: divide the signatures into b bands. If all r rows of at least one band are the same between two documents, the documents are a candidate pair for the similar documents. With LSH, the number of candidate pairs for the similar documents can be considerably reduced.
4) check the similarity of candidate pairs by calculating cosine or Jaccard similarity of their signatures or the feature vectors. If the similarity is larger than the defined threshold, then the pair can be considered as similar.
D0
D2
D3
…
Dx-1
W0
0
1
0
0
W2
1
1
0
1
:
Wm-1
0
1
1
0
3

Related Posts