Quiz 3. Similar Items
Suppose that two sets are considered to be similar if their Jaccard similarity is greater than or equal to .6.
Consider two sets S1 and S2. Suppose that their actual Jaccard similarity is .8. Consider their minhash signatures S1’ and S2’, each having 100 minhash values.
Suppose the signatures are divided into 20 bands with 5 rows in each band. That is, b = 20, r = 5. Locality-sensitive hashing (LSH) is then applied to the signatures to obtain candidate pairs of sets.
1. [3 points] What is the probability that S1 and S2 are not identified as a candidate pair (i.e., false negative rate)?
(Answer)
False negative rate: the probability that S1 and S2 are not candidate pair in any band
(1 − 𝑠𝑟)𝑏 = (1 − 0.85)20 = 0.000356 = 3.56 × 10−4
0.0356%
2. [2 points] Give the formula for computing the predicted threshold. Compute the predicted threshold when b = 20 and r = 5.
(Answer)
111
Predicted threshold: 𝑠 = (1 − (1 − 𝑝)𝑏)𝑟 when p = 0.5, or approximately (1)𝑟 𝑏
111
𝑠=(1−(1−0.5)𝑏)𝑟 =0.509or(1)𝑟 =0.549 𝑏
Now let’s set b = 10, r = 10, and perform LSH again.
3. [3 points] Compute the new predicted threshold [1 point]. Can you predict if the false negative rate will go up or down, by comparing the new predicted threshold with one in question (2) [2 points]?
(Answer)
𝑠=(1−(1−0.5)𝑏)𝑟 =0.763 or(1)𝑟 =0.794 𝑏
The new predicted threshold (b in the graph) is higher than the predicted threshold of question 2 (a in the graph), and the false negative rate will go up (red area in the graph).
The pairs with similarity above the predicted threshold are very likely to become candidates, while those below the threshold are unlikely to become candidates. Thus, the higher predicted threshold, the less candidates. That is, the higher predicted threshold means reducing false positive and increasing false negative.
111
4. [2 points] Now compute the false negative rate, when b = 10, and r = 10. Does it go up or down, compared to ones when b = 20 and r = 5? Do you reach the same conclusion as question (3)?
(Answer)
(1 − 𝑠𝑟)𝑏 = (1 − 0.810)10 = 0.32114
32.11%
False negative rate is increased, compared to ones when b=20 and r=5. Yes, it is the same conclusion as question 3.