COMP9313: Big Data Management
Course web site: http://www.cse.unsw.edu.au/~cs9313/
Chapter 7.1:
Finding Similar Items
A Common Metaphor
❖ Many problems can be expressed as finding “similar” sets: ➢ Find near-neighbors in high-dimensional space
❖ Examples:
➢ Pages with similar words
For duplicate detection, classification by topic ➢ Customers who purchased similar products
Products with similar customer sets ➢ Images with similar features
Google image search
Images with Similar Features
Similarity Search in One Dimensional Space
❖ Just numbers, use binary search, binary search tree, B+-Tree… ❖ The essential idea behind: objects can be sorted
Similarity Search
❖ k nearest neighbour (kNN) query: find the top-k nearest spatial object to the query location
❖ E.g., find the top-5 closest restaurants to UNSW
in 2D Space
Similarity Search
in 2D Space
❖ In Euclidean Space
Similarity Search
❖ In road networks: Distance is computed based on the network distance (such as the length of the shortest path)
n41 n n3 p522 6
in 2D Space
p5 is the closest in the spatial network setting p1 is the closest in the Euclidean space
The Problem in 2D Space
❖ Euclidean space ➢ Grid index
➢ Quad-tree
➢ k-d tree
➢ R-tree (R+-tree, R*-tree, etc.)
➢ m-tree, x-tree, … …
➢ Space filing curves: Z-order, Hilbert order, … …
❖ Road Networks ➢ G-tree
➢ Contraction Hierarchy ➢ 2-hop labeling ➢……
Curse of Dimensionality
❖ Refers to various phenomena that arise in high dimensional spaces that do not occur in low dimensional settings.
❖ Specifically, refers to the decrease in performance of similarity search query processing when the dimensionality increases.
❖ In high dimensional space, almost all points are far away from each other.
➢ To find the top-10 nearest neighbors, what is the length of the average neighborhood cube?
❖ Given: High dimensional data points 𝒙𝟏, 𝒙𝟐, …
➢ For example: Image is a long vector of pixel colors 121
0 2 1 →[121021010] 010
❖ And some distance function 𝒅(𝒙𝟏, 𝒙𝟐)
➢ Which quantifies the “distance” between 𝒙𝟏 and 𝒙𝟐
❖ Goal: Find all pairs of data points (𝒙𝒊, 𝒙𝒋) that are within some distance threshold 𝒅 𝒙𝒊, 𝒙𝒋 ≤ 𝒔
❖ Note: Naïve solution would take 𝑶 𝑵𝟐 where 𝑵 is the number of data points
❖MAGIC: This can be done in 𝑶 𝑵 !! How?
Distance Measures
❖ Goal: Find near-neighbors in high-dim. space
➢ We formally define “near neighbors” as points that are a “small
distance” apart
❖ For each application, we first need to define what “distance” means ❖ Today: Jaccard distance/similarity
➢ The Jaccard similarity of two sets is the size of their intersection divided by the size of their union:
sim(C1, C2) = |C1C2|/|C1C2|
➢ Jaccard distance: d(C1, C2) = 1 – |C1C2|/|C1C2|
3 in intersection
8 in union
Jaccard similarity= 3/8 Jaccard distance = 5/8
Task: Finding Similar Documents
❖ Goal: Given a large number (𝑵 in the millions or billions) of documents, find “near duplicate” pairs·
❖ Applications:
➢ Mirror websites, or approximate mirrors
Don’t want to show both in search results ➢ Similar news articles at many news sites
Cluster articles by “same story” ❖ Problems:
➢ Many small pieces of one document can appear out of order in another ➢ Too many documents to compare all pairs
➢ Documents are so large or so many that they cannot fit in main memory
Shingling: Convert documents to sets
Min-Hashing: Convert large sets to short signatures, while
preserving similarity
Locality-Sensitive Hashing: Focus on pairs of signatures likely to be from similar documents
Candidate pairs!
3 Essential Steps for Similar Docs
Docu- ment
Candidate pairs:
those pairs of signatures that we need to test for similarity
The Big Picture
of strings of length k that appear in the doc- ument
Signatures: short integer vectors that represent the sets, and reflect their similarity
Locality- Sensitive Hashing
Docu- ment
of strings of length k that appear in the doc- ument
Step 1: Shingling: Convert documents to sets
Documents as High
❖ Step 1: Shingling: Convert documents to sets
❖ Simple approaches:
➢ Document = set of words appearing in document ➢ Document = set of “important” words
➢ Don’t work well for this application. Why?
❖ Need to account for ordering of words! ❖ A different way: Shingles!
Define: Shingles
❖ A k-shingle (or k-gram) for a document is a sequence of k tokens that appears in the doc
➢ Tokens can be characters, words or something else, depending on the application
➢ Assume tokens = characters for examples ❖ Example: k=2; document D1 = abcab
Set of 2-shingles: S(D1) = {ab, bc, ca}
➢ Option: Shingles as a bag (multiset), count ab twice: S’(D1) = {ab, bc, ca, ab}
Shingles and Similarity
❖ Documents that are intuitively similar will have many shingles in common.
❖ Changing a word only affects k-shingles within distance k-1 from the word.
❖ Reordering paragraphs only affects the 2k shingles that cross paragraph boundaries.
❖ Example: k=3, “The dog which chased the cat” versus “The dog that chased the cat”.
➢ Only 3-shingles replaced are g_w, _wh, whi, hic, ich, ch_, and h_c.
Compressing Shingles
❖ To compress long shingles, we can hash them to (say) 4 bytes
❖ Represent a document by the set of hash values of its k-shingles
➢ Idea: Two documents could (rarely) appear to have shingles in common, when in fact only the hash-values were shared
❖ Example: k=2; document D1= abcab Set of 2-shingles: S(D1) = {ab, bc, ca} Hash the singles: h(D1) = {1, 5, 7}
Similarity Metric for Shingles
❖ Document D1 is a set of its k-shingles C1=S(D1) ❖ Equivalently, each document is a
0/1 vector in the space of k-shingles
➢ Each unique shingle is a dimension ➢ Vectors are very sparse
❖ A natural similarity measure is the Jaccard similarity: sim(D1, D2) = |C1C2|/|C1C2|
Working Assumption
❖ Documents that have lots of shingles in common have similar text, even if the text appears in different order
❖ If we pick k too small, then we would expect most sequences of k characters to appear in most documents
➢ We could have documents whose shingle-sets had high Jaccard similarity, yet the documents had none of the same sentences or even phrases
➢ Extreme case: when we use k = 1, almost all Web pages will have high similarity.
❖ Caveat: You must pick k large enough, or most documents will have most shingles
➢ k = 5 is OK for short documents
➢ k = 10 is better for long documents
Motivation for
❖ Suppose we need to find near-duplicate documents among 𝑵 = 𝟏 million documents
❖ Naïvely, we would have to compute pairwise Jaccard similarities for every pair of docs
➢ 𝑵(𝑵 − 𝟏)/𝟐 ≈ 5*1011 comparisons
➢ At 105 secs/day and 106 comparisons/sec, it would take 5 days
❖ For 𝑵 = 𝟏𝟎 million, it takes more than a year…
Docu- ment
of strings of length k that appear in the doc- ument
Signatures:
short integer vectors that represent the sets, and reflect their similarity
Step 2: Minhashing: Convert large sets to short signatures, while preserving similarity
Encoding Sets as Bit Vectors
❖ Many similarity problems can be formalized as finding subsets that have significant intersection
❖ Encode sets using 0/1 (bit, boolean) vectors ➢ One dimension per element in the universal set
❖ Interpret set intersection as bitwise AND, and set union as bitwise OR
❖ Example: C1 = 10111; C2 = 10011
➢ Size of intersection = 3; size of union = 4,
➢ Jaccard similarity (not distance) = 3/4
➢ Distance: d(C1,C2) = 1 – (Jaccard similarity) = 1/4
From Sets to Boolean Matrices
❖ Rows = elements (shingles) ❖ Columns = sets (documents)
➢ 1 in row e and column s if and only if e is a member of s
➢ Column similarity is the Jaccard similarity of the corresponding sets (rows with value 1)
➢ Typical matrix is sparse! ❖ Each document is a column:
Sets to Boolean Matrices
❖ Example: S1 = {a, d}, S2 = {c}, S3 = {b, d, e}, and S4 = {a, c, d}
➢ sim(S1, S3) = ?
Size of intersection = 1; size of union = 4,
Jaccard similarity (not distance) = 1/4
d(S1, S2) = 1 – (Jaccard similarity) = 3/4 7.27
Outline: Finding Similar Columns
➢ Documents → Sets of shingles
➢ Represent sets as boolean vectors in a matrix
❖ Next goal: Find similar columns while computing small signatures ➢ Similarity of columns == similarity of signatures
Outline: Finding Similar Columns
❖ Next Goal: Find similar columns, Small signatures ❖ Naïve approach:
➢ 1) Signatures of columns: small summaries of columns ➢ 2) Examine pairs of signatures to find similar columns
Essential: Similarities of signatures and columns are related
➢ 3) Optional: Check that columns with similar signatures are really similar
❖ Warnings:
➢ Comparing all pairs may take too much time: Job for LSH
These methods can produce false negatives, and even false positives (if the optional check is not made)
Hashing Columns (Signatures)
❖ Key idea: “hash” each column C to a small signature h(C), such that:
➢ (1) h(C) is small enough that the signature fits in RAM
➢ (2) sim(C1, C2) is the same as the “similarity” of signatures h(C1) and h(C2)
❖ Goal: Find a hash function h(·) such that:
➢ If sim(C1,C2) is high, then with high prob. h(C1) = h(C2) ➢ If sim(C1,C2) is low, then with high prob. h(C1) ≠ h(C2)
❖ Hash docs into buckets. Expect that “most” pairs of near duplicate docs hash into the same bucket!
❖ Goal: Find a hash function h(·) such that:
➢ if sim(C1,C2) is high, then with high prob. h(C1) = h(C2) ➢ if sim(C1,C2) is low, then with high prob. h(C1) ≠ h(C2)
❖ Clearly, the hash function depends on the similarity metric: ➢ Not all similarity metrics have a suitable hash function
❖ There is a suitable hash function for the Jaccard similarity: Min-Hashing
❖ Imagine the rows of the boolean matrix permuted under random permutation
❖ Define a “hash” function h(C) = the index of the first (in the permuted order ) row in which column C has value 1:
h (C) = min (C)
❖ Use several (e.g., 100) independent hash functions (that is,
permutations) to create a signature of a column
Hashing Example
Input Matrix
Signature Matrix
Hashing Example
Input Matrix
Signature Matrix
Hashing Example
Input Matrix
Signature Matrix
2nd element of the permutation is the first to map to a 1
Input matrix (Shingles x Documents)
Note: Another (equivalent) way is to store row indexes:
Hashing Example
Permutation
Signature matrix M
4th element of the permutation is the first to map to a 1
❖ Choose a random permutation
❖ Claim: Pr[h(C1) = h(C2)] = sim(C1, C2) ❖ Why?
➢ Let X be a doc (set of shingles), y X is a shingle ➢ Then: Pr[(y) = min((X))] = 1/|X|
It is equally likely that any y X is mapped to the min element ➢ Let y be s.t. (y) = min((C1C2))
➢ Then either: (y) = min((C1)) if y C1 , or
(y) = min((C2)) if y C2
➢ So the prob. that both are true is the prob. y C1 C2 One of the two
Hash Property
➢ Pr[min((C1))=min((C2))]=|C1C2|/|C1C2|= sim(C1, C2)
cols had to have 1 at position y
Four Types of Rows
❖ Given cols C1 and C2, rows may be classified as: C1 C2
A11 B10 C01 D00
➢ a = # rows of type A, etc.
❖ Note: sim(C1, C2) = a/(a +b +c)
❖ Then: Pr[h(C1) = h(C2)] = Sim(C1, C2)
➢ Look down the cols C1 and C2 until we see a 1 ➢ If it’s a type-A row, then h(C1) = h(C2)
If a type-B or type-C row, then not
Permutation
Input matrix (Shingles x Documents)
Signature matrix M
Similarity for Signatures
Similarities:
Col/Col Sig/Sig
1-3 2-4 1-2 3-4
0.75 0.75 0 0 0.67 1.00 0 0
Similarity for Signatures
❖ We know: Pr[h(C1) = h(C2)] = sim(C1, C2) ❖ Now generalize to multiple hash functions
❖ The similarity of two signatures is the fraction of the hash functions in which they agree
❖ Note: Because of the Min-Hash property, the similarity of columns is the same as the expected similarity of their signatures
❖ Pick K=100 random permutations of the rows
❖ Think of sig(C) as a column vector
❖ sig(C)[i] = according to the i-th permutation, the index of the first row that has a 1 in column C
sig(C)[i] = min (i(C))
❖ Note: The sketch (signature) of document C is small ~𝟏𝟎𝟎 bytes!
❖ We achieved our goal! We “compressed” long bit vectors into short signatures
Hash Signatures
Implementation Trick
❖ Permuting rows even once is prohibitive ❖ Row hashing!
➢ Pick K = 100 hash functions ki
➢ Ordering under ki gives a random row permutation! ❖ One-pass implementation
➢ For each column C and hash-func. ki keep a “slot” for the min-hash value
➢ Initialize all sig(C)[i] =
➢ Scan rows looking for 1s
Suppose row j has 1 in column C Then for each ki :
– If ki(j) < sig(C)[i], then sig(C)[i] ki(j) 7.42
How to pick a random
hash function h(x)? Universal hashing: ha,b(x)=((a·x+b) mod p) mod N where:
a,b ... random integers
p ... prime number (p > N)
Implementation
0. Initialize all sig(C)[i] =
❖ Row 0: we see that the values of h1(0) and h2(0) are both 1, thus sig(S1)[0] = 1, sig(S1)[1] = 1, sig(S4)[0] = 1, sig(S4)[1] = 1,
❖ Row 1, we see h1(1) = 2 and h2(1) = 4, thus sig(S3)[0] = 2, sig(S3)[1] = 4
Implementation
❖ Row 2: h1(2) = 3 and h2(2) = 2, thus sig(S2)[0] = 3, sig(S2)[1] = 2, no update for S4
❖ Row 3: h1(2) = 4 and h2(2) = 0, update sig(S1)[1] = 0, sig(S3)[1] = 0, sig(S4)[1] = 0,
❖ Row 4: h1(2) = 0 and h2(2) = 3, update sig(S3)[0] = 0,
Implementation
❖ We can estimate the Jaccard similarities of the underlying sets from this signature matrix.
➢ Signature matrix: SIM(S1, S4) = 1.0 ➢ Jaccard Similarity: SIM(S1, S4) = 2/3
Row C1 C2 1
h(1)=1 ∞ ∞ g(1)=3 ∞ ∞
h(1)=1 1 ∞ g(1)=3 3 ∞
h(2)=2 1 2 g(2)=0 3 0
h(3)=3 1 2 g(3)=2 2 0
h(4)=4 1 2 g(4)=4 2 0
h(5)=0 1 0 g(5)=1 2 0
Implementation Practice
10 01 11 10 01
h(x) = x mod 5
g(x) = (2x+1) mod 5
❖ Chapter 3 of Mining of Massive Datasets.
References
End of Chapter