www.cardiff.ac.uk/medic/irg-clinicalepidemiology
Similarity and distance
Copyright By PowCoder代写 加微信 powcoder
Information modelling
& database systems
a fundamental data mining problem is to find “similar” items in the data
initially, we will focus on a particular notion of similarity based on the relative size of intersection between the sets compared
this notion of similarity is called Jaccard similarity
we will then examine some of the uses of finding similar sets, e.g.
finding textually similar documents
collaborative filtering
Similarity
Jaccard similarity
Jaccard similarity of sets A and B is defined as the ratio of the size of the intersection of A and B to the size of their union, i.e.
sim(A, B) = |A ∩ B |/|A ∪ B |
e.g. sim(A, B) = 3 / (2 + 3 + 3)
note that sim(A, B) [0, 1]
sim(A, B) = 0 if A ∩ B =
sim(A, B) = 1 if A = B
Document similarity
an important class of problems that Jaccard similarity addresses well is that of finding textually similar documents in a large corpus such as the Web
documents are represented as “bags” of words and we compare documents by measuring the overlap of their bag representations
applications:
plagiarism
mirror pages
news aggregation
Plagiarism
finding plagiarised documents tests our ability to find textual similarity
the plagiariser may …
copy some parts of a document
alter a few words
alter the order in which sentences appear
yet the resulting document may still contain >50% of the original material
comparing documents character by character will not detect sophisticated plagiarism, but Jaccard similarity can
Mirror pages
it is common for an important or popular Web site to be duplicated at a number of hosts to improve its availability
these mirror sites are quite similar, but are rarely identical
e.g. they might contain information associated with their particular host
it is important to be able to detect mirror pages, because search engines should avoid showing nearly identical pages within the first page of results
News aggregation
the same news gets reported by different publishers
each articles is somewhat different
news aggregators, such as Google News, try to find all versions in order to group them together
this requires finding Web pages that are textually similar, although not identical
Collaborative filtering
the method of making automatic predictions (filtering) about the interests of a user by collecting taste information from many users (collaborating)
the underlying assumption of collaborative filtering is that those who agreed in the past tend to agree again in the future, e.g.
online purchases
movie ratings
Online purchases
Amazon has millions of customers and sells millions of products
its database records which products have been bought by which customers
we can say that two customers are similar if their sets of purchased products have a high Jaccard similarity
likewise, two products that have sets of purchasers with high Jaccard similarity could be considered similar
Movie ratings
NetFlix records which movies each of its customers watched together with their ratings
movies are similar if they were watched or rated highly by many of the same customers
customers are similar if they watched or rated highly many of the same movies
examples …
Distance vs. similarity
similarity is measure of how close to each other two instances are
the closer the instances are to each other,
the larger is the similarity value
distance is also a measure of how close to each other two instances are
the closer the instances are to each other,
the smaller is the distance value
Distance vs. similarity
typically, given a similarity measure, one can “revert”
it to serve as the distance measure and vice versa
conversions may differ, e.g. if d is a distance measure, then one can use:
if sim is the similarity measure that ranges between 0 and 1, then the corresponding distance measure can be defined as:
Distance axioms
formally, distance is a measure that satisfies the following conditions:
d(x, y) ≥ 0 non–negativity
d(x, y) = 0 iff x = y coincidence
d(x, y) = d(y, x) symmetry
d(x, z) ≤ d(x, y) + d(y, z) triangle inequality
these conditions express
intuitive notions about
the concept of distance
Euclidian distances
the most familiar distance measure is the one we normally think of as “distance”
an n–dimensional Euclidean space is one where
points are vectors of n real numbers
e.g. a point in a 3D space is represented as (x, y, z)
Pythagorean theorem
Exercise: d(A, B) = ???
using Pythagorean theorem:
d(A,B)2 = (x1 – x2)2 + (y1 – y2)2
d(A,B) = (x1 – x2)2 + (y1 – y2)2
Euclidian distance
Euclidian distance in an n–dimensional space between X = (x1, x2, … , xn) and Y = (y1, y2, … , yn):
square the distance in each dimension
sum the squares
take the square root
Cosine similarity
Cosine similarity
the cosine similarity between two vectors in a Euclidean space is a measure that calculates the
cosine of the angle between them
this metric is a measurement of orientation and not magnitude
Cosine similarity
we can calculate the cosine similarity as follows:
cosine similarity ranges from –1 to 1
Value Meaning
−1 exactly opposite
1 exactly the same
0 orthogonal
in between intermediate similarity
Cosine distance
cosine distance is a term often used for the measure defined by the following formula:
d(A,B) = 1 – sim(A,B)
it is important to note that this is not a proper distance measure!
it does not satisfy the triangle inequality property
it violates the coincidence property
Edit distance
Edit distance
edit distance has been widely applied in natural language processing for approximate string matching, where:
the distance between identical strings is equal to zero
the distance increases as the string s get more dissimilar with respect to:
the symbols they contain, and
the order in which they appear
Edit distance
informally, edit distance is defined as the minimal number (or cost) of changes needed to transform one string into the other
these changes may include the following edit operations:
insertion of a single character
deletion of a single character
replacement (substitution) of two corresponding characters in the two strings being compared
transposition (reversal or swap) of two adjacent characters in one of the strings
Edit operations
insertion … ac …
… abc …
deletion … abc …
… ac …
replacement … abc …
… adc …
transposition … abc …
… acb …
Applications
successfully utilised in NLP applications to deal with:
alternate spellings
misspellings
the use of white spaces as means of formatting
UPPER– and lower–case letters
other orthographic variations
e.g. 80% of the spelling mistakes can be identified and corrected automatically by considering a single omission, insertion, substitution or reversal
Applications
apart from NLP, the most popular application area of edit distance is molecular biology
edit distance is used to compare DNA sequences in order to infer information about:
common ancestry
functional equivalence
possible mutations
Notation and terminology
let x = x1 … xm and y = y1 … yn be two strings of lengths m and n respectively, where 0 < m n
a sequence of edit operations transforming x into y is referred to as an alignment (also edit sequence, edit script or edit path)
the cost of an alignment is calculated by summing the costs of individual edit operations it is comprised of
when all edit operations have the same costs, then the cost of an alignment is equivalent to the total number of operations in the alignment
Edit distance
formally, the value of edit distance between x and y, ed(x, y), corresponds to the minimal alignment cost over all possible alignments for x and y
when all edit operations incur the same cost, edit distance is referred to as simple edit distance
simple edit distance is a distance measure, i.e. ed(x, y) ≥ 0, ed(x, y) = 0 iff x = y, ed(x, y) = ed(y, x), ed(x, z) ≤ ed(x, y) + ed(y, z)
general edit distance permits different costs for different operations or even symbols
the choice of the operation costs influences the "meaning" of the corresponding alignments, and thus they depend on a specific application
depending on the types of edit operations allowed, a number of specialised variants of edit distance have been identified
Hamming distance allows only replacement
longest common subsequence problem allows only insertion and deletion, both at the same cost
Levenshtein distance allows insertion, deletion and replacement of individual characters, where individual edit operations may have different costs
Damerau distance extends Levenshtein distance by permitting the transposition of two adjacent characters
Levenshtein distance
well suited for a number of practical applications
most of the existing algorithms have been developed for the simple Levenshtein distance
many of them can easily be adapted for:
general Levenshtein distance, where different costs are used for different operations
Damerau distance, where transposition is an allowed edit operation
transposition is important in some applications such as text searching, where transpositions are typical typing errors
note that the transposition could be simulated by using an insertion followed by a deletion, but the total cost would be different in that case!
Dynamic programming
a class of algorithms based on the idea of:
breaking a problem down into sub–problems so that optimal solutions can be obtained for sub–problems
combining sub–solutions to produce an optimal solution to the overall problem
the same idea is applied incrementally to sub–problems
by saving and re–using the results obtained for the
sub–problems, unnecessary re–computation is avoided for recurring sub–problems, thus facilitating the computation of the overall solution
Wagner–Fischer algorithm
a dynamic programming approach to the computation of Levenshtein distance, which relies on the following reasoning:
at each step of an alignment, i.e. after aligning two leading substrings of the two strings, there are only three possibilities:
delete the next symbol from the first string (delete)
delete the next symbol from the second string (insert)
match the next symbol in the first string to the first symbol in the second string (exact match or replace otherwise)
Wagner–Fischer algorithm
for the cost C(i, j) of aligning the leading substrings
x1 ... xi and y1 ... yj, the cost of their alignment is calculated as follows:
1 i m, j = 0: C(i, 0) = C(i – 1, 0) + IC(xi)
i = 0, 1 j n: C(0, j) = C(0, j – 1) + DC(yj)
C(i – 1, j) + IC(xi)
1 i m, 1 j n: C(i, j) = min C(i, j – 1) + DC(yj)
C(i – 1, j – 1) + RC(xi, yj)
where C(0, 0) = 0 and IC, DC and RC are the costs of insert, delete and replace operations
Wagner–Fischer algorithm
if the cost values are represented by a cost matrix, then the matrix needs to be filled so that the values needed for the calculation of C(i, j) are available:
C(i – 1, j – 1)
C(i – 1, j)
C(i, j – 1)
it suffices to fill the cost matrix row–wise left–to–right, column–wise top–to–bottom, or diagonally upper–left to lower–right
edit distance between x and y is then obtained as C(m, n)
upper–left
Wagner–Fischer algorithm
let x[1..m], y[1..n] be two arrays of char
let ed[0..m, 0..n] be a 2D array of int
// distance to an empty string
for i in [0..m] ed[i, 0] = i;
for j in [0..n] ed[0, j] = j;
for j in [1..n]
for i in [1..m]
if x[i] = y[j] // match, so no operation required
then ed[i, j] = ed[i–1, j–1];
else ed[i, j] = minimum of (
ed[i–1, j] + 1, // delete
ed[i, j–1] + 1, // insert
ed[i–1, j–1] + 1 // replace
return ed[m,n];
a cost matrix calculated for the strings weather and sweeter, providing the value 3 for their edit distance
extracting an optimal alignment from the cost matrix:
insert from y into x
replace or match
delete from x
alignment:
– w e a t h e r
s w e e t – e r
String similarity
two strings are regarded similar if their edit distance is lower than a certain threshold: ed(x, y) t x y
an absolute threshold t does not take into account the lengths m and n (m n) of the strings x and y
the same threshold should not be used for very long strings and very short ones, e.g.
d o p p e l g ä – n g e r h o t
d o p p e l g a e n g e r d o g
a relative threshold r = t × m proportional to the length of the shorter string is suggested instead
otherwise, short strings would be erroneously regarded similar to non–similar long strings
/docProps/thumbnail.jpeg
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com