Data Mining and Machine Learning
Vector Representation of Documents Peter Jančovič
Slide 1
Data Mining and Machine Learning
Objectives
To explain vector representation of documents
To understand cosine distance between vector representations of documents
Slide 2
Data Mining and Machine Learning
Vector Notation for Documents
Suppose that we have a set of documents D = {d1, d2, … ,dN}
think of this as the corpus for IR
Suppose that the number of different words in the
whole corpus is V (vocabulary size)
Now suppose a document d in D contains M
different terms: {ti(1), ti(2),…, ti(M))
Finally, suppose term ti(m) occurs fi(m) times
Slide 3
Data Mining and Machine Learning
Vector Notation
The vector representation vec(d) of d is the V dimensional vector:
,0,…,0 frequency times the inverse document frequency
wi(1),d = fi(1),d×IDF(i(1)) from text IR Data Mining and Machine Learning
0,…,0, w ,0,…,0, w
,0,……,0, w
iM ,d
i1,d i(1)th
place
i2,d i(2)th
i(M)th place
place
Notice that this is the weighting – i.e. the term
Slide 4
Uniqueness
Is the mapping between documents and vectors one- to-one?
In other words:
– if d1 , d2 are documents, is it true that
vec(d1) = vec(d2) if and only if d1 = d2?
If λ is a scalar and vec(d1) = λvec(d2) what does this
tell you about d1 and d2?
Slide 5
Data Mining and Machine Learning
Example
d1 = the cat sat on the cat’s mat → cat sat cat mat
d2 = the dog chased the cat → dog chase cat
d3 = the mouse stayed at home → mouse stay home
Vocabulary:
– cat, chase, dog, home, mat, mouse, sat, stay
To calculate the vector representations of these
documents first calculate the TF-IDF weights
Slide 6
Data Mining and Machine Learning
Example (continued)
d1 d2 d3 Nd IDF w(t,d1) w(t,d2) w(t,d3)
cat chase dog home mat mouse sat stay
2
1
2
0.41
0.81
0.41
1
1
1.1
1.1
1
1
1.1
1.1
1
1
1.1
1.1
1
1
1.1
1.1
1
1
1.1
1.1
1
1
1.1
1.1
1
1
1.1
1.1
Slide 7
Data Mining and Machine Learning
Example (continued)
𝑣𝑒𝑐(𝑑1) =
0.81 0.41 0
0 1.1 0 0 1.1 0
0 𝑣𝑒𝑐(𝑑2) = 0 𝑣𝑒𝑐(𝑑3) = 1.1 1.1 0 0
0 0 1.1 1.1 0 0
0
1.1
Slide 8
Data Mining and Machine Learning
Document length revisited Recall that the length of a vector
xx,…,x 1N
is given by:
x x2 x2 …x2 12N
Slide 9
Data Mining and Machine Learning
Document length
In the case of a ‘document vector’
vec(d) 0,…0,w ,0,…,0,w ,0,……,w ,0…,0 i(1)d i(2)d i(M)d
vec(d) w2 w2 …w2 d
i(1)d i(2)d
i(M )d
Slide 10
Data Mining and Machine Learning
Document Similarity
Suppose d is a document and q is a query
– If d and q contain the same words in the same proportions, then vec(d) and vec(q) will point in the same direction
– If d and q contain different words, then vec(d) and vec(q) will point in different directions
– Intuitively, the greater the angle between vec(d) and vec(q) the less similar the document d is with the query q
Slide 11
Data Mining and Machine Learning
Cosine similarity
Define the Cosine Similarity between document d
and query q by: CSim(q,d) = cos
where is the angle between vec(q) and vec(d) Similarly, define the Cosine Similarity between
documents d1 and d2 by: CSim(d1,d2) = cos
where is the angle between vec(d1) and vec(d2) Data Mining and Machine Learning
Slide 12
Cosine Similarity & Similarity
Let u=(x1,y1) and v=(x2,y2) be vectors in 2 dimensions, then
cosxx yy uv 12 12
uv uv
u
v
In fact, this result holds for vectors in any N dimensional space
Slide 13
Data Mining and Machine Learning
Cosine Similarity & Similarity
Hence, if q is a query, d is a document, and is the
angle between vec(q) and vec(d), then:
Cosine similarity
C S i m q , d c o s Simq,d
tq td qd qd
vecqvecd
w w
t q d
Similarity
Slide 14
Data Mining and Machine Learning
Summary
Vector space representation of documents
Cosine distance between vector representations of documents
Slide 15
Data Mining and Machine Learning