程序代写代做代考 data mining Data Mining and Machine Learning

Data Mining and Machine Learning
Vector Representation of Documents Peter Jančovič
Slide 1
Data Mining and Machine Learning

Objectives
 To explain vector representation of documents
 To understand cosine distance between vector representations of documents
Slide 2
Data Mining and Machine Learning

Vector Notation for Documents
 Suppose that we have a set of documents D = {d1, d2, … ,dN}
think of this as the corpus for IR
 Suppose that the number of different words in the
whole corpus is V (vocabulary size)
 Now suppose a document d in D contains M
different terms: {ti(1), ti(2),…, ti(M))
 Finally, suppose term ti(m) occurs fi(m) times
Slide 3
Data Mining and Machine Learning

Vector Notation
 The vector representation vec(d) of d is the V dimensional vector:
,0,…,0 frequency times the inverse document frequency
wi(1),d = fi(1),d×IDF(i(1)) from text IR Data Mining and Machine Learning
0,…,0, w ,0,…,0, w
,0,……,0, w
iM ,d
i1,d i(1)th
place
i2,d i(2)th
i(M)th place
place
Notice that this is the weighting – i.e. the term
Slide 4

Uniqueness
 Is the mapping between documents and vectors one- to-one?
 In other words:
– if d1 , d2 are documents, is it true that
vec(d1) = vec(d2) if and only if d1 = d2?
 If λ is a scalar and vec(d1) = λvec(d2) what does this
tell you about d1 and d2?
Slide 5
Data Mining and Machine Learning

Example
 d1 = the cat sat on the cat’s mat → cat sat cat mat
 d2 = the dog chased the cat → dog chase cat
 d3 = the mouse stayed at home → mouse stay home
 Vocabulary:
– cat, chase, dog, home, mat, mouse, sat, stay
 To calculate the vector representations of these
documents first calculate the TF-IDF weights
Slide 6
Data Mining and Machine Learning

Example (continued)
d1 d2 d3 Nd IDF w(t,d1) w(t,d2) w(t,d3)
cat chase dog home mat mouse sat stay
2
1
2
0.41
0.81
0.41
1
1
1.1
1.1
1
1
1.1
1.1
1
1
1.1
1.1
1
1
1.1
1.1
1
1
1.1
1.1
1
1
1.1
1.1
1
1
1.1
1.1
Slide 7
Data Mining and Machine Learning

Example (continued)
𝑣𝑒𝑐(𝑑1) =
0.81 0.41 0
0 1.1 0 0 1.1 0
0 𝑣𝑒𝑐(𝑑2) = 0 𝑣𝑒𝑐(𝑑3) = 1.1 1.1 0 0
0 0 1.1 1.1 0 0
0
1.1
Slide 8
Data Mining and Machine Learning

Document length revisited  Recall that the length of a vector
xx,…,x  1N
is given by:
x  x2 x2 …x2 12N
Slide 9
Data Mining and Machine Learning

Document length
 In the case of a ‘document vector’
vec(d)  0,…0,w ,0,…,0,w ,0,……,w ,0…,0 i(1)d i(2)d i(M)d
vec(d)  w2 w2 …w2  d
i(1)d i(2)d
i(M )d
Slide 10
Data Mining and Machine Learning

Document Similarity
 Suppose d is a document and q is a query
– If d and q contain the same words in the same proportions, then vec(d) and vec(q) will point in the same direction
– If d and q contain different words, then vec(d) and vec(q) will point in different directions
– Intuitively, the greater the angle between vec(d) and vec(q) the less similar the document d is with the query q
Slide 11
Data Mining and Machine Learning

Cosine similarity
 Define the Cosine Similarity between document d
and query q by: CSim(q,d) = cos
where  is the angle between vec(q) and vec(d)  Similarly, define the Cosine Similarity between
documents d1 and d2 by: CSim(d1,d2) = cos
where  is the angle between vec(d1) and vec(d2) Data Mining and Machine Learning
Slide 12

Cosine Similarity & Similarity
 Let u=(x1,y1) and v=(x2,y2) be vectors in 2 dimensions, then
cosxx yy  uv 12 12
uv uv

u
v
 In fact, this result holds for vectors in any N dimensional space
Slide 13
Data Mining and Machine Learning

Cosine Similarity & Similarity
 Hence, if q is a query, d is a document, and  is the
angle between vec(q) and vec(d), then:
Cosine similarity
C S i m  q , d   c o s     Simq,d
tq td qd qd
vecqvecd
w w
 t  q  d
Similarity
Slide 14
Data Mining and Machine Learning

Summary
 Vector space representation of documents
 Cosine distance between vector representations of documents
Slide 15
Data Mining and Machine Learning