程序代写代做代考 data mining matlab Data Mining and Machine Learning

Data Mining and Machine Learning
Latent Semantic Analysis (LSA) Peter Jančovič
Slide 1
Data Mining and Machine Learning

Objectives
 To understand, intuitively, how Latent Semantic Analysis (LSA) can discover latent topics in a corpus
Slide 2
Data Mining and Machine Learning

Vector Notation
 The vector representation vec(d) of d is the V dimensional vector:
,0,…,0 frequency times the inverse document frequency
wi(1),d = fi(1),d×IDF(i(1)) from text IR Data Mining and Machine Learning
0,…,0, w ,0,…,0, w
,0,……,0, w
iM ,d
i1,d i(1)th
place
i2,d i(2)th
i(M)th place
place
Notice that this is the weighting – i.e. the term
Slide 3

Latent Semantic Analysis (LSA)
 Suppose we have a real corpus with a large number of documents
 For each document d the dimension of the vector vec(d) will typically be several (tens of) thousands
 Let’s focus on just 2 of these dimensions, corresponding, say, to the words ‘sea’ and ‘beach’
 Intuitively, often, when a document d includes ‘sea’ it will also include ‘beach’
Slide 4
Data Mining and Machine Learning

LSA continued
 Equivalently, if vec(d) has a non-zero entry in the ‘sea’ component, it will often have a non-zero entry in the ‘beach’ component
“about the seaside”

beach
Slide 5
Data Mining and Machine Learning
sea

Latent Semantic Classes
 If we can detect this type of structure, then we can discover relationships between words automatically, from data
 In the example we have found an equivalence set of terms, including ‘beach’ and ‘sea’, which is ‘about the seaside’
beach
“about the seaside”

Slide 6
Data Mining and Machine Learning
sea

Finding Latent Semantic Classes
 LSA involves some advanced linear algebra – the description here is just an outline
 First construct the ‘word-document’ matrix A
 Then decompose A using Singular Value Decomposition (SVD)
– SVD is a standard technique from matrix algebra – Packages such as MATLAB have SVD functions:
>>[U,S,V]=svd(A)
Slide 7
Data Mining and Machine Learning

Singular Value Decomposition
 Remember eigenvector decomposition?
 An eigenvector of a square matrix A is a vector e such that Ae = λe, where λ is a scalar
 For certain matrices A we can write A=UDUT, where U is an orthogonal matrix (rotation) and D is diagonal
– The elements of D are the eigenvalues – The columns of U are the eigenvectors
 You can think of SVD as a more general version of eigenvector decomposition, which works for general
matrices
Slide 8
Data Mining and Machine Learning

Word-Document Matrix
 The Word-Document matrix is a N x V matrix
whose nth row is vec(dn) term
𝑤𝑡1𝑑1 𝑤𝑡2𝑑1 ⋯ 𝑤𝑡𝑚𝑑1 ⋯ 𝑤𝑡𝑉𝑑1
𝑤𝑡1𝑑2 𝑤𝑡2𝑑2 ⋯ 𝑤𝑡𝑚𝑑2 ⋯ 𝑤𝑡𝑉𝑑2
⋮⋮⋮⋮⋮⋮
𝐴=
𝑤𝑡1𝑑𝑛 𝑤𝑡2𝑑𝑛 ⋯ 𝑤𝑡𝑚𝑑𝑛 ⋮ 𝑤𝑡𝑉𝑑𝑛 ⋮⋮⋮⋮⋮⋮
𝑤𝑡1𝑑𝑁 𝑤𝑡2𝑑𝑁 ⋯ 𝑤𝑡𝑚𝑑𝑁 ⋯ 𝑤𝑡𝑉𝑑𝑁
Weighting for term tm in dn
Slide 9
Data Mining and Machine Learning
document

Singular Value Decomposition (SVD)
AUSVT 𝑁
𝑢11 𝑢12 .𝑢1𝑁
N=number of docs, V=vocabulary size
𝑉
. 𝑣 𝑣 .𝑣 𝑣
0 … 0 0 12 22 𝑖2 . 𝑉2
Direction of most significant correlation
𝑣11 𝑣21 . 𝑣𝑖1 . 𝑣𝑉1
𝑁
𝑢21 𝑢22 . 𝑢2𝑁
𝐴=. . .. 0𝑠2…0.0⋮ ⋮ ⋮⋮.⋮⋮. 𝑉
𝑠1
….⋮⋮…⋮⋮⋮⋮⋮⋮⋮⋮
𝑉
. . .. 00…𝑠𝑁.0⋮ ⋮ ⋮ .
𝑢𝑁1 𝑢𝑁2 . 𝑢𝑁𝑁 𝑣1𝑉 𝑣2𝑉 . 𝑣𝑖𝑉 𝑣𝑉𝑉
‘Strength’ of most significant correlation
Slide 10
Data Mining and Machine Learning

Interpretation of LSA
 The matrices U and V are orthogonal matrices
 Their entries are real numbers
 U is N x N (N is the number of documents) and V is V x V (V is the vocabulary size)
TheysatisfyUUT =I=UTU,VVT =I=VTV
 The singular values s1,…, sN are positive and satisfy
s1≥s2 ≥…≥ sN
 The off-diagonal entries of S are all zero
Slide 11
Data Mining and Machine Learning

Interpretation of LSA (continued)  Focussing on V:
 The columns of V, {v1,…,vV} are unit vectors and orthogonal to each other
 They form a new orthonormal basis (coordinate system) for the document vector space
 Each column of V is a document vector corresponding to a semantic class (topic) in the corpus
 The importance of the topic corresponding to vn is indicated by the size of the singular value sn
Slide 12
Data Mining and Machine Learning

Interpretation of LSA (continued)
 Since vn is a document vector, its jth value corresponds to TF-IDF weight for jth term in the vocabulary for the corresponding document/topic
 This can be used to interpret the topic corresponding to vn – a large value of vnj indicates that the jth term in the vocabulary is significant for the topic
Slide 13
Data Mining and Machine Learning

Interpretation of LSA (continued)
 Now consider U
 It is easy to show that
Avn = USVTvn = snun
 While vn describes the nth topic as a combination of terms/words, un describes it as a combination of documents
Slide 14
Data Mining and Machine Learning

Topic-based representation
 Columns of V, v1,…,vV are an orthonormal basis (coordinate system) for the document vector space
 If d is a document, vec(d).vn is the magnitude of the component of vec(d) in the direction of vn
 ..the component of vec(d) corresponding to topic n 𝑣𝑒𝑐𝑑 ∙𝑣1
𝑣𝑒𝑐𝑑 ∙𝑣2
 Hence the vector 𝑡𝑜𝑝(𝑑) = . = VT vec(d)
.𝑣𝑒𝑐𝑑 ∙𝑣𝑉
Slide 15
is a topic-based representation of d in terms of v1,…,vV Data Mining and Machine Learning

More information about LSA  See:
Landauer, T.K. and Dumais, S.T., “A solution to Platos problem: The Latent Semantic Analysis theory of the acquisition, induction and representation of knowledge”, Psychological Review 104(2), 211-240 (1997)
Slide 16
Data Mining and Machine Learning

Thoughts on document vectors
 Once d is replaced by vec(d) it becomes a point in a vector space
 How does the structure of the vector space reflect the properties of the documents in it?
 Do clusters of vectors correspond to semantically related documents?
 Can we partition the vector space into semantically different regions?
 These ideas are a link between IR and Data Mining
Slide 17
Data Mining and Machine Learning

For an alternative perspective…  Chapter 14: “The cunning fox”
 Application of LSA to ‘dating agency’ personal adverts
 LSA suggests that the meaning of a personal advert can be expressed as a weighted combination of a few basic ‘concepts’
Dr Graham Tattersall, “Geekspeak: How life + mathematics = happiness”, 2007
Slide 18
Data Mining and Machine Learning

Summary
 Latent Semantic Analysis
 Interpretation of LSA
Slide 19
Data Mining and Machine Learning