Text Pre-Processing — 2
Text Pre-Processing — 2
Faculty of Information Technology, Monash University, Australia
FIT5196 week 5
(Monash) FIT5196 1 / 15
Outline
1 Inverted Index
2 Vector Space Model
3 TF-IDF
4 Collocations
(Monash) FIT5196 2 / 15
Inverted Index
Inverted Index
Figure: This figure is adopted from the book called “Introduction to Information Retrieval”, which can be viewed here
http://nlp.stanford.edu/IR-book/
A dictionary of terms (referred to as a vocabulary or lexicon)
Each term is associated with a list that records which documents the term
occurs in.
(Monash) FIT5196 3 / 15
Inverted Index
Inverted Index
To build an inverted index
1 Collect the documents to be indexed.
2 Tokenize the text, turning each document into a list of tokens
3 Do linguistic preprocessing, producing a list of normalized tokens, which are
the indexing terms
4 Index the documents that each term occurs in by creating an inverted index,
consisting of a dictionary and postings.
(Monash) FIT5196 4 / 15
Inverted Index
Inverted Index
Figure: This figure is adopted from the book called “Introduction to Information Retrieval”, which can be viewed here
http://nlp.stanford.edu/IR-book/(Monash) FIT5196 5 / 15
Vector Space Model
Vector Space Model
VSM: each text document is represented as a vector where the elements of
the vector indicate the occurrence of words within the text
V (d) = [w1,d ,w2,d ,w3,d , . . . ,wN,d ]
where
É N: the size of the vocabulary
É wn,d : the weight of the n-th term (in the vocabulary) in document d
Examples:
É document_1: “Data analysis is important.”
É document_2: “Data wrangling is as important as data analysis.”
É document_3: “Data science contains data analysis and data wrangling.”
(Monash) FIT5196 6 / 15
Vector Space Model
Vector Space Model
Examples:
É document_1: “Data analysis is important.”
É document_2: “Data wrangling is as important as data analysis.”
É document_3: “Data science contains data analysis and data wrangling.”
Vector representation
’analysis’ ’and’ ’as’ ’contains’ ’data’ ’important’ ’is’ ’science’ ’wrangling’
document_1 1 0 0 0 1 1 1 0 0
document_2 1 0 1 0 1 1 1 0 1
document_3 1 1 0 1 1 0 0 1 1
document_1 1 0 0 0 1 1 1 0 0
document_2 1 0 2 0 2 1 1 0 1
document_3 1 1 0 1 3 0 0 1 1
document_1 0 0 0 0 0 0.176 0.176 0 0
document_2 0 0 0.954 0 0 0.176 0.176 0 0.176
document_3 0 0.477 0 0.477 0 0 0 0.477 0.176
(Monash) FIT5196 6 / 15
Vector Space Model
Vector Space Model — Euclidean Distance
Figure: This figure is adopted from the lecture slides on
https://www.cl.cam.ac.uk/teaching/1314/InfoRtrv/lecture4.pdf
(Monash) FIT5196 7 / 15
Vector Space Model
Vector Space Model — Cosine Similarity
Figure: This figure is adopted from the book called “Introduction to Information Retrieval”, which can be viewed here
http://nlp.stanford.edu/IR-book/
(Monash) FIT5196 8 / 15
Vector Space Model
Vector Space Model — Cosine Similarity
sim(d1, d2) =
V (d1) · V (d2)
|V (d1)||V (d2)|
=
∑N
n=1 Vn(d1)Vn(d2)
Ç
∑N
n=1 V
2
n (d1)
Ç
∑N
n=1 V
2
n (d2)
(Monash) FIT5196 9 / 15
Vector Space Model
Vector Space Model — Cosine Similarity
sim(d1, d2) =
V (d1) · V (d2)
|V (d1)||V (d2)|
=
∑N
n=1 Vn(d1)Vn(d2)
Ç
∑N
n=1 V
2
n (d1)
Ç
∑N
n=1 V
2
n (d2)
Notation:
É Vn(d): the weight of the n-th vocabulary term in document d
É ·: indicates inner product
É |V (d)|: Euclidean length
The cosine similarity of d1 and d2 is equivalent to the cosine of the angle
between d1 and d2.
É Cosine is a monotonically decreasing function of the angle for the interval
[0◦, 180◦]
(Monash) FIT5196 9 / 15
Vector Space Model
Vector Space Model — Bag of Words
The Bag-of-Words assumption:
É The order of the words in a document does not matter.
É Not a problem for many text analysis tasks, such as document classification
and clustering
− A collection of words is usually sufficient to differentiate between semantic
concepts of documents.
É Not a universal solution for, such as information extraction, POS tagging and
many other NLP tasks, where the order of words does matter.
Consideration of using VSM:
É The dimensionality of the vectors in VSM.
É How the compute the weights of words in each document.
− binary values
− counts
− TF-IDF weights
(Monash) FIT5196 10 / 15
TF-IDF
Term Frequency
TF: the number of occurrences of a word in a document.
How well does a word represent its document?
É frequent (non-stop) words are thematic
É if a words t appears often in a document, then a document containing t
should be similar to that document.
Different ways of computing TF
weighting scheme TF
binary 0,1
raw frequency ft,d
log normalisation 1+ log(ft,d )
double normalisation K K + (1− K )× ft,d
max(t ′∈d)ft′,d
(Monash) FIT5196 11 / 15
TF-IDF
Term Frequency
TF: the number of occurrences of a word in a document.
How well does a word represent its document?
É frequent (non-stop) words are thematic
É if a words t appears often in a document, then a document containing t
should be similar to that document.
Problems:
É All words are considered equally important.
É Certain terms have little or no discriminating power in, for example, text
classification.
· All the medical documents contain words “patient” and “disease”
· All the patents contain words like “claim”, “embodiment “, etc.
É Can we scale down the term weights of terms ?
(Monash) FIT5196 11 / 15
TF-IDF
Inverse Document Frequency
IDF: Inverse document frequency
É The idf of a rare term is high, whereas the idf of a frequent term is likely to
be low
idft = log
�
D
nt
�
where D is the total number of documents in the corpus, and
nt = |{d ∈ D : t ∈ d}|
− if a word/term appears in every document, nt = D, then idft = 0
− if a word/term appears in a document, nt = 1, then idft = log(D)
É Different ways of computing IDF
weighting scheme IDF weight
IDF log
�D
nt
�
IDF smoothed log
�
1+ Dnt
�
IDF max log
�
1+
maxt′∈dnt′
nt
�
IDF probability log
�
D−nt
nt
�
(Monash) FIT5196 12 / 15
TF-IDF
TF-IDF
The TF-IDF weight schema
tf − idft = tft × idft = nd ,t × log
�
D
nt
�
where nd ,t : #occurrences of word t in document d , mt : the document
frequency of term t.
É most frequent words
É less frequent words
(Monash) FIT5196 13 / 15
Collocations
Colllocations
Collocations: multi-word expressions consisting of two or more words that
occur more frequently than by chance, and correspond to some conventional
way of saying things.
É Noun phrases: “strong tea” and “weapons of mass destruction”
É Verb phrases: “make up”, “wind up”
Collocations are characterised by limited compositionality.
É Non-compositional phrases:
− Difficult to predict the meaning of collocation from the meaning of its parts
− Add an element of meaning to the combination
− Example:
“He is known for his fair and square dealings and everybody trusts his
work.”1
“I had a shooting pain near the heart.”
1Example from http://language.worldofcomputing.net/nlp-overview/collocations.html
(Monash) FIT5196 14 / 15
Collocations
Colllocations
Collocations: multi-word expressions consisting of two or more words that
occur more frequently than by chance, and correspond to some conventional
way of saying things.
É Noun phrases: “strong tea” and “weapons of mass destruction”
É Verb phrases: “make up”, “wind up”
Collocations are characterised by limited compositionality.
É Compositional phrases
− the meaning of the expression can be predicted from the meaning of the parts.
− Example:
“I had a stomach pain yesterday.”
(Monash) FIT5196 14 / 15
Collocations
Colllocations
Collocations: multi-word expressions consisting of two or more words that
occur more frequently than by chance, and correspond to some conventional
way of saying things.
É Noun phrases: “strong tea” and “weapons of mass destruction”
É Verb phrases: “make up”, “wind up”
Collocations are characterised by limited compositionality.
How to extract collocation
É NLTK Collocation package
É Simple tutorial: http://www.nltk.org/howto/collocations.html
(Monash) FIT5196 14 / 15
Collocations
Summary: what do you need to do in this week?
What we have covered:
É Vector Space Model
É Term weighting
É Collocations
É Count vocabulary
What you need to do:
É Download and read the chapters provided in Moodle
É Attend tutorial 5 next week.
É Remember: assessment 1 is due next week.
(Monash) FIT5196 15 / 15
Inverted Index
Vector Space Model
TF-IDF
Collocations