程序代写代做代考 scheme information retrieval data science Text Pre-Processing

Text Pre-Processing — 2

Faculty of Information Technology, Monash University, Australia

FIT5196 week 5

(Monash) FIT5196 1 / 15

Outline

1 Inverted Index

2 Vector Space Model

3 TF-IDF

4 Collocations

(Monash) FIT5196 2 / 15

Inverted Index

Figure: This figure is adopted from the book called “Introduction to Information Retrieval”, which can be viewed here
http://nlp.stanford.edu/IR-book/

A dictionary of terms (referred to as a vocabulary or lexicon)
Each term is associated with a list that records which documents the term
occurs in.

(Monash) FIT5196 3 / 15

Inverted Index

To build an inverted index
1 Collect the documents to be indexed.
2 Tokenize the text, turning each document into a list of tokens
3 Do linguistic preprocessing, producing a list of normalized tokens, which are

the indexing terms
4 Index the documents that each term occurs in by creating an inverted index,

consisting of a dictionary and postings.

(Monash) FIT5196 4 / 15

Inverted Index

Figure: This figure is adopted from the book called “Introduction to Information Retrieval”, which can be viewed here
http://nlp.stanford.edu/IR-book/(Monash) FIT5196 5 / 15

Vector Space Model

VSM: each text document is represented as a vector where the elements of
the vector indicate the occurrence of words within the text

V (d) = [w1,d ,w2,d ,w3,d , . . . ,wN,d ]

where
É N: the size of the vocabulary
É wn,d : the weight of the n-th term (in the vocabulary) in document d

Examples:
É document_1: “Data analysis is important.”
É document_2: “Data wrangling is as important as data analysis.”
É document_3: “Data science contains data analysis and data wrangling.”

(Monash) FIT5196 6 / 15

Vector Space Model

Vector representation
’analysis’ ’and’ ’as’ ’contains’ ’data’ ’important’ ’is’ ’science’ ’wrangling’

document_1 1 0 0 0 1 1 1 0 0
document_2 1 0 1 0 1 1 1 0 1
document_3 1 1 0 1 1 0 0 1 1

document_1 1 0 0 0 1 1 1 0 0
document_2 1 0 2 0 2 1 1 0 1
document_3 1 1 0 1 3 0 0 1 1

document_1 0 0 0 0 0 0.176 0.176 0 0
document_2 0 0 0.954 0 0 0.176 0.176 0 0.176
document_3 0 0.477 0 0.477 0 0 0 0.477 0.176

(Monash) FIT5196 6 / 15

Vector Space Model

Vector Space Model — Euclidean Distance

Figure: This figure is adopted from the lecture slides on
https://www.cl.cam.ac.uk/teaching/1314/InfoRtrv/lecture4.pdf

(Monash) FIT5196 7 / 15

Vector Space Model

Vector Space Model — Cosine Similarity

Figure: This figure is adopted from the book called “Introduction to Information Retrieval”, which can be viewed here
http://nlp.stanford.edu/IR-book/

(Monash) FIT5196 8 / 15

Vector Space Model

Vector Space Model — Cosine Similarity

sim(d1, d2) =
V (d1) · V (d2)
|V (d1)||V (d2)|

∑N
n=1 Vn(d1)Vn(d2)

∑N
n=1 V

2
n (d1)

∑N
n=1 V

2
n (d2)

(Monash) FIT5196 9 / 15

Vector Space Model

Vector Space Model — Cosine Similarity

sim(d1, d2) =
V (d1) · V (d2)
|V (d1)||V (d2)|

∑N
n=1 Vn(d1)Vn(d2)

∑N
n=1 V

2
n (d1)

∑N
n=1 V

2
n (d2)

Notation:
É Vn(d): the weight of the n-th vocabulary term in document d
É ·: indicates inner product
É |V (d)|: Euclidean length

The cosine similarity of d1 and d2 is equivalent to the cosine of the angle
between d1 and d2.
É Cosine is a monotonically decreasing function of the angle for the interval
[0◦, 180◦]

(Monash) FIT5196 9 / 15

Vector Space Model

Vector Space Model — Bag of Words

The Bag-of-Words assumption:
É The order of the words in a document does not matter.
É Not a problem for many text analysis tasks, such as document classification
and clustering

− A collection of words is usually sufficient to differentiate between semantic
concepts of documents.

É Not a universal solution for, such as information extraction, POS tagging and
many other NLP tasks, where the order of words does matter.

Consideration of using VSM:
É The dimensionality of the vectors in VSM.
É How the compute the weights of words in each document.

− binary values
− counts
− TF-IDF weights

(Monash) FIT5196 10 / 15

TF-IDF

Term Frequency

TF: the number of occurrences of a word in a document.
How well does a word represent its document?
É frequent (non-stop) words are thematic
É if a words t appears often in a document, then a document containing t
should be similar to that document.

Different ways of computing TF

weighting scheme TF
binary 0,1

raw frequency ft,d
log normalisation 1+ log(ft,d )

double normalisation K K + (1− K )× ft,d
max(t ′∈d)ft′,d

(Monash) FIT5196 11 / 15

TF-IDF

Term Frequency

Problems:
É All words are considered equally important.
É Certain terms have little or no discriminating power in, for example, text
classification.

· All the medical documents contain words “patient” and “disease”
· All the patents contain words like “claim”, “embodiment “, etc.

É Can we scale down the term weights of terms ?

(Monash) FIT5196 11 / 15

TF-IDF

Inverse Document Frequency

IDF: Inverse document frequency
É The idf of a rare term is high, whereas the idf of a frequent term is likely to
be low

idft = log
�

D
nt

�

where D is the total number of documents in the corpus, and
nt = |{d ∈ D : t ∈ d}|

− if a word/term appears in every document, nt = D, then idft = 0
− if a word/term appears in a document, nt = 1, then idft = log(D)

É Different ways of computing IDF
weighting scheme IDF weight

IDF log
�D

�

IDF smoothed log
�

1+ Dnt
�

IDF max log
�

1+
maxt′∈dnt′

�

IDF probability log
�

D−nt
nt

�

(Monash) FIT5196 12 / 15

TF-IDF

The TF-IDF weight schema

tf − idft = tft × idft = nd ,t × log
�

D
nt

�

where nd ,t : #occurrences of word t in document d , mt : the document
frequency of term t.
É most frequent words
É less frequent words

(Monash) FIT5196 13 / 15

Collocations

Colllocations

Collocations: multi-word expressions consisting of two or more words that
occur more frequently than by chance, and correspond to some conventional
way of saying things.
É Noun phrases: “strong tea” and “weapons of mass destruction”
É Verb phrases: “make up”, “wind up”

Collocations are characterised by limited compositionality.
É Non-compositional phrases:

− Difficult to predict the meaning of collocation from the meaning of its parts
− Add an element of meaning to the combination
− Example:

“He is known for his fair and square dealings and everybody trusts his
work.”1

“I had a shooting pain near the heart.”

1Example from http://language.worldofcomputing.net/nlp-overview/collocations.html
(Monash) FIT5196 14 / 15

Collocations

Colllocations

Collocations are characterised by limited compositionality.
É Compositional phrases

− the meaning of the expression can be predicted from the meaning of the parts.
− Example:

“I had a stomach pain yesterday.”

(Monash) FIT5196 14 / 15

Collocations

Colllocations

Collocations are characterised by limited compositionality.
How to extract collocation
É NLTK Collocation package
É Simple tutorial: http://www.nltk.org/howto/collocations.html

(Monash) FIT5196 14 / 15

Collocations

Summary: what do you need to do in this week?

What we have covered:
É Vector Space Model
É Term weighting
É Collocations
É Count vocabulary

What you need to do:
É Download and read the chapters provided in Moodle
É Attend tutorial 5 next week.
É Remember: assessment 1 is due next week.

(Monash) FIT5196 15 / 15

Inverted Index
Vector Space Model
TF-IDF
Collocations

Related Posts