COMP20008 Elements of Data Processing
Raw Frequencies
• What are the problems?
• What are the alternatives?
Raw Frequencies
• What are the problems?
• What are the alternatives?
play
grace
crowd
play
grace
audience
SPORTS ARTS
TF-IDF
Discourse on Floating Bodies
– Galileo Galilei
Treatise on Light
– Christiaan Huygens
Experiments with Alternate Currents of High
Potential and High Frequency
– Nikola Tesla
Relativity: The Special and General Theory
– Albert Einstein
TF-IDF
• TF-IDF stands for Term Frequency-Inverse Document Frequency
• Each text document as a numeric vector
• each dimension is a specific word from the corpus
• A combination of two metrics to weight a term (word)
• term frequency (tf): how often a given word appears within a document
• inverse document frequency (idf): down-weights words that appear in many
documents.
• Main idea: reduce the weight of frequent terms and increase the
weight of rare and indicative ones.
TF-IDF
Term frequency (TF):
• 𝑡𝑓 𝑡, 𝑑 = the raw count of a term in the document.
Inverse Document Frequency (IDF):
• 𝑖𝑑𝑓 𝑡 = ln
1+𝑁
1+𝑑𝑓𝑡
+ 1 or 𝑖𝑑𝑓 𝑡 = ln
𝑁
𝑑𝑓𝑡
+ 1
• N is the number of document in the collection,
• 𝑑𝑓𝑡 is the document frequency, the number of document containing the term t.
TF-IDF (L2 normalised):
• 𝑡𝑓_𝑖𝑑𝑓 𝑡, 𝑑 =
𝑣𝑡
σ
𝑡′∈𝑑
𝑣𝑡′
2
where 𝑣𝑡 = 𝑡𝑓 𝑡, 𝑑 × 𝑖𝑑𝑓 𝑡
Example TF-IDF
word 𝑡𝑓 𝑡, 𝑑 𝒊𝒅𝒇(𝒕) =
𝒍𝒏
𝟏+𝑵
𝟏+𝒅𝒇𝒕
+ 1
𝒗𝒕 = 𝑡𝑓 𝑡, 𝑑 × 𝑖𝑑𝑓 𝑡 𝒕𝒇_𝒊𝒅𝒇 𝒕, 𝒅
A B A B A
σ
𝑡′∈𝑑 𝑣𝑡′
2 =
2.225
B
σ
𝑡′∈𝑑 𝑣𝑡′
2 =
2.225
car 1 0 ln
3
2
+ 1 = 1.405 1.405 0 0.632 0
driven 1 1 ln
3
3
+ 1 = 1 1 1 0.449 0.449
road 1 0 ln
3
2
+ 1 = 1.405 1.405 0 0.632 0
truck 0 1 ln
3
2
+ 1 = 1.405 0 1.405 0 0.632
highway 0 1 ln
3
2
+ 1 = 1.405 0 1.405 0 0.632
Two documents: A – ‘the car is driven on the road’
B – ‘the truck is driven on the highway’
* stop words removed
Example TF-IDF
Example TF-IDF – cont.
• Two documents, A and B.
A. ‘the car is driven on the road’
B. ‘the truck is driven on the highway’
• Text features for machine learning
car driven road truck highway
0.632 0.449 0.632 0 0
0 0.449 0 0.632 0.632
* stop words removed
TRY THIS!
• 3 documents:
A: ‘the car is driven on roads’
B: ‘the truck is driven on a highway’
C: ‘a bike can not be ridden on a highway’
word 𝑡𝑓 𝑡, 𝑑 𝒊𝒅𝒇 𝒕 =
𝒍𝒏
𝟏+𝑵
𝟏+𝒅𝒇𝒕
+1
𝒗𝒕 = 𝑡𝑓 𝑡, 𝑑 × 𝑖𝑑𝑓 𝑡 𝒕𝒇_𝒊𝒅𝒇 𝒕, 𝒅
A B C A B c A
σ
𝑡′∈𝑑 𝑣𝑡′
2
B
σ
𝑡′∈𝑑 𝑣𝑡′
2
C
σ
𝑡′∈𝑑 𝑣𝑡′
2
car 1 0 0 ln 4/2 + 1 =
driven 1 1 0
road 1 0 0
truck 0 1 0
highway 0 1 1
bike 0 0 1
ridden 0 0 1
* stop words removed
Features from unstructured text
Features for structured data
Features for unstructured text
car driven road truck highway
0.632 0.449 0.632 0 0
0 0.449 0 0.632 0.632
Rank document similarity to a query
• Query q = ‘I saw a car and a truck on the highway’
• Query terms = [‘car’, ‘truck’, ‘highway’]
• Query vector 1, 0, 0, 1, 1 , unit vector 𝑣𝑞 = [0.577, 0, 0, 0.577, 0.577]
• Cosine similarity to rank documents
cos(𝑣𝑞 , 𝑑1), cos(𝑣𝑞 , 𝑑2) : 0.36, 0.73
d1
d2
car driven road truck highway
0.632 0.449 0.632 0 0
0 0.449 0 0.632 0.632
car driven road truck highway
0.632 0.449 0.632 0 0
0 0.449 0 0.632 0.632
Break