CS计算机代考程序代写 COMP20008 Elements of Data Processing

COMP20008 Elements of Data Processing

Raw Frequencies

• What are the problems?

• What are the alternatives?

Raw Frequencies

• What are the problems?

• What are the alternatives?

play
grace
crowd

play
grace
audience

SPORTS ARTS

TF-IDF

Discourse on Floating Bodies
– Galileo Galilei

Treatise on Light
– Christiaan Huygens

Experiments with Alternate Currents of High
Potential and High Frequency

– Nikola Tesla
Relativity: The Special and General Theory

– Albert Einstein

TF-IDF

• TF-IDF stands for Term Frequency-Inverse Document Frequency

• Each text document as a numeric vector
• each dimension is a specific word from the corpus

• A combination of two metrics to weight a term (word)
• term frequency (tf): how often a given word appears within a document

• inverse document frequency (idf): down-weights words that appear in many
documents.

• Main idea: reduce the weight of frequent terms and increase the
weight of rare and indicative ones.

TF-IDF

Term frequency (TF):
• 𝑡𝑓 𝑡, 𝑑 = the raw count of a term in the document.

Inverse Document Frequency (IDF):

• 𝑖𝑑𝑓 𝑡 = ln
1+𝑁

1+𝑑𝑓𝑡
+ 1 or 𝑖𝑑𝑓 𝑡 = ln

𝑁

𝑑𝑓𝑡
+ 1

• N is the number of document in the collection,

• 𝑑𝑓𝑡 is the document frequency, the number of document containing the term t.

TF-IDF (L2 normalised):
• 𝑡𝑓_𝑖𝑑𝑓 𝑡, 𝑑 =

𝑣𝑡

σ
𝑡′∈𝑑

𝑣𝑡′
2

where 𝑣𝑡 = 𝑡𝑓 𝑡, 𝑑 × 𝑖𝑑𝑓 𝑡

Example TF-IDF

word​ 𝑡𝑓 𝑡, 𝑑 𝒊𝒅𝒇(𝒕) =

𝒍𝒏
𝟏+𝑵

𝟏+𝒅𝒇𝒕
+ 1

𝒗𝒕 = 𝑡𝑓 𝑡, 𝑑 × 𝑖𝑑𝑓 𝑡 𝒕𝒇_𝒊𝒅𝒇 𝒕, 𝒅

A B A B A

σ
𝑡′∈𝑑 𝑣𝑡′

2 =

2.225

B

σ
𝑡′∈𝑑 𝑣𝑡′

2 =

2.225

car 1 0 ln
3

2
+ 1 = 1.405 1.405 0 0.632 0

driven 1 1 ln
3

3
+ 1 = 1 1 1 0.449 0.449

road 1 0 ln
3

2
+ 1 = 1.405 1.405 0​ 0.632 0

truck​ 0​ 1 ln
3

2
+ 1 = 1.405 0 1.405 0 0.632

highway 0​ 1 ln
3

2
+ 1 = 1.405 0​ 1.405 0 0.632

Two documents: A – ‘the car is driven on the road’
B – ‘the truck is driven on the highway’

* stop words removed

Example TF-IDF

Example TF-IDF – cont.

• Two documents, A and B.
A. ‘the car is driven on the road’

B. ‘the truck is driven on the highway’

• Text features for machine learning

car driven road truck highway

0.632 0.449 0.632 0 0

0 0.449 0 0.632 0.632

* stop words removed

TRY THIS!
• 3 documents:

A: ‘the car is driven on roads’
B: ‘the truck is driven on a highway’
C: ‘a bike can not be ridden on a highway’

word​ 𝑡𝑓 𝑡, 𝑑 𝒊𝒅𝒇 𝒕 =

𝒍𝒏
𝟏+𝑵

𝟏+𝒅𝒇𝒕
+1

𝒗𝒕 = 𝑡𝑓 𝑡, 𝑑 × 𝑖𝑑𝑓 𝑡 𝒕𝒇_𝒊𝒅𝒇 𝒕, 𝒅

A​ B C A​ B c A

σ
𝑡′∈𝑑 𝑣𝑡′

2

B

σ
𝑡′∈𝑑 𝑣𝑡′

2

C

σ
𝑡′∈𝑑 𝑣𝑡′

2

car 1 0 0 ln 4/2 + 1 =

driven 1 1 0

road 1 0 0

truck​ 0​ 1 0

highway 0​ 1 1

bike 0 0 1

ridden 0 0 1

* stop words removed

Features from unstructured text

Features for structured data

Features for unstructured text

car driven road truck highway

0.632 0.449 0.632 0 0

0 0.449 0 0.632 0.632

Rank document similarity to a query

• Query q = ‘I saw a car and a truck on the highway’

• Query terms = [‘car’, ‘truck’, ‘highway’]

• Query vector 1, 0, 0, 1, 1 , unit vector 𝑣𝑞 = [0.577, 0, 0, 0.577, 0.577]

• Cosine similarity to rank documents

cos(𝑣𝑞 , 𝑑1), cos(𝑣𝑞 , 𝑑2) : 0.36, 0.73

d1

d2

car driven road truck highway

0.632 0.449 0.632 0 0

0 0.449 0 0.632 0.632

car driven road truck highway

0.632 0.449 0.632 0 0

0 0.449 0 0.632 0.632

Break