IFN647 Tutorial (Week 6): IR models Solutions for Tasks 1, 2 & 4
********************************************************
Task 1. Solution (discussion points):
There are lots of variant formulations and combinations! Whatever formulation is used, the unit-length-normalized TF*IDF scores are the precomputed and stored, so that similarity comparison is just a dot product.
Copyright By PowCoder代写 加微信 powcoder
Term Frequency (tf):
The term frequency tf in tf*idf can be the raw term frequency 𝑓 (the number of term t’s
appearance in document d). However, a term that occurs 10 times is not generally 10 times as important as a term that occurs once. Therefore, an alternative formulation of the tf in a document d can be:
Inverse Document Frequency (idf):
If N is the number of documents in a given document collection C (or a dataset), and 𝑑𝑓
is the number of documents that contain term t. Then the idf of term t in a collection C is defined as:
For example, suppose C includes 10 documents, and a word “tutorial” appears in three documents. Then, mathematically, its Inverse-Document Frequency, idft= log(10/3).
Smoothing and Document-Length-Normalized version:
))∙%&’ $ .’ #%&
where N = |C|, and T is the total number of terms in collection C. 1
tfidf(t, d) =
+∑( -(“#%&'((
(“# %&'((!,#)) ∙ %&’ $ #%!
Task 2. Solution
Term1 Term2 Term3 Term4 Term5
D1 3 D2 5 D3 0 D4 9 D5 0 D6 3
0 0 5 7 3 4 6 0 0 5 4 6 0 0 1 2 1 0 3 2 0 2 4 4
df4 2 3 6 5
Task 4. Solution
d1 = 1 d1 = 0 Total
Relevant Non-relevant
ri=3ni-ri= 1
ni=4 N- ni = 3 N=7
ni=4 N-ni=3 N=7
ni=3 N-ni=4 N=7
R- ri = 0 (N-R)-( ni -ri) = N- ni R= 3 N-R=
Relevant Non-relevant
d2 = 1 ri=0 d2 = 0 R-ri=3 Total R=3
ni-ri= 4 N-ni–R+ri= 0 N-R= 4
Relevant Non-relevant
d3 = 1 ri=2 d3 = 0 R-ri=1 Total R=3
ni -ri =1 N-ni–R+ri= 3 N-R = 4
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com