Microsoft Word – Week4 lec question SOLUTIONS.docx
Question Solutions for Week 4
Professor Yuefeng Li
Copyright By PowCoder代写 加微信 powcoder
School of Computer Science, Queensland University of Technology (QUT)
1. Abstract Model of Ranking 1. 排名抽象模型
Text search or Information Retrieval (IR) is very different from
traditional search tasks since it often uses an inverted index (a special data structure) that depends on a ranking function.
In the lecture notes, formally we define a ranking function R:
( , ) = ∑ ! (! () ∗ ! ()) (1)
for a given query Q and a document D.
Documents are written in natural human languages, which are difficult for computers to analyze directly. The popular method to represent a document is to select a set of document features (or index terms), where a document feature is some attribute of the document we can express numerically. For example, we can group document features into two categories: Topical features (e.g., important keywords) and Quality features (e.g., meta information, such as days since last update of the document). 局部特征
In Eq (1), fi is a kind of feature function that extracts a number from the document D (e.g., the keywork’s frequency), and fi is a similar feature function that extracts a value from the query Q. Each pair of functions gi and fi is multiplied together, and the results from all pairs are added to create a final document score
Please note in practice, the query features gi (Q) are mostly zero
as Q is normally very short. This means that the sum for each
document is only over the non-zero gi (Q) values.
2. Inverted Index
倒索引在计算上相当于教科书后面的索引 。
An inverted index is the computational equivalent of the index
found in the back of a textbook.
通常,图书索引是按照索引词的字母顺序排列的,以帮助读者搜索。每个索引术语后面都有一个关于该术语或单词的页面列表。
Normally, the book index is arranged in alphabetical order by index terms to help readers’ search. Each index term is followed by a list of pages about that term or word.
Remember that our abstract model of ranking considers each document
to be composed of features. With an inverted index, each word in
the index corresponds to a document feature.
记住,我们的抽象排名模型认为每个文档 都是由特性组成的。使用倒序索引,索引 中的每个单词对应于一个文档特性。
It is not hard to understand the data structures for designing
inverted index in the lecture notes. The workshop for this part is
to construct inverted index in python, and then perform the query
process. 在课堂讲稿中,理解设计 倒排索引所需的数据结构并不难。这部分的研讨会是 在python中构造倒索引,然后执行查询 过程。
3. Query processing
Once an index is built, we need to process the data in it to
produce query results. Even with simple algorithms, processing
queries using an index is much faster than it is without one.
However, clever algorithms can boost query processing speed by ten
to a hundred times over the simplest versions. There are two query
processing techniques: document-at-a-time and term-at-a-time.
然而,聪明的算法可以将查询处理速度提高10倍 ,比最简单的版本提高100倍。有两种查询 处理技术:一次文档和一次术语。
Question 1. Which of the following is wrong? and justify your answer.
(1) A document feature is some attribute of the document we can express numerically.、 文档特性是可以用数字表示的文档的某些属性。
(2) A ranking function takes data from document features combined with the query and produces a score. 排名函数从文档特性中结合查询获取数据并生成分数。
(3) Topical features are the only features we can find in documents.
主题特征是我们能在文档中找到的唯一特征。
(4) If a document gets a high score, this means that the system thinks that document is a good match for the query, whereas lower numbers mean that the system thinks the document is a poor match for the query. 如果一个文档得到高分,这意味着系统认为该文档与查询很匹配,而较低的分数则意味着系统认为该文档与查询不匹配。
Solution: (3)
A document also has Quality features (e.g., meta information, such as days since last update of the document).
Question 2. Which of the following is wrong? and justify your answer.
(1) The index is inverted because usually we think of words being a part of documents, but if we invert this idea, documents are associated with words. 索引是反向的,因为我们通常认为单词是文档的一部分,但如果我们将这种想法反过来,则文档与单词相关联。
(2) In an inverted index that contains only document information, i.e., the features are binary, meaning they are 1 if the document contains a term, 0 otherwise. This inverted index contains enough information to tell if the document contains the exact phrase “tropical fish”.
在只包含文档信息的倒索引中,即特性是二进制的,这意味着如果文档包含一个项,则它们为1,否则为0。这个反向索引包含足够的信息来判断文档是否包含确切的短语“热带鱼”。
每个索引项都有自己的倒排列表,其中保存了该索引项的相关数据。每个列表条目称为发布,发布中指向特定文档或位置的部分通常称为指针
(3) Each index term has its own inverted list that holds the relevant data for that term. Each list entry is called a posting, and the part of the posting that refers to a specific document or location is often called a pointer.
(4) An extent is a contiguous region of a document. We can represent these extents using word positions. 区段是一个文档的连续区域。我们可以用单词位置表示这些范围。
Solution: (2)
Although a document that contains the words “tropical” and “fish” is likely to be relevant; however, we really want to know if the document contains the exact phrase “tropical fish”. To do this, we need to add position information to the index.
虽然载有“热带”和“鱼类”等词的文件可能是有关的;然而,我们真正想知道的是文件中是否包含了确切的短语“热带鱼”。为此,我们需要向索引添加位置信息。
Question 3. (Abstract Model of Ranking)
Assume the model of ranking contains a ranking function R(Q, D), which compares each document with the query and computes a score. Those scores are then used to determine the final ranked list.
An alternate ranking model might contain a different kind of ranking function, f(A, B, Q), where A and B are two different documents in the collection and Q is the query. When A should be ranked higher than B,f(A, B, Q) evaluates to 1. When A should be ranked below B,f(A, B, Q) evaluates to – 1. If you have a ranking function R(Q, D), show how you can use it in a system that requires one of the formf(A, B, Q).
For two documents A and B, we can definef(A, B, Q) = 1 if R(Q, A) > R(Q, B); otherwise, if R(Q, A) < R(Q, B),f(A, B, Q) = - 1.
( , , ) =
( , ) > ( , )
( , ) > ( , )
For the special case of R(Q, A) = R(Q, B), we may extend the definition to letf(A, B, Q) = 0, or assign 1 or – 1 tof(A, B, Q).
Question 4. (Query processing)
Inverted indexing is an efficient data structure to represent documents for information retrieval, where each index term is associated with an inverted list that contains a list of pairs of document number and count of the term occurrences in that document. The following table is an inverted index for 4 documents and their index terms.
Assume the abstract model of ranking is
wherefi is the document topical feature function (the value offi(D) is term ti ’s counts in document D) and gi is a query topical feature function (gi(Q) = 1 if term ti is in query Q; otherwise, gi(Q) = 0). Let query Q = {freshwater, fish, aquarium, tropical}. Calculate each document’s ranking score by using Term-at-a-time Algorithm. You are required to write the calculating process (or steps).
L1 – freshwater
L2 – aquarium
L4 – tropical
2:3 3:2 4:2
Partial scores
———————————————
Old Partial scores 1:1 4:1
L2 – 3:1
new Partial scores 1:1 3:1 4:1
———————————————
Old Partial scores 1:1 3:1 4:1
L3 – 1:2 2:3 3:2 4:2
new Partial scores 1:3 2:3 3:3 4:3
———————————————
Old Partial scores
final scores
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com