Week 8 Lecture Review Questions
Professor Yuefeng Li
School of Computer Science, Queensland University of Technology (QUT)
Information filtering introduction
Copyright By PowCoder代写 加微信 powcoder
Information Filtering (IF) is a name used to describe a variety of processes involving the delivery of information to people. In the early stage, IF systems focused on user profiles and we call IF as document filtering. It is an alternative to the standard ad hoc search paradigm in which users typically enter many different queries over time, while the document collection stays relatively static. In filtering, the user’s information need stays the same, but the document collection itself is dynamic, with new documents arriving periodically. The goal of filtering, then, is to identify (filter) the relevant new documents and send them to the user.
Filtering is also an example of a supervised learning task, where the profile can be identified from a set of training samples which are labeled by human experts (e.g., relevance judgments, or relevance feedback) or machines (pseudo feedback). The incoming documents are tested and then classified as “relevant” or “non- relevant” or ranked in descending order.
Static filtering profiles (the batch filtering and routing tasks in TREC conference) are assumed not to change over time. Adaptive filtering is an alternative filtering technique that allows for dynamic profiles. We will discuss it later.
TREC 2002 conference provided a filtering track for adaptive filtering. The following are the experimental settings to evaluate
different types of IF systems.
(A) Adaptive Filtering
The system starts with a set of topics and a document stream, and a small number of examples of relevant documents. For each topic, it must create an initial filtering profile and then analyse the incoming documents and make a binary decision to retrieve or not to retrieve each document on a profile by profile basis. If a document is retrieved for a given profile
and a relevance judgement exists for that document/topic pair, this judgement is provided to the system and the information can be used to update the profile. This step is designed to simulate interactive user feedback.
(B) Batch Filtering
(C) Routing
The system starts with a set of topics and a set of training documents for each topic which have been labelled as relevant or non-relevant. For each topic, it must use this information to create a filtering profile and a binary classification rule which will be applied to an incoming stream of new documents.
The system starts with a set of topics and a set of training documents for each topic which have been labelled as relevant or non-relevant. For each topic, it must use this information to create a routing profile which assigns retrieval scores to the incoming documents. Systems are specifically prohibited from using any relevance judgements (including training data) for any other topics. The final output is the top 1000 ranked documents for each topic.
Rocchio-based information filtering model
The well-known Rocchio algorithm was based on the concept of an optimal query, which maximizes the difference between the average vector representing the relevant documents and the average vector representing the non-relevant documents. Given that only limited relevance information is typically available, the most common (and effective) form of the Rocchio algorithm modifies the initial weights in query vector Q to produce a new query Q’.
The Rocchio algorithm can be updated to find useful topical features for information filtering. It provides a term weighing method to select a set of terms based on a training set D. Please note we only obtain a training set, and we do not have an initial query.
The ranking function
relevence(d) = åw(t)t(t,d), tÎT
used in this week is different to the abstract model of ranking since we do not have a query. It is also used in probabilistic information filtering models.
The ranking function can be interpreted using the following total probability equation, if we review T as Bn and d as A:
where Bn is a finite partition of the sample space.
Question 1. (Term frequency measure)
Assume Table 1 is a training set provided by user feedback. A term’s frequency measure is defined as follows:
Label (Relevant)
GERMAN, VW
US, US, ECONOM, SPY
US, BILL, ECONOM, ESPIONAG
US, ECONOM, ESPIONAG, BILL
GERMAN, MAN, VW, ESPIONAG
GERMAN, GERMAN, MAN, VW, SPY
US, MAN, VW
Table 1. A training set from user feedback
Calculate term ESPIONAG’s frequency measure, and show your calculation steps:
Question 2. (Probabilistic information filtering model)
For a given training set D (or labelled dataset, e.g., a list of lists of terms, where each list of terms denotes a document), which consists of a set of relevant documents Rel and a set of non-relevant documents Nonrel; BM25 based IF model uses the following equation to calculate a term’s weight:
r(t) + 0.5 R-r(t)+0.5
n(t) – r(t) + 0.5
(N -n(t))-(R-r(t))+0.5
w5 (t) = log
Provide equations to calculate N, R, r(t) and n(t), where an equation is a math formula or a python
definition.
Question 3. (Pseudo relevance feedback)
Assume your user provides a query Q = {q1, q2, …, qm} and a data collection C = {d1, d2, …, dN} (unlabelled dataset) and asks you to define a probabilistic IF model by using pseudo relevance feedback techniques. After your data exploration of C, and you find that basically the distribution of terms in relevant documents is independent and their distribution in non-relevant documents is independent. And you want to define the probable relevance based only on the presence of search terms in documents. Describe your thoughts on how to design this IF model.
Tip: Review the Pseudo-Feedback Algorithm in a previous lecture on “IR models”.
Mining patterns for information filtering
Artificial intelligence (AI) seeks to represent information about the world in a form that computer systems can use to solve complex tasks. A key component of an AI-based system is that it uses knowledge to formulate solutions. IF-THEN rules are often used to represent knowledge. However, in the early days of applying AI to real problems, people tried to build rule-based AI systems, and knowledge acquisition was a problem. The hard problem is that it is not easy for domain experts to describe their knowledge, nor do they want to spend a lot of time on such a task. Therefore, this
problem is also known as the knowledge acquisition bottleneck.
Data mining techniques provide alternative approaches to solving knowledge acquisition bottlenecks, also known as KDD, knowledge
discovery in databases (or datasets).
Association rule learning (or mining) is a widely used data mining method. It is a rule-based machine learning method for discovering interesting relationships (e.g., associations) between variables (or features) in large databases. It includes pattern mining and rule generation. In this unit, we only discuss how to use data
mining concepts in text analysis.
Question 4. Let T={t1,t2,…,tk} be a set of terms, D be a training set of documents, which consists of a set of positive documents D+; and a set of negative documents D-. Assume X is a termset, which of the following is false? And justify your answer.
(1) X is a set of terms, that is X Í T.
(2) The covering set of X includes all documents d such that X Í d, denoting as [X] = {d | dÎD,
(3) X is called a frequent pattern if its support (|[X]|/|D+|) 3 min_sup, a minimum support.
(4) X is called an interesting pattern if it is a frequent pattern and its confidence (|[X]|/NX) 3
min_conf, a minimum confidence, where NX = |{d | dÎD, X Í d}|.
Question 5. (Pattern Mining)
Given the following complete transaction database for a training set of documents, where “yes” means relevant and “no” means non-relevant:
Transaction
1 2 3 4 5 6 7 8 9 10
t1 t2 t3 t7 yes t3 t4 t6 yes t3 t4 t5 t6 yes t3 t4 t5 t6 yes t1 t2 t6 t7 yes t1 t2 t7 yes t1 t2 no
t1 t2 t5 no t1 t3 no
(1) Write all interesting patterns if min_sup = 50% and min_conf = 90%.
(2) Write the covering sets of patterns
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com