The Gardens Approach to Parallel Computing on NOWs
Text, Web And Media Analytics
Copyright By PowCoder代写 加微信 powcoder
Information Filtering
1. Information filtering introduction
2. Rocchio-based information filtering model
tf*idf Term Weighting
Training algorithm
The Ranking Function
Testing Algorithm
3. Probabilistic information filtering models
Weighting Formulas
BM25 Training Algorithm
4. Mining patterns for information filtering
Pattern Mining
Relationship between Terms and Documents
Closed Patterns
Application of Patterns for Information Filtering
Y Li: IFN647 week 8
Y Li: IFN647 week 8
Y Li: IFN647 week 8
1. Information filtering introduction
Information Filtering (IF) is a name used to describe a variety of processes involving the delivery of information to people.
IF systems have a similar function to Information Retrieval (IR) systems.
Most IR systems support the short-term needs of a diverse set of users. IF systems, however, are commonly personalized to support long-term, relatively stable, or periodic goals or desires of a particular user or a group of users.
For example, document filtering, where the user’s information need stays the same, but the document collection itself is dynamic, with new documents arriving periodically.
The main distinction between IR and IF is that IR systems use “query”, and IF systems use “user profiles”:
Typically, user profiles represent long-term information interests or needs (e.g., keeping up-to-date on a topic) that may change slowly over time as conditions, goals, and knowledge change.
Y Li: IFN647 week 8
Y Li: IFN647 week 8
Information Filtering vs. Information Retrieval
It is very difficult to capture long-term information needs and represent them in a machine-readable format.
The easy way is to use a query, but a query doesn’t clearly describe what the user wants.
For example, for a given query {Java, C++, Oracle, Unix, program}, an IR model or a search engine often uses the query as a pattern to match relevant documents.
However, the user profile may be some knowledge about a topic (representing as a set of terms and/or a term-weight distribution).
For example, we may treat “program” as topics (subjects) “programming language” or “programmer”.
With this in mind, we can’t just think of “program” as a pattern, but want to describe what we know about the topics of “programming language” or “programmer” to determine the relevant documents.
Y Li: IFN647 week 8
Y Li: IFN647 week 8
Manually Acquiring of Knowledge for
Topics of Interest
(Example of long-term user information needs)
Domain experts provide descriptions and narratives for the topics of interest;
Build a concept model according to these descriptions and narratives.
Y Li: IFN647 week 8
Y Li: IFN647 week 8
Example of long-term user information needs cont.
Economic espionage
Commercial espionage
Corporate espionage
Industrial espionage
Technical espionage
Military espionage
Political espionage
Non-relevant subtopics
Y Li: IFN647 week 8
Y Li: IFN647 week 8
Manually Acquiring of Knowledge for Topics of Interest cont.
It is difficult to write adequate descriptions and narratives.
Although linguists can provide appropriate descriptions and narratives, the corresponding conceptual model is still incomplete.
First, the linguists may ignore some important terms, for example, “spy” in this example.
Dictionaries usually are not very useful for expanding the set of terms since we do not know authors’ writing styles.
It is also quite often that the linguists and the dictionaries may ignore some relations between subtopics. For instance, we are not sure if there is any overlap between technical espionage and industrial espionage
很难写出足够的描述和叙述。
虽然语言学家可以提供适当的描述和叙述,但相应的概念模型仍然不完整。
首先,语言学家可能会忽略一些重要的术语,例如,“间谍”在这个例子中。
因为我们不知道作者的写作风格,所以字典对扩充术语集通常不是很有用。
语言学家和词典常常会忽略子主题之间的一些关系。例如,我们不确定技术间谍活动和工业间谍活动之间是否有重叠
Y Li: IFN647 week 8
Y Li: IFN647 week 8
Automatically Acquiring of Knowledge from Relevance Feedback
Filtering task can be considered as a supervised learning.
This method does not ask to provide descriptions and narratives (no query).
It assumes that users can provide feedback (or documents are labelled by domain experts), a training set D (or labelled dataset), which consists of a set of relevant documents D+ and a set of irrelevant documents D-; or
At least, they can provide a small set of relevant documents for their topics of interest in the beginning.
The objective here is to discover a set of features (e.g., term-weight pairs) to represent the knowledge about “Topics of Interest” (or user information needs).
过滤任务可以看作是一种监督学习。
这个方法不要求提供描述和叙述(没有查询)。
它假设用户可以提供反馈(或文档由领域专家标记),一个训练集D(或标记数据集),由一组相关文档D+和一组无关文档D-组成;或
至少,他们可以在一开始就为他们感兴趣的主题提供一小部分相关文档。
这里的目标是发现一组特性(例如,术语权重对)来表示关于“感兴趣的主题”(或用户信息需求)的知识
Y Li: IFN647 week 8
Y Li: IFN647 week 8
Example of a Training Set D
Where, D = {d1, d2, …, d8}, D+ = {d1, d2, …, d6}, and D- = {d7, d8},
Doc Keywords Rel (label)
d1 GERMANY VW chief wants boost return sales tenfold Yes
d2 USA US Congress moves curb economic spying Yes
d3 US Bill ECONOM ECONOM ESPIONAG ECONOM Yes
d4 US Economic espionage bill clears Senate Yes
d5 GERMANY FOCUS Two men held VW industrial espionage Yes
d6 GERMAN GERMAN MAN VW ESPIONAG Yes
d7 GERMAN VW GERMAN VM No
d8 US ECONOM No
Y Li: IFN647 week 8
2. Rocchio-based information filtering model
Review Rocchio algorithm as an IR model.
Optimal query
Maximizes the difference between the average vector representing the relevant documents and the average vector representing the non-relevant documents
Modifies query according to
α, β, and γ are parameters
Typical values 8, 16, 4
tf*idf Term Weighting
It uses tf*idf weight to describe the importance of term tk for “Topics of Interest”.
Term frequency measures the importance of term tk T in relevant documents (D+):
Inverse document frequency measures importance in collection D:
where N = |D|, and nk = {di D | tk di} is the number of documents that include tk .
Term weight function
where tf’ is the term frequency in D-.
Y Li: IFN647 week 8
Y Li: IFN647 week 8
Training algorithm
Y Li: IFN647 week 8
Y Li: IFN647 week 8
Y Li: IFN647 week 8
The Ranking Function
To obtain a ranking function, we should select a weighting function, e.g., w, and a set of terms T.
A rank function for a document d can be defined as follows:
Y Li: IFN647 week 8
Testing Algorithm
Y Li: IFN647 week 8
Y Li: IFN647 week 8
Y Li: IFN647 week 8
3. Probabilistic information filtering models
It uses relevance to describe the importance of term tk for “Topics of Interest”.
The individual term weight can be estimated based on two mutually exclusive independence assumptions:
I1: The distribution of terms in relevant documents is independent and their distribution in all documents is independent.
I2: The distribution of terms in relevant documents is independent and their distribution in non-relevant documents is independent.
There are also two methods that refer to as ordering principles for presenting the result set:
O1: Probable relevance is based only on the presence of search terms in the documents.
O2: Probable relevance is based on both the presence of search terms in documents and their absence from documents.
Y Li: IFN647 week 8
Y Li: IFN647 week 8
Interpretations of Variables
Let N=|D| be the total number of documents in the training set D;
R be the number of relevant documents for a topic;
n(t) be the number of documents that contain term t; and
r(t) be the number of relevant documents that contain term t.
For training set D in Example 8.2, we have
N=8, R=6, n(US)= 4, and r(US)= 3.
Y Li: IFN647 week 8
Y Li: IFN647 week 8
Weighting Formulas
Choosing I1 and O1 yields the following weight.
Choosing I2 and O1 yields the following weight.
Y Li: IFN647 week 8
P(R | t) = P(t | D+) / P(t)
P(t | D+) = r(t)/R, and P(t) = n(t)/N
Y Li: IFN647 week 8
Weighting Formulas cont.
Choosing I1 and O2 yields the following weight.
Choosing I2 and O2 yields the following weight.
Y Li: IFN647 week 8
Y Li: IFN647 week 8
The Best Probabilistic Formula
Most people believe that w4 is most likely to yield the best results (BM25).
The denominators in these fractions may be zero. Also considering the incomplete information in the training set, 0.5 is added to the weights to account for these problems.
The modified weighting function appears as:
Y Li: IFN647 week 8
BM25 Training Algorithm
4. Mining patterns for information filtering
A pattern is a set of terms (or keywords) or term frequency pairs, which describes the extent to which the pattern is discussed in the training set.
Different from the term-based approaches which are based on ‘term’ or ‘keyword’, the patterns-based approaches utilize patterns for text analysis.
People usually believe that patterns carry more semantic meaning than terms.
Y Li: IFN647 week 8
8.4.1 Pattern Mining
T={t1,t2,…,tk} be a set of keywords (or terms)
D be a training set of documents, which consists of a set of positive documents D+; and a set of negative documents D-.
A set of terms X is referred to as a termset if X T.
In Example 8.2, we have
T = {t1,t2,…,t7}
D = {d1,d2,…,d8}, where
D+ = {d1,d2,…,d6}
D- = {d7,d8}
Y Li: IFN647 week 8
Example 8.3
A Training Set: Binary Information Table vs. Transaction Database
Doc t1 t2 t3 t4 t5 t6 t7 Rel
d1 1 1 0 0 0 0 0 yes
d2 0 0 1 1 0 1 0 yes
d3 0 0 1 1 1 1 0 yes
d4 0 0 1 1 1 1 0 yes
d5 1 1 0 0 0 1 1 yes
d6 1 1 0 0 0 1 1 yes
d7 1 1 0 0 0 0 0 no
d8 0 0 1 1 0 0 0 no
Transaction Terms
2 t3 t4 t6
3 t3 t4 t5 t6
4 t3 t4 t5 t6
5 t1 t2 t6 t7
6 t1 t2 t6 t7
Please note the transaction database only includes relevant documents.
Frequent Patterns in Positive Documents
Let X be a termset, we use [X] to denote the covering set of X, which includes all relevant documents d such that X d, that is
[X] ={d|dD+, X d}.
Given a termset X, its occurrence frequency is the number of positive documents that contain the termset, that is |[X]|; and its support is |[X]|/|D+|.
A termset X is called frequent pattern if its support min_sup, a minimum support.
Please note patterns and frequent patterns are defined over the set of relevant documents.
All Frequent Patterns in Example 8.3 Transaction Database
min_sup = 50%
Support Termsets
83% (5)
50% (3)
Example 8.4
Patterns and Their Covering Sets (indexing)
[2,3,4,5,6]
Confidence of Frequent Patterns
Please note we define interesting patterns based on their appearance in both relevant documents D+ and all documents D.
Example 8.5
Complete Transaction Database
Doc t1 t2 t3 t4 t5 t6 t7 Rel
d1 1 1 0 0 0 0 0 yes
d2 0 0 1 1 0 1 0 yes
d3 0 0 1 1 1 1 0 yes
d4 0 0 1 1 1 1 0 yes
d5 1 1 0 0 0 1 1 yes
d6 1 1 0 0 0 1 1 yes
d7 1 1 0 0 0 0 0 no
d8 0 0 1 1 0 0 0 no
Transaction Terms
1 t1 t2 yes
2 t3 t4 t6 yes
3 t3 t4 t5 t6 yes
4 t3 t4 t5 t6 yes
5 t1 t2 t6 t7 yes
6 t1 t2 t6 t7 yes
7 t1 t2 no
8 t3 t4 no
Interesting Patterns in Example 8.5
min_conf = 75%
Notice patterns’ confidence is the same concept as the association rules, for example, pattern
t1 t2 yes.
Confidence Frequent Patterns
100%
75%
8.4.2 Relationship between Terms and Documents
Not all interesting patterns are meaningful. For example, pattern
Therefore, we expect to keep only the larger patterns and prune the meaningless one, the shorter one.
Now we attempt to describe what is the condition for pruning the meaningless patterns.
Given a termset X, its covering set [X] which is a subset of relevant documents.
Similarly, given a set of relevant documents Y, we can define its termset:
termset(Y)= {t|tT, dY => td}.
Relationship between Terms and Documents cont.
Covering set
termset(Y)
Term Space
Y Li: IFN647 week 8
Examples of Documents’ termsets for Example 8.5
Set of positive documents termset
[2,3,4] {t3 , t4 , t6 }
[1,5,6] {t1 , t2}
[2,3,4,5,6] {t6}
Y Li: IFN647 week 8
Y Li: IFN647 week 8
8.4.3 Closed Patterns
Given an interesting pattern X, its closure
C(X)= termset([X]).
Properties of the closure operation
X1 X2 => C(X1) C(X2).
C(X) X for all patterns X.
An interesting pattern X is closed iff C(X)= X.
Y Li: IFN647 week 8
Y Li: IFN647 week 8
Closure Operation and Closed Patterns cont.
Term Space
Term Space
Y Li: IFN647 week 8
Example 8.6
Pruning Non-closed Patterns to Obtain Discovered Patterns
[2,3,4,5,6]
8.4.4 Application of Patterns for Information Filtering
There are two ways to use the discovered patterns:
Weighting patterns
This approach evaluates a weight for each pattern and views patterns as atoms. For example, w(P)=confidence(P); or w(P)=support(P)*confidence(P);
For each incoming document, the filtering algorithm first determines which patterns appear in the document; it then sum the weights of the appeared patterns.
Mapping patterns onto the term space
This approach views term as atoms. It deploys the discovered patterns onto the term space.
Pattern Deploying – Mapping Patterns onto the Term Space
{t1, t2, t3, t4, t5, t6 }
P2=
P3=
where, is the set of patterns.
Mapping Patterns onto the Term Space cont.
w(t1) = w(t2) = (support(P2)*confidence(P2))2
w(t3) = w(t4) = (support(P3)*confidence(P3))3
w(t6) = (support(P3)*confidence(P3))3 +
(support(P1)*confidence(P1))1
Patterns with Term Weights
One important factor is missed in the closed patterns
Term frequency.
This factor is very import in terms of information retrieval and search engines.
tf(d,t) the number of occurrences of term t in document d.
P={(t,f)|tT, f=tf(t,d)>0}, is referred to as a pattern.
support(P) describes the extent to which the pattern is discussed in the training set: the greater the support is, the more important the pattern is.
termset(P)={t|(t,f)P} be the termset of P.
Composition Operation
We assume that pattern P1 equals to pattern P2 if and only if termset(P1)=termset(P2).
Therefore, we should compose patterns with the same termset at least.
Let P1 and P2 be two patterns. We call P1 P2 the composition of P1 and P2 which satisfies:
support(P1P2)= support(P1)+support(P2).
Example 8.7
Weights in Patterns
* where P7 is generated by composing P3 and P4, and P8 is generated by composing P5 and P6 if we view “SPY is_a ESPIONAGE”.
Name Support Pattern
P1 1 {(GERMAN, 1), (VW, 1)}
P2 1 {(US, 2), (ECONOMY, 1), (SPY, 1)}
P3 1 {(US, 1), (BILL, 1), (ECONOMY, 1), (ESPIONAGE, 1)}
P4 1 {(US, 1), (ECONOMY, 1), (ESPIONAGE, 1), (BILL, 1)}
P5 1 {(GERMAN, 1), (MAN, 1), (VW, 1), (ESPIONAGE, 1)}
P6 1 {(GERMAN, 2), (MAN, 1), (VW, 1), (SPY, 1)}
*P7 2 {(US, 2), (BILL, 2), (ECONOMY, 2), (ESPIONAGE, 2)}
*P8 2 {(GERMAN, 3), (MAN, 2), (VW, 2), (ESPIONAGE, 2)}
Examples: Termsets of Patterns
(Espionage)
Examples: Relationship between Patterns
The set of patterns = {P1, P2, P7, P8}.
Normal Forms of Patterns
Let P={(t1,f1),(t2,f2),…,(tr,fr)}
Its normal form is
{(t1,w1),(t2,w2),…,(tr,wr)}, where
for all ir and i1 .
We call (w1,w2,…,wr) the weight distribution of P, and we have
Normalize support
support can be normalized, which satisfies:
support:: -> [0,1], such that
Example 8.8
Normalization
Notice: where “SPY” is replaced by “ESPIONAGE”.
Name Support Normal Form
P1 1/6 {(GERMAN, 1/2), (VW, 1/2)}
P2 1/6 {(US, 1/2), (ECONOMY, 1/4), (ESPIONAGE, 1/4)}
P7 1/3 {(US, 1/4), (BILL, 1/4), (ECONOMY, 1/4),
(ESPIONAGE, 1/4)}
P8 1/3 {(GERMAN, 1/3), (MAN, 2/9), (VW, 2/9),
(ESPIONAGE, 2/9)}
Deduced Probability
To obtain a relevance function, we firstly calculate a deduced probability function pr by using the normalized support function and the normal forms of patterns, which satisfies:
for all tT.
References
S. Robertson and J. Callan, Guidelines for the TREC 2002 Filtering Track, https://trec.nist.gov/data/filtering/T11filter_guide.html
N. Zhong, Y. Li, and S.-T. Wu, “Effective Pattern Discovery for Text Mining”. IEEE Transactions on Knowledge and Data Engineering, 2012, Vol. 24 No. 1, pages 30-44.
Y. Li, A. Algarni, M. Albathan, Y. Shen and M. Bijaksana, Relevance Feature Discovery for Text Mining, IEEE Transactions on Knowledge and Data Engineering, 2015, vol. 27, no. 6, pages 1656-1669.
Next week text classification –
chapter 9 (Section 9.1) of the textbook – W. , Search Engines – Information retrieval in Practice; Pearson, 2010.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com