A total of FORTY-FIVE (45) MARKS are available on the examination paper. Please attempt all thirteen (13) questions.
QUESTION 1 (Short Answer)
[Total marks for Question 1 = 2]
Which of the following is false? and explain why it is false.
Copyright By PowCoder代写 加微信 powcoder
(a) Stemming is a component of text processing that captures the relationships between
different variations of a word.
(b) An algorithmic stemmer has no logic of its own, but instead relies on pre-created
dictionaries of related terms to store term relationships.
(c) Stemming reduces the different forms of a word that occur because of inflection (e.g.,
plurals, tenses) or derivation (e.g., making a verb into a noun by adding the suffixation)
to a common stem.
(d) In general, using a stemmer for search applications with English text produces a small
but noticeable improvement in the quality of results.
QUESTION 2 (Short Answer)
Which of the following descriptions is wrong? and justify your answer.
(a) In the query likelihood retrieval model, documents are ranked by the probability that the query text could be generated by the document language model.
(b) To use the unigram language model to estimate P(Q|D) – the document D’s score for the given query Q ={q1, q2, …, qn}, and the language model probabilities P(qi|D) (i =1, 2, …, n) need to be estimated.
(c) The major problem with the estimate of P(qi|D) is that if any of the query words are missing from the document, the score given by the query likelihood model for P(Q|D) will be zero.
(d) Smoothing technique is used to estimate the difference between two probability distributions.
QUESTION 3 (Short Answer)
[Total marks for Question 3 = 2]
Which of the following is false? and justify your answer.
(a) A typical query log contains user identifier or user session identifier, query terms, list of
URLs of results, their ranks on the result list, and whether they were clicked on, and
timestamp(s).
(b) Many user actions (e.g., Query log data) can be considered as explicit relevance
judgments.
(c) For a very large data collection, a pooling technique is used to select the top-k results
from the rankings obtained by different retrieval algorithms. The results are merged into a pool, duplicates are removed, and the documents are presented in some random order to the people doing the relevance judgments.
(d) Pooling produces a large number of relevance judgments for each query. However, this list is incomplete, and for a new retrieval algorithm that had not contributed documents to the original pool, this could potentially be a problem.
[Total marks for Question 2 = 2]
IFN647T1.221 cont/…
QUESTION 4 (Short Answer)
Let T={t1,t2,…,tk} be a set of terms, and D be a training set of documents, which consists of a set of positive documents D+ and a set of negative documents D-. Assume X is a termset. Which of the following definitions is false? and justify your answer.
(a) X is a set of terms, that is X T.
(b) The covering set of X includes all relevant documents d such that X d, denoting as [X] =
{d | dD+, X d}.
(c) X is called a frequent pattern if its support (|[X]|/|D+|) min_sup, a minimum support.
(d) X is called an interesting pattern if it is a frequent pattern and its confidence (|[X]|/NX)
min_conf, a minimum confidence, where NX = |{d | d D+, X d}|.
QUESTION 5 (Short Answer)
[Total marks for Question 5 = 2]
Which of the following is false for sentiment analysis? and justify your answer.
(a) Sentiment analysis (opinion mining) discovers users’ opinion about products or services in on-line reviews or feedback or observes trends in public mood to analysis of clinical records.
(b) The orientation is the opinion provided about the entity and/or the aspect that was provided by the opinion holder at a specific time.
(c) The goal of subjectivity classification is to classify a piece of text according to a predefined set of basic emotions.
(d) Aspects are features, components or functions of the entity. They can be nouns and/or noun phrases
(e) Polarity classification is to group the expressed opinion in a document, a sentence, or an entity feature/aspect in positive, negative, or neutral regions.
QUESTION 6 (n-grams)
[Total marks for Question 6 = 4]
Typically, n-grams are formed from overlapping sequences of words., i.e., by moving an n-word “window” one word at a time in a document. For example, bigrams are 2 word sequences, and trigrams are 3 word sequences.
The definition of Tropical fish is described in the following document:
Tropical fish are generally those fish found in aquatic tropical environments around the world, including both freshwater and saltwater species. Fishkeepers often keep tropical fish in freshwater and saltwater aquariums.
Question 6 continued overleaf
IFN647T1.221 cont/…
[Total marks for Question 4 = 2]
Assume this document is represented as a list of words W = [‘Tropical’, ‘fish’, ‘are’, ‘generally’, ‘those’, ‘fish’, ‘found’, ‘in’, ‘aquatic’, ‘tropical’, ‘en vironments’, ‘around’, ‘the’, ‘world’, ‘including’, ‘both’, ‘freshwater ‘, ‘and’, ‘saltwater’, ‘species’, ‘Fishkeepers’, ‘often’, ‘keep’, ‘trop ical’, ‘fish’, ‘in’, ‘freshwater’, ‘and’, ‘saltwater’, ‘aquariums’]
Question 6 continued
(a) The following python function takes two parameters gram and keywords, where gram is
an n-gram and keywords is a list of words.
def is_highlight(gram, keywords):
for y in keywords:
if y in gram:
Explain what task this function performs and what its output is.
QUESTION 7 (Inverted Indexing)
[Total marks for Question 7 = 4]
(b) Design python list comprehensions to print all trigrams of the above document that contain at least one of the key words in keywords = [‘fish’, ‘tropical’, ‘freshwater’, ‘saltwater’, ‘aquariums’]. You can use the above python definitions.
Inverted indexing is an efficient data structure to represent documents for information retrieval, where each index term is associated with an inverted list that contains a list of pairs of document number and count of the term occurrences in that document. The following table is an inverted index for 4 documents and their index terms.
Question 7 continued overleaf
IFN647T1.221 cont/…
Question 7 continued
Assume the abstract model of ranking is
where fi is the document topical feature function (the value of fi(D) is the number of occurrences of term ti in document D) and gi is a query topical feature function (gi(Q) = 1 if term ti is in query Q; otherwise, gi(Q) = 0).
Let query Q = {fish, aquarium, tropical, saltwater}. Calculate each document’s ranking score by using Document-at-a-time Algorithm. You need to write down the calculation process.
QUESTION 8 (Information Retrieval)
Let Q = {ECONOM, ESPIONAG, VW, GERMAN} be a query, and C = {D1, D2, D3, D4, D5, D6}
be a collection of documents, where
D1 = {GERMAN, VW}
D2 = {US, US, ECONOM, SPY}
D3 = {US, BILL, ECONOM, ESPIONAG}
D4 = {US, ECONOM, ESPIONAG, BILL}
D5 = {GERMAN, MAN, VW, ESPIONAG}
D6 = {GERMAN, GERMAN, MAN, VW, SPY}
Jelinek-Mercer smoothing method uses the following equation to calculate a score for a document
where n is the number of query terms in Q, and
for all query term qi in Q, where fqi,D is the number of times query word qi occurs in document D, |D| is the number of word occurrences in D, cqi is the number of times query word qi occurs in collection C, and |C| is the total number of word occurrences in collection C. Assume parameter λ = 0.4, calculate P(Q|D6). You need to write down the calculation process.
[Total marks for Question 8 = 4]
IFN647T1.221 cont/…
QUESTION 9 (Evaluation)
[Total marks for Question 9 = 5]
Assume Table 1 illustrates the relevant judgments for documents 110 to 124 about topic 101, where 1 means the corresponding document is relevant and 0 means non-relevant; and Table 2 shows an IR model’s outputs, where documents are ranked according to their weights.
Assume A is the relevant set of documents for topic 101, Ac is the non-relevant set. We also assume the IR model selects top-8 documents as the relevant documents B (i.e., the set of retrieved documents), and Bc is the set of documents that are not retrieved.
(a) List all four sets’ elements (documents’ DocNo) and enclose them in curly braces.
(b) For the set of retrieved documents B, calculate recall and precision of the IR model.
(c) For ranking#1, calculate the IR model’s average precision.
Table 1. Relevant Judgements Table 2. Ranked output (ranking#1) of the IR model
QUESTION 10 (Mutual Information)
[Total marks for Question 10 = 4] For two words (or terms) a and b, their mutual information measure (MIM) in a collection of
documents C is defined as
Question 10 continued overleaf
IFN647T1.221 cont/…
TOPIC DocNo
101 111
101 112
101 113
101 110
101 114
101 119
101 115
101 122
101 118
101 116
101 123
101 121
101 120
101 117
101 124
TOPIC DocNo Rel
101 110 0 101 111 1 101 112 0 101 113 1 101 114 1 101 115 0 101 116 1 101 117 0 101 118 0 101 119 1 101 120 1 101 121 0 101 122 1 101 123 1 101 124 0
Question 10 continued
to calculate their MIM for any two words a and b.
QUESTION 11 (Information Filtering (IF))
[Total marks for Question 11 = 4]
where P(a) is the probability that word a occurs in a document in C, P(b) is the probability that word b occurs in a document in C, and P(a, b) is the probability that a and b occur in the same document in C.
Assume a function f(docs, a) is defined to calculate the number of documents that contain word a, and function f2(docs, a, b) returns the number of documents that contain both words a and b, where docs is a set of documents.
(a) Write mathematical formulas (or equivalent python definitions) to compute P(a), P(b) and P(a, b) in terms of C, f and f2.
2 marks (b) Write a mathematical formula (or an equivalent python definition) in terms of C, f and f2
For a given training set C (or a labelled dataset, for example, a list of lists of terms, where each list of terms denotes a document), assume C consists of a set of relevant documents Cp and a set of non-relevant documents Cn. A BM25 based IF model uses the following equation to calculate a term’s weight:
(b) Write mathematical formulas to compute N, R, r(t) and n(t). If you are not confident with your mathematical formula, you can use an equivalent Python definition.
r(t) + 0.5 R-r(t)+0.5
n(t) – r(t) + 0.5
(N -n(t))-(R-r(t))+0.5
w5 (t) = log
(a) Explain the meaning of N, R, r(t) and n(t) in the above equation.
IFN647T1.221 cont/…
QUESTION 12 (Distance Function)
Table 3 shows an example of a binary information table which includes 6 documents.
Table 3. A binary information table
d1 1100000 d2 0011010 d3 0011110 d4 0011110 d5 1100011 d6 1100011
(a) Draw the binary contingency table of Table 3 for comparing documents d4 and d5.
[Total marks for Question 12 = 4]
(b) Evaluate the Jaccard distance between d4 and d5.
(c) Evaluate the Simple matching coefficient distance between d4 and d5.
1 mark 1 mark
QUESTION 13 (Classification)
Let C be a set of classes and D be a training set, where each element in D is a pair (dj, cj), dj is a document and cj is its class. The training procedure of Rocchio classification for the input training set D uses centroids to represent classes. The centroid of a class is computed as the average vector of class members.
For the given topic “R101”, we have obtained the following training set which includes seven documents, ten selected features (terms) (VM, US, Spy, Sale, Man, GM, Espionag, Econom, Chief, and Bill) and the corresponding term weights, where Class = 1 means relevant, and Class = -1 means irrelevant.
Question 13 continued overleaf
IFN647T1.221 cont/…
[Total marks for Question 13 = 6]
Question 13 continued
Doc VM US ‘39496’ 0.17 0.01 ‘46547’ 0.10 0.21 ‘46974’ 0.00 0.23 ‘62325’ 0.17 0.01
‘6146’ 0.10 0.00 ‘18586’ 0.00 0.30 ‘22170’ 0.20 0.00
0.20 0.00 0.10 0.00 0.10 0.20 0.20 0.00 0.00 0.30 0.00 0.30 0.00 0.15
Espionag Econom Chief
0.02 0.20 0.12 0.00 0.00 0.22 0.05 0.10 0.10 0.02 0.20 0.12 0.10 0.20 0.00 0.20 0.00 0.00 0.20 0.25 0.00
0.11 0.00 0.00 1 0.20 0.00 0.10 1 0.10 0.01 0.00 1 0.11 0.00 0.00 1 0.12 0.10 0.00 -1 0.20 0.15 0.20 -1 0.00 0.00 0.10 -1
Assume the above training set is represented as two Python lists as follows, where X_train is the list of document vectors of documents in D, and y_train is the list of the corresponding classes.
X_train =[ [0.17, 0.01, 0.20, 0.00, 0.02, 0.20, 0.12, 0.11, 0.00, 0.00], [0.10, 0.21, 0.10, 0.00, 0.00, 0.00, 0.22, 0.20, 0.00, 0.10], [0.00, 0.23, 0.10, 0.20, 0.05, 0.10, 0.10, 0.10, 0.01, 0.00], [0.17, 0.01, 0.20, 0.00, 0.02, 0.20, 0.12, 0.11, 0.00, 0.00], [0.10, 0.00, 0.00, 0.30, 0.10, 0.20, 0.00, 0.12, 0.10, 0.00], [0.00, 0.30, 0.00, 0.30, 0.20, 0.00, 0.00, 0.20, 0.15, 0.20], [0.20, 0.00, 0.00, 0.15, 0.20, 0.25, 0.00, 0.00, 0.00, 0.10] ]
y_train = [1, 1, 1, 1, -1, -1, -1]
Design a Python function train_Rocchio(X_train, y_train) to calculate the centroid of the relevant class (named as ‘R101’) and the centroid of irrelevant class (named as ‘iR101’), respectively. For example, use the above table, we have ‘iR101’ = [0.100, 0.100, 0.000, 0.250, 0.167, 0.150, 0.000, 0.107, 0.083, 0.100].
IFN647T1.221
END OF PAPER
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com