CS代考 IFN647 final exam is an on-campus invigilated exam in the end of semester e

Search Engines

Text, Web And Media Analytics
Final Exam Information

Review, and Sample Questions

Yuefeng Li | Professor
School of Computer Science
Queensland University of Technology
S Block, Level 10, Room S-1024, Gardens Point Campus
ph 3138 5212 | email

Outline of Week 13 Lecture
Final exam information
Review questions
Last week overview
Sample questions
References

20/06/2022
1. Final Exam Information
IFN647 final exam is an on-campus invigilated exam in the end of semester exam period and deferred exam period.
If you have been approved to undertake the Alternative Assessment for IFN647, the unit coordinator will contact you by Monday 13 June to advise the format of the alternative assessment.
This semester’s final exam will have 13 questions (a total of 45 marks)
Short answer questions (5 questions – 10 marks): Question 1 to Question 5
Answer question clearly in several rows in the in the booklets provided
Points are indicated
Problem solving questions (8 questions – 35 marks): Question 6 to Question 13
Write your answer clearly in the booklets provided.
A few questions need short python code.
Difficulty levels and question distribution (a normal distribution)
Challenging

20/06/2022

Exam Information cont.
Need to know concepts, ideas and algorithms about text pre-processing, inverted indexing, IR models, evaluation, user information needs, information filtering, classification, and social search.
Week 3 to week 11 review questions, lecture notes and text-book.
Please note that the exam paper tries to cover each week and avoid checking the same knowledge or concepts repeatedly.
A few question requires writing python code. Don’t have to write much code usually and the basic python commands are enough.

2. Review Questions
Use lecture review questions (Week 3 to Week 11, about 37 questions, excluding optional questions) to help you to prepare the final exam. Lecture notes and text-book are important references.
Know how to answer all review questions（每周的LEC的文件夹里有）.
Try it out by yourself and then compare your answers with the solution.
Contact the unit coordinator if you have any problems for preparing the final exam.

3. Last Week Overview
If you would like to discuss your further study (e.g., MPhil or PhD) or career planning, you can contact Professor Li Yuefeng via email

4. Sample Questions
In this session, we discuss some sample exam questions.
Exam questions are designed according to lecture review questions.
The exam questions are different to these sample questions; but have the similar format or contents.
Please use the lecture review questions to understand all concepts, ideas, algorithms, math equations and python solutions.
Lecture notes and the text-book are important references to help you to work out solutions.

20/06/2022

Week 3 – Textual Data Pre-processing
Sample Question 1 (Short Answer)
Which of the following is FALSE? and explain why it is FALSE. # wk3Q2

Stemming is a component of text processing that captures the relationships between different variations of a word.
Stemming reduces the different forms of a word that occur because of inflection (e.g., plurals, tenses) or derivation (e.g., making a verb into a noun by adding the suffixation) to a common stem.
In general, using a stemmer for search applications with English text produces a small but noticeable improvement in the quality of results.
A dictionary-based stemmer uses a small program to decide whether two words are related, usually based on knowledge of word suffixes for a particular language.

Answer: (d)
A dictionary-based stemmer has no logic of its own, but instead relies on pre-created dictionaries of related terms to store term relationships. In contrast, an algorithmic stemmer uses a small program to decide whether two words are related, usually based on knowledge of word suffixes for a particular language.

Understand the difference between two concepts.

Review questions: document parsing, stemming, n-grams, Markov chain and html parsing

1. Processing Text
Text Statistics
Tokenizing
Stopping and stemming
Phrases and N-grams
2. Information Extraction
Named Entity
HMM – Hidden Markov Model
3. Document Structure and Markup
Hyperlinks

20/06/2022

Week 4 – Indexing text and ranking documents
Sample QUESTION 2 (short answer) #wk4, Q1
Which of the following is wrong? justify your answer.
The index is inverted because usually we think of words being a part of documents, but if we invert this idea, documents are associated with words.
In an inverted index that contains only document information, i.e., the features are binary, meaning they are 1 if the document contains a term, 0 otherwise. This inverted index contains enough information to tell if the document contains the exact phrase “tropical fish”.
Each index term has its own inverted list that holds the relevant data for that term. Each list entry is called a posting, and the part of the posting that refers to a specific document or location is often called a pointer.
An extent is a contiguous region of a document. We can represent these extents using word positions.

Solution: (b)
Although a document that contains the words “tropical” and “fish” is likely to be relevant; however, we really want to know if the document contains the exact phrase “tropical fish”. To do this, we need to add position information to the index.

Understand topical features’ representation.

Review questions: feature and representation, query processing, Abstract Model of Ranking.

1. Abstract Model of Ranking
2. Indexes
Inverted Index
Index construction
3. Query processing
Document-at-a-time
Term-at-a-time
Optimization techniques
Structured queries

20/06/2022

Week 5 – Information Retrieval
Sample QUESTION 3 (Language Models) # wk5, Q4
[Total marks for Question 3 = 4]
Let Q = {US, ECONOM, ESPIONAG} be a query, and C = {D1, D2, D3, D4, D5, D6} be a collection of documents, where
D1 = {GERMAN, VW}
D2 = {US, US, ECONOM, SPY}
D3 = {US, BILL, ECONOM, ESPIONAG}
D4 = {US, ECONOM, ESPIONAG, BILL}
D5 = {GERMAN, MAN, VW, ESPIONAG}
D6 = {GERMAN, GERMAN, MAN, VW, SPY}
Jelinek-Mercer smoothing method uses the following equation to calculate a score for a document D

where n is the number of query terms in Q, and

for all query term qi in Q, where fqi,D is the number of times query word qi occurs in document D, |D| is the number of word occurrences in D, cqi is the number of times query word qi occurs in collection C, and |C| is the total number of word occurrences in collection C. Assume parameter λ = 0.4, calculate P(Q|D2). You are required to write the finding process.

Review questions: IR model concepts, probabilistic model, language models.

1. Overview of IR Models 2. Older IR Models
Boolean Retrieval
Vector Space Model
3. Probabilistic Models
Language models
4. Relevance models
Pseudo-Relevance Feedback
KL-Divergence

Week 5 cont.
Understand the meaning of variables in equations and
use equations in practical examples
Question 3 Solution:

D2 = {US, US, ECONOM, SPY}, Q = {US, ECONOM, ESPIONAG}

P(Q| D2) = P(US | D2) * P(ECONOM | D2) * P(ESPIONAG | D2) = 0.003865

P(US | D2)=(1-0.4)*(2/4)+0.4*(4/23)=0.6*0.5 + 0.4*0.17 = 0.368

P(ECONOM | D2) =0.6 *(1/4)+0.4*(3/23) = 0.6*0.25 + 0.4*0.13 = 0.202

P(ESPIONAG | D2) = 0.6*(0/4) + 0.4*(3/23) = 0.4*0.13 = 0.052

20/06/2022

Week 5 sample questions cont.
Sample QUESTION 4 (Similarity measure) # lecture examples
[Total marks for Question 4 = 4]
Cosine similarity measure is defined as follows:

Given two documents D1, D2 and a query Q :
D1 = (0.5, 0.8, 0.3), D2 = (0.9, 0.4, 0.2), Q = (1.5, 1.0, 0)
Calculate Cosine(D1, Q) and Cosine(D2, Q).

20/06/2022

Week 6 – Evaluation
Sample QUESTION 5 (Evaluation) # Similar to wk6, Q2

[Total marks for Question 5 = 4]

Let A be the set of relevant documents, and B be the set of retrieved documents of an IR model.
Use A and B to define the precision and recall of the IR model.
If A = {d1, d2, d4, d6, d7, d9} and B = {d1, d3, d4, d5, d6, d8, d9, d10 }, calculate the precision and recall values.

Answer: (understand concept definitions)

Precision = |{d1, d4, d6, d9}|/|B| = 4/8 = 50%
Recall = |{d1, d4, d6, d9}|/|A| = 4/6 = 66.67%

Review questions: relevance judgements, Effectiveness Metrics and evaluation.

1. Evaluation overview 2. Relevance Judgments
Query Logs
Filtering Clicks
3. Effectiveness Measures
Precision, recall, F measure
4. Ranking Effectiveness
Average Precision
Mean Average Precision (MAP)
Average recall-precision graph
Discounted Cumulative Gain (DCG) & NDCG
5. Efficiency Metrics 6. Significance Tests

20/06/2022

Week 6 Sample Questions
Sample QUESTION 6 (Evaluation) # Similar to wk6, Q4

[Total marks for Question 6 = 3]

MAP (Mean Average Precision) summarizes rankings from multiple queries by averaging average precision. The following are two ranking results for two queries. Calculate the MAP for these rankings. You are required to write the calculating process.

Week 6 cont.
Answer for QUESTION 6:
20/06/2022

20/06/2022

Week 7: Text mining and user information needs
Sample QUESTION 7 (Mutual Information) #wk7 Q3
[Total marks for Question 7 = 4]

For two words (or terms) a and b, their mutual information measure (MIM) is defined as

where P(a) is the probability that word a occurs in a text window of a given size, P(b) is the probability that word b occurs in a text window, and P(a, b) is the probability that a and b occur in the same text window.

Assume a function df(docs, a) is defined to calculate the number of documents containing word a, and function df2(docs, a, b) returns the number of documents containing both words a and b, where docs is a set of documents.

Let D be a set of documents. For any two words a and b, use the MIM equation and the provided two functions to calculate their MIM. You need to write math formulas (or equivalent python definitions) to compute P(a), P(b) and P(a, b) in terms of D, df and df2.

Review questions: concepts and techniques for information needs, Noisy channel model, Mutual Information

1. Concepts of Text Mining and Information Needs
2. Query based approaches for acquiring information needs
Stem classes
Spelling correction
Noisy channel model
3. Query Expansion
Association Measures
Context vectors
4. Relevance Feedback 5. Query Context and Personalization
6. Applications
Snippet Generation – information summery
Searching Advertisements

Week 7 cont.
Compute based on equations and function definitions
Question 7 solution

Let N= |D|, then we have
P(a)= df(D, a)/ N, P(b)= df(D, b)/N,
P(a, b) = df2(D, a, b)/N

So, MIM = log(df2(D, a, b)/N)/(( df(D, a)/ N)*( df(D, b)/N))
= log[N *df2(D, a, b)/(df(D, a)*df(D, b))]
= log[|D| *df2(D, a, b)/(df(D, a)*df(D, b))]

20/06/2022

Week 8: Information Filtering

Sample QUESTION 8 (Pattern mining) # wk8, Q5

[Total marks for Question 8 = 5]

Given an interesting pattern X, its closure is defined as
C(X) = termset([X]).

Explain the meaning of [X] and the function termset, respectively.
Assume X1 = and X2 = . Calculate C(X1) and C(X2) by using the table on the right.

Doc_ID Terms
1 t1 t2 yes
2 t3 t4 t6 yes
3 t1 t2 t5 t6 yes
4 t3 t4 t5 t6 yes
5 t1 t2 t6 t7 yes
6 t1 t2 t6 t5 yes
7 t1 t2 no
8 t3 t4 no

Review questions: Term frequency measure, Probabilistic information filtering model, Pseudo relevance feedback, patterns and pattern mining

1. Information filtering introduction
2. Rocchio-based information filtering model
tf*idf Term Weighting
Training algorithm
The Ranking Function
Testing Algorithm
3. Probabilistic information filtering models
Weighting Formulas
BM25 Training Algorithm
4. Mining patterns for information filtering
Pattern Mining
Relationship between Terms and Documents
Closed Patterns
Application of Patterns for Information Filtering

Week 8 cont.
Answer for Question 8:
(understand pattern definitions in textural data)

[X] denotes the covering set of X, which includes all positive documents d such that X  d.

Termset is a function, which maps a set of documents to a set of terms, which satisfies
termset(Y)= {t|tT, dY => td}.
[X1] = [3,5,6]
[X2] = [1,3,5,6]
C(X1) = termset([X1] ) = termset([3,5,6]) = {t1, t2, t6}
C(X2) = termset([X2] ) = termset([1,3,5,6]) = {t1, t2}

20/06/2022

Week 9: Text Classification
QUESTION 9 (Classification)
[Total marks for Question 9 = 6]
Let {c0, c1} be a set of classes and D be a training set, where each element in D is a pair (dj, cj), dj is a document and cj is its class. For example, Table 1 shows a training set D which contains 10 documents and their multinomial document representations. Assume this training set is represented as two Python lists as follows: D = [[0,0,0,0,2], [3,0,1,0,1], [0,0,0,0,1], [2,0,3,0,2], [5,2,0,0,1], [0,0,1,0,1], [0,1,1,0,1], [0,0,0,0,1], [0,0,0,0,1], [1,1,0,1,2]]; and C = [0,1,0,1,1,0,0,0,0,0], where a document is represented as a list of terms’ frequency counts in the document (please note we assume the vocabulary set V= [‘cheep’, ‘buy’, ‘banking’, ‘dinner’, ‘the’] is numbered from 0 to n-1), D is represented as a list of documents, C is represented as a list of class labels of the corresponding documents in D, and c1 (spam) and c0 (not spam) are labelled as 1 and 0 in C.
The training procedure of Naive Bayes classification is to calculates prior probability P(c) and conditional probabilities P(w|c) for all terms w in V and all classes c in {c0, c1}, where P(c) = Nc / |D|, P(w|c) = (tfw,c + 1)/(|c| + |V|), Nc is the number of training documents with class label c, tfw,c is the number of times that term w occurs in class c in the training set, and |c| is the total number of times of terms that occur in training documents with class label c.
Define a python function TRAIN_MULTINOMIAL_NB(C, D, V) to calculate prior probability P(c) and conditional probabilities P(wj|ci). Please note the function parameters are different to the Naive Bayes Algorithm in the lecture notes.

Review questions: Multinomial Document Representation, Naive Bayes classification, and Rocchio Classification

Classification introduction
Text classification application
Classifiers
Naïve Bayes Classifier
Vector Space Classification
Rocchio classification
Feature Selection
Evaluating Classifiers

Week 9 cont.

Question 9 solution

Understand an algorithm and write python code to implement it.
def n_words(c,C,D):
for i in range(len(C)):
if C[i]==c:
for j in range(len(D[i])):
n_w = n_w + D[i][j]
return n_w

def TRAIN_MULTINOMIAL_NB(C, D, V):
N = len(D)
prior = []
condprob = tf = [[0 for w in range(len(V))] for c in range(2)]
for c in range(2):
Nc = 0 # calculate Nc
for i in C:
if C[i]==c:
prior.append(Nc/len(C))
for w in range(len(V)): # calculate tf(c,w)
for d in range(len(D)):
if C[d]==c:
tf[c][w] = tf[c][w] + D[d][w]
for w in range(len(V)):
condprob[c][w] = (tf[c][w]+1)/(n_words(c,C,D)+len(V))
return(prior, condprob)

20/06/2022

Week 10: Social Media Analytics
Sample QUESTION 10 (likelihood method) # wk10, Q1

[Total marks for Question 10 = 3]

Assume C is a collection of tweets. We can use the following query likelihood method to calculate the probability of a tweet d occurring for a given query Q.

Let |d| and |Q| are the respective lengths (size) of the tweet d and query Q, and μ is the smoothing parameter.

Interpret the meaning of c(w, d), c(w, Q), and P(w|C).

Review questions: likelihood method, sentiment analysis, KL divergence for profiles, and recommender systems

1. Social media analysis
Microblog Retrieval
Sentiment analysis
2. Social search
Searching Tags
Inferring Missing Tags
Browsing and Tag Clouds
Collaborative Search
3. Filtering and recommender systems
Static filtering
Adaptive filtering
Recommender systems
Collaborative Filtering
Rating using User Clusters
Rating using Nearest Neighbors

Week 10 cont.
Understand definitions in a math equation

Question 10 solution

c(w, d) is the word w’s counts in the given tweet d,

c(w, Q) is the word w’s counts in the given query Q,

P(w|C) is the probability of the word w in the collection C that is used to normalise the model.

Week 11: Web Analytics
Sample QUESTION 11 (Distance Function) # Similar to Wk11, Q3

[Total marks for Question 9 = 4]

Given the following binary information table:

Draw the binary contingency table for comparing documents d1 and d2.
Evaluate the Jaccard distance and the Simple matching coefficient distance between d1 and d2.
Document t1 t2 t3 t4 t5 t6 t7
d1 1 0 0 0 1 1 1
d2 0 0 1 1 0 1 1
d3 0 0 0 1 1 1 0
d4 0 0 1 0 0 1 0
d5 1 1 1 1 0 1 1

Review questions: concepts for web analytics, pageRank, distance function, cluster precision

Web Analytics Definition
Logfile analysis
Link analysis
Link Quality
Online communities
Finding Communities
Cluster Analysis
Distance Measures
The k-medoids Technique
Hierarchical Methods
Evaluation Clustering Algorithms
Community Based Question Answering

Week 11 cont.
Answer for Question 11
( problem solving based on concepts defined in lecture notes ):

r=2, s= 2, q = 2, t= 1
disJac(d1, d2) = (r+s) / (q+r+s) = 4/6 = 0.67
disS(d1, d2) = (r+s) / (q+r+s+t) = 4/7 = 0.57

Document d2
1 0 Sum
1 2 2 4
Document d1 0 2 1 3
Sum 4 3 7

20/06/2022

Week 11 Sample Questions
Sample Question 12. (PageRank) #wk11, Q2
[Total marks for Question 12 = 4]

A Web graph G = (P, L) consists of Web pages (vertices) and links (edges). The PageRank procedure takes a Web graph G as input and then outputs the better PageRank estimate PR using the following equation

where Bu is the set of pages that point to u, and Lv is the number of outgoing links from page v.
The following figure G = (P, L) is a Web graph, where P = {A, B, C, D} and L = {(A, B), (A, D), (C, A), (D, B), (C, B)}.

Assume the initial estimate of each Web page is equally, i.e., PR(A) = PR(B) = PR(C) = PR(D) = 0.25; and λ = 0.15. Calculate the PageRank estimate (PR value) of each Web page after running the first iteration of procedure PageRank(G).

20/06/2022

Week 11 Sample Questions cont.
Learn how to use equations in iterative calculations
Question 12 solution:

The first iteration: LC={A, B}, LA={B, D}, LD={B}, LB={},
For page A: BA = {C}
R (A) = 0.15/4 + (1-0.15)*(R(C)/2) =0.0375 + 0.85*(0.25/2) = 0.14375
For page B: BB = {A,C,D}
R(B) = 0.15/4 + (1-0.15)*(R(C)/2 + R(A)/2 + R(D)/1) = 0.0375 + 0.85*(0.25/2 + 0.25/2 + 0.25) = 0.4625
For page C: Bc = {}
R(C) = 0.15/4 + (1-0.15)*0 = 0.0375
For page D: BD = {A}
R(B) = 0.15/4 + (1-0.15)*(R(A)/2) = 0.0375 + 0.85*(0.25/2) = 0.14375

5. References
[1] IFN647 week 3 to week 11 lecture review questions
[2] IFN647 week 3 to week 11 question solutions
[3] IFN647 week 2 to week 11 lecture notes
[4] W. , Search Engines – Information retrieval in Practice; Pearson, 2010. This textbook is available at http://ciir.cs.umass.edu/downloads/SEIRiP.pdf

/docProps/thumbnail.jpeg

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts