IT代写 Week 7 Question Solutions

Week 7 Question Solutions
Professor Yuefeng Li
School of Computer Science, Queensland University of Technology (QUT)
Concepts of Text Mining and Information Needs

Copyright By PowCoder代写 加微信 powcoder

There are several definitions of text mining. Text mining is the process of deriving high-quality information from text. It is also referred to as text analytics – an AI-based technology to use NLP (nature language processing) to transform unstructured textural data into structured data that are suitable for machine learning algorithms to do text analysis.
Normally a text document includes a lot of noisy information. For example, many web pages contain text, links, and pictures that are not directly related to the main content of the page. The main content block of a story page (the story) may take up less than 20% of the display area of the page, and the rest is made up of banners, advertisements, images, general navigation links, services (such as search and alerts), and miscellaneous information, such as copyright.
Also, high quality information usually refers to some combination of relevance, novelty, and interest. It is hard to extract high quality information if we do not know what users want and their information needs.
For information retrieval research, an information need is the underlying cause of the query that a person submits to a search engine. In contrast to a request to a database system, such as for the balance of a bank account, text queries are often poor descriptions of what users want. A short query such as “topical fish” are very common in Google search. However, a short query can only describe partial information need because an information need is a gap in their knowledge. It is difficult for people to define exactly what their information need is.
User information needs are also required for information extract, text classification and information filtering since we need to know the definitions of named entities, aspects, and relevant knowledge about classes.

Techniques such as query suggestion, query expansion, and relevance feedback can use interaction and context to refine the initial query in order to acquire user information needs.
Question 1. Which of the following is False? And justify your answer.
(1) A stem class is the group of words that will be transformed into the same stem by the stemming algorithm.
(2) The most common measure for comparing words (or more generally, strings) is the edit distance, which is the number of operations required to transform one of the words into the other.
(3) The Damerau-Levenshtein distance metric counts the maximum number of insertions, deletions, substitutions, or transpositions of single characters required to do the transformation.
(4) A variety of techniques and data structures have been used to speed up the calculation of edit distances between the misspelled word and the words in the dictionary. These include restricting the comparison to words that start with the same letter (since spelling errors rarely change the first letter), words that are of the same or similar length (since spelling errors rarely change the length of the word), and words that sound the same.
Solution: (3)
The Damerau-Levenshtein distance metric counts the minimum number of insertions, deletions, substitutions, or transpositions of single characters required to do the transformation.
Question 2. Noisy channel model
Consider the spelling error “fish tink” given by a user. We want to select the best correction of “tink” from {“tank”, “think”} words with edit distance 1. Assume all words with same edit distance have same probability and the words “tank” and “think” are quite common and have very similar priori probabilities, i.e., P(tank) = P(think).
Based on the noisy channel model and these assumptions, we are likely select tank (rather than think) as the correction of tink. Please justify this statement.
Based on the noisy channel model, we need to calculate P(w|e), which is the probability that the correct word is w given that we can see the person wrote e.
Based on Bayes rule and language model, we have
P(w|e) μ P(e|w)P(w)
The language model probability for a word is then calculated as a mixture of the probability that the word occurs in text and the probability that it occurs following the previous word, or

λP(w) + (1 − λ)P(w|wp)
where 0£λ £1 is a parameter that specifies the relative importance of the two probabilities,
and P(w|wp) is the probability of a word w following the previous word wp. By using the above equations, we have
P(tank|tink) = P(tink|tank)* P(tank) = P(tink|tank)*( λP(tank) + (1 − λ)P(tank|fish)) P(think|tink) = P(tink|think)* P(think) = P(tink|think)*( λP(think) + (1 − λ)P(think|fish))
According to the assumptions, all words with same edit distance 1 have same probability, i.e., P(tink|tank) = P(tink|think). Also, the words “tank” and “think” are quite common and have very similar priori probabilities, i.e., P(tank) = P(think).
But based on our common knowledge, usually we can get P(tank|fish)) > P(think|fish)) in a collection of documents. So, P(tank|tink) > P(think|tink) is very likely.
Query Expansion is a technique to expand an initial query using synonyms and related words. It usually based on an analysis of word or term co-occurrence. Query based stemming can also be regarded as a query expansion technique, with the expansion terms limited to word variants.
The key to effective expansion is to choose words that are appropriate for the context, or topic, of the query. For example, “aquarium” may be a good expansion term for “tank” in the query “tropical fish tanks”, but not appropriate for the query “armor for tanks”.
Term association measures, e.g., Dice’s coefficient, mutual information and Chi-squared, are an important part of many approaches to query Expansion.
Question 3. (Mutual Information)
For two words (or terms) a and b, their mutual information measure (MIM) is defined as
where P(a) is the probability that word a occurs in a text window of a given size, P(b) is the probability that word b occurs in a text window, and P(a, b) is the probability that a and b occur in the same text window.

Assume a function df(docs, a) is defined to calculate the number of documents containing word a, and function df2(docs, a, b) returns the number of documents containing both words a and b, where docs is a set of documents.
(1) Let D be a set of documents. For any two words a and b, use the MIM equation and the provided two functions to calculate their MIM.
(2) Assume the collection of documents is represented as a dictionary {docID: {term: counts, …},…}, for example, docs = {‘D1’:{‘term1’:3, ‘term4’:5, ‘term5′:7},’D2’:{‘term1’:5, ‘term2’:3, ‘term3’:4, ‘term4’:6}, ‘D3’:{‘term3’:5, ‘term4’:4, ‘term5’:6}, ‘D4’:{‘term1’:9, ‘term4’:1, ‘term5’:2}, ‘D5’:{‘term2’:1, ‘term4’:3, ‘term5′:2},’D6’:{‘term1’:3, ‘term3’:2, ‘term4’:4, ‘term5’:4}}.
Define a python function df2(docs, a, b) to calculate nab – the number of documents that
include both terms a and b.
(3) Assume python function c_df(docs) calculates df value for all terms in docs and returns a
{term:df, …} dictionary. Use both functions df2() and c_df() to define a Python function MIM(docs, a, b) to calculate the MIM value for terms a and b.
Solution: (1)
Let N= |D|, then we have
P(a)= df(D, a)/ N, P(b)= df(D, b)/N, P(a, b) = df2(D, a, b)/N
So, MIM = log(df2(D, a, b)/N)/(( df(D, a)/ N)*( df(D, b)/N)) = log[N*df2(D, a, b)/(df(D, a)*df(D, b))]

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com