程序代写代做代考 algorithm information retrieval CMPT 456 Assignment 3

CMPT 456 Assignment 3
Due: 11:59 pm, July 17, 2020 100 points in total
Please submit your assignment in Coursys.
Every student has to complete the assignment independently. While you are encouraged to learn through discussion with the instructor, the TAs and the peer students, any plagiarisms are serious violation of the university’s academic integrity policy. We have absolutely zero tolerance of such behavior.
This assignment covers the materials in the sections of queries and interfaces, and information retrieval models. The questions use the data sets D1 and D2 in Q1 of Assignment 2. Please also merge D1 and D2 to form data set D. Do NOT stem the data sets.
Question 1 (50 points)
This question asks you to build a simple spelling correction function.
Your program should take a word w as input. It searches all words x in D that have edit distance edit-dist(w, x) <= 2, where each of the editing operations, insertion, deletion, and replacement, takes weight 1. Then, your program should output the top-5 words x that have the highest term- frequency in D as the suggested corrections as well as their term-frequencies in D. 1. Describe your implementation of the above spelling correction function and use 3 examples to show the correction results. Particularly, explain how you can find those candidate corrections and the top-5 candidates efficiently. Please submit your codes. (20 points) 2. The above design has a problem. If you input a correctly spelt word, like “heart”, it still may suggest some similar words as possible corrections, such as “heard”. Can you propose a fix to the problem so that your program can avoid suggesting corrections of words frequently used in D? Describe your idea use examples, implement it, and test the results. Does your program need any parameters? If so, discuss how those parameters may contribute to or affect the performance of the program, that is, how those parameters are tuned. Please submit your codes. (10 points) 3. A data set of tweets may often contain misspelt words. Your spelling correction program trained from D may be used on D to improve the data quality. Apply your program to D itself to catch misspelt words and correct them. Show the corrections you program makes. Does your program make some mistakes in correction? Use the first 100 corrections your program makes (or as many as possible if your program makes less than 100 corrections) to calculate the precision. Explain (1) how your program picks the correction of a word among all possible candidates; and (2) how your program avoids changing every word in D, particularly those words correctly spelt but infrequent in D. Submit your codes. (20 points) Question 2 (25 points) This question asks you to use language models to understand the differences between the usages of words in D1 and D2. Please remove the stop words in D1 and D2. !" !"!" a document 𝑑𝑑" (𝑑𝑑" ). Compute the language model 𝑀𝑀" (𝑀𝑀" ) of 𝑑𝑑" (𝑑𝑑" ). Please use For a word w that appears in document D1 (D2), we collect all the tweets containing w and form !! !!!! Dirichlet smoothing, use 𝐷𝐷# ∪ 𝐷𝐷$ as the collection and set 𝜇𝜇 = 25 . For the top-100 most frequent words in D, compute 𝐾𝐾𝐾𝐾(𝑀𝑀!"" || 𝑀𝑀!"! ) and sort the words in KL-divergence descending order. Please submit your list of the words, the KL-divergence values and your codes. Question 3 (25 points) This question asks you to conduct clustering on the tweets. D2 has 1000 tweets. Combine the first 100 tweets into one document 𝐷𝐷$#, combine the next 100 tweets into one document 𝐷𝐷$, ..., combine the last 100 tweets into one document 𝐷𝐷$#%. Let the language models on 𝐷𝐷$#,𝐷𝐷$,...,𝐷𝐷$#% be 𝑀𝑀""!,𝑀𝑀"",...,𝑀𝑀""!#, respectively. For each tweet t in D1, we can form a vector 𝑡𝑡⃗ = ⟨𝑥𝑥#, 𝑥𝑥$, ... , 𝑥𝑥#%⟩, where 𝑥𝑥& = 𝐾𝐾𝐾𝐾 6𝑀𝑀'||𝑀𝑀"$ 7, and 𝑀𝑀 is the language model of t. Again, please use Dirichlet smoothing, use 𝐷𝐷 ∪ 𝐷𝐷 as the '#$ collection and set 𝜇𝜇 = 25. The vectors should be normalized. For any two tweets 𝑡𝑡# and 𝑡𝑡$ in D1, the similarity between them is the cosine similarity between 8⃗ 8⃗ the two normalized associated vectors 𝑡𝑡# and 𝑡𝑡$ . Run the k-means algorithm to cluster all tweets in D1. You can learn the k-means algorithm on pages 382-384 in the textbook. You can also use any implementations or libraries, such as scikit in Python , to implement the k-means algorithm. Report your results when k is set to 3, 5, and 10, respectively. For each cluster, report the number of tweets and the top-5 most frequent words in the tweets in the cluster. Submit your codes as well.
Hint: many standard k-means clustering packages, such as scikit, uses Euclidean distance as standard. There is a linear connection between cosine similarity on normalized vectors and Euclidean distance. You can see an explanation at https://stats.stackexchange.com/questions/299013/cosine-distance-as-similarity-measure-in- kmeans