CS代考 COMP9727 Recommender Systems Assignment 1 - 22T2

COMP9727 Recommender Systems Assignment 1 – 22T2
Due: 1st July, 17:00 AEST
Total Mark: 50
Introduction In this assignment, you will be required to manually implement a few recommen- dation algorithms in Python as well as answer some corresponding questions individually. Other than this spec, all the required files can be found in ‘a1.zip’, which you can download directly from the WebCMS3 page. The following are some general requirements for the assignment:

• File link (login required): https://cs9727.web.cse.unsw.edu.au/22T2/assn1/a1.zip. If you can not download the file, try to open it in incognito window.
• There are 5 questions that count for 25% of your final mark. They are not equally dis- tributed.
• Each question consists of a conceptual questions part and an implementation part.
• For conceptual questions part, you must submit your answers electronically. You should report your answers to all the questions in a single file ass1 YourZId.pdf. You are strongly encouraged to use LATEX to produce your answers (Word is fine if you prefer). Please note that we do not accept scanned copies.
• For implementation part, please note that all the algorithms are required to be imple- mented in Python3.
• As far as third-party packages are concerned, only the following are permitted: NumPy, Pandas and Matplotlib. For versioning, please check it in the submission section.
• For ‘a1.zip’, it contains the following datasets: q1.txt, q2.txt, q3.csv, shows.txt, user-shows.txt, q5.csv. It also contains some starter codes: q3.py and q5.py, please follow the instructions inside to finish the tasks. For q1.py, q2.py and q4.py, you need to implement them from scratch, including creating the python files yourself.
• Please ensure your code works well on CSE machines. If your code was unable to run on CSE machines, you will receive 0 for the corresponding question.
• Late Penalty: 5% reduction per day of the assessment mark. Late submissions that are more than 5 days late will not be accepted.
• We do not set up standard outputs for the questions. As such, you will not lose marks due to the output format, as long as you include all the required content and the outputs are well-structured and meaningful.

Question 1 – Associate Rules and its Application in Recommendation (10 Marks)
To measure significance and interest in rules for recommendations, there are three commonly used metrics:
• Confidence (denoted as conf(X → Y )): Confidence is defined as the probability of occur- rence of Y in the transaction if the transaction already contains X:
conf(X → Y ) = Pr(Y |X), where Pr(Y |X) is the conditional probability.
• Lift (denoted as lift(X → Y )): Lift is the ratio of the confidence of the rule and the expected confidence of the rule if X and Y are statistically independent:
lift(X →Y)= conf(X →Y); support(Y )
• Conviction(denoted as conv(X → Y )): Conviction can be explained as the expected frequency that X occurs without Y (i.e., the frequency that the rule makes an incorrect prediction):
conv(X → Y) = 1−support(Y) 1−conf(X →Y)
(a) (2 Marks)
There is a problem by using confidence only to evaluate the rules: confidence ignores the Pr(Y ). Explain why it matters. Moreover, explain why lift and conviction are not affected by that.
(b) (2 Marks)
Consider the following definition of symmetrical:
Definition 1 (Symmetrical) A measure is symmetrical if measure(X → Y ) = measure(Y →
For those mentioned measures: Confidence, Lift and Conviction, prove or disprove that
they are symmetrical.
(c) (2 Marks)
A rule is said to be a “perfect” rule if it has a conditional probability of 1 (i.e., P r(Y |X) = 1 for X → Y ). A measure is considered optimal if it reaches its maximum achievable value for a “perfect” rule. Which of these measures is optimal? You should prove or disprove the statement. You can ignore 0/0 but not other cases.
Now let’s move to the implementation part. You are required to create a program named q1.py. Use your program to answer the following questions: (d) and (e).
We will use the provided dataset q1.txt to apply the associate rule mining technique to make recommendations. In this dataset, each line represents a session of a user while the user id is not

known. Within each session, there are several item IDs (strings of length 8) that the user viewed during this session, separated by space.
In your q1.py, use the Apriori algorithm to find the items which are frequently viewed together. Assume the support is 100 and fixed. Find the itemsets of sizes 2 and 3.
(d) (2 Marks)
Given a pair of items (X,Y), assume that the support count of {X,Y} is at least 100. Compute the confidence scores of the following two association rules for all the pairs meeting that condition:
Sort the rules in decreasing order based on confidence scores. Additionally, if more than one rule has the same confidence score, sort them by lexicographically increasing order on the left hand side of the rule. Please report top 10 rules in your report. Note that, your program should output top 20 rules for sanity check.
(e) (2 Marks)
Now consider a tuple of items (X, Y, Z), follow the same instructions in part (d) and report your results. The association rules you need to consider are:
(X,Y)→Z, (X,Z)→Y, and (Y,Z)→X Question 2 – Latent Factor Method (5 Marks)
The goal of this question is to implement an algorithm – Stochastic Gradient Descent (SGD) to build a recommender system. Given a matrix R of ratings which contains m items and n users (i.e., size of R is m×n). Riu represents the rating given by user u to item i. Where the dimension of Q is m×k and the dimension of P is n×k. The goal of this problem is to find user matrix P and item matrix Q. We define the error as the following:
E= 􏰉 (Riu−Rˆiu)2 +λ 􏰉∥pu∥2+􏰉∥qi∥2 , whereRˆiu =qi·p⊤u, (1)
where 􏰇(i,u)∈ratings means that we sum only on the pairs (user,item) for which the user has rated the item, i.e. the (i, u) entry of the matrix R is known. qi denotes i−th row of the matrix Q, andpu istheu−throwofthematrixP. qi andpu arebothrowvectorsofsizek,λisisthe regularization parameter. ∥ · ∥2 is the L2 norm.
In the lecture, we’ve discussed how SGD is used for updating the gradient. Now, implement the algorithm in a file named q2.py by using the provided dataset q2.txt with the following requirements. Note q2.txt comes from the same dataset as what we used in Tutorial 2. You may consider to use the tutorial 2 Q3’s code as a start point.
• You are not allowed to store the matrix R in memory. You have to read each Riu once a time from disk and apply your update rules in each iteration.
• Your program should read the whole file within each iteration.
• There is no code template provided in this question, you need to implement it from scratch.
Now, let’s fix k = 20, λ = 0.1 and the number of iterations = 40. You are tasked with finding an optimal value for the learning rate η. With an ideal η, you should observe the following: E < 65, 000 after 40 iterations with both qi and pu stop changing. Find the value of η that you believe is ideal. Based on the η you find, please plot the cor- responding value of the objective function E (defined in Eq. 1) changing over the number of iterations (i.e., value of E as y-axis and iteration as x-axis). Question 3 - Collaborative Filtering with Naive Bayes Classifier (5 Marks) Collaborative filtering (CF) can be viewed as a generalization of classification tasks when there are a few distinct unordered discrete ratings. In this question, you will be required to work with collaborative filtering based on Naive Bayes classifier (NBC). Built upon the Bayes’ theorem, NBC is a simple yet effective supervised classification algorithm with conditional independence assumption. Concretely, consider a K-class classification problem where an instance x = {xi}i=1:d is represented by a d-dimensional feature vector. The probability that it belongs to each of K possible outcomes is given by likelihood prior 􏰋 􏰎􏰍 􏰌􏰋􏰎􏰍􏰌 Pr(Ck | x) = Pr(x | Ck) Pr(Ck), 􏰍 􏰌􏰋 􏰎 Pr(x) k = 1...K (2) Since the evidence Pr(x) does not depend on C, we only care about the numerator for different Ck in practice. Under a “naive” assumption that features are mutually independent conditioned on the class Ck - Pr(xi | xi+1,...,xd,Ck) = Pr(xi | Ck), Eq. 2 can be re-written as Pr(Ck | x) ∝ Pr(x | Ck) Pr(Ck) ∝ Pr(x,Ck) ∝ Pr(x1 | x2,...,xd,Ck) Pr(x2 | x3,...,xd,Ck)···Pr(xd | Ck) Pr(Ck) ∝ Pr(x1 | Ck) Pr(x2 | Ck)···Pr(xn | Ck) Pr(Ck) n ∝ 􏰊Pr(xi | Ck) Pr(Ck) i=1 Now let’s take a look at a concrete recommendation scenario. Assume there is a rating matrix R ∈ rm×n associated with a user set U = {ul}l=1:m and an item set I = {il}l=1:n consisting of m users and n items, respectively. The (u, i)-th entry of R represents the specified rating of the u-th user to the i-th item, denoted by ru,i = {0, 1}, i.e., rating is a random binary variable. In this case, estimating absent ratings is cast to a classification task. As usual, one can perform either user-based CF or item-based CF, with the differences shown by how prior and likelihood are computed. You may find the following steps to perform CF with NBC. 1. Prior and Likelihood Computation. In item-based CF, we consider the calculation of both prior and likelihood from the item perspective. Specifically, let Pr (r·,i = k) be the prior probability of item i being rated as k, defined as Pr(r·,i =k)= card{u∈U |ru,i =k}+α (4) card{u∈U |ru,i ̸=null}+α·card{R} where card {·} counts the number of elements in the set, card {R} represents all plausible ratings (i.e., not empty and valid value) in the rating matrix. null denotes the absence of rating, and α refers to Laplacian smoothing parameter to handle over-fitting. Let Pr (r·,j = k′ | r·,i = k) be the probability of item j being rated as k′ conditioning on r·,i = k, then Pr􏰀r·,j =k′ |r·,i =k􏰁= card{u∈U |ru,j =k′ ∧ru,i =k}+α (5) card{u∈U |ru,j ̸=null∧ru,i =k}+α·card{R} As for user-based CF, prior and likelihood should be considered from the user perspective. Anal- ogously, define Pr (ru,· = k) as the prior probability that user u rates an arbitrary item with value Pr(ru,· =k)= card{i∈I |ru,i =k}+α (6) card{i∈I |ru,i ̸=null}+α·card{R} Meanwhile, let Pr (rv,· = k′ | ru,· = k) be the probability that user v rate k′ to an item condi- tioning on ru,· = k, Pr􏰀rv,· =k′ |ru,· =k􏰁= card{i∈I |rv,i =k′ ∧ru,i =k}+α (7) card{i∈I |rv,i ̸=null∧ru,i =k}+α·card{R} 2. Rating Estimation. Using prior and factors of the likelihood, we are able to derive the classification score Pr (ru,i = k) of user u to the item i of being a specific rating k. 􏰆Pr(r·,i = k)􏰈j∈Iu Pr(r·,j = ru,j | r·,i = k), if item-based Pr(ru,i = k) ∝ Pr(ru,· = k)􏰈v∈Ui Pr(rv,· = rv,i | ru,· = k), if user-based (8) whereIu ={i∈I |ru,i ̸=null}collectstheitemsthatuseruhasrated. Ui ={u∈U |ru,i ̸=null} is the set of users who have rated item i. Finally, the predicted rating is produced by the one that maximizes the classification score, rˆu,i =argmaxPr(ru,i =k|R) (9) (a) (2 Marks) Assume, an interaction matrix R described by 5 users and 5 items where each entry is a binary value denoting the implicit feedback. Please derive a user-based CF for Rˆ2,1 and an item-based CF for Rˆ4,5, by following the strategies of NBC above. 0 ? 1 0 1 ? 1 0 ? 1 R=0 0 1 1 1 1 0 1 ? ? For each entry, you are supposed to show the classification score of each possible rating, as well as the predictions (w.r.t Eq. 8 and Eq. 9). You can directly use the multiplication result as the score (i.e., replace ∝ with = in Eq. 8). The Laplacian smoothing coefficient here α = 0.1. Your answer should show how you derive the results. (b) (3 Marks) Upon knowing how user-based CF and item-based CF work, you can also combine them into a hybrid one. It goes without saying that considering them both would increase the use of evidence from R. As such, the classification score of ru,i = k becomes the weighed product of both sides 1 1+card{Ui } Pr(ru,i = y) ∝Pr(r·,i = k) 􏰊 Pr(r·,j = ru,j | r·,i = k) j ∈Iu 1 1+card{Iu } ·Pr(ru,· =k) 􏰊 Pr(rv,· =rv,i |ru,· =k) v∈Ui where Ui and Iu represents the set of users who have rated item i, and the set of items that have been rated by user u, respectively. Now given another rating matrix R to be read from a file q3.csv, you are tasked with implementing a hybrid CF that enables rating predictions of a specified missing entry. Given the fact that, the prediction performed by a hybrid CF (w.r.t Eq. 10) is simply a weighted production of user-based CF and item-based CF, you should implement them as well in your program. For each specified query to be predicted, your program should be able to estimate the classification score for each possible rating value, and predict the rating value accordingly. • You are provided with an example dataset q3.csv and the starter code q3.py where the helper functions are placed (DO NOT ALTER THEM). • Complete the program ONLY within the indicated area and focus on the core functionalities. You should follow instructions inside q3.py. • Do NOT hardcode the results since the testing data may be different from the provided example dataset. Question 4 - User-item bipartite graph in Recommendation (15 Marks) Let’s consider a common data structure - graph. We define a user-item bipartite graph proposed by Li and Chen (2013). The user-item bipartite graph in this question is split into two separate and disjoint sets U and I representing users and items, respectively. In a bipartite user-item graph, an edge indicates that user u likes item i or that user u provides positive feedback to item i. We also have the rating matrix R where each row represents a user and each column corresponds to an item. If user u likes item i then Ru,i = 1, otherwise Ru,i = 0. We assume that there are m users and n items. Define an m-by-m diagonal matrix P, where each diagonal element represents the corresponding degree of the user node. Similarly, each diagonal element in an n-by-n matrix Q represents the degree of each item node. See the following figure for an example. (a) (2 Marks) Now we define a user related matrix T = RR⊤. Describe what Tii and Tij (i ̸= j) represent respectively in the above mentioned bipartite graph. (b) (3 Marks) Users Items Figure 1: Example of bipartite graph Now we define an item similarity matrix SI , with a size of n × n. Each element represents the pairwise cosine similarity between items. For example, Sij represents the element in row i and column j, which is the cosine similarity between item i and item j. Show that SI = Q−1/2R⊤RQ−1/2. Hint: row i and column j in SI correspond to column i and column j of the matrix R, respectively. Now consider the user similarity matrix SU , with a size of m × m. Please find the rep- resentation of SU by using R, P and Q. Your answer should show how you derive the expressions. (c) (4 Marks) Let’s define the final recommendation matrix M, with a size of m×n, such that M(i,j) = ri,j. Find M for both item-based and user-based collaborative filtering approaches (assume the cosine similarity is used) by using R,P and Q. Your final answer should describe operations at the matrix level, not specific terms of matrices. Different from the lecture, we will use a simplified version to represent rating prediction for user u and item s, ru,s: ru,s = 􏰉 Rx,s ·sim(x,u) and ru,s = 􏰉 Ru,x ·sim(x,s) x∈users x∈items Your answer should show how you derive the expressions. (d) (6 Marks) Now, let’s apply those methods into the a real application. Given a dataset that contains the information about TV shows. More precisely, for 9,985 users and 563 popular TV shows, we know if a given user watched a given show over a 3 month period. You are provided with two files: user-shows.txt and shows.txt: – user-shows.txt: This is the ratings matrix R, where each row corresponds to a user and each column corresponds to a TV show. Rij = 1 if user i watched the show j over a period of three months. The columns are separated by a space. – shows.txt: This is a file containing the titles of the TV shows, in the same order as the columns of R. We will compare the user-based and item-based collaborative filtering recommendations for the 500-th user of the dataset. Let’s call him Bob. In order to do so, we have erased the first 100 entries of Bob’s row in the matrix, and replaced them by 0s. This means that we don’t know which of the first 100 shows Bob has watched. Based on Bob’s behaviour on the other shows, we will give Bob recommendations on the first 100 shows. We will then see if our recommendations match what Bob had in fact watched. Here are the tasks: (1) Compute P and Q. (2 Marks) (2) Based on your results about M in part (c), compute M for the user-based collaborative filtering. Let S denote the set of the first 100 shows (the first 100 columns of the matrix). What are the five TV shows in S that have the highest similarity scores for Bob? Choose the show with the smaller index if similarity scores are tied between two shows. Do not write the index of the TV shows. Write their names using the file shows.txt. (2 Marks) (3) Compute the matrix M for the item-based collaborative filtering. From all the TV shows in S, which are the five that have the highest similarity scores for Bob? In case of ties between two shows, choose the one with smaller index. Again, report the names of the shows and their similarity scores. (2 Marks) Note: Write a program, stored in a file named q4.py. You program q4.py should produce the following four outputs in order: – Matrix P – Matrix Q – Top 5 items for Bob that have the highest similarity scores by using user- based collaborative filtering – Top 5 items for Bob that have the highest similarity scores by using item- based collaborative filtering. Matrix can be printed as a numpy.ndarray. If your code cannot produce the correct output (i.e., same as your report), you will get 0 for this part. Question 5 - Locality Sensitive Hashing for ANN Search (15 Marks) In Lecture 4, we’ve discussed LSH and ANN. Now, we would like to further study how to use LSH to conduct the ANN search. Assume we have a dataset A consisting of n points. Let’s define a distance metric d(·, ·), and a constant c : c > 1. Now define a (c, λ)-ANN problem as the following: Given an arbitrary query data point z, assuming there exists a point x such that d(x,z) ≤ λ, return a point y from the dataset such that d(y, z) ≤ cλ. The point y is known as (c, λ)-ANN. Hence, c here presents the

maximum allowable approximation factor.
Now let us consider a LSH family H that is (λ, cλ, p1, p2)-sensitive with the distance measure d(·,·). LetG=Hk ={g=(h1,h2,···,hk)|hi ∈H,∀1≤i≤k},wherek=log1/p2(n). Nowwe have the following procedure:
• Assume we have N = np random numbers from G : g1,g2,··· ,gN, with p = log(1/p1). log(1/p2 )
• Hash all the data as well as the query data by using gi(i ∈ [1,N]).
• Retrieve 3N (at most) data points from the set of N buckets to which the query point hashes. (Note: It may occur that we cannot retrieve 3N data points in some cases. In that case, you should fetch all the data points.)
• Select the data points closest to the query point from your 3N data points. Return it as the (c, λ)-ANN.
Before we proceed with the implementation, we would like to prove that the above procedure can produce a correct answer with a certain probability.
(a) (2 Marks)
Let’sdefinetwosets: Bj ={x∈A|gj(x)=gj(z)}wherej∈[1,N],andC={x∈A| d(x, z) > cλ}. Prove that:
Hint: Markov’s inequality
(b) (3 Marks)
|Bj∩C|≥3N ≤3
Let y ∈ A be the point that d(y, z) ≤ λ. Prove that: 􏰄􏰅1
Prgj(y)̸=gj(z) CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts