ECE 219: Models and Algorithms
Project 3: Recommender Systems
1 Introduction
The increasing importance of the web as a medium for electronic and business transactions and advertisement, and social media has served as a driving force behind the development of recommender systems technology. Among the benefits, recommender systems provide a means to prioritize data for each user from the infinite information available on the internet. Such systems are critical to ensuring (among others): (a) the detection of hate speech, (b) user retention on a web service, and (c) fast and high-quality access to relevant information. An important catalyst is the ease with which the web enables users to provide feedback about a small portion of the web that they traverse.
Copyright By PowCoder代写 加微信 powcoder
Such user-driven sparse feedback poses the following challenge in the desing of recommender systems: Can we utilize these sparse user datapoints to infer generalized user interests?
We define some terms:
• The entity to which the recommendation is provided is referred to as the user; • The product being recommended is an item.
The basic models for recommender systems works with two kinds of data:
A User-Item interactions such as ratings (a user (you) provides ratings about a movie (item));
B Attribute information about the users and items such as textual profiles or relevant key- words (deep representations about a user or item).
Models that use type A data are referred to as collaborative filtering methods, whereas models that use type B data are referred to as content-based methods. In this project, we will build a recommendation system using collaborative filtering methods.
2 Collaborative filtering models
Collaborative filtering models use the collaborative power of established user-item interactions to make recommendations about new user-item interactions. In this project we use a ratings database where the user is an audience member who viewed a movie, and the item is the movie being rated.
The main challenge in designing collaborative filtering methods is that the underlying ratings matrices are sparse. Consider this example of a movie application in which users specify ratings indicating their like or dislike of specific movies. Most users would have viewed only a small fraction of the large universe of available movies and as a result most of the ratings are unspecified.
The basic idea of collaborative filtering methods is that these unspecified ratings can be imputed because the observed ratings are often highly correlated across various users and items.
For example, consider two users named John and Molly, who have very similar tastes. If their respective ratings exist within our database and are very similar, then the media recommended to them should likely be similar as well.
For those few scenarios in which only John has rated a movie M, the similarity across other movies to Molly’s preferences should make clear that Molly might also prefer movie M. Thus, most collaborative filtering methods leverage either inter-item correlations or inter-user correlations for the prediction process.
In this project, we will implement and analyze the performance of two types of collaborative filtering methods:
1. Neighborhood-based collaborative filtering: Directly leverages the choices of other users to determine potential items to recommend to the current user.
2. Model-based collaborative filtering: Estimates a joint model from all the user data such that in order to generate a new recommendation, we do not need to use the entire user base and can query (a smaller) model.
In this project, we will build a recommendation system to predict the ratings of movies in the provided dataset. The dataset can be downloaded using the following link: https://drive. google.com/drive/folders/1_JF9plSjE3PAFBuSvUFRkDdftJWo1TFz?usp=sharing.
For the subsequent discussion, we assume that the ratings matrix is denoted by R (you will have to construct this), and it is an m × n matrix containing m users (rows) and n movies (columns). The (i,j) entry of the matrix is the rating by user i for movie j and is denoted by rij. Before moving on to the collaborative filter implementation, we will analyze and visualize some properties of this dataset.
QUESTION 1: Explore the Dataset: In this question, we explore the structure of the data.
A Compute the sparsity of the movie rating dataset:
Sparsity = Total number of available ratings (1) Total number of possible ratings
B Plot a histogram showing the frequency of the rating values: Bin the raw rating values into intervals of width 0.5 and use the binned rating values as the horizontal axis. Count the number of entries in the ratings matrix R that fall within each bin and use this count as the height of the vertical axis for that particular bin. Comment on the shape of the histogram.
C Plot the distribution of the number of ratings received among movies: The X-axis should be the movie index ordered by decreasing frequency and the Y -axis should be the number of ratings the movie has received; ties can broken in any way. A monotonically decreasing trend is expected.
D Plot the distribution of ratings among users: The X-axis should be the user index ordered by decreasing frequency and the Y -axis should be the number of movies the user has rated. The requirement of the plot is similar to that in Question C.
E Discuss the salient features of the distributions from Questions C,D and their implications for the recommendation process.
F Compute the variance of the rating values received by each movie: Bin the variance values into intervals of width 0.5 and use the binned variance values as the horizontal axis. Count the number of movies with variance values in the binned intervals and use this count as the vertical axis. Briefly comment on the shape of the resulting histogram.
4 Neighborhood-based collaborative filtering
The basic idea in neighborhood-based methods is to use either user-user similarity or item-item similarity to make predictions from a ratings matrix. There are two basic principles used in neighborhood-based models:
1. User-based models: Similar users have similar ratings on the same item. Therefore, if John and Molly have rated movies in a similar way in the past, then one can use John’s observed ratings on the movie Terminator to predict Molly’s rating on this movie. Item is kept constant.
2. Item-based models: Similar items are rated in a similar way by the same user. Therefore, John’s ratings on similar science fiction movies like Alien and Predator can be used to predict his rating on Terminator. User is kept constant.
In this project, we will only implement user-based collaborative filtering (implementation of item-based collaborative filtering is very similar).
4.1 User-based neighborhood models
In this approach, we are trying to find a set of users similar in their rating strategy to a target user. This results in a user-based neighborhood and we will use the majority vote within the neighborhood to provide recommendations.
In order to determine the neighborhood of the target user u, their similarity to all the other users is computed. Therefore, a similarity function needs to be created between each pair of the historical rating patterns – one by each user across the movies. In this project, we will use the Pearson-correlation coefficient to compute this similarity as a correlation.
4.2 Pearson-correlation coefficient
The Pearson-correlation coefficient between users u and v denoted by Pearson(u,v) captures the similarity between the rating vectors of users u and v. First some notation:
• Iu : Set of item indices for which ratings have been specified by user u; • Iv : Set of item indices for which ratings have been specified by user v; • μu: Mean rating for user u computed using her specified ratings;
• ruk: Rating of user u for item k.
QUESTION 2: Understanding the Coefficient:
A Write down the formula for μu in terms of Iu and ruk;
B In plain words, explain the meaning of Iu ∩ Iv. Can Iu ∩ Iv = ∅? (Hint: Rating matrix R is sparse)
Then, with the above notation, the Pearson-correlation coefficient between a pair of users u and v is defined by equation 2:
k∈Iu∩Iv (ruk − μu)(rvk − μv)
Pearson(u,v) = (2)
k∈Iu∩Iv (ruk − μu)2 k∈Iu∩Iv (rvk − μv)2
4.3 k-Nearest neighborhood (k-NN)
Having introduced a similarity metric between users (as a correlation coefficient between their ratings across movies), we are now ready to define a neighborhood of users. The k-Nearest neighbors of user u, denoted by Pu, is the set of k users with the highest Pearson-correlation coefficient with user u (pairwise).
4.4 Prediction function
The predicted rating that user u might award for item j, denoted by rˆuj, can simply be modeled by equation 3:
v∈ (u,v)(rvj −μv)
rˆuj =μu + v∈Pu |Pearson(u,v)| (3)
QUESTION 3: Understanding the Prediction function: Can you explain the reason behind mean-centering the raw ratings (rvj − μv) in the prediction function? (Hint: Consider users who either rate all items highly or rate all items poorly and the impact of these users on the prediction function.)
4.5 k-NN collaborative filter
The previous sections have equipped you with the basics needed to implement a k-NN col- laborative filter for predicting ratings of the movies. Although we have provided you with the equations needed to write a function for predicting the ratings, we don’t require you to write it. Instead, you can use the built-in python functions for prediction.
4.5.1 Design and test via cross-validation
In this part of the project, you will design a k-NN collaborative filter and test its performance via 10-fold cross validation. In a 10-fold cross-validation, the dataset is partitioned into 10 equal sized subsets. Of the 10 subsets, a single subset is retained as the validation data for testing the filter, and the remaining 9 subsets are used to train the filter. The cross-validation process is then repeated 10 times, with each of the 10-subsets used exactly once as the validation data.
QUESTION 4: Design a k-NN collaborative filter to predict the ratings of the movies in the original dataset and evaluate its performance using 10-fold cross validation. Sweep k (number of neighbors) from 2 to 100 in step sizes of 2, and for each k compute the average RMSE and average MAE obtained by averaging the RMSE and MAE across all 10 folds. Plot average RMSE (Y-axis) against k (X-axis) and average MAE (Y-axis) against k (X-axis).
The functions that might be useful for solving this question are described in these documen- tations 1.
Use Pearson-correlation function as the similarity metric. You can read about how to specify the similarity metric in the documentation: http://surprise.readthedocs.io/en/stable/ similarities.html
QUESTION 5: Use the plot from question 4, to find a ’minimum k’. Note: The term ’minimum k’ in this context means that increasing k above the minimum value would not result in a significant decrease in average RMSE or average MAE. If you get the plot correct, then ’minimum k’ would correspond to the k value for which average RMSE and average MAE converges to a steady-state value. Please report the steady state values of average RMSE and average MAE.
1 http://surprise.readthedocs.io/en/stable/knn_inspired.html, http://surprise.readthedocs.io/en/stable/model_ selection.html#surprise.model_selection.validation.cross_validate
4.6 Filter performance on trimmed test set
In this part of the project, we will analyze the performance of the k-NN collaborative filter in predicting the ratings of the movies in the trimmed test set. The test set can be trimmed in many ways, but we will consider the following trimming options:
• Popular movie trimming: In this trimming, we trim the test set to contain movies that have received more than 2 ratings. If a movie in the test set has received less than or equal to 2 ratings in the entire dataset then we delete that movie from the test set and do not predict the rating of that movie using the trained filter.
• Unpopular movie trimming: In this trimming, we trim the test set to contain movies that have received less than or equal to 2 ratings. If a movie in the test set has received more than 2 ratings in the entire dataset then we delete that movie from the test set and do not predict the rating of that movie using the trained filter.
• High variance movie trimming: In this trimming, we trim the test set to contain movies that have variance (of the rating values received) of at least 2 and has received at least 5 ratings in the entire dataset. If a movie has variance less than 2 or has received less than 5 ratings in the entire dataset then we delete that movie from the test set and do not predict the rating of that movie using the trained filter.
Having defined the types of trimming operations above, now we can evaluate the performance of the k-NN filter in predicting the ratings of the movies in the trimmed test set.
4.6.1 Performance evaluation using ROC curve
Receiver operating characteristic (ROC) curve is a commonly used graphical tool for visualizing the performance of a binary classifier. It plots the true positive rate (TPR) against the false positive rate (FPR).
In the context of recommendation systems, it is a measure of the relevance of the items recommended to the user. Since the observed ratings are in a continuous scale (0-5), so we first need to convert the observed ratings to a binary scale. This can be done by thresholding the observed ratings. If the observed rating is greater than the threshold value, then we set it to 1 (implies that the user liked the item). If the observed rating is less than the threshold value, then we set it to 0 (implies that the user disliked the item). After having performed this conversion, we can plot the ROC curve for the recommendation system in a manner analogous to that of a binary classifier.
QUESTION 6: For EACH of the 3 subsets in the test set, design:
A k-NN collaborative filter to predict the ratings of the movies in the test subset (i.e Popular, Unpopular or High-Variance) and evaluate each of the three models’ performance using 10-fold cross validation:
• Sweep k (number of neighbors) from 2 to 100 in step sizes of 2, and for each k compute the average RMSE obtained by averaging the RMSE across all 10 folds. Plot average RMSE (Y-axis) against k (X-axis). Also, report the minimum average RMSE.
• Plot the ROC curves for the k-NN collaborative filters for threshold values [2.5, 3, 3.5, 4]. For each of the plots, also report the area under the curve (AUC) value.
We provide you with the following hints:
• For each value of k, split the dataset into 10 pairs of training and test sets.
(trainset 1, testset 1), (trainset 2, testset 2), . . . , (trainset 10, testset 10)
The following documentation might be useful for the splitting:
http://surprise.readthedocs.io/en/stable/getting_started.html#use- cross- validation- iterators
• For each pair of (trainset, testset):
– Train the collaborative filter on the train set
– Write a trimming function that takes as input the test set and outputs a trimmed test set
– Predict the ratings of the movies in the trimmed test set using the trained collaborative filter – Compute the RMSE of the predictions in the trimmed test set
• Compute the average RMSE by averaging across all the 10 folds.
• For the ROC plotting, split the dataset into 90% for training and 10% for testing. The functions described in the documentation
below might be useful http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html
Model-based collaborative filtering
In model-based collaborative filtering, models are developed using machine learning algorithms to predict users’ rating of unrated items. Some examples of model-based methods include decision trees, rule-based models, bayesian methods, and latent factor models. In this project, we will explore latent factor based models for collaborative filtering.
5.1 Latent factor based collaborative filtering
Latent factor based models can be considered as a direct method for matrix completion. It estimates the missing entries of the rating matrix R, to predict what items a user will most probably like other than the ones they have rated. The basic idea is to exploit the fact that significant portions of the rows and columns of the rating matrix are correlated. As a result, the data has built-in redundancies and the rating matrix R can be approximated by a low-rank matrix. The low-rank matrix provides a robust estimation of the missing entries.
The method of approximating a matrix by a low-rank matrix is called matrix factorization. The matrix factorization problem in latent factor based model can be formulated as an opti- mization problem given by 4
minimize (rij −(UVT)ij)2 (4)
In the above optimization problem, U and V are matrices of dimension m × k and n × k respectively, where k is the number of latent factors. However, in the above setting it is assumed that all the entries of the rating matrix R is known, which is not the case with sparse rating matrices. Fortunately, latent factor model can still find the matrices U and V even when the rating matrix R is sparse. It does it by modifying the cost function to take only known rating values into account. This modification is achieved by defining a weight matrix W in the following manner:
1,rij isknown Wij = 0, rij is unknown
Then, we can reformulate the optimization problem as
minimize Wij(rij −(UVT)ij)2 (5)
Since the rating matrix R is sparse, the observed set of ratings is very small. As a result, it might cause over-fitting. A common approach to address this problem is to use regulariza- tion. The optimization problem with regularization is given by equation 6. The regularization parameter λ is always non-negative and it controls the weight of the regularization term.
minimize Wij(rij −(UVT)ij)2 +λ∥U∥2F +λ∥V∥2F (6)
There are many variations to the unconstrained matrix factorization formulation (equation 6) depending on the modification to the objective function and the constraint set. In this project, we will explore two such variations:
• Non-negative matrix factorization (NMF)
• Matrix factorization with bias (MF with bias)
5.2 Non-negative matrix factorization (NMF)
Non-negative matrix factorization may be used for ratings matrices that are non-negative. As we have seen in the lecture, the major advantage of this method is the high level of inter- pretability it provides in understanding the user-item interactions. The main difference from other forms of matrix factorization is that the latent factors U and V must be non-negative. Therefore, optimization formulation in non-negative matrix factorization is in 7:
by equation 8
minimize Wij(rij −(UVT)ij)2 +λ∥U∥2F +λ∥V∥2F
i=1 j=1 subjectto U≥0,V≥0
There are many optimization algorithms like stochastic gradient descent (SGD), alternating least-squares (ALS),etc for solving the optimization problem in 7. Since you are familiar with the SGD method, we will not describe it here. Instead, we will provide the motivation and main idea behind the ALS algorithm. SGD is very sensitive to initialization and step size. ALS is less sensitive to initialization and step size, and therefore a more stable algorithm than SGD. ALS also has a faster convergence rate than SGD. The main idea in ALS, is to keep U fixed and then solve for V. In the next stage, keep V fixed and solve for U. In this algorithm, at each stage we are solving a least-squares problem.
Although ALS has a faster convergence rate and is more stable, we will use SGD in this project. The main reason behind this is based on the fact that the python package that we will be using to design the NMF-based collaborative filter only has the SGD implementation. This choice would have no effect on the performance of
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com