BU.450.760 Technical Document T7.1 – RecSys Prof.
R Implementation of a Movie RecSys
This is a companion document for the script of “S7.1 R Implementation of a Movie RecSys.R”.
1. Install + load “recommenderlab” package
Copyright By PowCoder代写 加微信 powcoder
The software required to implement RecSys that we will use is contained by the package “recommenderlab.” The first step in our analysis is to install and load it. Line 19 below is the commands that installs the package. Currently (in the code below), this instruction is commented on the assumption that the package has been already been installed. If this is not the case for your machine, you should uncomment and run line 19. Line 20 loads the package, that is, it activates the software in the R environment. Having loaded the package, R will be able to execute instructions that use functions programmed under the package.
BU.450.760 Technical Document T7.1 – RecSys Prof.
2. Loading the ratings data
Our RecSys implementation utilizes movie rating data. These data are contained by “D7.1 Movie Ratings.csv”. The screenshot below is taken from the Excel rendition of the data. Userid indexes the users who provide the ratings, movieid indexes the movies (“D7.2 Movie Titles.csv” lists the title associated to each entry of movieid), and rating indexes the provided rating. Thus, for example, userid number 1 rated movie 31 with a 2.5. The same individual rated movie 1029 with a 3 and movie 1129 with a 2. As it will be shown below, there is a large number of ratings (over 100,000) available for our analysis. The variable timestamp reflects the time at which the rating was provided. This variable will not play a role in our analysis (we could remove it).
BU.450.760 Technical Document T7.1 – RecSys Prof.
Having familiarized ourselves with the structure of the available data, the next step is to load it to R. This step is accomplished by line 26. After running this line, the data frame that we are calling “ds” contains our data. The screenshot of the environment shown below illustrates the size of our data: 100,004 observations for 4 variables.
The methods contained by the “recommenderlab” package do not directly apply to data frames like ds. Instead, these methods function on objects that have been characterized as a ratings matrix. In other words, we need to notify R that an object over which we are going to perform RecSys methods is indeed a object over which those methods work. This is what we do in line 27 – we declare the data frame “ds” as “real rating matrix” (ie, a rating matrix in which ratings are real numbers). The newly declared object is named “rm”. All our operations will now be performed on rm.
BU.450.760 Technical Document T7.1 – RecSys Prof.
3. Brief data inspection
It is always a good idea (one may even say, necessary) to familiarize ourselves with the data. We have already done some of this by looking directly at the csv file in the Excel viewer. We can also do this by inspecting the ratings matrix.
In line 31 we use the “head” command to do this. While this varies somewhat depending of what the analyzed object is, the “head” command generally lists the information associated to the first few observations. Note that here we are deliberately expressing the ratings matrix as a data frame (that is, we are going back to the data’s original structure). The screenshot below shows the output obtained when running line 31: the information associated to the first 6 observations (all of which belong to the individual associated to userid number 1).
A perhaps more informative rendition of the data is provided by the command of line 32. Instead of expressing the data as data frame, we are printing it as a ratings matrix. The benefit of this approach is that it is easier to appreciate how small is the set of movies each user actually rates. In what we get to see printed, it is easy to see that the vast majority of user/movie pairs are correspond to “missing” rating entries (as in the lecture slides, rows correspond to users and columns to movies).
BU.450.760 Technical Document T7.1 – RecSys Prof.
One other important aspect of the data corresponds to the distribution of ratings. This distribution is easy to inspect through a histogram. The code of line 33 does this. Note that the histogram command “histogram()” is applied to a vector containing all ratings, which are extracted from rm via “getRatings(rm)”. Alternatively, we could have fetched the data directly from the raw data, as “histogram(ds$rating)”. The figure below shows the result. Ratings appear to range continuously between 0 and 5. That is, ratings are not necessarily natural numbers only – they also include decimal ratings.
The last descriptive that we will perform aims at computing the number of ratings provided by each individual (line 34). Again we use a histogram, but which we feed a different rendition of the data. In particular, we feed the histogram a vector containing, for each individual, the number of provided ratings. The input “rowCounts(rm)” generates this object (ie, a count of the non- missing entries by row, for each row of rm). The second argument used for the “histogram” command in line 34 “breaks=500” simply instructs R to draw this histogram with 500 bars. We have used this option to illustrate the heterogeneity in the data, but also the capability built in into the “histogram” command.
BU.450.760 Technical Document T7.1 – RecSys Prof.
4. Creating an evaluation scheme
A significant advantage that is built into the “recommenderlab” package is the ability to estimate/fit and evaluate the predictive performance of several RecSys specifications through a small number of instructions. Specifically, we can do this by relying on an “evaluation scheme.”
For the purposes at hand, and “evaluation scheme” in practice will correspond to an instruction by which we “tell” R to evaluate a list of recommenders using a training/validation data split. This instruction is written in line 43 in the codes below. Here, we are “telling” R to fit a list of recommenders (the list will be specified later) to the ratings contained by rm (first input). The second input indicates that the evaluation is carried out through a training/validation data split; the third argument, that 60% of the data should be allocated to the training sample. From the lecture slides, recall that the training/validation split is implemented at the individual level. That is, the split is produced by successively allocating each individual’s entire set of available ratings to either the training or validation sample.
The fourth input of evaluationScheme, “given=5”, is the number of ratings of each individual assigned to the validation sample that will be used to generate predictions. That is, for each individual in the validation sample, the evaluation protocol will chose 5 random ratings (in lecture slide 27, the parameter “given” is assumed to equal 2 – the two randomly selected ratings shown in yellow for each individual). Using these 5 ratings, the evaluated recommenders will generate predictions for the individual’s additional ratings, from which a prediction error is then computed. In other words, “given” controls to the number of ratings that will play the role of “X”.
Having specified the evaluation scheme, the next step is to craft a list of recommenders to evaluate. We are specifying this list in lines 48-53 and storing it as an object that we are calling “compared_recommenders”. In addition to UBCF (line 51), we consider two additional recommenders solely for benchmarking: “random items” (makes random recommendations) and “popular items” (recommends most popular items). Although both these recommenders provide recommendations that vary across individuals, they can hardly be thought of as “personalized”. Indeed, for “random items” recommendations vary from one individual to the other because recommendations are picked at random. In turn, the recommender “popular items” recommends the most popular (ie, highest rated) items to all individuals but which have not been previously rated by each. Thus, in this last case, differences in recommendations across individuals come solely from the fact that some individuals may have rated some of the most popular items.
As you can see in line 52, we could very easily include an IBCF in our evaluation. By commenting this line, the provided codes exclude this recommender solely because its high computational demands – it would take significant time to fit in our dataset. You can include this
BU.450.760 Technical Document T7.1 – RecSys Prof.
recommender in the evaluation simply by removing the comment sign. A comprehensive list of recommenders included in the package can be obtained by typing:
recommenderRegistry$get_entries(dataType = “realRatingMatrix”)
It is worth noting that until now we have not executed any calculation – we have merely specified the “rules” for evaluation and objects to be evaluated. The actual evaluation is instructed by means of the code of line 57, which is self-explanatory (the console screenshot below shows what the printed output will look like). We are storing the evaluation results in an object that we are calling “eval_results.”
The primary output of the evaluation is obtained by plotting this object. This is instructed in line 58. The resulting plot is shown in the next page. (The second input for “plot”, “ylim = c(0,4)”, is controlling the scale of the y axis – this input needs to be set by hand aiming to maximize the figure’s illustrative power).
BU.450.760 Technical Document T7.1 – RecSys Prof.
This plot is showing us three different metrics describing predictive performance: root mean squared error (RMSE), mean squared error (MSE), and mean absolute error (MAE), which will coincide (ie, provide the same ranking) in the vast majority of cases. The metric that we have discussed in class – total sum of squared errors (TSE) – is essentially carries the same information of MSE and RMSE (TSE = N*MSE = (N*RMSE)^2, with N = number of observations). MAE differs from these in that it is based on the absolute value of prediction errors rather than the square thereof. Since these three metrics are based on essentially the same construct (ie, prediction errors), it is no surprise that they all result in the same ranking of recommenders.
The obtained results show that the “random items” recommender performs the worst. The “popular items” and UBCF algorithms perform about the same –both significantly better than “random items”– with a slight advantage for the former. (This is likely stemming from the fact that individuals in our sample all tend to rate popular items highly.) Given that the performance of the “popular items” and UBCF recommenders is so similar, our decision should be based on whether we would like to have personalized recommendations (UBCF) or recommendations that are largely the same for everyone (“popular items”).
BU.450.760 Technical Document T7.1 – RecSys Prof.
5. Producing recommendations for a chosen algorithm
Our results above show that the “popular items” and UBCF recommenders perform about the same and thus our choice between them hinges on criteria beyond the magnitude of prediction errors. To illustrate the procedure, we will move forward assuming that we select UBCF. The value of choosing this recommender resides on the fact that it will produce personalized recommendations, as opposed to “popular items” which will produce largely the same ones for everyone.
To produce the personalized recommendations, we will start from scratch, ie, by first clearing the workspace and re-loading the data. These basics steps are performed by lines 64-66. Code line 69 below fits the selected recommender using the full rm dataset. The step is implemented by the “Recommender” command: its first input indicates the training data; the second, the algorithm to be applied. The produced recommendations are stored in an object that we are calling “UBCFrec”.
Lines 72-73 respectively extract the recommendations for users 1 and 10 (note: line 72 focuses on ratings stored in row number 1 matrix rm; line 73, on those stored in line number 10). The output produced by these code lines is shown below. Note that the movie recommendations that we will be making for each individual are quite different. (Recall that you can check the actual titles of recommended movies in file D5.2.)
BU.450.760 Technical Document T7.1 – RecSys Prof.
We finalize our analysis by exporting the top-10 recommendations for each individual into a csv file. The value of doing this not only stems from the fact that it may be easier to inspect the recommendations in Excel, but also from the fact that many websites may be designed to operate from information contained by csv files.
The codes below implement this task. Line 76 produces the recommendations using the “predict” command (note that with the third input we are only requiring 10 recommendations – R “understands” that these are the top-10). We store this recommendations in an object that we are calling “top10recs”.
In line 77 we re-format the produced recommendations into matrix format which is required when we want to export data into the csv format. The matrix “dd” that stores these results is written into a csv file called “top10recs.csv” in line 78. Once you run this line, the object “top10recs.csv” should appear on your working directory. The screenshot at the bottom of the page is taken after opening this file on Excel. Here, individuals vary by columns and recommendations by rows. Hence, the dimension of the matrix is 10 x N (=number of individuals) – a very “wide” matrix in our case.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com