Midterm: Recommender System for Movies¶
(Note: This midterm assignment will have hidden test cases)¶
In this project, you will implement a recommender system for your classmates, professor and TAs based on the movie survey we have conducted. The movie preference file is at ./data/movie_preference.csv
Recommender System¶
The objective of a Recommender System is to recommend relevant items for users, based on their preference. Recommender system is prevalent in the digital space. For example, when you go shopping on Amazon, you will notice that Amazon is recommending products on the front page before you even type anything in the search box. Similarly, when you go on YouTube, the top bar of Youtube is typically “videos recommended to you.” All these features are based on recommmender system.
What item to recommend to which user is arguably the most important business decision in many digital platforms. For instance, YouTube cannot control which videos that users upload to it. It cannot control which videos users like to watch. Moreoveor, since watching videos is free, YouTube cannot change the price of its items. It does not have inventory either since each video can be viewed as many times as possible. In this case, what could YouTube control? Or in other words, what differentiates a good video streaming service from a bad one? The answer is recommender system.
Types of Recommender Systems¶
There are three types of recommender system. In this bonus project, we will implement the first one.
Popularity-based Recommendation¶
The most obvious system is popularity-based recommendation. In this case, this model recommends to a user the most popular items that the user has not previously consumed. In the movie setting, we will recommend the movie that most users have liked and consumed. In other words, this system utilizes the “widom of the crowds.” It usually provides good recommendations for most of the people. Since it is easy to implement, people normally use popularity-based recommendation as a baseline. Note: this system is not personalized. If both consumers did not watch Movie A and Movie A is the most popular one, both of them will be recommended Movie A.
Content-based Recommendation¶
This recommender system leverages the data of one customer’s historical actions. This recommender systems first utilizes a set of features to describe an item (for example, for movies, we can use the movie’s director, main actor, main actress, genre, etc. to describe the movie). When a user comes in, the system will recommend the movies that are closest to the movie that the users have consumed and liked before in terms of the features. For instance, if a user likes action move from Nolan the most, this system will recommend another action movie from Nolan that this user has not consumed. Note: we will not implement this system in this bonus project since it requires knowledge about supervised learning. We will come back to this topic at the end of this semester.
Collaborative Filtering Recommendation¶
The last type of recommender system is called collaborative filtering. This approach uses the memory of previous users interactions to compute users similarities based on items they’ve interacted (user-based approach) or compute items similarities based on the users that have interacted with them (item-based approach).
A typical example of this approach is User Neighbourhood-based CF, in which the top-N similar users (usually computed using Pearson correlation) for a user are selected and used to recommend items those similar users liked, but the current user have not interacted yet.
Step-0 Read-in the preference file¶
The first exercise is to read in the movie preference csv file (you need to use relative path).
It returns two things:
1. A dictionary where the key is username and the value is a vector of (-1, 0, 1) that indicates the users preference across movies (in the order of the csv file).
2. A list of strings that contains movie names. (The order of movie names should be the same as the order in the original csv file)
Note 1: Your result should exactly match the results from the assert statements. This means you should pay attention to extra space, newline, etc.
Note 2: If there are two records with the same name, use the first record from the person.
In [ ]:
def read_in_movie_preference():
“””Read the move data, and return a
preference dictionary.”””
preference = {}
movies = []
# YOUR CODE HERE
raise NotImplementedError()
return [movies, preference]
In [ ]:
[movies, preference] = read_in_movie_preference()
assert len(movies) == 20
In [ ]:
[movies, preference] = read_in_movie_preference()
assert movies == [‘The Shawshank Redemption’, ‘The Godfather’,
‘The Dark Knight’, ‘Star Wars: The Force Awakens’,
‘The Lord of the Rings: The Return of the King’,
‘Inception’, ‘The Matrix’, ‘Avengers: Infinity War’,
‘Interstellar’, ‘Spirited Away’, ‘Coco’, ‘The Dark Knight Rises’,
‘Braveheart’, ‘The Wolf of Wall Street’, ‘Gone Girl’, ‘La La Land’,
‘Shutter Island’, ‘Ex Machina’, ‘The Martian’, ‘Kingsman: The Secret Service’]
In [ ]:
[movies, preference] = read_in_movie_preference()
assert preference[“Jacob Scheinman”] == [1, 1, 1, 1, 1, 1, 1, 1, -1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1]
assert preference[“Ziqing Ouyang”] == [0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1]
Step-1 Popularity-based Ranking¶
Step-1.1 Compute the ranking of most popular movies¶
Your next task is to take the movie preference dataframe and computes the popular ranking of movies from the most popular to the least popular. You should return a list where each element represents the popularity ranking of the movies. The order of the list should reflect the order of the movie names in the dataframe.
In the process to compute a movie’s popularity. You should first compute how many times people have liked movies in the entire dataset across all movies. You should then compute how many times people have disliked movies in the entire data set across all movies.
Assuming that people have liked movies A times in the entire data set and disliked movies B times in the entire data set. The popularity of a movie is then defined as Num_of_People_Like_the_Movie – A / B * Num_of_People_Dislike_the_Movie
(We use A/B to normalize the weights of likes and dislikes because if one type of reaction is rare, it derseves more weights. For exmaple, if all movies on average are liked 1000 times but disliked only once. Then the signal of dislike on a movie’s quality should be much stronger than the signal of likes on a movie’s quality).
Your function should return:
1. A dictionary where the key are movie names and the values are correpsonding movie popularity.
2. A list of movie names sorted ascendingly by their popularity. For exmaple, if ‘The Shawshank Redemption’ is the second most popular movie, the second element in the list should be ‘The Shawshank Redemption’.
3. **A** and **B** as defined above.
Note: You may want to use prior functions to help you read data inside this function
In [ ]:
def movies_popularity_ranking():
movie_popularity = {}
movie_popularity_rank = []
total_likes = 0
total_dislikes = 0
# YOUR CODE HERE
raise NotImplementedError()
return movie_popularity, movie_popularity_rank, total_likes, total_dislikes
In [ ]:
movie_popularity, movie_popularity_rank, total_likes, total_dislikes = movies_popularity_ranking()
assert total_likes == 1300
assert total_dislikes == 236
In [ ]:
movie_popularity, movie_popularity_rank, total_likes, total_dislikes = movies_popularity_ranking()
assert round(movie_popularity[“The Shawshank Redemption”], 2) == 89.02
assert round(movie_popularity[“Avengers: Infinity War”], 2) == 191.14
In [ ]:
movie_popularity, movie_popularity_rank, total_likes, total_dislikes = movies_popularity_ranking()
assert movie_popularity_rank == [‘Star Wars: The Force Awakens’,
‘Avengers: Infinity War’,
‘The Dark Knight’,
‘The Lord of the Rings: The Return of the King’,
‘Interstellar’,
‘The Wolf of Wall Street’,
‘The Dark Knight Rises’,
‘Spirited Away’,
‘La La Land’,
‘The Martian’,
‘Kingsman: The Secret Service’,
‘Coco’,
‘Inception’,
‘The Godfather’,
‘The Matrix’,
‘Gone Girl’,
‘Shutter Island’,
‘Ex Machina’,
‘The Shawshank Redemption’,
‘Braveheart’]
1.2 Recommendation¶
You will then implement a recommendation function. This function will take in a user’s name, it will return a string representing the name of the top movie that this user has not watched and has best popularity ranking (i.e., lowest ranking number) only if this unwatched movie has higher popularity scores than the average of popularity scores of movies that this user has watched (regardless whether he/she likes or dislikes the movie).
If the user name does not exit, this function should return “Invalid user.”
If the user has watched all movies, this function should return “Unfortunately, no new movies for you.”
If the unwatched movies all have lower popularity scores than the average score of movies watched by this uer, this function should return “Unfortunely, no new movies for you.”
Note: Again, you may want to use prior functions to help you read data and rank movies inside this function
In [ ]:
def Recommendation(name):
recommended_movie = “”
# YOUR CODE HERE
raise NotImplementedError()
return recommended_movie
In [ ]:
assert Recommendation(“Jiaxu Rong”) == ‘Star Wars: The Force Awakens’
assert Recommendation(“Nobody”) == ‘Star Wars: The Force Awakens’
In [ ]:
assert Recommendation(“Dennis Zhang”) == ‘The Lord of the Rings: The Return of the King’
In [ ]:
assert Recommendation(“Test Student 2”) == ‘Invalid user.’
2.1 Cosine Similarity¶
Let us then use collaborative filtering to find the recommendation.
First, we need to get the cosine similarity beween movies and users. Again, we can use the preference file that we get in Step 0. In that case, each person is represented by a vector of (0, 1, -1). Cosine similarity in our case is the dot product of the two preference vectors divided by the product of the magnitude of the two preference vectors. In other words, if person A has preference vector A, and person B has preference vector B, their cosine similarity is equal to
$$ \frac{A \cdot B}{||A||||B||} = \frac{\sum_i^n A_iB_i}{\sqrt{\sum_i^nA_i^2}\sqrt{\sum_i^nB_i^2}}$$
If a person has not watched any movies, then the cosine similarity between this person and any other person is defined as 0. For more information on cosine simialrity, you can read this wiki page
For example, the following two vectors represent Dennis’ and Jake’s preference over 3 movies.
Inception Coco The Dark Knight
Jake 1 -1 0
Dennis -1 0 1
In this case, Dennis and Jake’s cosine similarity is equal to
$$ \frac{1*(-1)+(-1)*0+0*(-1)}{\sqrt{1+1}*\sqrt{(-1)^2+1}} = \frac{-1}{2} = -0.5$$
Your task is to write a similarity function that takes in two names and returns the jaccard similarity between these two people. If one or two names do not exist in the database, return 0.
In [ ]:
def Similarity(name_1, name_2):
“””Given two names and preference, get the similarity
between two people”””
cosine = 0
# YOUR CODE HERE
raise NotImplementedError()
return cosine
In [ ]:
assert round(Similarity(“Test Student”, “Nobody”), 2) == 0.17
assert round(Similarity(“Test Student”, “DJZ2”), 2) == -0.27
In [ ]:
assert round(Similarity(“Test Student”, “Test Student 2”), 2) == 0
2.2 Movie Soulmate¶
Your next task is to find the movie soulmate for a person. In order to find a person’s movie soulmate, you will compute the cosine similarity between this person and every other person in the data set. You will then return the person who has the highest cosine similarity with the focal person. If two people have the same cosine similarity with the focal person, you can tie break by the length of names (the name with lower length will be the soulmate). If the focal person does not exist in the database, return an empty string as the soulmate name.
You function will return two things:
1. the name of the soulmate
2. the largest cosine similarity
In [ ]:
def Movie_Soul_Mate(name):
“””Given a name, get the player that has highest Jaccard
similarity with this person.”””
soulmate = “”
cosine_similarity = -100
# YOUR CODE HERE
raise NotImplementedError()
return soulmate, cosine_similarity
In [ ]:
soulmate, cosine_similarity = Movie_Soul_Mate(“Q”)
assert soulmate == ‘Yunong Tian’
assert round(cosine_similarity, 2) == 0.75
In [ ]:
soulmate, cosine_similarity = Movie_Soul_Mate(“Test Student”)
assert soulmate == ‘Test Student Long Name’
assert round(cosine_similarity, 2) == 0.80
In [ ]:
soulmate, cosine_similarity = Movie_Soul_Mate(“Yunong Tian”)
assert soulmate == ‘Andy Mu’
assert round(cosine_similarity, 2) == 0.81
2.3 Memory-based Collaborative Filtering Recommendation¶
Now after finding a person’s movie soulmate, we can then construct a (very preliminary) collaborative filtering recommendation. In our recommendation system, for a focal person, we first find his or her soul mate. We then find all the movies that he/she has not watched but the soul mate has watched and liked. Among all of these movies, we recommend the movie with the highest popularity ranks defined in Step 1.1 and 1.2.
Again,
if the user name does not exit, this function should return “Invalid user.”
If the person has watched all the movies, return “Unfortunately, no new movies for you.”
If there is no movies watched and liked by the soulmate but not watched by the focal person, then return the movie (or string) that should be returned in Step 1.2.
In [ ]:
def Recommendation2(name):
recommended_movie = “”
# YOUR CODE HERE
raise NotImplementedError()
return recommended_movie
In [ ]:
assert Recommendation2(“Test Student”) == ‘The Martian’
assert Recommendation2(“Test Student Long Name”) == ‘The Lord of the Rings: The Return of the King’
In [ ]:
assert Recommendation2(“Test Student Long Name”) == ‘The Lord of the Rings: The Return of the King’