程序代写代做代考 scheme python Fill in your name and netid in the following lines

Fill in your name and netid in the following lines
In [2]:
netid = “xxx”
name = “xxx yyy”
print(netid)
print(name)

xxx
xxx yyy
In [3]:
Collaboration_comment = “N/A” # Replace with names of collaborators
Citation_comment = “N/A” # Replace with description of resources used, or specific things you got from office hours”
Credit_comment = “N/A” # (Optional) Replace with special creddit you would like to give for github/ Piazza contributions
In [ ]:

HW3, Movies¶
You will write you homework into this notebook. You should probably create another notebook to test out your solutions. After you are finished, make sure that your notebook can be run. Run your notebook, and submit the results. Do not delete the notebook cells that we supplied. Add as many cells as you need.

The following lines import the modules we will use. Do not import any others
In [4]:
import pickle
import hashlib
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline

1. Create a DataFrame by reading in movies.csv¶
In [ ]:

keep only the following columns:¶
title, id, rtAllCriticsRating, rtAllCriticsNumReviews
Feel free to give them more useful names. Produce something like this. It does not have to be exactly like this. This is just an example. Anything you can use to solve the rest will suffice.
In [ ]:

3. Make the ratings numeric.¶
Check the dtypes of your DataFrame. If the movie ratings are strings, force them to be numeric (say float64). Remove all movies with null ratings and all movies that have no reviews. Give evidence that you did this successfully.
Hint: look up the Pandas function to_numeric. It gives one way of doing this. There are others.
In [ ]:

Sanity check: you should now have 9312 movies in your table.¶
Show us that you do.
In [ ]:

4. Show us the 10 movies with the highest ratings, and the 10 with the lowest ratings.¶
We show the top and bottom 3.
In [ ]:

In [ ]:
movieNumA.nsmallest(10,”rtAllCriticsRating”)

5. Create a DataFrame from movie_actors.csv.¶
In [ ]:

Apply head() to it. You should get something like the following. The movieID is the same as in our other file. We will identify the actors by their actorIDs. We will ignore the ranking, which is the order in which the actors were listed.
In [ ]:

6. Merge these two DataFrames together on movieID.¶
This will take a little work.
In [ ]:

In [ ]:

7. The best actors.¶
For each actor, compute the average rating of the movies in which that actor appears. View this as a rating of that actor. Find the 10 actors with the highest ratings. We give the top 3.

mitsuhiro_mori 9.6 shiro_osaka 9.6 toru-abe 9.6
In [ ]:

Some of these actors might look great just because they appeared in only one movie, and it was good. We now restrict to actors that appeared in many movies.

8. Restricting to popular actors.¶
For each actor, find the number of movies in which that actor appeared. Create a new DataFrame only contining actors who appeared in at least 37 movies. This should give you 101 actors and 4500 rows.
Hint: one approach uses the Pandas method isin . Another approach produces counts of movies per actor.
In [ ]:

9. Restricting the movies.¶
Now, create a DataFrame that only contains those popular actors and only those movies in which at least two of them appeared. This should leave you with 1032 movies and a DataFrame with 2499 lines. This is the DataFrame we will use for the rest of this homework.
In [ ]:

10. Confirm that each movie appears at least twice in this table.¶
Also compute the minimum number of times that an actor appears in this table. We got 5.
In [ ]:

11. Within this set find the 10 highest and lowest rated actors by giving them a rating equal to the average of the ratings of the movies in which they appear.¶
In [ ]:

Rating by Least Squares¶
The drawback of the above rating scheme is that it does not distinguish each actor’s individual contribution to a movie. We would like to do that by assigning a rating to each actor so that the rating of every movie is the average of the ratings of actors in it.
We can’t do that exactly, but we can try to come close with least squares.

12. Building the matrix¶
Construct a matrix whose rows are indexed by the movies and whose columns are indexed by the actors in the last DataFrame. I don’t mean “indexed” in a formal sense. Rather, it should have 101 columns and 1032 rows. It should have a nonzero in row r and column c when actor c appears in movie r. In this case, the entry should be the the reciprocal of the number of actors in that movie. This way, multiplying by a vector encoding the ratings of actors will compute the average rating of the actors in each movie.
Hint: it is possible to do this using either DataFrame operations or Dictionaries. Some might find the Pandas iterrows or itertuples method, or the Python function zip, to be useful.
If you case you cannot construct this matrix, we will create a file that you can use to include it. This will allow you to complete the rest of the problem set (at some loss of points).
In [ ]:

13. Make a vector of the movie ratings, indexed in the same way as the columns of the matrix.¶
In [ ]:

14. Solve the least squares problem to find the ratings for actors.¶
Then, compute the mean of the squares of the entries of the vector $$ Ax – b$$. To test how good this is, also compute the mean of the squares of b minus the mean of b.
In [ ]:

15. According to these ratings, whose are the 10 best and worst actors (out of the 101 we are considering)?¶
In [ ]:

In [ ]:

Should we believe these top (and bottom) 10 lists?¶
We will divide the movies into a training set and a test set. We will use the training set to assign ratings to actors, and then see how well they explain the ratings of movies in the test set. We will use your netid to determine which movies go in which set.

16. Map movies to train and test.¶
Create a pandas Series or a list or an array containing the movie ids. The following function assigns True or False to a movie id. Apply it to all of the movie ids in the 101 we are considering. Verify that the number of True values is about 87.5%. You can do this by computing the sum of a boolean vector.
In [ ]:
# DO NOT CHANGE THIS CODE!
def is_train(movie_id):
hash_object = hashlib.md5(netid + str(movie_id))
hex_hash = hash_object.hexdigest()
return int(hex_hash[0],16) < 14 is_train('1') In [ ]: 17. Compute the Least Squares Solution on the trainind data.¶ Then, report the top and bottom 10 actors under this new rating vector. In [ ]: 17. Evaluate how the least squares solution on train performs on test.¶ This time, compute the mean squared error on the ratings of movies in the test set, and compare it to the mean squared error if we just guessed the average rating for those movies. In [ ]: 18. How stable were your top 10 and bottom 10?¶ Give any observations you can make about the difference between dividing into train and test, and when we run on the whole data. That is, what do you think of these actor ratings that we created? Give your answer in a Markdown cell. In [ ]: In [ ]:

Related Posts