CS代考 DS 111, Prof Jones-Rooy, Prof Wood

DS 111, Prof Jones-Rooy, Prof Wood

Homework 5

This homework is due on Thursday, Dec. 8, by 12pm (noon). Please complete this assignment in
your own document and then upload it as a single attached PDF to Brightspace > Assignments
> Homework 5. Email submissions and/or submissions in any format other than PDF will not be

Throughout your homework you must (a) clearly identify the question you are answering and (b)
provide all executed code you used to generate your answers. Failure to do either (a) or (b) will
result in no credit for that question. This homework is worth 30 points (one point per sub-question).
Late homework will be graded down, no exceptions.

Some of the questions refer to external resources, which you can find attached on Brightspace >
Assignments > Homework 5 (where you found this assignment).

Note that the course academic honesty policy applies to every homework, including this one. You
are welcome and encouraged to use external resources for the coding portions of the homework, but
please be sure to cite your sources. These citations should include a link to the webpage you used
and a note about how you used that resource. If you use code from a website, you are expected to
acknowledge that source.

Throughout this homework, we will use a dataset related to movies and the Bechdel test. To pass
the Bechdel test, a movie must have at least two named women who have a conversation with each
other at some point, and that conversation must not be about a male character. If the criteria is
not met, the movie fails the Bechdel test. You can read more about the Bechdel test and data in
the article “The Dollar-And-Cents Case Against Hollywood’s Exclusion of Women” published at
FiveThirtyEight (also available as a resource for this assignment). A description of each variable
in the dataset can be found in the bechdel description.txt resource for this assignment.

1. In this question we will load and inspect our data.

(a) Load the dataset bechdel.csv and display the first 5 rows. It is okay if not all of the
columns are visible.

(b) What is the unit of analysis in this dataset?

The Dollar-And-Cents Case Against Hollywood’s Exclusion of Women

(d) The rated column provides the rating of the film, with options for “PG” (parental guid-
ance suggested), “PG-13” (parents strongly cautioned), “R” (restricted), and “other”
(none of the above). Report the number of films with each rating in the dataset.

(e) The dataset has two sources: BechdelTest.com and The-Numbers.com. The website
BechdelTest.com relies on voluntary contributions, where any person can submit a form
indicating whether a movie passes the Bechdel test or not. Users can then agree or
disagree with the assessment. The website The-Numbers.com provides data on movie
finances. To be included in the bechdel.csv dataset, a movie must appear on both
websites. Do you think the dataset is representative of the population of movies? Briefly
explain why or why not?

(f) The dataset includes many missing values (NaN). In particular, there are many missing
values in the runtime feature. Given that the run time of a movie is usually reported,
missing runtime values may indicate that other features reported for these movies are
not reliable. Drop the rows where runtime contains a missing value. Show your code
for dropping these rows and display the shape of the final dataset.

2. In this question, we will perform multiple linear regression using the statsmodels package.

(a) Use the statsmodels package to estimate a multiple regression evaluating the effect on
imdb rating of four features: budget, drama, sci fi, and romcom. Display the output.

(b) How much of the variance in imdb rating is explained by our features?

(d) Holding budget constant, a 1-unit increase in romcom is associated with what kind of
change in imdb rating?

(e) Suppose, for now, that the relationship between budget and imbd rating is causal. How
would increasing the budget for a movie by $100 million change the imdb rating?

3. Now that we’ve gotten to know our data a little bit, we will use SkiKit Learn and a train/test
split to see how well our model – using the same DV and IVs as Q2 – can predict a movie’s
IMDB rating.

(a) Based on the statsmodels output from Q2, do you expect that the four features (budget,
drama, romcom, sci fi) will do a good or bad job at predicting IMDB rating for out-of-
sample data? Briefly explain why or why not.

(b) Create an X dataframe with just four features: budget, drama, romcom, and sci fi.
Create a Y dataframe (or series) with just the imdb rating variable. Display the first
five rows of each.

(c) Create a train/test split where 20% of the data is held out for testing. Use a random
seed of 135 (i.e., set the random seed to this value). To show your code has worked,
display the first 5 rows of the X training data and report the number of observations in
the X training data and X test data.

(d) Use SKLearn to train a linear regression using only the training data. Display the
intercept and coefficients for your trained model (coefficients do not need to be labeled).

(e) Compare the coefficients estimated by both regression models. How does the coefficient
for “drama” change (if at all) when it is estimated in Q2 (using statsmodels and the
full dataset) versus when it is estimated in Q3 (using SKLearn and only the training

(f) Use your trained SKLearn regression model to predict IMDB ratings for the hold-out set
of X test data. Report the first 10 predicted values.

(g) Display a scatterplot in which the horizontal axis shows the actual value of the Y test
data and the vertical axis shows the predicted Y values for the X test data. Label your
horizontal axis ”Actual IMDB rating” and your vertical axis ”Predicted IMDB rating”.

(h) Calculate and display the Root Mean Squared Error (RMSE) for this model. Provide a
brief interpretation of what this means in terms of the “average error” of the model.

(i) Reflecting on any of the analyses conducted above, do you feel that this model does a
good or poor job of predicting the IMBD rating of movies?

4. In this next question, we will use KNN to classify whether movies pass or fail the Bechdel

(a) First, let’s get a better sense of the balance of classes in our data (i.e. how many
observations of each class we have). Display the count of each value present in the
bechdel column.

(b) If we built a classifier that always guessed that a movie failed the Bechdel test, what
percentage of the time would the classifier be correct? In other words, what percentage
of movies fail the Bechdel test?

(c) Let’s build a classifier using the following 9 features: year, budget, domgross, intgross,
imdb rating, romcom, drama, action, sci fi. Create an X dataframe with just
these 10 columns and display the first 5 rows.

(d) Now, rescale (or normalize) this X data so that each feature has a mean of 0 and a
standard deviation of 1. Important: You should only rescale the continuous features
(e.g. year, budget, domgross, intgross) and not the binary features. Display the
first 5 rows of this normalized data.

(e) We’ll want to be able to see how well our classifier performs out-of-sample. Create a
train/test split, setting Y to be the bechdel column of the dataframe. Here, use 20% of
the data for testing and set the random state to 321. Display (at least) the first 3 rows
of X training data.

(f) Next, we’ll want to determine the number of neighbors k to consider for our KNN
classifier. For values of k from 1–30 (inclusive), calculate the accuracy of a KNN classifier.
Display your results by creating a plot with k values on the horizontal axis and the
corresponding accuracy on the vertical axis.

(g) Based on your analysis, choose a reasonable value of k. Train a KNN classifier that
considers this number of neighbors and predict Y values (i.e. whether the movies in
your test data pass or fail the Bechdel test) for your test data. Display the first 5
predictions for bechdel.

(h) Use actual and predicted Y values to calculate and display the confusion matrix for
your model. The matrix may be displayed without labels, but it will show the classes

in alphabetical order (FAIL, PASS; upper left corner is “FAIL-FAIL”). As with the
examples in lecture, the rows will indicate the actual values and the columns will indicate
the predicted values. How many movies that passed the Bechdel test (“True PASS”)
were predicted to have failed the Bechdel test?

(i) Use the actual and predicted Y values to display the full classification report. What
does the recall for the classification ”PASS” suggest about our model?

(j) Reflecting on the analysis above, do you feel like this model does a good or poor job of
predicting whether a movie passes or fails the Bechdel test?

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts