R语言代写

Cologne Center for Comparative Politics Introduction to Quantitative Analysis Bruno Castanho Silva

Bonus points assignment #2 – Predictive Modeling

This assignment can give you up to 6 bonus points, on top of the total maximum of 100 for this course. The deadline is on December 11, 23:55 CET. NB! The deadline changed because the data was not available!

This assignment will involve predictive modeling. These are applications of statistics where we’re not so much interested in making inferences about the relations between variables, but in predicting values of observations on a given dependent variable. For an intro, read pages 1-9 of the book Introduction to Statistical Learning with R. There’s a link to the book in the readings. For a refresher on classification and logistic regression, take a look at pages 127-138.

For this task, you have a dataset from the site kickstarter. This is a crowdfunding platform where anyone can post a project and say how much money they want to raise for it. Users can choose to support the project if they like, and donate certain amounts of money. Projects have a self-imposed deadline for when they expect to raise the amount of money asked for. If they raise that much, the project is funded. If it fails to raise that amount by the deadline, it is not funded at all.

The question is, can we predict what projects will be funded or not (1 or 0) given project characteristics?

The data has 70000 observations. Each one is one kickstarter project launched between 2009 and 2015. The outcome variable is final_status. It takes the value of 1 if that project was funded, and 0 if it wasn’t.

This exercise involves creating a model that gives accurate predictions on whether projects will be funded or not. Meaning, you should regress final_status on any variables in the dataset (including all of them if you want), or combination of variables (quadratic terms, interactions,…) so that the predictions are the most accurate possible. Accuracy is judged by the Area Under the Curve (AUC). The closer to 1 that value is, the better are the predictions.

The easiest way of doing that is with a logistic regression, but you can choose any technique from the Introduction to Statistical Learning book.

You should submit r code with your final model. It doesn’t have to be in rmarkdown, but rather just a plain r-script, which should include the code to run the model, looking like this below:

my.model <- glm(final_status ~ goal + disable_communication + backers_count + app, data = data, family=’binomial’).

The dataset you have is a sample from the total. I have kept 23000 observations aside. I’ll use the model you generated to make predictions about the observations in this new sample: were they funded or not? The better the performance of your model on this sample (i.e., the more accurate are the predictions it makes), the better your grade.

A useful tool for making sure your model performs well in new data is the validation set approach, which is described in pp. 176-178 of Introduction to Statistical Learning with R.

The variables in the dataset are:

Goal = amount of money in USD that the project expects to raise
Disable_communication = If the project manager disabled communication with users during the fund raising
Backers_count = how many people backed the project;
Daysup = how many days it was up from creation until the deadline;
Daysup2 = how many days it was up from the launch (i.e., people can start donating) until the deadline;
Year = in what year was the project launched?
Month = in what month was the project launched?
All the other variables are dummies, on whether the project keywords included that word (1) or not (0). For example, a project that has a value of 1 in app mentions in its keywords the word “app”. We may assume the person wants to raise funds to develop, well, an app…