Homework 3 1-checkpoint
Homework 3¶
You will work on a crowdfunding data. If you don’t know much about crowdfunding, please take a look at the following wikipedia page: https://en.wikipedia.org/wiki/Crowdfunding. Your goal is to predict which projects are successfully funded or not. For the definition of each variable, refer to ‘data_dictionary.xlsx’. You can also refer to ‘crowdfunding_variables.’
Copyright By PowCoder代写 加微信 powcoder
Import Libraries¶
Import a few libraries you think you’ll need.
Get the Data¶
Read in the original_file.csv file and set it to a data frame called crowdfunding.
Check the head of crowdfunding
Use info and describe()
Discuss important aspects of your crowdfunding data. This is an open-ended question.
Pre-Processing Steps¶
The main reasoning for this step is to look at the variables which can be controlled by the user while setting up the project.
Setting the scope: we are only considering projects which have generated at least 250. Use balance variable.
To make things easier, drop the following three variables that have lots of missing values: ‘funding_successful_at’, ‘visits_to_crt_edt_pgs’, ‘partner_id’.
Drop all the projects with any missing values.
How many campaigns have you ended up having?
Feature Engineering¶
First convert the following variables to date/time variables:
funding_started_at
funding_ends_at
Use a pandas function, ‘to_datetime’, to do it.
Next create Day of Week (DOW) and Month of Year (MOY) variables by using the variable transformed from ‘funding_started_at’.
Check out that you successfully transformed the two variables and created DOW and MOY variables.
Campaign Duration & Success Measures¶
• duration: the organizer specified fundraising duration in days
• win: a binary indicator of whether the campaign achieved its fundraising goal.
• calculate the natural logarithm for goal and duration
Construct the following variable: duration. Duration is the time period in days that the organizer specified. In other words, it is the difference in days between the date the funding started and the date it ended.
Construct a binary indicator of success, win. ‘win’ is a binary indicator of whether the campaign achieved its fundraising goal.
This will be our target variable.
log-transform variables¶
We will log-transform the two variables, goal and duration. First check out that they are skewed toward the right tail by drawing histograms.
Import math package for log-transformation.
Calculate the natural logarithm for goal and duration.
We will explore our data a bit.
What are top 2 states that have most projects originated? Draw a plot to answer this question.
Which months have the most projects originated? Which months have the highest success rate? Draw a plot to answer them. You use the month a campain (i.e., funding) started.
Converting categorical features¶
We will generate dummy variables with day of week variable. Use ‘funding_started_at_DOW’ variable. Why cannot we use the current variable? What is a problem?
It is a nominal variable. This means that values do not have a specific order. We cannot say that 1 is greater than 12. Thus, we can’t use it directly in a model. Instead, we should convert it to dummy variables.
Now generate dummy variables from the DOW variable.
Do the same thing to convert another categorical variable, ‘funding_started_at_MOY’, to several dummies. How many dummies should we generate and use in the following models?
You finished cleaning your data and generating necessary variables. Please check out your final data again by using head(), info(), and describe().
Examine the summary statistics of ‘win’. What is the rate of successful projects?
Now let’s create the model data for the logistic regression model. Suppose that you are using the following set of variables as predictors: [‘ln_goal’,’ln_duration’,’facebook_url’, ‘imdb_url’, ‘twitter_url’, ‘youtube_url’, ‘website_url’, ‘featured’, ‘enable_drcc’, ‘enable_payp’,’all_or_nothing’,’funding_started_at_DOW_1′,’funding_started_at_DOW_2′,’funding_started_at_DOW_3′,’funding_started_at_DOW_4′,’funding_started_at_DOW_5′,’funding_started_at_DOW_6′, ‘funding_started_at_MOY_2’, ‘funding_started_at_MOY_3’, ‘funding_started_at_MOY_4’, ‘funding_started_at_MOY_5’, ‘funding_started_at_MOY_6’, ‘funding_started_at_MOY_7’, ‘funding_started_at_MOY_8’, ‘funding_started_at_MOY_9’, ‘funding_started_at_MOY_10’, ‘funding_started_at_MOY_11’, ‘funding_started_at_MOY_12’]
The focus in predictive analysis should be on those variables that are under the control of a campaign organizer. In theory, you should thus not include variables that a campaign organizer cannot control in your preditive model. Which variable(s), if any, cannot be controlled by a campaign organizer?
Great! Our data is ready for our model!
Building a Logistic Regression model¶
Let’s start by splitting our data into a training set and test set. We set ‘test_size’ to 0.3 and ‘random_state’ to 101.
Train Test Split¶
Training and predicting¶
Import ‘LogisticRegression’ function and follow the standard steps in the lecture.
LogisticRegression(solver=’liblinear’)
Evaluation¶
You can get the confusion matrix.
You can check precision,recall,f1-score using classification report!
Describe the results briefly. In particular, discuss the precision for class 1.
You can finally get the ROC curve and AUC score.
How can we improve the performance? Can we add more variables? What else can we do? This is an open-ended question.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com