Problem Set 4¶
In [ ]:
import numpy as np
import pandas as pd
import random
import sklearn.model_selection as model_selection
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_predict, cross_validate
from sklearn import linear_model
import matplotlib.pyplot as plt
%matplotlib inline
Problem A: Import 10000 records from the NYC yellow cab data set¶
Import 100 records from the NYC yellow cab data set. We’re going to be looking at the relationship between trip distance and tip amount. Are passengers fwho travel further tipping better than others?
Our first task is to import and clean the data. Import the first 10000 rows of the Jan 2017 yellow cab NYC trip data into a pandas data frame. Create a view on trip_distance and tip_amount. Plot the trip_distance vs. tip amount in a scatter plot.
The data set can be found here: http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml
In [ ]:
# IMPLEMENT ME (Please make sure you name the dataframe as ‘df1’ and don’t modify ‘df1’ in later parts.)
# (Read your data from “yellow_tripdata_2017-01.csv”.)
Problem B: Remove the people who tipped nothing¶
People who tipped nothing aren’t very useful for our needs. We are interesteded in how much they tipped when they were generous enough to do so. Similarly, we don’t care about people that tipped for very short rides. Remove all records where the trip distance was less than 0.5 miles. Create a data frame with the records of zero (or less) tips removed and plot the relationship between the trip distance and the tip amount. Can the resulting data frame ‘df’.
In [ ]:
# IMPLEMENT ME (Please make sure you name the dataframe as ‘df2’ and don’t modify ‘df2’ in later parts.)
Problem C: Fit a linear regression to the resulting trip_distance vs. tip_amount data¶
In [ ]:
X = df2[‘trip_distance’].values.reshape(-1,1)
y = df2[‘tip_amount’]
In [ ]:
# IMPLEMENT ME
Problem D: Look at the regression coefficients¶
Print the regression coefficients (intercept and slope) and in a markdown cell comment about their values. Do they make sense? What do they tell you?
In [2]:
# IMPLEMENT ME (Save the value of intercept as ‘a’ and the value of slope as ‘b’.
# Make sure you only save the value of slope as ‘b’ instead of an array.)
Problem E: Evaluate the model¶
Just because the model fit the data set doesn’t mean it’s very predictive. Run a 10-fold cross-validation and compute the average mean absolute deviation. Show the result. In a markdown cell comment on what that number means in terms of predictive accuracy.
In [ ]:
random.seed(5)
# IMPLEMENT ME (Save the computed average mean absolute deviation as ‘c’.)
In [ ]:
In [ ]: