代写 {r setup, include=FALSE} library(knitr) library(tidyverse) library(lubridate) library(caret) library(rpart) library(rpart.plot) opts_chunk$set(echo = TRUE)

{r setup, include=FALSE} library(knitr) library(tidyverse) library(lubridate) library(caret) library(rpart) library(rpart.plot) opts_chunk$set(echo = TRUE)

1. Business Problem

Yourcabs can train a model to predict the cancellation behavior of drivers. Driver either cancels or not, which is a typical classification problem, which can be done by supervised machine learning such as decision tree.

I would fetch the data and test several classification methods to find the best model. If the model has great predicting power, it could help Yourcabs.

There are at least two possible benefits.

Firstly, with the right model, Yourcabs can detect variables that contribute to driver’s cancellation. The insight from this model could then be transferred to the iteration of the application. New feature could be added to the service to affect the contributing variables.

Secondly, if the model can accurately predict a single driver’s behavior, it can be integrated into the application. When a driver is very likely to cancel, intervention could be arranged to reduce his or her incentive to cancel.

2. Prepare Data

Read data:

{r message=F} data_raw <- read_csv('Taxi-cancellation-case.csv') 2.1 Unimportant Variables After careful evaluation, following variables are considered unimportant: - row#, user_id: Not useful in classficaton model - from_long, to_long, from_lat, to_lat: other geo variables are more relevant - to_date: cancelled order has no such value, the arriving timestamp is not important 2.2 Missing Values Following variables are dumped, because there are too many missing values: - to_city_id: 96.6% missing values - package_id: 92.5% missing values, and not sure what it means - from_city_id: 63% missing values 2.3 Categorical Variables Barplot are used to examing the distribution of categorical variables. categorical_col_names <- c('Car_Cancellation', 'mobile_site_booking', 'online_booking', 'travel_type_id', 'vehicle_model_id', 'from_area_id', 'to_area_id') data_raw %>%
select(categorical_col_names) %>%
gather(‘variable’, ‘value’) %>%
mutate(value = ifelse(is.na(value), ‘N.A.’, value)) %>%
ggplot(aes(x = value)) +
geom_bar() +
facet_wrap( . ~ variable, scales = “free”)

After examining bar plot, following decisions are made:

– from_area_id and to_area_id are thrown away, because there are too many levels and they spread widely.
– vehicle_model_id’s levels other than the most frequent one are gathered to become a level: other
– Other categorical variables are turned to factor.
– Car_Cancellation’s levels are labeled as Not Cancel and Cancel to make tree plot easier to read.

Code to preprocess categorical variables:

categorical_var <- c('Car_Cancellation', 'mobile_site_booking', 'online_booking', 'travel_type_id', 'vehicle_model_id') df_1 <- data_raw %>%
mutate_at(categorical_var, factor) %>%
mutate(Car_Cancellation = fct_recode(Car_Cancellation,
“Not Cancel” = “0”,
“Cancel” = “1”)) %>%
mutate(vehicle_model_id = fct_lump(vehicle_model_id, n = 1))

2.4 Datetime Variables

From personal experiences, date and time are very likely to influence driver’s decision. Following varialbes are derived from raw data:

– from_wday: the weekday of taxi’s supporsed departing date
– from_time: the hour and minute of taxi’s supporsed departing time
– time_before_trip: how many hours left between booking and departure

Unluckly, the raw datetime variables are _dirty_. 60% of the values are in bad format. When converted to datetime object, they become missing values.

However, I insist keeping those variables, because rpart can deal with missing data. According to the rpart document, missing data are surrogated, which reduces the effect of missing data to a certain degree.

When it comes to model comparison, we could always fit different models and keep the best one.

Here is the code to handle datetime varaibles:

“`{r, warning=F} datetime_var <- c('from_date', 'booking_created') df_2 <- df_1 %>% mutate_at(datetime_var, ymd_hm) %>% mutate(time_before_trip = (from_date – booking_created)/60) %>% mutate(from_wday = weekdays(from_date), from_time = hour(from_date) + minute(from_date) / 60) %>% select(-from_date, -booking_created) “`

2.5 Data After Clean

As a final step, only variables needed by the model are kept:

kept_var <- c(categorical_var, 'time_before_trip', 'from_wday', 'from_time') df_clean <- df_2 %>%
select(kept_var)

kable(df_clean[1:6, ],
format = ‘latex’,
digits = 0) %>%
kableExtra::kable_styling(latex_options = ‘scale_down’)

3. Classification Tree

3.1 Partition

60% of the data are training data:

set.seed(1)
train_index <- sample(c(1:dim(df_clean)[1]), dim(df_clean)[1]*0.6) train_df <- df_clean[train_index,] valid_df <- df_clean[-train_index,] 3.2 Balancing Training Data Response varialbe is very imbalanced. Therefore, training data is balanced with upSample() method: train_upsample <- upSample(train_df, train_df$Car_Cancellation) train_upsample <- train_upsample[, -dim(train_upsample)[2]] 3.3 Model Fitting Model is fitted and pruned. Only 4 predictor variables are left. Code and plot: fit_ct <- rpart(Car_Cancellation ~ ., data = train_upsample, method = "class", control=rpart.control(minsplit = 1000, minbucket = 100, xval = 4)) prp(fit_ct, type = 1, extra = 1, split.font = 1, varlen = -10, box.col=ifelse(fit_ct$frame$var == "“, ‘gray’, ‘white’))

3.4 Predictors

Four Predictors are found:

– Is booking made online?
– Is booking made on mobile site?
– Is time before trip longer than 6.3 hours?
– Is deptarture day in Saturday, Sunday or Thursday?

IS BOOKING NOT MADE ONLINE? , ie, is booking made by phone call? If it is by phone call, it is less likely to be cancelled. This actually makes sense. The datset is collected mostly between 2001 to 2012, when online booking means PC booking. Someone book by phone must have more urgen need of taxi than someone sit in front of PC.

Given booking is made by phone call, IS BOOKING NOT MADE BY MOBILE SITE? Well, this one highly correlates with BOOKING ONLINE, since BOOKING BY MOBILE SITE is a subset of BOOKING ONLINE.

More interesting nodes are next two.

Given booking is made online, IS THE TIME BEFORE DEPARTURE LONGER THAN 6.3 HOURS?. If so, driver is more likely to cancel. This makes sense. Longer waiting time means higher uncertainty. More things could happen to driver such that he couldn’t or wouldn’t fuifill the promise.

Given the time before departure is longer than 6.3 hours, IS THE DEPARTURE TIME ON SATURDAY, SUNDAY OR THURSDAY?. If so, driver would not cancel. Longer waiting hours seems more endurable at weekend. This one is tricky.

4. Confusion Matrix

pred_train <- predict(fit_ct, train_df, type = "class") pred_valid <- predict(fit_ct, valid_df, type = "class") train_conf_ma <- confusionMatrix(pred_train, train_df$Car_Cancellation) valid_conf_md <- confusionMatrix(pred_valid, valid_df$Car_Cancellation) Training data set's matrix table: train_conf_ma$table train_conf_ma$overall[1] The training data set has an accuracy of 66.7%. This level of accuracy is not good. Besides, when the model predict a Cancel, only 13% is right. That's poor. Validating data set's matrix table: valid_conf_md$table valid_conf_md$overall[1] The validating data set has an accuracy of 66.5%. The accuracy of two data sets are close. At least the model doesn't have bad out of sample performance. 5. Lift Plot valid_df$pred_value <- as.integer(as.integer(pred_valid == 'Cancel')) lift_obj <- lift(relevel(Car_Cancellation, ref = 'Cancel') ~ pred_value , data = valid_df) plot(lift_obj, main = 'Lift Plot') The plot start to show significance only after 30% of samples tested, and 40% samples are tested to arrive 60% samples found. By this lift plot, I would argue that the model is not good. Can the model be used in practice? Hardly. 66% of accuracy is not too far away from 50% guess. If we use this model in practice, we will overestimate the probability of cancellation. I guess the fitting problems come from 1. The imbalance nature of the dataset 2. The missing value generated by dateime operation To improve the model, we could: - Increase sample size - Reducing the number of missing values or bad formats