EM-623-Final-Rush-Kirubi
EM 623 – FINAL PROJECT
Student: Rush Kirubi
Semester: Fall 2017
Instructor: Dr. Carlo Lipizzi
Business Understanding
Bike sharing is increasingly becoming popular in major cities across the globe. One estimate is
that there are well over 100 programs in 125 cities (Shaheen, Guzman, & Zhang, 2010). In the
New York/New Jersey area, we have Citi Bike, whose ubiquitous blue bikes are seen in Jersey
City and across the boroughs of New York City. Hoboken is situated in the same area but uses a
different public rental bike system known as Hudson Bike Share (herein, HBS). And as publicly
sharing data associated with Citi Bike trips has been released, so has the initiative been picked up
by Hudson Bike Share (NextBike, n.d.).
The mobile application provided by HBS allows real-time viewing of bike stations, showing the
number of bikes that are available. While useful at the time the bike is needed, just as one is
interested in the timing of a train commute the following day, a user is interested in determining
if bikes will be available, at a bike station of their choice. We thus seek building a forecasting
system that can forecast bike demand for a station and thus help a user assess bike availability
for a future date.
Data Understanding
Unlike Citi Bike whose trip data dates backs to 2013 (Citi Bike, n.d.), HBS is relatively new, and
the trip data available is limited to 2016. The variables available include the trip id, start time,
stop time, bike id, trip duration, to and from station ids, and to and from station names. 12.83%
of the “from station ids” and 15.33% of the “to station ids” are missing. The total dataset has
151,103 records.
Metadata
According to the providers of the dataset, a valid trip that’s recorded is one that’s at least longer
than a minute, and begins and ends at a valid HBS station. Below, we can see descriptive
statistics for the Trip duration, with the median being about 8.78 minutes and the mean being
32.62 minutes suggesting a heavily skewed distribution. There are clearly outliers that maybe
due to data glitches, as the minimum recorded trip is 0.01 mins and longest recorded trip is
33,161.58 mins
Trip Duration – Descriptive Statistics
There are 245 bikes but the number of unique stations differ between “from stations” and “to
stations”, which are 44 and 46 respectively. We would expect station counts to be the same, upon
further inspection, we find bike stations included in the data that don’t belong to Hoboken, or the
US for that matter:
Station name outliers
In addition, from the HBS REST API, we have stations that have been recently added in 2017
but were not in the 2016 dataset:
• 600 Harrison’,
• ‘6th Street’,
• ‘Bloomfield Street’,
• ‘Clinton Street 1000’,
• ‘Washington Street 230’,
• ‘Washington Street 600’,
• ‘Washington Street 601’
Next, is a bar graph of the distribution of where bike trips start across stations
Hoboken terminal, where NJTransit and PATH train stations are located is reasonably the station
seeing the highest demand.
What’s true of other bike sharing systems is also true with HBS. Chiefly, there is some
correlation between bike trips and the weather, if it’s a workday, and even the hour a bike trip
starts (Kaggle, 2014). Collecting all the weather data for Hoboken, for the year 2016, with the
hour as a granular level, and separating weekends and holidays, we see the following correlation
matrix:
Correlation Matrix
The bikes taken are most correlated (positively) with the temperature. And the graph below
shows further evidence of that where hotter summer months have significant bike demand when
compared to the much colder winter months.
In addition, the bikes taken is inversely related to the amount of precipitation and wind speed,
and directly related to the temperature and visibility. And the higher the humidity (hum), the less
the bike demand, though the negative correlation is weak. Interestingly, pressure has a negative
correlation with bikes taken as well. There’s high multicollinearity amongst precipitation,
pressure and temperature; making one or more a candidate for removal.
Data Preparation
We begin by first eliminating observations whose stations we do not have historical data for. As
pointed out earlier, the stations left are those limited to Hoboken. It is important to note, that the
nature of our problem does not necessitate using the Trip duration variable, which is
subsequently dropped, while variables pertaining to when start trips are retained and those
pertaining to when trips end, are dropped. Amongst multicollinearity concerns and the difficulty
of weather forecasts when it comes to precipitation (usually only the probability is available and
not what amount of rain will fall), we drop the variables pressure, visibility, wind speed and
precipitation.
Whether it’s a holiday or if it’s a workday are two variables created by checking observed
federal holidays in the United States (we did not consider state holidays) and differentiating
between weekdays and weekends for 2016.
The station names and the hour a bike trip starts are categorical variables that are converted to
binary variables. The hour a trip starts is derived by pivoting operations. Understandably, there
are also multicollinearity concerns due to the degrees of freedom in play, but because we are
keen on implementing a decision tree, this is not an issue.
As for the target variable, we have hinted at using the bikes_taken at a particular given hour,
though we need to make a few adjustments. For our bike prediction to be even more precise, it
would have been nice to have the bikes available in a given hour alongside the variable we have
derived for the bikes taken. That way, getting the difference reveals stations that are prone to
shortages in particular situations. But this variable is not offered. Therefore, we will seek to
create a target binary available where 1 denotes a bike station having more than 1 bike and 0
denotes a bike station having either 1 or no bikes. This warrants explanation: from the author’s
experience, when only one bike is available at a station, more often than not, there’s a problem
with the bike – either the onboard computer does not work or more mechanical issues such as a
tire puncture or a loose-fitted bike chain is witnessed.
After cleaning, our data now has 64, 361 observations.
Modeling
We split the data into two: 67% for training and 33% for testing. Our target variable is binary, so
this is a classification task. As mentioned earlier, we use a Classification and Regression Tree
(CART), or simply, a decision tree. Below, are the hyperparameters for our decision tree
Notably, the max_leaf_nodes hyperparameter acts as a regularization term, whose absence makes
our decision tree prone to overfitting. Before observing how our trained decision tree performs
on the test set, we split run it through a five-fold split of the training dataset to assess its viability
first.
Here is a snapshot of the trained decision tree limited to five levels. The most significant feature
happens to be temperature:
Decision tree snapshot
Evaluation
On a 5-fold cross-validation of the training set, performance is as follows:
Accuracy is stable at around 64%. We are thus confident with the stability of our selected
hyperparameters, let’s see how it performs on the test set:
Again, accuracy is about 64%, which though not as high as we would like; when compared to
our performance on our training set, does mean there is no overfitting – and our model can
generalize well on unseen data. The recall score is higher than the precision score by about 20%.
Meaning that our model can tell apart between true positives and false negatives much better
than between true positives and false positives. The confusion matrix below aids this inspection:
Confusion Matrix
The ROC curve shows that the decision tree is not very accurate as the red line is not close to x
or y axes but is better than random as its distinctly upwards and away from the 45-degree line.
ROC Curve
Deployment
With our decision tree model ready, ideally, we need an interface to specify the inputs required
and make a prediction. A web application lends itself as natural choice as its available on devices
with different form factors – from desktop PCs to mobile phones.
The web application can be found at: http://rushkirubi.pythonanywhere.com/
There was a trade-off between automating retrieval of weather data or letting the user specify
these inputs. The former makes it a lot easier on the user’s part as less fields need to be
completed. Yet, it requires imagination on the part of the architect: should the forecast be limited
to a day? A few days? A few weeks? There are number of possibilities, and thus it was
reasonable to leave these fields open to a user, as they can check predictions for different
situations to their heart’s content.
More than the variables that we dropped before modelling, on inspection, our decision tree does
not use all the features provided to it, so we can prune them out as well. The first illustration is a
screenshot of the web application that receives input, the next illustration shows the result. Both
the binary target and its respective probability are shown. The web application is built on Flask,
a third-party Python-language based web-framework useful for building prototypes. In the
Python world, pickling an object is serializing it and saving it to disk to be loaded and used when
required. This a go-to option for deploying machine learning models. But the simplicity of the
decision tree lends itself to be converted to a series of nested if-else statements that can be
modularized into a function – this implementation was used instead (StackOverflow,
September).
Get a prediction
Example of a Prediction
Bibliography
Citi Bike. (n.d.). Retrieved 12 03, 2017, from Citi Bike: https://www.citibikenyc.com/system-
data
Kaggle. (2014). Bike Sharing Demand . Retrieved from Kaggle: https://www.kaggle.com/c/bike-
sharing-demand
NextBike. (n.d.). Hudson Bike Share – Trip Data. Retrieved 11 28, 2017, from Hudson Bike
Share: https://hudsonbikeshare.com/data/
Shaheen, S., Guzman, S., & Zhang, H. (2010). Bikesharing in Europe, the Americas, and Asia:
Past, Present, Future. Transportation Research Record: Journal of the Transportation
Research Board, 2143, 159-167.
StackOverflow. (September, 29 2016). Retrieved from How to extract the decision rules from
scikit-learn decision-tree? : https://stackoverflow.com/questions/20224526/how-to-
extract-the-decision-rules-from-scikit-learn-decision-tree