程序代写代做代考 python decision tree chain EM-623-Final-Rush-Kirubi

EM-623-Final-Rush-Kirubi

EM 623 – FINAL PROJECT

Student: Rush Kirubi

Semester: Fall 2017

Instructor: Dr. Carlo Lipizzi

Business Understanding

Bike sharing is increasingly becoming popular in major cities across the globe. One estimate is

that there are well over 100 programs in 125 cities (Shaheen, Guzman, & Zhang, 2010). In the

New York/New Jersey area, we have Citi Bike, whose ubiquitous blue bikes are seen in Jersey

City and across the boroughs of New York City. Hoboken is situated in the same area but uses a

different public rental bike system known as Hudson Bike Share (herein, HBS). And as publicly

sharing data associated with Citi Bike trips has been released, so has the initiative been picked up

by Hudson Bike Share (NextBike, n.d.).

The mobile application provided by HBS allows real-time viewing of bike stations, showing the

number of bikes that are available. While useful at the time the bike is needed, just as one is

interested in the timing of a train commute the following day, a user is interested in determining

if bikes will be available, at a bike station of their choice. We thus seek building a forecasting

system that can forecast bike demand for a station and thus help a user assess bike availability

for a future date.

Data Understanding

Unlike Citi Bike whose trip data dates backs to 2013 (Citi Bike, n.d.), HBS is relatively new, and

the trip data available is limited to 2016. The variables available include the trip id, start time,

stop time, bike id, trip duration, to and from station ids, and to and from station names. 12.83%

of the “from station ids” and 15.33% of the “to station ids” are missing. The total dataset has

151,103 records.

Metadata

According to the providers of the dataset, a valid trip that’s recorded is one that’s at least longer

than a minute, and begins and ends at a valid HBS station. Below, we can see descriptive

statistics for the Trip duration, with the median being about 8.78 minutes and the mean being

32.62 minutes suggesting a heavily skewed distribution. There are clearly outliers that maybe

due to data glitches, as the minimum recorded trip is 0.01 mins and longest recorded trip is

33,161.58 mins

Trip Duration – Descriptive Statistics

There are 245 bikes but the number of unique stations differ between “from stations” and “to

stations”, which are 44 and 46 respectively. We would expect station counts to be the same, upon

further inspection, we find bike stations included in the data that don’t belong to Hoboken, or the

US for that matter:

Station name outliers

In addition, from the HBS REST API, we have stations that have been recently added in 2017

but were not in the 2016 dataset:

• 600 Harrison’,

• ‘6th Street’,

• ‘Bloomfield Street’,

• ‘Clinton Street 1000’,

• ‘Washington Street 230’,

• ‘Washington Street 600’,

• ‘Washington Street 601’

Next, is a bar graph of the distribution of where bike trips start across stations

Hoboken terminal, where NJTransit and PATH train stations are located is reasonably the station

seeing the highest demand.

What’s true of other bike sharing systems is also true with HBS. Chiefly, there is some

correlation between bike trips and the weather, if it’s a workday, and even the hour a bike trip

starts (Kaggle, 2014). Collecting all the weather data for Hoboken, for the year 2016, with the

hour as a granular level, and separating weekends and holidays, we see the following correlation

matrix:

Correlation Matrix

The bikes taken are most correlated (positively) with the temperature. And the graph below

shows further evidence of that where hotter summer months have significant bike demand when

compared to the much colder winter months.

In addition, the bikes taken is inversely related to the amount of precipitation and wind speed,

and directly related to the temperature and visibility. And the higher the humidity (hum), the less

the bike demand, though the negative correlation is weak. Interestingly, pressure has a negative

correlation with bikes taken as well. There’s high multicollinearity amongst precipitation,

pressure and temperature; making one or more a candidate for removal.

Data Preparation

We begin by first eliminating observations whose stations we do not have historical data for. As

pointed out earlier, the stations left are those limited to Hoboken. It is important to note, that the

nature of our problem does not necessitate using the Trip duration variable, which is

subsequently dropped, while variables pertaining to when start trips are retained and those

pertaining to when trips end, are dropped. Amongst multicollinearity concerns and the difficulty

of weather forecasts when it comes to precipitation (usually only the probability is available and

not what amount of rain will fall), we drop the variables pressure, visibility, wind speed and

precipitation.

Whether it’s a holiday or if it’s a workday are two variables created by checking observed

federal holidays in the United States (we did not consider state holidays) and differentiating

between weekdays and weekends for 2016.

The station names and the hour a bike trip starts are categorical variables that are converted to

binary variables. The hour a trip starts is derived by pivoting operations. Understandably, there

are also multicollinearity concerns due to the degrees of freedom in play, but because we are

keen on implementing a decision tree, this is not an issue.

As for the target variable, we have hinted at using the bikes_taken at a particular given hour,

though we need to make a few adjustments. For our bike prediction to be even more precise, it

would have been nice to have the bikes available in a given hour alongside the variable we have

derived for the bikes taken. That way, getting the difference reveals stations that are prone to

shortages in particular situations. But this variable is not offered. Therefore, we will seek to

create a target binary available where 1 denotes a bike station having more than 1 bike and 0

denotes a bike station having either 1 or no bikes. This warrants explanation: from the author’s

experience, when only one bike is available at a station, more often than not, there’s a problem

with the bike – either the onboard computer does not work or more mechanical issues such as a

tire puncture or a loose-fitted bike chain is witnessed.

After cleaning, our data now has 64, 361 observations.

Modeling

We split the data into two: 67% for training and 33% for testing. Our target variable is binary, so

this is a classification task. As mentioned earlier, we use a Classification and Regression Tree

(CART), or simply, a decision tree. Below, are the hyperparameters for our decision tree

Notably, the max_leaf_nodes hyperparameter acts as a regularization term, whose absence makes

our decision tree prone to overfitting. Before observing how our trained decision tree performs

on the test set, we split run it through a five-fold split of the training dataset to assess its viability

first.

Here is a snapshot of the trained decision tree limited to five levels. The most significant feature

happens to be temperature:

Decision tree snapshot

Evaluation

On a 5-fold cross-validation of the training set, performance is as follows:

Accuracy is stable at around 64%. We are thus confident with the stability of our selected

hyperparameters, let’s see how it performs on the test set:

Again, accuracy is about 64%, which though not as high as we would like; when compared to

our performance on our training set, does mean there is no overfitting – and our model can

generalize well on unseen data. The recall score is higher than the precision score by about 20%.

Meaning that our model can tell apart between true positives and false negatives much better

than between true positives and false positives. The confusion matrix below aids this inspection:

Confusion Matrix

The ROC curve shows that the decision tree is not very accurate as the red line is not close to x

or y axes but is better than random as its distinctly upwards and away from the 45-degree line.

ROC Curve

Deployment

With our decision tree model ready, ideally, we need an interface to specify the inputs required

and make a prediction. A web application lends itself as natural choice as its available on devices

with different form factors – from desktop PCs to mobile phones.

The web application can be found at: http://rushkirubi.pythonanywhere.com/

There was a trade-off between automating retrieval of weather data or letting the user specify

these inputs. The former makes it a lot easier on the user’s part as less fields need to be

completed. Yet, it requires imagination on the part of the architect: should the forecast be limited

to a day? A few days? A few weeks? There are number of possibilities, and thus it was

reasonable to leave these fields open to a user, as they can check predictions for different

situations to their heart’s content.

More than the variables that we dropped before modelling, on inspection, our decision tree does

not use all the features provided to it, so we can prune them out as well. The first illustration is a

screenshot of the web application that receives input, the next illustration shows the result. Both

the binary target and its respective probability are shown. The web application is built on Flask,

a third-party Python-language based web-framework useful for building prototypes. In the

Python world, pickling an object is serializing it and saving it to disk to be loaded and used when

required. This a go-to option for deploying machine learning models. But the simplicity of the

decision tree lends itself to be converted to a series of nested if-else statements that can be

modularized into a function – this implementation was used instead (StackOverflow,

September).

Get a prediction

Example of a Prediction

Bibliography
Citi Bike. (n.d.). Retrieved 12 03, 2017, from Citi Bike: https://www.citibikenyc.com/system-

data

Kaggle. (2014). Bike Sharing Demand . Retrieved from Kaggle: https://www.kaggle.com/c/bike-

sharing-demand

NextBike. (n.d.). Hudson Bike Share – Trip Data. Retrieved 11 28, 2017, from Hudson Bike

Share: https://hudsonbikeshare.com/data/

Shaheen, S., Guzman, S., & Zhang, H. (2010). Bikesharing in Europe, the Americas, and Asia:

Past, Present, Future. Transportation Research Record: Journal of the Transportation

Research Board, 2143, 159-167.

StackOverflow. (September, 29 2016). Retrieved from How to extract the decision rules from

scikit-learn decision-tree? : https://stackoverflow.com/questions/20224526/how-to-

extract-the-decision-rules-from-scikit-learn-decision-tree