Bikespector

Machine Learning for Shared Mobility

The forecast is made possible using public databases provided by the state of North Rhine-Westphalia as part of its open-data initiative.

84%

prediction rate

10,000

bikes

Our plans called for Bikespector predicting how many bikes will most likely be available at a given time and place. We also wanted it to have the option of adjusting the radius around the pick-up location. Adjusting it to exactly the distance that people are willing to walk to pick up their bike.

The development addressed questions such as: What factors influence bike availability? And, is it even possible to make forecasts based on the available sources of data? Also playing an important role were our thoughts about an appropriate structure for the forecast pipeline, the matching data-science approach, and a model offering the highest-quality forecasts.

Rapid prototyping in a data-science context requires a focus on solutions with great impact and little effort needed for implementation. Coming up with a prototype very early on made it possible for us to maintain an integrated view of Bikespector while its simplicity gave us room for new ideas and approaches.

We drew bike-sharing data from the next bike API for roughly three months. This table shows the factors that influence bike availability the most.

Bikespector: the first prototype

Imagine a map of Cologne at a given time. The bike-sharing data provides the exact locations of all available bikes. To reduce the complexity of the data, we discretized the space into a hexagonal grid. This meant each hexagonal cell could be regarded as a standalone forecast unit. If a user were to choose a location and radius on the map, the forecasts of all hexagonal cells within the radius would be added together.

This trick left us with a simple data-science problem: The number of incorporated factors for bike availability in each grid cell were now limited to “Time” and “Day of Week”. (See the table above). Although the degree of nonlinearity was considerably reduced by excluding the “Latitude” and “Longitude”, the bike availability still varied drastically across “Time”.

First we brought structure to the problem with simple statistical procedures and then we applied machine learning methods.

Julian Pohl Data Scientist, denkwerk

Additionally, the amount of available data for an individual cell was now relatively low. Furthermore, we were more interested in predicting bike availability rather than understanding it. These arguments drove us to a more flexible, machine-learning model. When we crossvalidated a variety of models, one of them reached the highest forecast accuracy: the random forest, a combination of bagging and decision-tree learning.

For the application prototype, we deployed the highest-performance forecast model with an interface to the Bikespector front end.