Research projects are an important part of our work at denkwerk. Bikespector was born during one of these very projects. We began with a question: What can we achieve with machine learning, using little effort?
In the field of shared mobility, there exist numerous mobility-control systems. What they can’t do, though, is make predictions. Whether car or bike sharing, you see what’s available right now. How will things look over the next few days? You don’t know. That makes it impossible to plan your travel. And it’s also where Bikespector steps in. It provides information about how many bikes will most likely be available at a desired time at a queried location. The forecast is made possible using public databases provided by the state of North Rhine-Westphalia as part of its open-data initiative.
The forecast is made possible using public databases provided by the state of North Rhine-Westphalia as part of its open-data initiative.
Our plans called for Bikespector predicting how many bikes will most likely be available at a given time and place. We also wanted it to have the option of adjusting the radius around the pick-up location. Adjusting it to exactly the distance that people are willing to walk to pick up their bike.
The development addressed questions such as: What factors influence bike availability? And, is it even possible to make forecasts based on the available sources of data? Also playing an important role were our thoughts about an appropriate structure for the forecast pipeline, the matching data-science approach, and a model offering the highest-quality forecasts.
Rapid prototyping in a data-science context requires a focus on solutions with great impact and little effort needed for implementation. Coming up with a prototype very early on made it possible for us to maintain an integrated view of Bikespector while its simplicity gave us room for new ideas and approaches.
We drew bike-sharing data from the next bike API for roughly three months. This table shows the factors that influence bike availability the most.
Imagine a map of Cologne at a given time. The bike-sharing data provides the exact locations of all available bikes. To reduce the complexity of the data, we discretized the space into a hexagonal grid. This meant each hexagonal cell could be regarded as a standalone forecast unit. If a user were to choose a location and radius on the map, the forecasts of all hexagonal cells within the radius would be added together.
This trick left us with a simple data-science problem: The number of incorporated factors for bike availability in each grid cell were now limited to “Time” and “Day of Week”. (See the table above). Although the degree of nonlinearity was considerably reduced by excluding the “Latitude” and “Longitude”, the bike availability still varied drastically across “Time”.
Additionally, the amount of available data for an individual cell was now relatively low. Furthermore, we were more interested in predicting bike availability rather than understanding it. These arguments drove us to a more flexible, machine-learning model. When we crossvalidated a variety of models, one of them reached the highest forecast accuracy: the random forest, a combination of bagging and decision-tree learning.
For the application prototype, we deployed the highest-performance forecast model with an interface to the Bikespector front end.