The Simply Statistics Blog‘s “Sunday Statistics Roundups” are always an interesting read. These posts tend to bring together links about statistics in the news, contests, conferences, and also data sets. This week’s roundup had a link to a really interesting data set on the bike trip histories from Capital Bikeshare.
As you may imagine Capital Bikeshare is a bike sharing program in Washington, DC. Members of this service can go to any bike station (locations here), borrow a bike, and return the bike to any station they like. Membership plans can vary in length from three days to a year.
I imagine Capital Bikeshare uses their data to try to understand where new stations should be put, which stations need to have more bike parking spaces, and how often to maintain bicycles. While they aren’t making available full access to their data (which is good because it would be creepy to be able to track individual people’s locations and trips) there is still a lot to be learned from the data they released.
The data sets are broken up by 3 months intervals. The first data were collected over the last quarter of 2010, and there are data available up to the first 3 months of 2013. Variables included in the data set are: the duration of the bike trips, the start and end dates of each trip, the start station, the end station, the bike #, and the membership type.
There are all sorts of questions you could try to answer with these data:
- Are some stations more likely to be destinations? Are some stations more likely to be points of departure?
- Are short-term members more likely to return their bike to the station they originated from than long-term members?
- Does the duration of trip vary by season and membership type?
- Does the distance between locations vary by season and membership type?
- What station attributes (such as proximity to a Metro station or bike-parking capacity) are associated with greater station utilization?
You could also use these data to form a network (which would be really cool to visualize) where the stations are nodes, the strength of the connection between the station A and B is the number of trips starting at station A which end at station B. Thinking of this as network data I would be interested to know:
- Does this network have similar characteristics to other kinds of networks such as social networks, other transportation networks (like flights between airports), or neural networks.
- What would be the best way to model this network?
- Can we use the given network to predict links to new stations that get introduced to the bikeshare system?
- If you only had information on a subset of the bicycles, would you be able to infer the rest of the network?
- If you were to look at the network formed from a weekend day, how would that vary from a network formed during a weekday?
The Capital Bikeshare data could be great to use in the classroom. In particular this could be useful to show the difference between a population mean and a sample mean, and about the variability of a sample mean. Students could generate their own questions and answer them using a representative subset of the data. Though there are only a few variables, there are many many questions we could attempt to answer!