The recent opening of Data.gov has led to a number of discussions on data formats, feeds, what kinds of data, which agencies are or are not participating, etc. One key aspect that gets overlooked very easily, but that is really essential, is that what is being published is actual data: original, raw, unprocessed, undigested, naked data. Everything else is secondary.
A lot of data comes from tables in reports. It has become something of a custom in recent years for government agencies to publish the tables in their reports as Excel files. I really don’t get the point of doing this, as it’s the same exact numbers that can be found in the report, anyway. And usually, it’s only a handful of numbers per table. Here is a recent example from a crime report in the UK:
This table, called Who Are The Burglars? appears with others, like What Do They Steal? and How Do They Get In? These are all relevant questions, of course, but what if I want to draw new connections? Perhaps male burglars steal different things than female ones? Do younger people steal more? Is there a difference in how they get in between the different age groups, sexes, and repeat offenders?
None of that is possible with this data, because it’s not raw. It has been pre-digested and what we get is a neatly arranged report. There’s nothing wrong with a well-prepared report, of course, but providing the source data would make the report a starting point for further exploration, rather than the end product of the analysis.
A good example for how things should be done are the U.S. Census Public Use Microdata Samples (PUMS). The data format is not exactly a joy to work with: it’s fixed width, with 00 meaning something different than 000 or 0, but it’s possible to parse it. The metadata is in PDF files, and not exactly directly accessible. But it’s data. Hundreds of dimensions. Millions of data points. Beautiful, delicious, raw data. Exactly the stuff we need to use our visualization tools on, to explore and find interesting new connections.
Data.gov has inspired people around the world, and particularly in the UK. The UK government is currently trying to figure out how to build a kind of data.gov.uk, and how to make that as useful as possible. This is an interesting process to watch, since there wasn’t much public participation in data.gov’s design. There are many valid questions about data formats, feeds, etc., but the key issue is really the rawness of the data. Everything else can be handled with tools and libraries.
Quite a bit of the current data on data.gov is in shape files. These are proprietary, and their geographical component isn’t always that interesting. There are a number of libraries that can open these files, though, so it is quite easy to extract the tabular data and get it into a CSV file or similar. The same is true for some of the XML files, which would make more sense as CSV (and vice versa). Even the missing RSS feed was taken care of by Sunlight Labs.
Another question that has come up is whether such a site should be an index or a repository: link to the data hosted on a variety of websites, or provide it in a central place, perhaps with a common data model, bulk downloads, and even an API. I believe that data.gov is choosing the right middle ground here. Linking to data is much easier than hosting it, and especially trying to get it all into one data collection – a project that is guaranteed to lead to endless discussions about the correct taxonomy, data model, etc.
We all have our data models and ideas about the data, so any overarching be-all-end-all data model data.gov would choose would match nobody else’s. It really makes more sense to leave that to the user. Of course, it would be great to be able to query all the data in one central place. I just don’t think it’s going to happen. And given the choice between the current data.gov and the perfect model in five years, I know what I would pick.
But despite all the flaws, the inconsistencies, and things that could be done but haven’t been, there is one key component that makes it all viable: it’s data. Real, raw, original data. Data we can use and melt and recast and analyze. That’s the kind of data that is worth going to all the trouble for. If we don’t get that kind of data, all the other issues are moot.