Is a data journalist one who unearths the data, who finds the insights in the data, who finds the right way to visually communicate the data? The answer is, of course, all three. But let’s tease them apart and look at each separately.
Unearthing the Data
First, the data has to be found. And finding, in the journalism context, doesn’t always mean just scouring the web. There may be sources nobody else has access to. Data may have to be wrangled out of the government’s hands with Freedom of Information Act (FOIA) or similar requests. That data then might need to be cleaned, processed, sometimes even made machine-readable in the first place.
Cleaning data is not easy, and can be incredibly time-consuming and error-prone. It requires good knowledge of data cleaning tools, scripting languages, optical character recognition (OCR), and the common pitfalls of different data formats and types. It can be difficult to verify that cleaning the data has not inadvertently destroyed or skewed it in some way.
One example I am familiar with is the work Sarah Cohen and her colleagues did for their Careless Detention series in The Washington Post in 2008. They collected data on the deaths of immigration detainees (using FOIA requests, etc.), and were able to see regional patterns on the resulting data of what they classified as questionable deaths on a simple map. The resulting story was based on the people rather than the data, but the data led to the story.
Finding the Insights in the Data
Once the data has been found, it needs to be analyzed to find out what it actually contains. Most of the time, there is nothing interesting in it. But in those cases where a discovery is made, it can make for a great story.
The skills required here are quite different from the data digging. The key skill is data analysis: statistics, data exploration, hypothesis testing, etc. It can also require domain knowledge, e.g., about economics when the story is about unemployment, etc.
One of my favorite examples is Hannah Fairfield’s Driving Safety in Fits and Starts in The New York Times two years ago. The data had to be collected, but was publicly available. The key part that made for the story, however, was finding explanations for the patterns that emerged (and also a very compelling visual representation).
Communicating the Insight
Given the insight, getting that across should be easy, right? Well, no. That’s the big mistake many academics make, and this is where you can see the most impressive work.
Perhaps my favorite example of that is Jonathan Corum’s whale/lunge feeding graphic (discussed in his phenomenal Tapestry talk last year). The data was collected by scientists, and they had already created a chart for a paper. Corum’s insight was to put something back into the graph that the scientists had left out: the depth the whale was diving to. Perhaps this was obvious to the scientists, but certainly not to the readers of The New York Times.
A more recent example was on the behavior of dogs in different settings. The data again came from a paper, which even included almost the exact same chart that was used in the NY Times piece. But only almost. The key differences are what turn a boring bar chart into an interesting, readable one: color, spacing, and some cute drawings. Enough for Kaiser Fung to use it as an argument that visualization can be worth paying for.
However, Greg McInerny argues that the NY Times version loses some important elements of the chart, namely the statistical significance of the differences. He also proposes some alternative designs that retain most of the stylistic improvements, while adding a bit more information.
Either way, the key part here is finding an interesting story like a gemstone, pulling it out of the surrounding material, and making it shine. It doesn’t always have to start from the raw data, though.
What Makes A Data Journalist?
All these examples are pieces of data journalism. Not all involve visualization. Not all entailed digging for data. Not all even require finding the insight yourself.
What data journalism requires, then, is a broad mix of skills and instincts. Not all are necessarily needed in all cases. But you never know which ones a story will require. Many of the technical and math skills are still rather unusual among the people typically working in journalism. That makes this new direction so interesting but also so problematic: how do we know if we can trust the work produced? Alberto Cairo is skeptical, and wants data journalism to up its standards.
But in a way, data journalism is the logical extension of what journalism has been all along: collecting facts and data, understanding the implications, finding the story, and reporting it. The tools and materials are changing. But soon, all journalism will be data journalism in one form or another.
One response to “What is Data Journalism?”
Good article, but it didn’t quite solve the question.
Alex Graul tweeted yesterday (https://twitter.com/alexgraul/status/494112383522971649) that “this is what real data journalism looks like” referring to this New York Times article (http://atwar.blogs.nytimes.com/2014/07/24/in-ukraine-spent-cartridges-offer-clues-to-violence-fueled-by-soviet-surplus) where journalists collected and analyzed samples of cartridges used in Ukraine combats.
I greatly appreciate the collection of primary data. Original data journalism can come out of it, and you hint at it in the middle of your blog post. I wonder though at which point we enter “data journalism” territory. This sample (less than 80 cartridges) seems quite small and inconclusive, as the author repeats several times. The analysis is barely numerical. Is it the size of the data set what will make an article “data journalism”? Is it the analytical methods applied to any data set?
At the other extreme, there is a call for rigorous methods of data analysis. I’m all for rigor, but we can’t hold data journalism to the standards of scientific research. The scientific journals already exist and they are read by a tiny audience, require massive resources, and often produce arcane findings. It seems to me that we need to be more lenient with journalism, not to expect the highest standards of data and statistical analysis.
In short, a definition of data journalism needs to say what is enough and what is too much to qualify.