Row-Level Thinking vs. Cube Thinking

Our mental model of a dataset changes the way we ask questions. One aspect of that is the shape of the data (long or wide); an equally important issue is whether we think of the data as a collection of rows of numbers that we can aggregate bottom-up, or as a complete dataset that we can slice top-down to ask questions. Continue reading Row-Level Thinking vs. Cube Thinking

Spreadsheet Thinking vs. Database Thinking

The shape of a dataset is hugely important to how well it can be handled by different software. The shape defines how it is laid out: wide as in a spreadsheet, or long as in a database table. Each has its use, but it’s important to understand their differences and when each is the right choice. Continue reading Spreadsheet Thinking vs. Database Thinking

What Means Mean

Percentile Bands

for Unemployment Data in 380 U.S. Metro Areas
Metro Area:

The visualization above shows the unemployment rate in 380 metro areas in the U.S. from January 2003 to June 2013 (data from the Bureau of Labor Statistics). Each of these is itself an average, but the overall mean is also shown as a heavier line. Mouse over to see individual metro areas highlighted.

As you explore, you will see many small and large patterns that the average, or mean, completely misses. You can see some outliers with very high unemployment, Hurricane Katrinaseemingly random spikes, etc. (click these links to highlight them in the visualization above, click again to turn the highlight off).  That is part of the function of the mean: it averages away small changes. That can be a desired effect, but it is often glossed over when numbers like unemployment rates are reported. Does a small change in the average taken over 200 million people really mean much? Worse yet, does no change mean that nothing happened?

How do you account for the large variation in this data, though? One way is to include a range based on percentiles. The most obvious would be to report the range from smallest to largest value. That does tend to be very sensitive to outliers, however, which may or may not be desirable. Instead, perhaps a narrower range should be reported that covers most of the data, with the extreme values treated separately. But which one?

Percentiles are one of the simplest ideas in statistics: sort the data values, then pick the ones you want depending on their location in that list (as a fraction of the length of the list). The value in the middle is the 50th percentile, also known as the median. The value one quarter of the way into the list is the 25th percentile, etc. Picking the range of values from the 25th to the 75th percentile selects half the data (dropping the bottom and top quarters); this is also called the interquartile range.

A common way of looking at data is to drop the top and bottom 5%, which leaves the range from 5% to 95% (clicking these links will change the settings of the visualization above). That removes quite a bit of the range, though. Is 1% to 99% better? How about the interquartile range? Talking about percentiles in the abstract is one thing, but seeing how much data, and how much of the range of values, that ignores, is quite another.

Calculating percentiles requires additional data. With unemployment data, there is some on metro areas, sectors, and a number of demographic values. In other cases, that data is often not easy to find or simply not available. But whenever possible, we need to demand more context than a single number. A simple mean without such context is meaningless.