Anscombe's Quartet
Visualization may not be as precise as statistics, but it provides a unique view onto data that can make it much easier to discover interesting structures than numerical methods. Visualization also provides the context necessary to make better choices and to be more careful when fitting models. Anscombe's Quartet is a case in point, showing that four datasets that have identical statistical properties can indeed be very different.
Arguing for Graphics in 1973
In 1973, Francis J. Anscombe published a paper titled, Graphs in Statistical Analysis. The idea of using graphical methods had been established relatively recently by John Tukey, but there was evidently still a lot of skepticism. Anscombe first lists some notions that textbooks were "indoctrinating" people with, like the idea that "numerical calculations are exact, but graphs are rough."
He then presents a table of numbers. It contains four distinct datasets (hence the name Anscombe's Quartet), each with statistical properties that are essentially identical: the mean of the x values is 9.0, mean of y values is 7.5, they all have nearly identical variances, correlations, and regression lines (to at least two decimal places).
I | II | III | IV | ||||
---|---|---|---|---|---|---|---|
x | y | x | y | x | y | x | y |
10.0 | 8.04 | 10.0 | 9.14 | 10.0 | 7.46 | 8.0 | 6.58 |
8.0 | 6.95 | 8.0 | 8.14 | 8.0 | 6.77 | 8.0 | 5.76 |
13.0 | 7.58 | 13.0 | 8.74 | 13.0 | 12.74 | 8.0 | 7.71 |
9.0 | 8.81 | 9.0 | 8.77 | 9.0 | 7.11 | 8.0 | 8.84 |
11.0 | 8.33 | 11.0 | 9.26 | 11.0 | 7.81 | 8.0 | 8.47 |
14.0 | 9.96 | 14.0 | 8.10 | 14.0 | 8.84 | 8.0 | 7.04 |
6.0 | 7.24 | 6.0 | 6.13 | 6.0 | 6.08 | 8.0 | 5.25 |
4.0 | 4.26 | 4.0 | 3.10 | 4.0 | 5.39 | 19.0 | 12.50 |
12.0 | 10.84 | 12.0 | 9.13 | 12.0 | 8.15 | 8.0 | 5.56 |
7.0 | 4.82 | 7.0 | 7.26 | 7.0 | 6.42 | 8.0 | 7.91 |
5.0 | 5.68 | 5.0 | 4.74 | 5.0 | 5.73 | 8.0 | 6.89 |
But when plotted, they suddenly appear very different.
While dataset I appears like many well-behaved datasets that have clean and well-fitting linear models, the others are not served nearly as well. Dataset II does not have a linear correlation; dataset III does, but the linear regression is thrown off by an outlier. It would be easy to fit a correct linear model, if only the outlier were spotted and removed before doing so. Dataset IV, finally, does not fit any kind of linear model, but the single outlier makes keeps the alarm from going off.
How do you find out which model can be applied? Anscombe's answer is to use graphs: looking at the data immediately reveals a lot of the structure, and makes the analyst aware of "pathological" cases like dataset IV. Computers are not limited to running numerical models, either.
A computer should make both calculations and graphs. Both sorts of output should be studied; each will contribute to understanding.
What is an Outlier?
In addition to showing how useful a clear look onto data can be, Anscombe also raises an interesting question: what, exactly, is an outlier? He describes a study on education, where he studied per-capita expenditures for public schools in the 50 U.S. states and the District of Columbia. Alaska is a bit of an outlier, so it moves the regression line away from the mainstream. The obvious response would be to remove Alaska from the data before computing the regression. But then, another state will be an outlier. Where do you stop?
Anscombe argues that the correct answer is to show both the regression with Alaska, but also how much it contributes and what happens when it is removed. The tool here, again, are graphical representations. Not only the actual data needs to be shown, but also the distances from the regression line (the residuals), and other statistics that help judge how well the model fits. It seems like an obvious thing to do, but presumably was not the norm in the 1970s, and I can imagine that it still not always is.
Scientific Paper or Blog Posting?
Besides the content, what is remarkable about the paper is its tone: Anscombe states his opinions and uses some fairly strong language (by today's standards). He calls the floating-point notation computers produce when printing out numbers "abominable," talks about textbooks "indoctrinating" students, and does not back up his claims with a lot of data (e.g., there's no study showing that people cannot infer the structure of the datasets from merely reading the table).
I have seen similar things in papers from that time and earlier. Some of this would be shot down by reviewers today, and never make it into a published paper. It's almost as if these papers were partly scientific paper, partly blog posting to vent some steam or argue a position.
Anscombe also describes his preference for APL (an ancient programming language), but assures the reader who can program in FORTRAN or PL/1 that he'll be able to produce graphs. Mere users of statistical packages, however, were out of luck in the early '70s. The paper closes with a call for action:
Posted by Robert Kosara on February 13, 2011. Filed under criticism.