Anscombe's Quartet

Visualization may not be as precise as statistics, but it provides a unique view onto data that can make it much easier to discover interesting structures than numerical methods. Visualization also provides the context necessary to make better choices and to be more careful when fitting models. Anscombe's Quartet is a case in point, showing that four datasets that have identical statistical properties can indeed be very different.

Arguing for Graphics in 1973

In 1973, Francis J. Anscombe published a paper titled, Graphs in Statistical Analysis. The idea of using graphical methods had been established relatively recently by John Tukey, but there was evidently still a lot of skepticism. Anscombe first lists some notions that textbooks were "indoctrinating" people with, like the idea that "numerical calculations are exact, but graphs are rough."

He then presents a table of numbers. It contains four distinct datasets (hence the name Anscombe's Quartet), each with statistical properties that are essentially identical: the mean of the x values is 9.0, mean of y values is 7.5, they all have nearly identical variances, correlations, and regression lines (to at least two decimal places).

I		II		III		IV
x	y	x	y	x	y	x	y
10.0	8.04	10.0	9.14	10.0	7.46	8.0	6.58
8.0	6.95	8.0	8.14	8.0	6.77	8.0	5.76
13.0	7.58	13.0	8.74	13.0	12.74	8.0	7.71
9.0	8.81	9.0	8.77	9.0	7.11	8.0	8.84
11.0	8.33	11.0	9.26	11.0	7.81	8.0	8.47
14.0	9.96	14.0	8.10	14.0	8.84	8.0	7.04
6.0	7.24	6.0	6.13	6.0	6.08	8.0	5.25
4.0	4.26	4.0	3.10	4.0	5.39	19.0	12.50
12.0	10.84	12.0	9.13	12.0	8.15	8.0	5.56
7.0	4.82	7.0	7.26	7.0	6.42	8.0	7.91
5.0	5.68	5.0	4.74	5.0	5.73	8.0	6.89

But when plotted, they suddenly appear very different.

While dataset I appears like many well-behaved datasets that have clean and well-fitting linear models, the others are not served nearly as well. Dataset II does not have a linear correlation; dataset III does, but the linear regression is thrown off by an outlier. It would be easy to fit a correct linear model, if only the outlier were spotted and removed before doing so. Dataset IV, finally, does not fit any kind of linear model, but the single outlier makes keeps the alarm from going off.

How do you find out which model can be applied? Anscombe's answer is to use graphs: looking at the data immediately reveals a lot of the structure, and makes the analyst aware of "pathological" cases like dataset IV. Computers are not limited to running numerical models, either.

A computer should make both calculations and graphs. Both sorts of output should be studied; each will contribute to understanding.

What is an Outlier?

In addition to showing how useful a clear look onto data can be, Anscombe also raises an interesting question: what, exactly, is an outlier? He describes a study on education, where he studied per-capita expenditures for public schools in the 50 U.S. states and the District of Columbia. Alaska is a bit of an outlier, so it moves the regression line away from the mainstream. The obvious response would be to remove Alaska from the data before computing the regression. But then, another state will be an outlier. Where do you stop?

Anscombe argues that the correct answer is to show both the regression with Alaska, but also how much it contributes and what happens when it is removed. The tool here, again, are graphical representations. Not only the actual data needs to be shown, but also the distances from the regression line (the residuals), and other statistics that help judge how well the model fits. It seems like an obvious thing to do, but presumably was not the norm in the 1970s, and I can imagine that it still not always is.

Scientific Paper or Blog Posting?

Besides the content, what is remarkable about the paper is its tone: Anscombe states his opinions and uses some fairly strong language (by today's standards). He calls the floating-point notation computers produce when printing out numbers "abominable," talks about textbooks "indoctrinating" students, and does not back up his claims with a lot of data (e.g., there's no study showing that people cannot infer the structure of the datasets from merely reading the table).

I have seen similar things in papers from that time and earlier. Some of this would be shot down by reviewers today, and never make it into a published paper. It's almost as if these papers were partly scientific paper, partly blog posting to vent some steam or argue a position.

Anscombe also describes his preference for APL (an ancient programming language), but assures the reader who can program in FORTRAN or PL/1 that he'll be able to produce graphs. Mere users of statistical packages, however, were out of luck in the early '70s. The paper closes with a call for action:

Posted by Robert Kosara on February 13, 2011. Filed under criticism.

Comments

Martin Theus says…

"Anscombe argues that the correct answer is to show both the regression with Alaska, but also how much it contributes and what happens when it is removed"
This is maybe the most important concept one has to understand. Using graphical means to understand data - preferable interactive - will make the optional choice more natural. If you are (only) looking for THE one correct/optimal model, you will probably miss a lot which contributes to the dataset.

Kyle Hailey says…

Good write up on an important example of the power of visualizing of quantitative data graphically. The concept of using graphics was actually a hard sell 10-15 years ago. At the time Anscombe’s quartet was the most important example I had when arguing for graphics in performance analytics and dashboards. Now days the tides are changing with graphical visualizations everywhere. I now have engineers now chastising me when I provide textual data without any graphics! It's a refreshing change. Even though tides are changing for the better, Anscombe’s Quartet will always be a powerful example of the insights that can be gleamed from graphical visualization of quantitative data. Thanks for the write up.
Kyle Hailey

Gurbaksh says…

if we have to find any relationship or correlation between these variable that differentiate all four. which one do you think will be better?

Anscombe's Quartet ​

Arguing for Graphics in 1973 ​

What is an Outlier? ​

Scientific Paper or Blog Posting? ​

Anscombe's Quartet

Arguing for Graphics in 1973

What is an Outlier?

Scientific Paper or Blog Posting?