• Skip to main content
  • Skip to primary sidebar
  • Skip to footer

eagereyes

Visualization and Visual Communication

  • Explore
    • Starter Pack
    • Blog Calendar
    • Blogroll
    • eagereyesTV YouTube Videos
  • Practical
    • Basics
    • Pie Charts
    • Techniques
    • Book Reviews
    • Journalism
  • Academic
    • Speaking Mistakes
    • Acceptance Rates
    • Papers
    • Conference Reports
    • Lists of Influences
    • Criticism
    • Peer Review
  • Admin
    • About
    • Contact
    • License

Robert Kosara / February 13, 2011

Anscombe’s Quartet

Visualization may not be as precise as statistics, but it provides a unique view onto data that can make it much easier to discover interesting structures than numerical methods. Visualization also provides the context necessary to make better choices and to be more careful when fitting models. Anscombe’s Quartet is a case in point, showing that four datasets that have identical statistical properties can indeed be very different.

Arguing for Graphics in 1973

In 1973, Francis J. Anscombe published a paper titled, Graphs in Statistical Analysis. The idea of using graphical methods had been established relatively recently by John Tukey, but there was evidently still a lot of skepticism. Anscombe first lists some notions that textbooks were “indoctrinating” people with, like the idea that “numerical calculations are exact, but graphs are rough.”

He then presents a table of numbers. It contains four distinct datasets (hence the name Anscombe’s Quartet), each with statistical properties that are essentially identical: the mean of the x values is 9.0, mean of y values is 7.5, they all have nearly identical variances, correlations, and regression lines (to at least two decimal places).

I II III IV
x y x y x y x y
10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58
8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76
13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71
9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84
11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47
14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04
6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25
4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50
12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56
7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91
5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89

But when plotted, they suddenly appear very different (click image for a larger version).

While dataset I appears like many well-behaved datasets that have clean and well-fitting linear models, the others are not served nearly as well. Dataset II does not have a linear correlation; dataset III does, but the linear regression is thrown off by an outlier. It would be easy to fit a correct linear model, if only the outlier were spotted and removed before doing so. Dataset IV, finally, does not fit any kind of linear model, but the single outlier makes keeps the alarm from going off.

How do you find out which model can be applied? Anscombe’s answer is to use graphs: looking at the data immediately reveals a lot of the structure, and makes the analyst aware of “pathological” cases like dataset IV. Computers are not limited to running numerical models, either.

A computer should make both calculations and graphs. Both sorts of output should be studied; each will contribute to understanding.

What is an Outlier?

In addition to showing how useful a clear look onto data can be, Anscombe also raises an interesting question: what, exactly, is an outlier? He describes a study on education, where he studied per-capita expenditures for public schools in the 50 U.S. states and the District of Columbia. Alaska is a bit of an outlier, so it moves the regression line away from the mainstream. The obvious response would be to remove Alaska from the data before computing the regression. But then, another state will be an outlier. Where do you stop?

Anscombe argues that the correct answer is to show both the regression with Alaska, but also how much it contributes and what happens when it is removed. The tool here, again, are graphical representations. Not only the actual data needs to be shown, but also the distances from the regression line (the residuals), and other statistics that help judge how well the model fits. It seems like an obvious thing to do, but presumably was not the norm in the 1970s, and I can imagine that it still not always is.

Scientific Paper or Blog Posting?

Besides the content, what is remarkable about the paper is its tone: Anscombe states his opinions and uses some fairly strong language (by today’s standards). He calls the floating-point notation computers produce when printing out numbers “abominable,” talks about textbooks “indoctrinating” students, and does not back up his claims with a lot of data (e.g., there’s no study showing that people cannot infer the structure of the datasets from merely reading the table).

I have seen similar things in papers from that time and earlier. Some of this would be shot down by reviewers today, and never make it into a published paper. It’s almost as if these papers were partly scientific paper, partly blog posting to vent some steam or argue a position.

Anscombe also describes his preference for APL (an ancient programming language), but assures the reader who can program in FORTRAN or PL/1 that he’ll be able to produce graphs. Mere users of statistical packages, however, were out of luck in the early ’70s. The paper closes with a call for action:

The user is not showered with graphical displays. He can get them only with trouble, cunning and a fighting spirit. It’s time that was changed.

Indeed.


Francis J. Anscombe, Graphs in Statistical Analysis. The American Statistician, vol. 27, no. 1, pp. 17–21, 1973.

Filed Under: Criticism

Robert Kosara is Data Visualization Developer at Observable. Before that, he was Research Scientist at Tableau Software (2012–2022) and Associate Professor of Computer Science (2005–2012). His research focus is the communication of data using visualization. In addition to blogging, Robert also runs and tweets. Read More…

Reader Interactions

Comments

  1. Martin Theus says

    February 14, 2011 at 12:51 pm

    “Anscombe argues that the correct answer is to show both the regression with Alaska, but also how much it contributes and what happens when it is removed”

    This is maybe the most important concept one has to understand. Using graphical means to understand data – preferable interactive – will make the optional choice more natural. If you are (only) looking for THE one correct/optimal model, you will probably miss a lot which contributes to the dataset.

    Reply
  2. Kyle Hailey says

    June 7, 2012 at 11:30 am

    Good write up on an important example of the power of visualizing of quantitative data graphically. The concept of using graphics was actually a hard sell 10-15 years ago. At the time Anscombe’s quartet was the most important example I had when arguing for graphics in performance analytics and dashboards. Now days the tides are changing with graphical visualizations everywhere. I now have engineers now chastising me when I provide textual data without any graphics! It’s a refreshing change.
    Even though tides are changing for the better, Anscombe’s Quartet will always be a powerful example of the insights that can be gleamed from graphical visualization of quantitative data. Thanks for the write up.

    – Kyle Hailey

    Reply
  3. Gurbaksh says

    August 20, 2019 at 2:41 am

    if we have to find any relationship or correlation between these variable that differentiate all four. which one do you think will be better?

    Reply

Leave a Reply Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Primary Sidebar

More Criticism Articles

  • Review: Jon Schwabish, Better Presentations
  • Review: Manuel Lima, The Book of Trees
  • Review: Isabel Meirelles, Design for Information
  • Another Look at Many Eyes, 18 Months Later
  • It’s Just Too Easy

Recently Popular

  • Data: Continuous vs. Categorical
  • The Simple Way to Scrape an HTML Table: Google Docs
  • The US ZIPScribble Map
  • New, Improved Traveling Presidential Candidate Map
  • How The Rainbow Color Map Misleads
  • New video: Gauges for Data Visualization, The NY Times Election Needle, and Circular Bar Charts
  • Watch My Outlier Talk: This Should Have Been A Bar Chart!
  • Facebook
  • GitHub
  • LinkedIn
  • RSS
  • Twitter
  • YouTube

Subscribe via Email

Footer

  • About
  • Contact
  • License

Copyright © 2006–2022 Robert Kosara · All original materials are available under CC-BY-SA