Skip to content

How to Get Excited About Standard Datasets

It can be hard to get excited about the standard datasets that we keep using to show how visualization and statistics work. But if that's the case for you, it's not the datasets's fault, it's you! Here’s how to keep that spark going!

Cars

What could be more interesting than cars? I mean, come on – they’re cars! And I’m not talking about boring Priuses or self-driving cars or any of that newfangled stuff. No, these are from the time when cars were still cars: the 1970s and early 80s. That’s what the cars dataset is all about (there are, it turns out, lots of car-related datasets, but there’s only one true cars). Real cars. Manly cars.

So yeah, cars. Like, from the 1970s. Look at them! All those cylinders (whatever those are)! Four and six and even eight cylinders! Crazy! Also weight and mileage and stuff. Who knew they had those in the 70s?

You can learn fascinating things, like that heavier cars have lower mileage – who knew? Or that more cylinders mean lower mileage. I know, somebody should really tell those car makers about that. Even acceleration is correlated with weight, you can’t make this stuff up!

Cars just never get old. I mean, cars. Who doesn’t love cars? Cars, cars, cars…

Iris

If the cars dataset seems a bit dated, surely the iris data will answer your burning questions. Who hasn’t stared at an iris plant and gone crazy trying to decide whether it’s an iris setosa, versicolor, or maybe even virginica? It’s the stuff that keeps you up at night for days at a time.

Luckily, the iris dataset makes that super easy. All you have to do is measure the length and width of your particular iris’s petal and sepal, and you’re ready to rock! What’s that, you still can’t decide because the classes overlap? Well, but at least now you have data!

Actually, it turns out that this data is even older than the cars! It's from a 1936 paper! They sure knew their irises in the 30s. And it's not like plants change all that much in 80 years.

Titanic

Of all the datasets, the Titanic data is clearly the most dramatic. Who isn't obsessed with the disaster that happened over 100 years ago? Who hasn’t seen the movie that came out in 1997, which is, uh, just over 20 years ago now? I mean, who over the age of 40, of course (millennials don’t know anything, as usual)?

Well, the data is fascinating either way. You can see how people in the first class did much better than those in the second and third classes! Fascinating insights that you would never have guessed! And the crew mostly died too. It's almost as if wealth bought you survival. Of course, by now they're all dead so it's not like it matters anymore.

Isn't it amazing how much you can learn from just four variables, though! It doesn't even matter who all those people were, they're just numbers now anyway. They've all turned into data.

Love the Classics

The classic datasets are fine. If they bore you, maybe it’s you who’s boring? If they don’t interest you, maybe you have the wrong interests? Generations of students have learned to love them, and so will you!

Posted by Robert Kosara on March 21, 2018.