• Skip to content
  • Skip to primary sidebar
  • Skip to footer

eagereyes

Visualization and Visual Communication

  • Explore
    • Starter Pack
    • Blog Calendar
    • EagerEyes Decade
    • Blogroll
  • Practical
    • Basics
    • Pie Charts
    • Techniques
    • Book Reviews
    • Journalism
  • Academic
    • Speaking Mistakes
    • Acceptance Rates
    • Papers
    • Conference Reports
    • Lists of Influences
    • Criticism
    • Peer Review
  • Admin
    • About
    • Contact
    • License
MTurk IDs Are Not Anonymous

Robert Kosara / May 5, 2016

MTurk IDs Are Not Anonymous

The worker IDs Amazon’s Mechanical Turk gives you may look pretty random and anonymous, but they can reveal personally-identifiable information. They need to be removed from datasets, especially when they are shared or published.

Like many things, I learned this the hard way. Or I would have, had Steve Haroz not caught it in the data for an upcoming EuroVis paper.

Here’s an example of a worker profile with way too much information about the person’s location and reviews of Amazon purchases. You also often get partial or even full names, and can start guessing their hobbies, etc.

Many studies are run on MTurk these days, because it’s convenient. And sharing the resulting data is clearly the way to go. Just make sure you replace the Worker IDs with some random identifier before doing so.

Another pitfall is that once you’ve checked that data into git (and share on github), you need to recreate the repository to erase any trace of it. Just deleting or overwriting a file in a repository isn’t enough, because it’s still in the history. There are tricks in git that allow you to change history, but you better know git really well to use those. Nuking the repository and recreating it with the cleaned data is the safer bet.

Why Amazon ties the worker IDs to people’s accounts is a bit of a mystery to me. I guess they never expected people to start sharing those IDs, since doing studies isn’t exactly their main use case. It’s still odd, since Amazon otherwise tries to keep the workers anonymous as much as possible (you’re not allowed to ask them certain questions, etc.).

Filed Under: Blog 2016 Tagged With: mechanical turk, mturk

Robert Kosara is Senior Research Scientist at Tableau Software, and formerly Associate Professor of Computer Science. His research focus is the communication of data using visualization. In addition to blogging, Robert also runs and tweets. Read More…

Reader Interactions

Comments

  1. Eytan Adar says

    May 5, 2016 at 9:08 am

    Yes… see this for a longer analysis: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2228728

    Reply
  2. Monika M. Wahi, MPH, CPH says

    May 7, 2016 at 4:59 am

    Thank you for putting us on notice. I am an epidemiologist, but I recently helped a communications colleague pilot a survey (n=60) on opinions of the Republican Party on MTurk, which I had never used.

    We used the feature in SurveyMonkey that creates an anonymous link on the web, and designated it not to collect IP address, so the survey dataset was indeed de-identified. However, to be able to allow the workers to get paid on MTurk, you need to have each survey have a unique Survey ID which is then entered into MTurk by the users to get their payment. This is apparently much easier to do in Qualtrics than in SurveyMonkey.

    Qualtrics advice: http://brentcurdy.net/qualtrics-tutorials/link/

    SurveyMonkey advice: http://nicholasnicoletti.com/blog/2015/06/survey-monkey-and-mechanical-turk-the-verification-code/

    I had a bear of a time in SurveyMonkey, but finally figured it out with the advice above. We were limited to sending 20 per batch, and so we did 3 batches for n=60.

    I now realize from this post that I could have re-identified the SurveyMonkey answers by crosslinking the Survey ID in the SurveyMonkey data with info on Amazon Turk about who completed which survey. For this reason, I think if you do it this way using SurveyMonkey, you can say to the IRB that you promise to not try to re-identify the people by going back to your MTurk account and cross-linking the Survey IDs.

    I just want to add: I had never used MTurk for surveys, and was wary of it for two reasons: 1) a biased but uncharacterized convenience sample, and 2) no way to calculate a response rate (no denominator) so hard to quantify selection bias. The colleague and I found (given our topic) that a preponderance of the n=60 were Democrats who were lower income, and this will have a profound effect on the results of surveys on most topics, I believe. I strongly advise against using MTurk for anything but exploratory research that is taken with a grain of salt (as my colleague was doing).

    Reply
  3. Chris Fuccione says

    May 10, 2016 at 8:20 pm

    Do a search for your worker/requester ID on Google. It will be an eye opener.

    Reply

Leave a Reply Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Primary Sidebar

More Blog 2016 Articles

  • A Roundup of Year-End News Graphics Roundups
  • The Dumbest User Interface of 2016
  • When Rankings Are Just Data Porn
  • The EagerEyes Holiday Shopping Guide
  • The Problem with Vis Taxonomies

Recently Popular

  • Understanding Pie Charts
  • Data: Continuous vs. Categorical
  • How The Rainbow Color Map Misleads
  • The Simple Way to Scrape an HTML Table: Google Docs
  • What is Visualization? A Definition
  • Spreadsheet Thinking vs. Database Thinking
  • Facebook
  • GitHub
  • LinkedIn
  • RSS
  • Twitter

Subscribe via Email

Footer

  • About
  • Contact
  • License

Copyright © 2006–2019 Robert Kosara · All original materials are available under CC-BY-SA