• Skip to content
  • Skip to primary sidebar
  • Skip to footer

eagereyes

Visualization and Visual Communication

  • Explore
    • Starter Pack
    • Blog Calendar
    • EagerEyes Decade
    • Blogroll
  • Practical
    • Basics
    • Pie Charts
    • Techniques
    • Book Reviews
    • Journalism
  • Academic
    • Speaking Mistakes
    • Acceptance Rates
    • Papers
    • Conference Reports
    • Lists of Influences
    • Criticism
    • Peer Review
  • Admin
    • About
    • Contact
    • License
The Simple Way to Scrape an HTML Table: Google Docs

Robert Kosara / November 15, 2009

The Simple Way to Scrape an HTML Table: Google Docs

Raw data is the best data, but a lot of public data can still only be found in tables rather than as directly machine-readable files. One example is the FDIC’s List of Failed Banks. Here is a simple trick to scrape such data from a website: Use Google Docs.

The table on that page is even relatively nice because it includes some JavaScript to sort it. But a large table with close to 200 entries is still not exactly the best way to analyze that data.

After some digging around – and even considering writing my own throw-away extraction script –, I remembered having read something about Google Docs being able to import tables from websites. And indeed, it has a very useful function called ImportHtml that will scrape a table from a page.

To extract a table, create a new spreadsheet and enter the following expression in the top left cell:

=ImportHtml(URL, "table", num)

URL here is the URL of the page (between quotation marks), “table” is the element to look for (Google Docs can also import lists), and num is the number of the element, in case there are more on the same page (which is rather common for tables). The latter supposedly starts at 1, but I had to use 0 to get it to pick up the correct table on the FDIC page.

Once this is done, Google Docs retrieves the data and inserts it into the spreadsheet, including the headers. The last step is to download the spreadsheet as a CSV file.

This is very simple and quick, and a much better idea than writing a custom script. Of course, the real solution would be to offer all data as a CSV file in addition to the table to begin with. But until that happens, we will need tools like this to get the data into a format that is actually useful.

Filed Under: Data

Robert Kosara is Senior Research Scientist at Tableau Software, and formerly Associate Professor of Computer Science. His research focus is the communication of data using visualization. In addition to blogging, Robert also runs and tweets. Read More…

Reader Interactions

Comments

  1. Jon Peltier says

    November 15, 2009 at 10:44 am

    It took under a minute to set up a web query in Excel to extract the data. Data menu > Import External Data > New Web Query. Enter the URL, then select the table you want imported. Whenever you want, you can click on the imported table. The External Data toolbar pops up, and you can click on the icon with the exclamation point to update the query. It’s a simple matter to save the sheet with the data as a CSV.

    Reply
    • Kevin Stirtz says

      August 13, 2015 at 5:00 pm

      Hi John – Just curious if Excel can scrape multiple URLs this way. Thanks

      Reply
  2. Enrico Poli says

    November 15, 2009 at 11:15 am

     

    This is a very nice trick. For more complex data extraction needs, I’ve been playing with Open Dapper (http://www.dapper.net/open/) and I found it quite powerful: it’s able to extract data from almost any page showing any kind of regularity – not just tables – and export it at least as csv or rss (and you can access the rss from an url which stays live, updating the feed as the original page update etc.)

    (I’ve searched the site for “Dapper” and the search engine returns no results, so I thought it could be useful to point it to you – thanks for this blog and for your work!)

    Reply
  3. Tony Hirst says

    November 15, 2009 at 12:06 pm

    Hi

    Some time ago I posted a walkthrough of how to scrape data from Wikipedia with Google docs, and then annotate the CSV output of the scraped data with geodata in Yahoo Pipes:

    Data Scraping Wikipedia with Google Spreadsheets

    Reply
  4. Robert Kosara says

    November 15, 2009 at 1:59 pm

    Excel for the Mac doesn’t have that function, though. 

    Reply
  5. Robert Kosara says

    November 15, 2009 at 2:02 pm

    I haven’t seen that one before, I’ll have to try it out.

    Reply
  6. Robert Kosara says

    November 15, 2009 at 2:04 pm

    That’s probably where I saw this. I actually did this a while ago and couldn’t find the page where I had seen the ImportHtml trick when I wrote the posting. I’ll add a link.

    Reply
  7. Tim says

    November 15, 2009 at 4:59 pm

    I would just use yahoo pipes for this. Very easy (with a small investment of time to learn), and very flexible too. Great tool for scraping.

    Reply
  8. derek says

    November 17, 2009 at 7:51 am

    I would be very grateful for a reliable way to scrape PDF tables. 

    Reply
    • ftr says

      September 17, 2012 at 3:03 pm

      I ocr them with FineReader

      Reply
    • alexgerdom says

      April 27, 2015 at 5:07 am

      Tabula ( http://tabula.Technology/ ) may work for you. I haven’t had too much time to play around with it yet, but it shows some promise. The PDF would need to be OCR’d first, and they say it still struggles with headers. But if you have a really dense table that needs scraped, I’m sure there are worse things out there.

      Reply
  9. Bill Dedman says

    November 22, 2009 at 11:32 am

    Why not copy and paste this table into Excel. The rule: If it’s a table, just paste it.

    Reply
  10. Robert Kosara says

    November 23, 2009 at 1:31 pm

    The good thing about HTML is that tables are very clearly structured in the markup. That’s not the case in PDF, where it’s just stuff that happens to line up and maybe lines that are drawn in-between. Best option is probably to copy&paste from the PDF into Excel or another spreadsheet app.

    Reply
  11. D Kelly O'Day says

    December 14, 2009 at 9:25 pm

    Let’s combine your climate data and web scrapping posts.

    Here’s an example where I scrapped climate science data for a visualization of the IR absorption properties of 5 greenhouse gases.

     

    I wanted to scrap the spectrum data from 5 NIST Chemistry Webbook data web pages and generate this chart automatically.

    Jon Peltier is right about Excel’s external data capabilities, however, Jon would need a VBA procedure to retreive the data and reproduce my 5 panel chart.

    Bill Dedman’s suggestion about using Excel’s cut and paste approach would be quite time consuming and Bill would have a challenge generating the 5 panel chart.

    Here’s a link to my post, it includes a link to my R script, available on Google docs.

    http://chartsgraphs.wordpress.com/2009/12/07/understanding-the-science-of-co2%E2%80%99s-role-in-climate-change-3-%E2%80%93-how-green-house-gases-trap-heat/

     

     

    Reply
  12. Jay says

    December 17, 2009 at 8:06 pm

    using something like HTML::TableContentParser or HTML::TableExtract and a cron job if i needed to keep it up to date.

    Reply
  13. David G. says

    April 30, 2012 at 3:44 pm

    Awesome feature.. great tutorial–

    Thanks

    —
    androidgeeky.weebly.com

    Reply
  14. ftr says

    June 3, 2012 at 7:57 am

    For Firefix users there is the Table2Clipboard extension that works well here, at https://addons.mozilla.org/fr/firefox/addon/dafizilla-table2clipboard/

    And there is TableTools2, too, at http://datatables.net/forums/discussion/3726/tabletools-2-released/p1

    Reply
  15. kristineh says

    June 21, 2012 at 2:18 am

    This is a great website, but you have some syntax errors:

    =ImportHtml(URL, “table”, num)

    should be:
    =ImportHtml(URL; “table”; num)

    Reply
  16. Christopher says

    August 24, 2012 at 2:02 pm

    I tried the method you described, I keep getting a parse error. Any idea what I am doing wrong?

    Reply
  17. JD Markland says

    September 17, 2012 at 9:31 am

    The table to clipboard method is by far the easiest. Thanks for the great posts!

    Reply
  18. TechView says

    January 28, 2013 at 10:45 am

    You could have used MS Excel’s “Data–>From Web” option to fetch tables from web sites. It’s more easy and workable. And it could be updated automatically also when spreadsheet opens every time. Cheers…

    Reply
  19. Mark Bullock says

    February 22, 2013 at 11:00 am

    @TechView – this doesn’t work for Office for Mac 2011.
    The work-around is to construct a query and then “Get External Data”, “Run Saved query”.

    Reply
  20. Kevin says

    March 25, 2013 at 5:14 pm

    Hi, good post, how do you find the table number in this example? tks

    Reply
  21. Kevin says

    March 25, 2013 at 5:25 pm

    @ Jon Peltier –> How do I do this in excel 10, if I use this http://www.fdic.gov/bank/individual/failed/banklist.html
    as an example it does not work, it just imports the whole page and not just the table as the google docs does.
    So Google Docs is alot better from what i have seen….

    Reply
  22. Kevin Lyda (@lyda) says

    July 27, 2013 at 5:11 am

    Just curious, why all the defensive posts from Excel users?

    Reply
  23. wguteamsmith says

    December 4, 2013 at 7:10 am

    i’m a newbie with complex ideas and no programming skills. i need to extract table data from an internal web portal page. i can’t find the table id. i’m in firefox and viewed the frame source. all i can find is a table class. thoughts? ideas? if i can make this work, i’ve just saved myself hours per week.

    Reply
  24. arthur huppe says

    January 13, 2014 at 9:58 am

    Awesome. I bought a list of directories and I would like to import it to my google doc account. This article helped me. Thank you so much.

    Reply
  25. Alexandru Cobuz says

    June 22, 2014 at 12:47 pm

    I didn’t even know that Excel was capable of this. Great tip!
    Ha, even a beginner could scrape some data with this.

    Reply
  26. Jason Pittman says

    August 7, 2014 at 11:15 pm

    Another good way to import data into a spreadsheet is from JSON data. If you use json-csv.com you can upload text or enter a URL and a spreadsheet will be produced.

    Reply
  27. Alex says

    September 3, 2014 at 5:49 am

    Data menu > Import External Data > New Web Query

    just find this today….i was doing it by hand for years…..thanks man!

    Reply
  28. tbanwell says

    September 4, 2014 at 7:05 am

    This was a great help for me; using it to pull cattle price data. Using ‘Table 0’ was the tip that saved me. Thanks!

    Reply
  29. Bob says

    October 17, 2014 at 4:49 am

    Minor error – code needs quotes around URL, like this
    =ImportHtml(“http://example.com/”, “table”, 0)

    Reply
  30. Tom says

    October 21, 2014 at 2:57 am

    Thanks definitely useful – I didn’t know Google docs had this feature.
    I approached this issue however differently making my own Add-In and UDF functions so that there would be no need of writing VBA as in the Google example:

    http://blog.tkacprow.pl/excel-tools/excel-scrape-html-add/

    Gettting a href is easy:
    =GetElementByRegex(“[URL]”;”href=””(.*?)””)

    A more elaborate example as yours above, of getting the tables contents in Excel, is also not that complicated and just requires nesting of the UDF:
    =GeRegex(
    GetRegex(
    GetElementByRegex(“[URL]”;”([^””]*?)”;0);
    “(?:(.|\n)*?)”;0);
    “(?:(.|\n)*?)”;0)

    gets the first cell contents of the first row and the first table. Changing the “0” to higher numbers will get the next rows, cells etc. so the formulas can be dragged like other excel formulas :).

    Hope this will interest you!

    Reply
  31. Grant Kaye says

    January 15, 2015 at 12:15 pm

    Hi Robert, thanks for this brief tutorial. I’m wondering if your technique can be applied to automatically gather weather sensor data once a day at a specific time and feed the data into a Google Doc spreadsheet? For example: http://weather.uwyo.edu/cgi-bin/wyowx.fcgi?TYPE=SFLIST&DATE=CURRENT&HOUR=CURRENT&UNITS=A&&STATION=PACV

    Reply
    • Robert Kosara says

      January 15, 2015 at 7:34 pm

      No, because the table on that page is just raw text, not HTML. I don’t know if something like import.io would work, but worth a try. You’d need to extract the text table and then run that through something that can parse tab-separated data, like Google Refine or Data Wrangler.

      Reply
  32. Juan J. Pérez says

    January 16, 2015 at 8:55 am

    I was wondering if there is a way to scrape only certain rows?

    Reply
    • Juan J. Pérez says

      January 16, 2015 at 9:03 am

      also, if I can select a column only when a cell matches the criteria, for example, I’m doing a data scraping from the fire rescue website, on that table there are some rows I dont want on my page and I only want calls from a certain unit number.http://www.orangecountyfl.net/EmergencySafety/FireRescueActiveCalls.aspx#.VLlBL8bnjKA

      Reply
      • Robert Kosara says

        January 16, 2015 at 9:14 am

        Not on import, but once the data is imported, you can sort, filter, etc. in Google Docs. Click a column header to highlight it, then pick Data->Filter from the menu. When you then click on the first row in that column, you get a dropdown that lets you filter, etc.

        Reply
  33. Craig says

    July 8, 2015 at 10:54 am

    The page I want to scrape has a submit button on it. Is there a way to get the form to submit and use this method? The default form information is all I need from the table.

    Reply
  34. Nate says

    July 22, 2015 at 7:31 am

    Robert,
    This is a great resource, thank you. I’ve been trying to grab the table data from this website: http://som.yale.edu/faculty-research/our-centers-initiatives/international-center-finance/data/stock-market-confidence-indices/united-states-valuation-index

    The Googlesheets function
    =importHTML(“http://www.spindices.com/indices/real-estate/sp-case-shiller-20-city-composite-home-price-index”,”table”,0)
    returns the #N/A “imported content is empty” error, as does changing the index to 1, or 2, etc.

    I thought I might be able to use the importXML function, and the HTML from the site i’m attempting to retrieve data from shows:

    Valuation Index – Institutional

    Date
    Index Value
    Standard Error

    Oct 1989
    80.58
    3.36

    And so on.

    I included the start of the data I want from the first row for Oct 1989. Since the table doesn’t appear to have an id, I’ve tried are “//table/”, “//td”, “//td/”, “//div/” and can’t get the function to return anything so it appears I’m missing some fundamental concept. This can’t be that hard. Do you have any suggestions?

    Reply
  35. SGIII says

    September 19, 2015 at 4:41 pm

    Just wanted to chime in to say how helpful this is, Robert.

    Reply
  36. aestuehler says

    October 30, 2015 at 6:27 am

    I have read all the posts and tried a variety of approaches but I am not able to extract the data I am looking for with these methods. The website appears to not have a table or list behind the data so that may be the problem – I am not sure. I am looking to extract all of the camps listed in the New England states (CT, ME, MA, NH, RI, VT) from this site: find.acacamps.org. Using the excel approach returns no data (after I select the table created after the search). Google sheets just returns the text on the screen from either a table (0 – 4) or list.

    Any other suggestions please?

    Reply
  37. Bob G Grieger says

    January 24, 2016 at 3:28 pm

    I captured the first table OK. How do I automatically capture the next 50 pagination sheets?

    Thanks for this article.

    Reply
  38. Josh says

    February 27, 2016 at 7:31 am

    I think I’m having the same issue as Bob above. I’m pulling stats from here: http://www.cbssports.com/collegebasketball/stats/teamsort/NCAAB/DEFENSIVE/regularseason/yearly

    The default is for the page to load the first 50ish rows and then you have to select an option at the bottom to turn to the next page or view all records (neither of which are at a different URL). Any way to use this or other tools to work around this issue?

    Thanks!

    Reply
    • Holden says

      November 16, 2016 at 4:39 am

      It is a manual process. Get the url of the 50-100 result set and use it in the next cell or sheet of the G sheets.

      I too would be interested if there was a simpler method.

      Reply
  39. SK says

    March 26, 2016 at 1:09 am

    Brilliant one ! Thank you :)

    I improvised a little on it. My challenge was to have information extracted from multiple urls present on one url.

    So with the help of data miner I got all urls extracted on one google spreadsheet, then using importhtml, instead of url, I gave cell address where urls are present to fetch data.

    This has worked well for me.

    Only challenge is the field of email id coming as protected whereas I can see email id mentioned on webpage. This is something still not resolved. Any view?

    Reply
  40. glennhenshaw says

    September 20, 2016 at 7:31 pm

    This is good to know.

    You can also use python’s pandas library

    1) highlight the table
    2) import pandas as pd
    3) table = pd.read_clipboard()

    Reply
  41. astapramuditya says

    October 31, 2017 at 7:44 am

    Thanks!

    Reply
  42. Peter says

    November 4, 2017 at 2:56 pm

    Hi,

    Do you know the method (maybe by use of the script) how to load the content of the website in case when it is loaded dynamically (ajax)?

    Thanks,
    Peter

    Reply
  43. Alexandro Trese says

    December 15, 2017 at 10:42 pm

    Just saved me hours of time, thanks!

    Reply
  44. Jennifer Jennifer says

    January 14, 2018 at 10:26 am

    You mean Google Sheets. Google Docs is a word processor, like MS Word.

    Reply
  45. Jérémy says

    April 18, 2018 at 4:45 am

    That’s what I was looking for ! So simple and so helpful. thank you

    Reply

Leave a Reply to glennhenshaw Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Primary Sidebar

More Data Articles

  • Temperature Baseline Differences
  • A Look At Climate Data
  • Interactively Explore Climate Data
  • A Browser for Data.gov
  • Data Is A Dish Best Served Raw

Recently Popular

  • Understanding Pie Charts
  • Data: Continuous vs. Categorical
  • What is Visualization? A Definition
  • How The Rainbow Color Map Misleads
  • The Simple Way to Scrape an HTML Table: Google Docs
  • Chart Junk Considered Useful After All
  • Facebook
  • GitHub
  • LinkedIn
  • RSS
  • Twitter

Subscribe via Email

Footer

  • About
  • Contact
  • License

Copyright © 2006–2019 Robert Kosara · All original materials are available under CC-BY-SA