The Simple Way to Scrape an HTML Table: Google Docs

Google Docs used for scraping web tables

Raw data is the best data, but a lot of public data can still only be found in tables rather than as directly machine-readable files. One example is the FDIC's List of Failed Banks. Here is a simple trick to scrape such data from a website: Use Google Docs.

The table on that page is even relatively nice because it includes some JavaScript to sort it. But a large table with close to 200 entries is still not exactly the best way to analyze that data.

I first tried dabbledb for this task, and it worked in principle. The only problem was that it only extracted 17 rows for some reason. I have no idea what the issue was, but I didn't want to invest the time to figure it out.

After some digging around and even considering writing my own throw-away extraction script, I remembered having read something about Google Docs being able to import tables from websites. And indeed, it has a very useful function called ImportHtml that will scrape a table from a page.

To extract a table, create a new spreadsheet and enter the following expression in the top left cell: =ImportHtml(URL, "table", num). URL here is the URL of the page (between quotation marks), "table" is the element to look for (Google Docs can also import lists), and num is the number of the element, in case there are more on the same page (which is rather common for tables). The latter supposedly starts at 1, but I had to use 0 to get it to pick up the correct table on the FDIC page.

Once this is done, Google Docs retrieves the data and inserts it into the spreadsheet, including the headers. The last step is to download the spreadsheet as a CSV file.

This is very simple and quick, and a much better idea than writing a custom script. Of course, the real solution would be to offer all data as a CSV file in addition to the table to begin with. But until that happens, we will need tools like this to get the data into a format that is actually useful.

---

Comments

Excel's pretty easy too

Anonymous's picture

It took under a minute to set up a web query in Excel to extract the data. Data menu > Import External Data > New Web Query. Enter the URL, then select the table you want imported. Whenever you want, you can click on the imported table. The External Data toolbar pops up, and you can click on the icon with the exclamation point to update the query. It's a simple matter to save the sheet with the data as a CSV.

Not on the Mac

Robert Kosara's picture

Excel for the Mac doesn't have that function, though. 

  This is a very nice trick.

Anonymous's picture

 

This is a very nice trick. For more complex data extraction needs, I’ve been playing with Open Dapper (http://www.dapper.net/open/) and I found it quite powerful: it’s able to extract data from almost any page showing any kind of regularity – not just tables – and export it at least as csv or rss (and you can access the rss from an url which stays live, updating the feed as the original page update etc.)

(I’ve searched the site for “Dapper” and the search engine returns no results, so I thought it could be useful to point it to you – thanks for this blog and for your work!)

Interesting

Robert Kosara's picture

I haven't seen that one before, I'll have to try it out.

Worked example of using Google docs as a table scraper

Anonymous's picture

Hi

Some time ago I posted a walkthrough of how to scrape data from Wikipedia with Google docs, and then annotate the CSV output of the scraped data with geodata in Yahoo Pipes:

Data Scraping Wikipedia with Google Spreadsheets

That page looks familiar

Robert Kosara's picture

That's probably where I saw this. I actually did this a while ago and couldn't find the page where I had seen the ImportHtml trick when I wrote the posting. I'll add a link.

I would just use yahoo pipes

Anonymous's picture

I would just use yahoo pipes for this. Very easy (with a small investment of time to learn), and very flexible too. Great tool for scraping.

PDF

Anonymous's picture

I would be very grateful for a reliable way to scrape PDF tables. 

PDF is much less structured

Robert Kosara's picture

The good thing about HTML is that tables are very clearly structured in the markup. That's not the case in PDF, where it's just stuff that happens to line up and maybe lines that are drawn in-between. Best option is probably to copy&paste from the PDF into Excel or another spreadsheet app.

Why not Excel

Anonymous's picture

Why not copy and paste this table into Excel. The rule: If it's a table, just paste it.

Scrapping Climate Data

Anonymous's picture

Let's combine your climate data and web scrapping posts.

Here's an example where I scrapped climate science data for a visualization of the IR absorption properties of 5 greenhouse gases.

 

I wanted to scrap the spectrum data from 5 NIST Chemistry Webbook data web pages and generate this chart automatically.

Jon Peltier is right about Excel's external data capabilities, however, Jon would need a VBA procedure to retreive the data and reproduce my 5 panel chart.

Bill Dedman's suggestion about using Excel's cut and paste approach would be quite time consuming and Bill would have a challenge generating the 5 panel chart.

Here's a link to my post, it includes a link to my R script, available on Google docs.

http://chartsgraphs.wordpress.com/2009/12/07/understanding-the-science-o...

 

 

I would have gone with the script

Anonymous's picture

using something like HTML::TableContentParser or HTML::TableExtract and a cron job if i needed to keep it up to date.