Tom Meagher bio photo

Tom Meagher


Twitter LinkedIn Github

Not another blinking PDF!

There are few things more frustrating than trying to get a simple spreadsheet from a government agency and being told by a spokesman that the data (say, a table of property tax rates in municipalities that was clearly created in Excel) only exists in a locked PDF format.

For the sake of expediency, you sometimes just have to bite the bullet and try to wrestle the data free from the PDF’s clutches so that you can gently guide it into a more useful spreadsheet.

It’s not always easy, but here are a few resources to do just that:

Dan Nguyen at ProPublica crafted this very helpful and comprehensive guide to the various strategies for unlocking data from PDFs.

I’ve also had some success using CometDocs, a free site that had a surprisingly high accuracy rate for converting documents on one particularly labor-intensive project.

If you’re not afraid of installing a simple command line program, I’ve also had some luck with PDF2text. Here’s a nice tutorial from IRE (looks like the link may be temporarily broken), as well as a guide to how to automate the conversions and not be bothered by the pesky command line.

I’ve heard good things about the commercial software DeskUnPDF, but I haven’t had an opportunity to use it myself.

When you don’t have the time or patience to negotiate with an agency to give you what you want how you want it, give these solutions a try.

If you have other techniques for culling data from PDFs, please share them.

Good luck!