Way off topic: extract table data from PDF

bep · July 6, 2015, 7:34am

Yes, this is me abusing this forum with off topic questions, but you are so helpful!

A customer of me for the last 15 years asked me if I could dig up a piece of software that could take his personal bank statements in PDF (Norwegian banks only hand out the very recent transaction history in CSV, the archive is in PDF) and convert them to CSV or Excel.

He had done a job himself looking. And the software out there that works are either very expensive or an online service (which is no good for confidential data).*

So I though I take a stab at it myself, and with my Hugo-driven Go-skills I thought a pdftocsv application would be useful to many.

So I took this little library from Go’s Russ Cox:

And this little test case file:

And I manage to extract the plain text outside the tables (and other plain text PDFs), but the rest is jibberish. There are obvious parts of the PDF format/encodings etc. that I do not understand – and will have to read up on. But I throw it out here to see if anyone have any relevant experience and hints in this area?

I have looked at http://tabula.technology/ which seems to do what I want, but it is a monster web app to do a small simple task.

dplesca · July 6, 2015, 8:03am

Just a shot in the dark, but have you thought of maybe using pdf.js to render the pdfs in html and maybe try to parse them as you’d do an html file?

lotrfan · July 7, 2015, 1:16am

You might try pdftotext, bundled with the Poppler PDF library

I’ve found that the -layout option can give decent results, which might be machine-parseable:

pdftotext -layout input.pdf output.txt

If you want to write something in pure Go (or any other languange), you’ll probably have to learn a fair bit of the PDF spec. Some random links that may or may not be helpful:

http://www.planetpdf.com/developer/article.asp?ContentID=navigating_the_internal_struct
http://stackoverflow.com/questions/88582/structure-of-a-pdf-file
http://superuser.com/questions/256997/browse-internal-pdf-structure
and the spec itself: http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf. In my experience with the spec, it’s pretty annoying to navigate unless you’re already partly (mostly) familiar with the structure of PDFs :(. (Ironically, it’s a PDF document.)

It’ll also be helpful to have a graphical viewer of the structure. I used iText RUPS when I dealt with PDFs a year or two ago. There are some others mentioned on the thrid link above.

bep · July 7, 2015, 6:00am

Thanks @lotrfan – that info will be very useful. I have tried pdftotext with both -layout and -html, and it looses too much of the tabular info to be useful in this case.

Weston_McNamee · July 17, 2015, 5:54am

@bep - Simple calling out that this thread is way off topic and that you are abusing the forum doesn’t make it right to do so. I would highly recommend using StackOverflow for this type of thing. There are many more people there that would be able to help.

In the spirit of helping out: here’s an article for a PowerShell cmdlet that outputs the contents of a PDF to plain-text. From there, you can continue to use PowerShell to sanitize and convert to CSV, or whatever else you want, as you now have plain-text to work with.
http://www.beefycode.com/post/convertfrom-pdf-cmdlet.aspx

Topic		Replies	Views
Get markdown table into front matter or summary	2	516	February 11, 2020
Pages from data	2	559	March 5, 2021
Convert Page to pdf file support	1	1262	November 28, 2021
Improving table formatting in documentation for template lookup order dev	0	647	January 19, 2021
Export page as PDF support	4	3150	March 13, 2021

Way off topic: extract table data from PDF

Related topics