Way off topic: extract table data from PDF

Yes, this is me abusing this forum with off topic questions, but you are so helpful!

A customer of me for the last 15 years asked me if I could dig up a piece of software that could take his personal bank statements in PDF (Norwegian banks only hand out the very recent transaction history in CSV, the archive is in PDF) and convert them to CSV or Excel.

He had done a job himself looking. And the software out there that works are either very expensive or an online service (which is no good for confidential data).*

So I though I take a stab at it myself, and with my Hugo-driven Go-skills I thought a pdftocsv application would be useful to many.

So I took this little library from Go’s Russ Cox:

And this little test case file:

And I manage to extract the plain text outside the tables (and other plain text PDFs), but the rest is jibberish. There are obvious parts of the PDF format/encodings etc. that I do not understand – and will have to read up on. But I throw it out here to see if anyone have any relevant experience and hints in this area?

  • I have looked at http://tabula.technology/ which seems to do what I want, but it is a monster web app to do a small simple task.

Just a shot in the dark, but have you thought of maybe using pdf.js to render the pdfs in html and maybe try to parse them as you’d do an html file?

You might try pdftotext, bundled with the Poppler PDF library

I’ve found that the -layout option can give decent results, which might be machine-parseable:

pdftotext -layout input.pdf output.txt

If you want to write something in pure Go (or any other languange), you’ll probably have to learn a fair bit of the PDF spec. Some random links that may or may not be helpful:

It’ll also be helpful to have a graphical viewer of the structure. I used iText RUPS when I dealt with PDFs a year or two ago. There are some others mentioned on the thrid link above.

Thanks @lotrfan – that info will be very useful. I have tried pdftotext with both -layout and -html, and it looses too much of the tabular info to be useful in this case.

@bep - Simple calling out that this thread is way off topic and that you are abusing the forum doesn’t make it right to do so. I would highly recommend using StackOverflow for this type of thing. There are many more people there that would be able to help.

In the spirit of helping out: here’s an article for a PowerShell cmdlet that outputs the contents of a PDF to plain-text. From there, you can continue to use PowerShell to sanitize and convert to CSV, or whatever else you want, as you now have plain-text to work with.