This is the title of a great talk at the last chaos communication congress in berlin (27c3).
When writing my own pdf parser for a homework assignment that I put way too much ambition into, I encountered all of what is mentioned in that talk and was also able to realize how bad the situation really is. I just deleted three paragraphs of this post where I started to rant about how frickin bad the pdf format is. How unworkable it is and how literally impossible to perfectly implement. But instead, just watch the video of the talk and make sure to remember that it is even worse than Julia Wolf is able to make clear in 1h she was given.
So after I stopped myself from spreading another wave of my pdf hate over the internets lets look at the issue at hand:
This is the snippet I uncompressed from the pdf to (just by chance) find the number I was looking for. The 000000 piece was actually containing the number I needed.
6 0 obj
/DA (/Arial 14 Tf 0 g)
/Rect [ 243.249 176.784 382.489 210.297 ]
/BG [ 0.75 0.75 0.75 ]
/P 4 0 R
/N 7 0 R
So let me say: WTF? My bank not only requires me to resort to one specific pdf implementation (namely the acrobat reader by adobe) but also requires me to pay to a US based company first to have an operating system that reader software works on? Or am I really supposed to go through the raw pdf source by hand?? Bleh...
Also, dont ask for my code - it's super dirty and unreadable. Instead look at the mupdf project. It supplies a renderer which is massively superior to poppler in terms of speed (even suitable for embedded devices) and comes with a program called pdfclean which does the same thing my program did so that I was able to get the number I needed.