Many of the accounting systems supported by us require some kind of bills import. Nowadays, soft copy bills have become the norm, and represents an additional opportunity for automation, eliminating much of the tedious manual keying of the past. Unfortunately, that soft copy often comes in the form of PDF files. And sometimes, the PDF writer has bugs.
The first step to reading a PDF file is to look for the xref
table.
We can look for the following tokens to pick up the byte offset of the xref
table:
startxref 186843 %%EOF
Unfortunately, every bill had the same flaw: the offset to the xref
is
4 bytes before the actual table, which is right smack in the middle of the
endobj
before the xref
:
44 0 obj << /Count 1 /First 7 0 R /Last 7 0 R >> endobj 2 0 obj <</Type/Catalog/Pages 10 0 R/Outlines 44 0 R/PageMode/UseOutlines>> endobj xref 0 117 0000000000 65535 f 0000186008 00000 n
The middle of endobj
gives rise to a valid token obj
, which
confuses the situation thoroughly for the user, to say the least: Invalid Token?!?
We relaxed the PDF import parser to ignore leading invalid tokens, and the system began chugging through the backlog.