Introduction

Many of the accounting systems supported by us require some kind of bills import. Nowadays, soft copy bills have become the norm, and represents an additional opportunity for automation, eliminating much of the tedious manual keying of the past. Unfortunately, that soft copy often comes in the form of PDF files. And sometimes, the PDF writer has bugs.

Inside the PDF File

The first step to reading a PDF file is to look for the xref table. We can look for the following tokens to pick up the byte offset of the xref table:

startxref
186843
%%EOF

Unfortunately, every bill had the same flaw: the offset to the xref is 4 bytes before the actual table, which is right smack in the middle of the endobj before the xref:

44 0 obj
<< /Count 1 /First 7 0 R /Last 7 0 R >>
endobj
2 0 obj
<</Type/Catalog/Pages 10 0 R/Outlines 44 0 R/PageMode/UseOutlines>>
endobj
xref
0 117
0000000000 65535 f
0000186008 00000 n

The middle of endobj gives rise to a valid token obj, which confuses the situation thoroughly for the user, to say the least: Invalid Token?!?

Quick Solution

We relaxed the PDF import parser to ignore leading invalid tokens, and the system began chugging through the backlog.

Last updated on 5 Oct 2023

133, New Bridge Road #24-01

Chinatown Point S(049513) •

Tel: 6-552-6826 •

Email: sales@bumblebee.com.sg