The PDF file format is a wonderful thing — except when it isn’t. Today, I explain the discovery, origin, and resolution of a recent “bug” in PDFTextStream, and provide a gentle introduction of how text is encoded in PDF files along the way.
One of the most difficult to understand computer science topics is the notions and implementation of character encoding schemes — those bits of code that connect an otherwise arbitrary number (say, 107) to a character (a lower-case ‘k’, in the case of the standard ASCII character encoding standard). It all seems so simple, right? Here’s a stream of integers that you might find in a binary file somewhere:
[104, 101, 108, 108, 111, 32, 116, 104, 101, 114, 101]
and, you just have a lookup table somewhere that connects each integer value to its corresponding character:
['h', 'e', 'l', 'l', 'o', ' ', 't', 'h', 'e', 'r', 'e']
Ah, if it were that simple. Well, sometimes it is, say 90% of the time, but it’s that last 5-10% that makes the world interesting, isn’t it?
P.S.: That’s all I’m going to say about character encoding in the general sense — the rest of this post is related to PDF-specific stuff. If you’d like to read a good introduction to character encoding, and Unicode in particular, check out this Joel on Software post that gives a 5,000-foot-level view of how Unicode works.
We hate bugs. They really irritate us, and I’m no exception. So, when we stumbled across a particular PDF that PDFTextStream appeared to not handle properly, we were suitably…irritated.
On its face, the bug seems relatively simple. There’s a particular PDF file that is written in Icelandic. Here’s a screenshot of the PDF as shown by a PDF viewer:
See the differences? In the PDF file, there are a number of ethcharacters that look like this: ð. (Icelandic has a couple of characters not found elsewhere; this is one of them.) Now look at the text that PDFTextStream extracted — all of those eth characters are gone, replaced by right-angle brackets (’>’)!
You can see why we might be irritated.
We pride ourselves on PDFTextStream producing very accurate output of text extracted from PDF files, so a problem like this is taken very seriously. We looked at all of the output, checked out the PDF file, looked at its internals, and concluded that PDFTextStream had a bug that was affecting Icelandic characters specifically (which I wrote about here). After all, this is the first time we’ve run across this particular issue, and it didn’t seem to occur in connection with any of our other international/Unicode test PDFs (including some in Russian, French, Spanish, etc., etc.).
Little did we know that PDFTextStream’s behaviour in this case was not only correct, but that the “problem” was actually caused by the PDF file being malformed in a very particular way.
PDF Text Encoding Primer You’ll need a seat for this…
To understand why this is happening, you’ll need to know a little bit about how text encodings work in PDF files. By no means is this information complete; if you’re really interested, there’s a 1172-page bedtime story (the PDF v1.5 specification) that explains it all (or as much as the good folks at Adobe could remember).
To get us started, here’s a pictorial depiction of how text is represented in a PDF document:
Conceptually, it’s relatively simple. All text in a PDF file is stored as a sequence of character codes. In addition, every PDF file contains a character encoding for every font that it uses. The character encoding is essentially a dictionary: it links every character code used in the PDF file to a corresponding glyph code. Those glyph codes are then passed on to a font program, which is a set of specialized routines that know how to draw glyphs (glyphs are the particular manifestation of a character or symbol — how the letter ‘a’ is drawn on your screen is one glyph, for example). Once a PDF viewer has found the glyph-drawing instructions (provided by the font program) that correspond to the glyph codes (provided by the character encoding) that were associated with the character codes that are actually contained in a PDF file, it can draw the text on a computer screen or send it to a printer.
Where things get complicated is in the translation between character codes and glyph codes. In almost every case, the glyph codes specified by character encodings correspond to Unicode character id’s — such id’s are very standard, and PDFTextStream (or any other library that might attempt to read text out of a PDF) can readily use the stream of those glyph codes as the effective text content of a PDF document. However, very rarely (this Icelandic PDF is the first PDF document we’ve come across that has this peculiarity), those glyph codes don’t correspond to Unicode character id’s, leading to improper characters being outputted as the text content of the PDF document.
Confused? Don’t worry, it’s not the simplest of things to grok. Simply put, in the case of the Icelandic PDF, the in-force font program associates the glyph code for a right-angle bracket to the eth glyph, and the character encoding in the PDF file reflects that.
If that doesn’t sound right, it isn’t — technically, a glyph code provided by a character encoding should correspond precisely with the glyph in the font program for the Unicode character that corresponds with the glyph code to begin with. So, PDFTextStream is outputting the wrong character because the PDF file is malformed (to correspond with the malformed font program).
So, after we figured this out, as a last check of our theory, we turned to Google. We originally stumbled upon the Icelandic PDF file on the Internet, so a few searches later, we managed to get it to show up in in the results of a Google search. Google has this nifty ‘View as HTML’ link next to most of its PDF search results; clicking on that link brought up Google’s extract of the text of the PDF. Here’s screenshot of that HTML view:
Ah, and there’s those right-angle brackets again! So, even Google’s text extraction utility falls prey to the encoding problems in the semi-faulty PDF file, and so will every other PDF text extraction library. Once we saw this, we decided to put the issue to bed.
The bottom line is that, because there is no notion of a ‘valid’ PDF file, there will always be some vanishingly small percentage of PDF’s that don’t follow the PDF file specification, or even widely-held conventions. Unfortunately, that means that extracting text out of PDF documents will always be an imperfect art.