In my last experiment with text extraction, I implemented text extraction via callback. Never mind that for this experiment. But also: don’t try this at home, kids; I am a trained professional.
I was recently asked if we could extract the text from an example document in a way that more closely matched the original layout of the document.
In general, this is not a great idea. PDF allows you a two-dimensional canvas on which to place text glyphs at whatever size or angle you choose. Plain text, in contrast, is really a one-dimensional format, even though in a specialized application known as a ‘text editor‘, it can be viewed in a two-dimensional rendering. Back in the day, you could also pipe plaintext to a printer port have it rendered on a [dot-matrix] printer. But that use case is long obsolete.
However, while not a great idea, it is a somewhat interesting technical challenge. WordFinder extracts words along with position quadrilaterals and style transitions. In theory, that information could be used to approximate the original text layout, if that layout isn’t too fancy.
Let’s try this
You basically need to divide the page into a grid, where each cell of the grid can hold exactly one text character. The question is how big should the cell be? If the grid has too few cells because cell size is too big, then text would end up overlapping. If the grid has too many cells because cell size is too small, the text will be too spread out. In which case, you will lose the resemblance to the original layout.
I experimented with using the smallest font size as the basis for my cell size. But then I switched to using the most common font size (as determined by the code below).
However, an implicit assumption in this text layout scenario is that the plaintext text will be displayed using a non-proportional font where each character will have the same width. Most text in a PDF is going to use proportional fonts. So it will take up less space on the page than the non-proportional rendering. As a compromise, I use the most common font size for my vertical dimension and use the next smaller size for my horizontal dimension.
Once we have our grid dimension-ed, then it is time to start dividing our words into lines of text. We are basically treating the grid as a sparse matrix of rows. We are inserting each word into its calculated row, but keeping the row horizontally-sorted.
The payout for the sorting is that we can then layout each line fairly easily. We calculate the gird position of each word in the line compared to the current line position. We insert the required number of spaces if there is a gap. Then we insert the word (minus any soft hyphen), and repeat until we are done. Easy peasy.
Evaluating the output
So let’s look at what we’ve got:
There are a few things to note here:
The title phrase on line 14 has horizontal gaps not seen in the original. The original title used 14pt text, and we essentially placed it in the text output as 9pt text.
There are a number of vertical gaps (lines 21, 27, and 33) between lines of text, not present in the original file. This is basically an artifact of the text not matching up to the grid we’ve calculated. We could possibly test the coordinates of the lines above or below to see if there really should be a gap between those lines or not. But if we take them out, we are effectively shortening the height of the page.
The original text was justified on both sides, sometimes achieving that with hyphens. Our little routine re-joins the hyphenated words like self-ref-erential, in-cluded, and mul-ti-paragraph, etc. This makes the text output ragged on the right and left as there is now a gap where the second half of the word was. This could be fixed up further. Possibly by adding the same Word index to two lines, if it has two sets of Quads which match up to separate rows.
For the second half of the page, we now see a second column emerge which wasn’t visible in the top half because it consisted of an image (of another obsolete piece of equipment from the prior millennium). The gap between the two columns is smaller than in the original document. That could indicate that the factor I chose to account for proportional-to-non-proportional expansion was too small. There is also more bleeding of the first column into the second column due to rejoined hyphenated word.
The reason why this sort of layout is a folly is seen on lines 49-51. In the original document, these lines from the first column do not line up with the lines of text of the second column, but are offset from each other. Into the same grid, they must go.
Speaking of follies, the second page has a nice one:
In the original, figure 2 was an image of computer keyboard (sorry Alexa, not yet obsolete). Someone had apparently run some sort of OCR process on the image, because the keys, and button and light labels came through as (3pt, hidden) text. You can, uhm, recognize, the keyboard layout from the pure text rendering. But it does show the very real limits of trying to use plaintext to represent a two-dimensional layout.
Any questions about this experiment? Comment below or contact us.