It was my privilege to represent Datalogics at the PDF Hackathon that our partner company Callas Software organized in Berlin on April 11 and 12. This was Callas’ second Hackathon; their first was focused on Callas’s new and impressive pdfChip technology, but this iteration was broadened in scope to any PDF-related
topic.
With most of the attendees being Callas customers, the topics they wanted to hack tended towards pre-press issues with Callas tools. But there was a small group of topics which caught my fancy, including an issue with converting Emails to PDFs causing Images to be split-up, a request for how to mask an image with a vector path, and an interesting request from a gentleman from a Finnish newspaper for how to extract an article from PDF(s) for a reprint service given an XML file that describes where the components of the article are on the PDF, all other ancillary material having already been destroyed as it would overwhelm their resources to archive the InDesign files generated on a daily basis.
My initial thought was to wonder if InDesign might possibly be including Article and Bead information (a PDF v1.1 feature; section 8.3.2 of the v1.7 PDF Reference) in the document when it generated the PDF, and if that could be used to extract the relevant article. Alas, a small bit of sleuthing revealed that this little-used PDF feature is seemingly not used by the product most likely to populate that information in a PDF.
Turning to the example XML file and examining the elements and attributes that it contained, we convinced ourselves, because it would be much easier for us if it were true, that its article coordinates were likely to be Desktop publishing points (1/72 in.), until I later noticed that the xgeometry coordinate in our sample XML file was beyond the right edge of our sample PDF. Oops. Our newspaper man then turned to InDesign to determine the position of the article on the page in points, and I turned to the XML coordinates to determine how to convert them to match those point-based coordinates.
Once I figured those formulas (yeah for Algebra!), we fed them into the JavaScript program for pdfChip that Olaf put together in order to create a new PDF page which essentially clipped the rest of the page contents to display only the article.
Other team members worked on a JavaScript routine to calculate the minimum-size page/bounding box if the article had more than one element on the page. Sadly, this code did not make it into the final program as our one sample PDF/XML file set only had one element to it so it wasn’t required for what we could demonstrate.
At a lull in between, I took a small break to put together a quick DLE program to demonstrate how one masks an Image with a vector path, which I’ll discuss in a follow-up article.