Sample of the Week:
Every so often I’ll see the same kind of question pop up across multiple forums that I monitor. I can tell that they’re not all coming from the same person or company; it’s more like something has just entered the zeitgeist. Such was the case a couple of weeks ago. For some reason, I saw several questions asking how to search a PDF file for a set of terms, highlight them, and then extract the pages that contain any one of the terms into a separate document. This is actually pretty easy using the Datalogics PDF Java Toolkit with the PMMService class.
The PMMService class supports several manipulations of a PDF document. You can easily insert, delete, merge, overlay, and extract pages from a PDF file and it handles all of the complex assembly operations and page tree balancing through the PMMOptions class.
To solve the problem of collecting pages that contain search terms into a new document we need to start by locating the words in question. Then we create the highlight annotations and add those pages to the list of pages to be extracted.
After generating the appearances for the highlight annotations, we can then extract the pages creating a new document in the process.
The great thing about the PMMService.extractPages() method is that you can pass in an array of discontiguous PDFPage objects to be extracted. This allows the extraction to happen in one call rather than multiple calls where you might have to pass in a page range. By doing the extraction in one call, the resulting PDF is likely to be smaller and more efficient since otherwise duplicate resources only need to be added once.
To get started working with PDF, download this Gist and request an evaluation copy of The Datalogics PDF Java Toolkit.