Sample of the Week:
One of the most useful features of PDF for navigating long documents are bookmarks. Bookmarks allow you to quickly move from one part of a document to another… and when the PDF is Optimized for “Fast Web View,” Adobe Reader can skip the download of all the pages in between. Bookmarks make browsing PDF files far more efficient for both the user and the internet connection.
Many interactive PDF authoring tools like Microsoft Word or Adobe InDesign will create bookmarks for you automatically based on headings and styles but most PDF Library tools don’t… unless they are also creating the PDF file. That said, unfortunately, there are still a lot of PDF files that were created without bookmarks and could really benefit from having them… but adding bookmarks to an existing PDF can be a bit tricky.
Ok – That’s not exactly true. Just adding bookmarks is easy… discovering their destinations… that’s the tricky part. Fortunately, the ReadingOrderTextExtractor class in the Datalogics PDF Java Toolkit makes it easy to find paragraphs that match certain criteria that can be interpreted as a Heading or Subheading and use the information it provides to create a proper bookmark destination…
In this example input file, the top-level headings, or H1, are in 21 point, MinionPro-Bold, the H2 headings are 18 point, MyriadPro-Bold and the H3 headings are in 14 point, MyriadPro-Bold.
Knowing this, we can iterate through each paragraph of the PDF file and use this style information to locate headings and then set the destinations of the bookmarks to the coordinates of the words we find with those styles. By extracting the text in reading order, we know that the bookmarks will also be in the correct order and nested properly regardless of the length of the document.
In the Gist below, I’ve set up ranges of sizes for the heading fonts we are interested in; this should make it easier to modify the code to fit your particular needs.
The first step in reading the text of the PDF is to set up the text extractor. In order for the text extractor to interpret the text correctly, it needs to know what fonts are available, once you’ve loaded them, setting up the ReadingOrderTextExtractor class is easy.
From there you can easily iterate over each paragraph.
Each “paragraph” is an ArrayList of ArrayLists, which are the sentences, which are in turn are an ArrayList of Word objects. It is the Word objects that can be used to discover position of the word on the page and we can use the characters in the Word to discover the font name and size of the Word. Once we’ve determined we have a Heading paragraph, we can then add the bookmark in it’s appropriate place in the bookmark tree and set it’s destination.
The images below are before and after screen captures of the input file being displayed in Adobe Acrobat DC.
You can see that an entire tree of bookmarks has been added to the PDF file perfectly reflecting the heading styles and how they are nested. To get started with adding bookmarks to PDF files, download this Gist and request an evaluation copy of The Datalogics PDF Java Toolkit.