Using jsoup with the Datalogics PDF Java Toolkit

Sample of the Week:

This is not an article about converting HTML to PDF… though that’s exactly what the Gist referenced in it does. Instead, I’m using jsoup as a way to read in a tagged file format, HTML in this case, and use it to layout text on a PDF page using the Talkeetna component of Datalogics PDF Java Toolkit. The fact that the text model in Talkeetna bares a striking resemblance to HTML and CSS and that the Talkeetna class names correspond to the terms described in the “Tagged PDF” section of the ISO 32000 Reference is not coincidental.
What You Need to Know First:
The Datalogics PDF Java Toolkit contains a basic page layout component that makes it easy to create new PDF files from your own input. The LayoutEngine was designed from the ground up to allow developers to leverage their existing knowledge of HTML and CSS to create new PDF documents. The Element in Talkeetna is similar to the Element in HTML.
In Talkeetna, the subclasses of Element are used together to layout a page of text. The Element subclasses are divided into three categories:

  • Grouping Elements: This category consists of the single class Div. A Div is used to group other Divs, Paragraphs or Headings into vertically laid out blocks on the page.
  • Block-Level Structure Elements: This category includes Paragraph and Heading. Like Divs, they can be vertically laid out on the page as blocks; however, unlike Divs, they contain Spans of text.
  • Inline-Level Structure Elements: This category consists of the single class Span. A Span is a run of horizontally laid out text. Several Spans may be contained in a single Paragraph or Heading.

Sound familiar? There’s more…
Any Element can carry a Style. A Style describes the appearance of the Element it is attached to, including the typeface, text decoration, margins, etc. These are specified as properties of the Style. Use of Styles is optional but by default, an Element will use the Style of the Element in which it is contained; thus a Span without a Style will use the Style of its containing Paragraph, a Paragraph without a Style will use the Style of its containing Div, and so on. A ‘default’ Style provided in the LayoutEngine will be used if none of the containers have a Style. Any property in a Style may be ‘null’, in which case the corresponding property from the containing Element’s Style will be used; for example, this allows a Heading to use the typeface of its containing Div, but apply a larger font size. Some of the Style properties are specified using the Length class, which can associate a value with a Dimension. Dimensions can be absolute, or relative to the Style’s font size.
This should also sound familiar to anyone that understands CSS.
Using jsoup with the Datalogics PDF Java Toolkit
While I’m using it very simplistically, jsoup is actually a very sophisticated HTML parser. If you know the jQuery and CSS selection methods, you can get started without much of a learning curve.
To use jsoup to layout a PDF page, we simply read in the HTML Element by Element and create their Talkeetna Element counterparts and use Spans where appropriate to change styles mid-paragraph.   

One of the current limitations of Talkeetna is that it doesn’t do lists. I got lucky though in that each item in both the ordered and unordered lists don’t wrap into multiple lines, so I just prepend the character I need. It works for this example but don’t expect it to work for everything.
You can use this concept to read just about any tagged file format and create simple PDF from that source using the Elements in Talkeetna.  
To get started working with PDF, download this Gist and request an evaluation copy of The Datalogics PDF Java Toolkit.  

Share this post with your friends

Share on facebook
Share on twitter
Share on linkedin

Leave a Comment

Your email address will not be published. Required fields are marked *