Along with JSON formatted output, PDF Alchemist 3.0 introduced a new feature for those who wish to refine their data extraction from PDF. PDF Alchemist now allows users to supply an XSLT stylesheet to transform XML formatted output. Using XSLT, users can enhance their PDF conversion by extracting data according to their specifications. PDF Alchemist is compatible with XSLT 1.0.
Briefly, XSLT is a language designed to transform XML into a new document based on rules declared by a stylesheet, an .xslt file. Output is created based on patterns of nodes found via XPath queries to the original XML. The resultant document can be new XML, or it can be a different file format.
What does this mean for our users? Because PDF Alchemist already identifies elements in PDF and assigns tags in XML, you can select the content you wish to preserve. Here are a few examples:
- Search for table data only to output to CSV (Comma Separated Values).
- Extract only paragraphs, ignoring visual elements and other data.
- Rename the default XML tags to match a custom schema.
Let’s take a look at the first example. PDF Alchemist ships with several sample PDFs, one of which is Table_Image.pdf. It should come as little surprise that this document consists of an image of a table.
We could leverage PDF Alchemist’s OCR capability to extract data from this document into a format we can parse, such as JSON or XML. However, let’s suppose I wish to get my hands dirty analyzing this data in a spreadsheet. CSV is a file format that I can open in my favorite spreadsheet app, and its relative simplicity means that I can write a fairly lightweight stylesheet.
To get the data from PDF into CSV, I must first obtain the contents of each cell. Using a series of templates, I can match a pattern of elements in XML and get all “cell” elements in each row. For each cell, I precede its value with double quotation marks, write the value, and append another double quotation mark. If there are more values in the row, add a comma. A new line is then inserted at the end of the row. Let’s also double-up any embedded quotes in cell data.
Here is an excerpt from such a stylesheet:
With the stylesheet complete, I run PDF Alchemist using the following command:
pdfalchemist Samples\Table_Image.pdf C:\Users\cgreen\Documents\tablestocsv -outputFormat xml -outputFilename sample.csv -ocrMode replace -xsltStylesheetPath Samples\xmltocsv-tables.xslt
By running with these options, the following occurs:
- During conversion, OCR detects text within the image and writes it to XML.
- The stylesheet is applied to write a new file.
- The output file is named “sample.csv”. Note that files obtained by way of transformation will lack a file extension by default, so the “-outputFilename” option is essential if the file extension is desired immediately.
The result is a file that can be opened in a stylesheet application and interacted with directly.
Here is a view of the document in a text editor:
There are numerous applications for XSLT transformation that can be more complex than the example above. We hope that users take full advantage of this additional flexibility to work with their converted PDFs.