Back in January of this year, Matt Kuznicki wrote an article about Intelligently Extracting Content From PDF Files based on a presentation he gave back at CodeMash 2016. Since then we have been working with our customers to make PDF Alchemist even better and while today we are only releasing a small portion of the updates, there are certainly more to come!
PDF Alchemist 2.1.3 packages up a few things that drove some of our early users crazy while just trying to evaluate earlier versions of PDF Alchemist. These types of items (missing documentation, poor error message descriptions, or even the name of the download for PDF Alchemist) are easy to miss and seem so small that you wouldn’t think they have an impact on anyone’s experience but they do! We, at Datalogics, want to make sure that everyone sees and feels the time and energy we put into developing our products and that you find them as easy to use as possible.
The reliability of our products matters to us as well, which is why we have also started on a project to improve the detection of tables in a PDF so that they are output correctly in the HTML created by PDF Alchemist. Detecting tables in a PDF that does not contain any structure information is difficult and it is a problem we have heard about from a number of the evaluators and customers of PDF Alchemist, what we have learned so far though is that we are going to break things while we improve the overall product. In version 2.1.3 there is only a small improvement in the table detection and output, let’s take a look at a simple example.
Here is one of our sample files, a phony Profit and Loss statement we have been using for testing. With PDF Alchemist 2.1.0 or earlier, the first row below the column headings contains 3 lines of text in the first cell.
With PDF Alchemist 2.1.3 and later, the same file now produces slightly different output. The first cell in the first row below the column headings is now broken into multiple cells as it would have been if the input file had more lines drawn to clearly separate the rows.
No matter what your use of PDF Alchemist is, this change will help PDF Alchemist provide better output of tables. The structure of the HTML now more closely matches how we would read the original PDF and with the structure now in HTML it is less likely that screen readers or data analysis tools would need to do something special to understand the content.
Let us know if you have run into similar issues while converting PDF to HTML by upvoting this suggestion on our feedback forum! By upvoting the suggestion you will automatically be subscribed to notifications anytime we update the suggestion with our progress. If you have test files that you would like to share with us you can also upload them there as well!
Release notes for PDF Alchemist 2.1.3 (aggregate of release notes since last public release)
- Improved separation of table rows when text is indented in the first cell of a row
- Folder that PDF Alchemist extracts to now contains the version number of PDF Alchemist
- Updated documentation for possible return values from processPDF function
- Automatically create an output directory if no output directory is specified
- Provide helpful error message if a PDF cannot be processed because it requires a password
- Evaluation version : Missing or invalid license error is reported before all other errors
- Add documentation for processPdf and processPdf2Epub functions in PDFAlchemist.h
- Fix encoding issues in PDFAlchemist.h