Datalogics has just released updated versions of the Adobe PDF Library (APDFL) products, including a major enhancement to the OCR support included with the Datalogics Extensions to APDFL.
The initial release of the OCR support was made available with APDFL version 15 in February of 2019. This capability was added based upon user requests, and it utilized version 3.05 of the Tesseract open-source OCR engine that was originally developed by HP. With this initial release, Datalogics Extensions provided support for character recognition in 7 different languages. These languages were chosen from the selection of languages Tesseract supported to keep the deliverable package small, while supporting what we thought would be the most requested languages.
What’s The Big Deal?
The Tesseract open source project is now sponsored by Google, and Tesseract version 4 is significantly evolved from the v3 character recognition engine in the way it “thinks”. Tesseract 4 now utilizes Long short-term memory (LSTM), which is an artificial recurrent neural network (RNN) architecture used in the field of deep learning. Using machine learning algorithms, Tesseract has substantially improved its ability to recognize text.
More to Come
This past week, we delivered updated OCR support with the Datalogics Extensions for APDFL 18, and the core Tesseract engine has been upgraded to version 4.1.1. While the base deliverable still supports those 7 languages, developers will soon have the ability to add support for up to 114 supported languages and 37 supported scripts without a new deliverable from Datalogics. This will include the ability to add support for Chinese, Japanese, and Korean text recognition that our far east customers have requested.
While Tesseract 4 is available now with the Datalogics Extensions for APDFL 18, we are continuing to extend and enhance the OCR capabilities with the addition of image pre-processing capabilities and APIs to de-skew and de-speckle scanned images that may have inadvertently been rotated or gained artifacts during the scanning process. That’s not all; below you will see some visual examples of the improvements we’ve added. These enhancements will further increase character recognition accuracy and are expected to roll out with a Datalogics Extensions update in Q2 2021.
Coming Soon to PDF Alchemist as Well
PDF Alchemist users will also see improved character recognition soon. The updates to the OCR support in Datalogics Extensions for APDFL will be rolled into the PDF Alchemist product after they have shipped with APDFL. Stay tuned for more!