Here at Datalogics, we are continuously innovating and providing our customers with more value to better assist them with their PDF document needs. Over the past few months, we’ve added Optical Character Recognition Support (OCR) to many of our products. We are excited to announce that OCR support is now available within the Java and .NET interfaces of the Adobe PDF Library. We’ve combined the power of the Adobe PDF Library together with Tesseract (a widely-used open source OCR engine) to allow users to access and process the data and text within images.
One of the most common use cases for OCR is in preparing documents for searching or extracting the data into another process. By using our OCR APIs, the text data within these images is accessible without modifying the look of the input document. Let’s walk through some of the key components of the API using .NET. You can view the full code by visiting our public sample GitHub repository.
Setting the PageSegmentationMode to Automatic lets the OCR engine choose how to segment the page for text detection. The Performance parameter allows for multiple levels of granularity when choosing speed vs performance. In this case, we are selecting the mode that will output the best accuracy. This is a common setting when you are unsure of the quality of your input document. The OCRParams will default to English; you’ll need to use the Languages parameter to select other languages. Multiple languages can be selected at the same time.
Once the OCREngine is configured, we can loop through the content of the document, identify the images, and apply the OCR processing:
The image object is replaced by a form, which contains the original image and the identified text laid out behind it. Once this step is complete, the resulting document can be saved and it will contain the original content and the identified text.
As an added benefit, the .NET and Java interfaces currently support Dutch, English, French, German, Italian, Portuguese and Spanish languages, and with additional Chinese, Japanese and Korean languages to be added shortly. Try it out yourself by requesting a free evaluation, and feel free to take a look at our full sample code for Java and .NET (which includes how to start this process from an image rather than a PDF) under the OpticalCharacterRecognition section inside Sample_Source.