Datalogics has recently added OCR (Optical Character Recognition) support into our Adobe PDF Library .NET and Java interfaces. Supporting many Latin character-based languages on Windows and Linux, OCR support allows users to recover text from images in PDFs. So, why would you want to consider doing this?
The Challenges With Pictures
Images and pictures are an important part of information transfer and archiving. Whether these are coming from smartphone pictures of receipts, scans of paper documents, or newspaper archives on film, important information is often communicated through images of letters and words rather than actual text. Many pictures within PDF files are only pictures – leaving information inaccessible for indexing, and keeping information locked away and lost to most programs.
Occasionally, we will find “searchable images” inside PDFs – PDF files that have pictures which are searchable by PDF viewers. These searchable images go far to turn these images of information into actual, accessible data. This is accomplished through the use of OCR to apply machine vision and reading techniques to images. OCR “reads” letters in pictures much like humans by being able to see letters and combining these back into actual words.
Enhancing the PDF Process
With our recent enhancements, users of the Datalogics distributed Adobe PDF Library SDK can take their PDF workflows further into new directions, including:
Creating PDFs with searchable images: with OCR, users creating PDFs can scan and recover text from images when importing images into a PDF document. Adding text along with images at document creation unlocks more capabilities for “born-digital” documents. Read-aloud, information interchange, and long-term usability are all better enabled when creating PDFs that contain machine readable text with images.
Enhancing existing PDFs with searchable images: existing PDFs with information locked inside images can be scanned with OCR. Images within PDF files can be replaced with searchable text layers placed underneath the existing images. This enables searching and textual copy and paste from these images – without changing the appearance of these PDFs.
Replacing pictures with text: for files where you know the important content in pictures is text, you can take the OCR process one step further. With our easy-to-use PDF .NET and Java APIs, you can not only recover text from pictures – you can eliminate the pictures and only keep the text portions. This leads to smaller files, faster processing, and more usable data.
Recovering Information From PDFs
At its heart, PDF is a container for various types of information: textual and visual. Optical character recognition in the Datalogics interfaces for the Adobe PDF Library can help you transform pictures into useful text, enhancing the usability and value of new and existing PDF files.
Those who are interested in transforming PDF files into machine-readable information and responsive HTML representations should know that the OCR capabilities discussed above are shared with Datalogics PDF Alchemist. PDF Alchemist takes information retrieval and recovery even further, transforming visually-oriented PDF files into reflowed, re-structured XML and HTML that is suited for information processing tools and workflows.
Whether your interest is in making better PDF files, making your existing PDFs better, or making better use of your PDF files – Datalogics has the technology for you! Feel free to request your free evaluation today.