Cracking the Code: Adding OCR to a PDF with Adobe PDF Library
What Problem Does OCR Solve?
When a document is scanned, photographed, or exported as a flattened image, the result is a PDF that looks like text but contains no actual text data. Search engines cannot index it. Automation pipelines cannot extract fields from it. Users cannot copy and paste from it.
This is one of the most common document processing problems in enterprise environments. Think of a law firm processing thousands of legacy contracts, a healthcare provider digitizing patient intake forms, or a financial institution ingesting paper-based statements. In every case, the scanned PDF is a dead end until OCR runs against it.
Optical Character Recognition (OCR) solves this by analyzing the image data on each page, identifying characters, and embedding the recognized text into the PDF. The visual appearance of the document does not change. What changes is that the document now contains a real, machine-readable text layer that supports search, copy, extraction, and accessibility.
Use OCR when you need to:
• Make scanned documents full-text searchable
• Extract field data from forms, receipts, or invoices
• Prepare PDFs for archival in a searchable format such as PDF/A
• Feed document content into downstream NLP or data processing pipelines
• Support accessibility requirements that depend on actual text content
The Sample Code: What Is Available on GitHub
The Datalogics .NET OCR sample repository includes two distinct samples that demonstrate different OCR scenarios in C# using Adobe PDF Library:
AddTextToDocument
Takes a scanned PDF (or any PDF containing image-based pages) and runs OCR across the entire document. For each page that contains image content, the engine places a hidden text layer behind the original image. The output is a PDF that looks identical to the input but is now fully searchable and selectable.
AddTextToSinglePage
Applies the same OCR process to a single page rather than the full document. This is useful when you only need to process specific pages, or when building a workflow that selects pages based on content type before applying OCR.
Both samples follow the same core pattern: configure the OCR engine, iterate over the page content, detect image elements, replace each image with a form that combines the original image and the recognized text, then save the document.
How It Works: Step-by-Step Code Walkthrough
Step 1: Initialize the Library and Open the Document
using (Library lib = new Library())
{
Document doc = new Document("scanned-input.pdf");
The Library object initializes Adobe PDF Library. All APDFL operations must occur within the scope of an active Library instance.
Step 2: Configure the OCR Engine
OCRParams ocrParams = new OCRParams();
ocrParams.PageSegmentationMode = PageSegmentationMode.Automatic;
ocrParams.Performance = Performance.BestAccuracy;
OCREngine ocrEngine = new OCREngine(ocrParams);
Two key parameters control OCR behavior:
PageSegmentationMode determines how the engine breaks down a page before recognizing characters. Automatic is the right default for most documents because it lets the engine decide whether the page is a single column, multi-column, table, or mixed layout. If you know your input is always single-column text, SingleColumn mode can improve speed.
Performance controls the tradeoff between speed and accuracy. BestAccuracy produces the most reliable results and is the recommended setting when input document quality is unknown. BestSpeed is appropriate for high-volume pipelines where documents are consistently clean and high-resolution.
The OCRParams object defaults to English. To process documents in other languages, or multilingual documents, use the Languages parameter:
ocrParams.Languages = new List<LanguageSetting>
{
new LanguageSetting(LanguageCode.English),
new LanguageSetting(LanguageCode.French)
};
Adobe PDF Library's OCR capability is built on Tesseract, which supports a broad range of languages. Multiple languages can be active simultaneously.
Step 3: Iterate Pages and Apply OCR to Image Elements
for (int pageNum = 0; pageNum < doc.NumPages; pageNum++)
{
Page page = doc.GetPage(pageNum);
Content content = page.Content;
for (int index = 0; index < content.NumElements; index++)
{
Element e = content.GetElement(index);
if (e is Datalogics.PDFL.Image)
{
Form form = ocrEngine.PlaceTextUnder((Image)e, doc);
content.RemoveElement(index);
content.AddElement(form, index - 1);
}
}
page.UpdateContent();
}
The core operation here is PlaceTextUnder. It takes an image element and returns a PDF Form object. That Form contains two layers: the original image (so the document looks unchanged) and the recognized text positioned beneath it (invisible but present in the text data). The original image element is removed and replaced with this composite Form.
This approach preserves the exact visual appearance of the input document while adding a complete text layer.
Step 4: Save the Output
doc.Save(SaveFlags.Full, "searchable-output.pdf");
}
The output PDF is written to disk. The file visually matches the input. The difference is in the document structure: pages that contained only images now also contain a hidden text layer that makes the content searchable, selectable, and extractable.
What to Expect From the Output
After running either OCR sample, the output PDF will:
• Look identical to the input. No visual changes are made to the page layout, images, or formatting.
• Be fully text-searchable. Open the output in any PDF viewer and use Ctrl+F / Cmd+F to search for words from the original scanned content.
• Support text selection and copy/paste. Users can highlight and copy recognized text directly from the PDF.
• Support programmatic text extraction. Tools like Adobe PDF Library's text extraction APIs can now pull the text content from the document.
• Retain the original file structure. Bookmarks, annotations, metadata, and other document features are preserved.
If accuracy is lower than expected on a specific document, the most common causes are low-resolution source images (below 150 DPI), heavy background noise or toner artifacts, or unusual fonts. Increasing scan resolution before OCR processing has the largest single impact on recognition quality.
Running the Samples
Clone or download the repository, then navigate to the sample you want to run:
OpticalCharacterRecognition/
AddTextToDocument/
AddTextToSinglePage/
Each sample directory contains a .csproj file. Install the package via NuGet:
dotnet add package Datalogics.PDFL
Then build and run:
cd AddTextToDocument
dotnet build
dotnet run
On first run, the library will prompt for a free trial activation key. You can obtain one at datalogics.com/adobe-pdf-library. Alternatively, set the key in code before instantiating the library:
Library.LicenseKey = "xxxx-xxxx-xxxx-xxxx";
using (Library lib = new Library()) { ... }
The output PDF is written to the same directory where the application runs.
FAQ: PDF OCR with Adobe PDF Library
What types of PDFs need OCR?
Any PDF where the text content exists only as a raster image rather than embedded text data. This includes scanned documents, photographed pages, PDFs exported from certain legacy applications, and any PDF where you cannot select or search text. You can verify this quickly by trying to select text with your cursor. If you cannot, the PDF needs OCR.
Does OCR change how the document looks?
No. The PlaceTextUnder method places the recognized text as an invisible layer beneath the original image. The document's visual appearance is unchanged. Any reader that renders the PDF will continue to show the original scanned image.
What languages does the OCR engine support?
Adobe PDF Library's OCR capability is built on the Tesseract engine, which supports over 100 languages. You can process multilingual documents by specifying multiple language codes in OCRParams.Languages. English is the default if no language is specified.
What image quality is required for good OCR results?
A minimum resolution of 150 DPI is recommended, with 300 DPI producing significantly better results. Documents with heavy background noise, low contrast, or unusual fonts may need preprocessing before OCR. In most real-world workflows involving modern scanners, 300 DPI scans processed at Performance.BestAccuracy produce reliable results without preprocessing.
Is there a difference between AddTextToDocument and AddTextToSinglePage?
AddTextToDocument processes every page in the PDF that contains image elements. AddTextToSinglePage applies OCR to a specific page. Use the single-page variant when building pipelines that selectively process pages, or when you need to test OCR behavior on a specific page before running a full document job.
Can I use OCR output with text extraction APIs?
Yes. Once OCR has been applied, the PDF contains a standard text layer. Adobe PDF Library's text extraction functions, as well as other PDF text extraction tools, will be able to read the recognized content from the document.
Does this work with PDF/A or other PDF standards?
Adobe PDF Library supports PDF/A creation and conversion. If your workflow requires archival-compliant output, you can apply OCR and then convert or save the result as PDF/A. The text layer added by OCR is compatible with PDF/A requirements.
What if only some pages in my document need OCR?
The sample code iterates over all pages and applies OCR only to elements that are of type Image. Pages that already contain embedded text will not have their content altered. If you need finer control, the single-page sample demonstrates how to target a specific page index.
Next Steps
Review the full sample code in the Datalogics GitHub repository and request a free trial of Adobe PDF Library to run the samples in your own environment.
For related capabilities, see the text extraction samples (link is for .NET), which demonstrate how to read the text layer from a PDF once OCR has been applied.