Extract Text from PDFs: Basics, Methods, and Resources for Developers

Extract Text from PDFs: Basics, Methods, and Resources for Developers

Published September 19, 2025

Whether you're working with scanned documents, complex layouts, or encoded text, PDF text extraction requires the right tools and techniques to get clean, usable content. In this blog, we’ll discuss the basics of PDF text extraction, explore different methods, and break down our text extraction code samples with use cases to help you find the right solution for your needs.

Common scenarios for why text needs to be extracted from a PDF are:

  • To index the text
  • To analyze the text
    • with or without information about the positioning/style of the text

Main Elements of PDF Text Extraction

Text Extraction Techniques

  • Unicode Text Extraction: PDFs can store text using Unicode encoding, which allows for extracting text with proper character encoding.
  • OCR (Optical Character Recognition): In cases where the PDF contains scanned images or non-selectable text, OCR technology may be used to recognize and extract text from images.

Use Cases

  • Content Analysis: Text extraction is often used for content analysis, allowing applications to analyze and process the textual information within PDF documents.
  • Data Extraction: Extracting structured data from PDFs, such as tables or form data, is a common use case.
  • Searchable Archives: Making the content of PDFs searchable by extracting text allows users to find specific information within large collections of documents.

Challenges

  • Text Layout: PDFs can store text in a way that reflects the visual layout of the document, which can pose challenges in maintaining the original formatting during extraction.
  • Image-based Text: Some PDFs may contain text as images, requiring OCR for accurate text extraction.
  • Encrypted PDFs: Encrypted PDFs may require decryption before text extraction can take place.
  • Programming Interfaces: Most programming languages have libraries or APIs that facilitate PDF text extraction. These libraries provide methods to parse the PDF structure and extract text content from the PDF.

PDF Text Extraction Methods and Code Samples

The Adobe PDF Library SDK can handle nearly any use case for PDF text extraction. Below are examples of the code samples we have available for text extraction, along with their use cases. Please note: the samples we discuss here are written in .NET/C#, but other languages are available on our GitHub homepage, including C++, Java/Maven, Kotlin, and more.

Extracting Page Text

The TextExtract sample pulls text from a PDF file and exports it to a text file (TXT). It will open a PDF file called Constitution.PDF and create an output file called TextExtract-untagged-out.txt. The export file includes page number references, and the text is produced using standard Times Roman encoding. The program is also written to include a provision for working with tagged documents, and determines if the original PDF file is tagged or untagged. Tagging is used to make PDF files accessible to the blind or to people with vision problems. To learn more about PDF accessibilty, check out our blog, PDFs are for Everyone: Understanding PDF/UA. 

Extracting PDF Forms Data

The ExtractAcroFormFieldData sample shows how to extract text from the AcroForm fields in a PDF document. This is useful for those who work with fillable forms in PDFs and need to extract the text within the Acroforms as a .JSON file to use in a text editor or web browser.

Here’s what that portion of the code looks like:

const char *DEF_INPUT = "../../../../Resources/Sample_Input/ExtractAcroFormFieldData.pdf"; ← Input document (PDF)

const char *DEF_OUTPUT = "ExtractAcroFormFieldData-out.json"; ← Output document (JSON) APDFLDoc inAPDoc(DEF_INPUT, true);

// This array will hold the JSON stream that we will print to the output JSON file. json result = json::array();

// Create the TextExtract object TextExtract textExtract(inAPDoc.getPDDoc()); std::vector extractedText = textExtract.GetAcroFormFieldData();

Searching for Patterns

The ExtractTextByPatternMatch sample searches for patterns within the text of a document, such as phone numbers, using simple overarching commands and extracts the data into a .TXT file. For example, phone numbers in the U.S. are set up as ###-###-####, but that format varies worldwide. This sample makes it easy to extract any phone number by simply using ‘PHONE_PATTERN’ in the code instead of ((1-)?(\()?\d{3}(\))?(\s)?(-)?\d{3}-\d{4})

Here’s how that looks in the context of the code:

const char *DEF_INPUT = "../../../../Resources/Sample_Input/ExtractTextByPatternMatch.pdf"; ← Input document (PDF)

const char *DEF_OUTPUT = "ExtractTextByPatternMatch-out.txt"; ← Output document (TXT)

// This sample will look for text that matches a phone number pattern

const char *DEF_PATTERN = regexPattern[PHONE_PATTERN];

You can also search for Unicode characters such as Chinese, Japanese, and Korean (CJK) with the ExtractCJKTextByPatternMatch code sample. With more than 1.5 billion people speaking those languages (and growing), organizations must be able to extract millions of different types of characters correctly.

Extract Text by Regions

The ExtractTextByRegion sample has to do with extracting text from a specific region of a page in a PDF document, which then saves the extracted text to a .TXT file. For example, companies who have thousands of invoices with the same number format that need those numbers extracted from that specific region on the PDF, or when the IRS must pull social security numbers from that section of their 1044s, can use ExtractTextByRegion to accomplish that task.

Extract Text from Multiple Regions

The ExtractTextFromMultiRegions sample processes PDF files in a folder and extracts text from multiple specific regions of its pages and saves the text to a .CSV file. For example, this command can create a single file with all the invoice numbers, dates, order numbers, customer IDs, and total from the invoices in the folder, so you have all the data you need in one view.

Consolidating Annotations

Annotation Consolidation PDFs can contain thousands of annotations and the ExtractTextFromAnnotations sample shows how to pull that information out and save it to a separate text file (.JSON). For example, contract negotiations may include comments and questions that have been accepted or rejected, and this function can extract that data. Here's a look at a section of the code:

const char *DEF_INPUT = "../../../../Resources/Sample_Input/sample_annotations.pdf"; ← Input document (PDF)

const char *DEF_OUTPUT = "ExtractTextFromAnnotations-out.json"; ← Output document (JSON)

json textObject = json::object(); textObject["annotation-type"] = extractedText[textIndex].type; textObject["annotation-text"] = extractedText[textIndex].text; result.push_back(textObject);

Style Preservation

The ExtractTextPreservingStyleAndPositionInfo sample extracts all text from the PDF along with information about the text (in a .JSON file) such as its style, color, and font size for style preservation.

Additional Text-based Use Cases and Code Samples

Aside from the text extraction samples above, we also have related samples if you need to add text elements to your PDFs. All of these can be found in GitHub under the "Text" folder.

Add Glyphs

Glyphs can represent specific symbols or characters that are not available in standard text fonts, like letters, numbers, punctuation marks, and other symbols. Glyphs are most common in technical documents, mathematical papers, or specialized industries where unique symbols are required (design, print media, finance, and legal to name a few). Designers especially might add glyphs to enhance the visual appeal of a PDF because glyphs can be used as decorative elements, bullets, or icons to improve readability and design consistency. Additionally, some languages have characters or symbols that aren't supported by common fonts. Adding specific glyphs ensures accurate representation of these characters, which is crucial for preserving the meaning and readability of the text.

Add Unicode Text

Unicode provides a comprehensive character encoding standard that supports virtually all written languages, including those with non-Latin scripts such as Chinese, Arabic, or Devanagari. By using Unicode, a PDF can include text in multiple languages, ensuring accurate representation and readability. Unicode also ensures that the text appears consistently across different devices, operating systems, and PDF viewers, making it super accessible. This is particularly important in global communications where the same document might be viewed on various platforms. Bonus points to Unicode for being searchable as well! Add Vertical Text Vertical writing is a traditional method for some languages, especially East Asian languages like Chinese, Japanese, and Korean. These languages often have texts that are read from top to bottom, right to left. Adding vertical text to a PDF is essential for documents written in these languages to preserve their cultural and linguistic norms.

Regex Extract

Text Regular expressions (regex) allow for highly precise and flexible search patterns. You can search for variations of a phrase, such as different spellings, word forms, or even partial matches that might not be caught by a simple keyword search. When working with large documents or multiple PDFs, regex can quickly find all instances of a phrase, even if it's embedded in different contexts. This saves time compared to manually searching through the document. Regex PDF Text Search Searches for phrases or text patterns in a PDF input document. It supplies sample regular expressions to use in searching for phone numbers, email addresses, or URLs, and you can use them or create your own. You can search the entire PDF document or provide a page range for your search. The program generates an output PDF document that matches the input file except that the search content appears highlighted. You can enter the name of the input file you plan to use, and the name of the output file. The sample uses PDDocTextFinder to find instances of a phrase or pattern in a PDF input document.

But wait, there's more! You can also use APDFL to do the following with text in a PDF:

  • Unicode Text - Illustrates working with Unicode text, adding text in several languages to a PDF page.
  • Text Select Enum - Uses the PDTextSelect feature to show how to select the text within a defined area of a PDF page.
  • Text Search - Illustrates how to find and highlight every example of a specific word in an input PDF document.
  • Hello Japan - This sample program is effectively a version of Hello World, except that when run it generates a PDF document with the text “Hello Japan,” using Japanese Kanji characters.

Streamline Your Development Workflow with Adobe PDF Library

Check out the Datalogics GitHub Repository for more information on Adobe PDF Library and samples for the creation, modification and management of text in PDF documents. Start a free trial and discover how our PDF SDK can help you minimize your development time.

free trial adobe pdf library sdk datalogics

adobe pdf library container api datalogics

Join us on Discord Schedule a Call with an Engineer Ask Scout, our Friendly AI Assistant