PDF Text Sample Code: Adding and Extracting Text
Text is arguably one of the most important elements of PDFs. When the text isn't looking or working correctly, it can screw up your whole workflow. A good PDF tool will help with that by giving you many options for adding, editing, and manipulating PDF text. Adobe PDF Library is one of those tools, so let's take a look at what all you can do with text when you use our Adobe PDF Library SDK by looking at what our code samples have to offer.
Add Glyphs
Glyphs can represent specific symbols or characters that are not available in standard text fonts, like letters, numbers, punctuation marks, and other symbols. Glyphs are most common in technical documents, mathematical papers, or specialized industries where unique symbols are required (design, print media, finance, and legal to name a few). Designers especially might add glyphs to enhance the visual appeal of a PDF because glyphs can be used as decorative elements, bullets, or icons to improve readability and design consistency. Additionally, some languages have characters or symbols that aren't supported by common fonts. Adding specific glyphs ensures accurate representation of these characters, which is crucial for preserving the meaning and readability of the text.
Add Unicode Text
Unicode provides a comprehensive character encoding standard that supports virtually all written languages, including those with non-Latin scripts such as Chinese, Arabic, or Devanagari. By using Unicode, a PDF can include text in multiple languages, ensuring accurate representation and readability. Unicode also ensures that the text appears consistently across different devices, operating systems, and PDF viewers, making it super accessible. This is particularly important in global communications where the same document might be viewed on various platforms. Bonus points to Unicode for being searchable as well!
Add Vertical Text
Vertical writing is a traditional method for some languages, particularly East Asian languages like Chinese, Japanese, and Korean. These languages often have texts that are read from top to bottom, right to left. Adding vertical text to a PDF is essential for documents written in these languages to preserve their cultural and linguistic norms.
Extract Acroform Field Data
AcroForm fields in PDFs are often used for collecting user input, such as in forms, surveys, or applications. Extracting this text allows organizations to gather and analyze the collected data, whether for statistical analysis, customer feedback, or decision-making processes. In scenarios where a large number of forms need to be processed, such as in government agencies or large enterprises, extracting text from AcroForm fields enables automated data processing. This can streamline workflows, reduce manual entry errors, and increase efficiency by integrating the extracted data into databases, spreadsheets, or other software systems.
Text Extract (Tagged and Untagged PDFs)
Extracting text from both tagged and untagged PDF documents serves various purposes, depending on the type of PDF and the user's needs. Tagged PDFs are designed to be more accessible to users with disabilities, such as those who rely on screen readers. Extracting text from tagged PDFs ensures that the content can be presented in a way that maintains its semantic meaning and structure, making it easier to understand and navigate. In situations where you only need the raw text from a document, such as for data entry, analysis, or reference, extracting text from an untagged PDF is a straightforward way to obtain the necessary information.
Extract Text By Pattern Match
Regular expressions can be used to search for specific types of data within a large text, such as email addresses, phone numbers, dates, URLs, or social security numbers. This is useful when you need to extract certain information from a document or dataset quickly. In web development or data entry, regex is often used to validate whether the input matches a specific pattern, ensuring that users provide data in the correct format (e.g., a valid email address or ZIP code).
Extract PDF Text By Region
In documents with a consistent layout, such as forms, invoices, or reports, specific information (like names, dates, or totals) may always appear in the same location. Extracting text from these regions allows for efficient data collection without manually sifting through the entire document. For documents where only certain sections are relevant (e.g., signatures, addresses, or specific paragraphs), extracting text from a particular region ensures that only the necessary data is captured and processed.
Extract Text From Annotations
Sometimes you may need to extract text from the annotations in a PDF document and save the text to a JSON file, since JSON is a format widely used in APIs and data exchange. By storing annotation data in JSON, it can easily be integrated into other software systems for further processing, reporting, or archival purposes. If you need to consolidate all the notes that were added as annotations in your PDFs, this sample can help you pull out that data for easy viewing and/or printing.
Extract Text From Multi-Regions
Processes example invoice PDF files in a folder that share the same page layout and extracts text from specific regions of its pages and saves the text to a CSV file. All the invoice numbers, dates, order numbers, customer IDs, and totals from the invoices in the folder are saved in convenient CSV format.
Regex Extract Text
Regular expressions (regex) allow for highly precise and flexible search patterns. You can search for variations of a phrase, such as different spellings, word forms, or even partial matches that might not be caught by a simple keyword search. When working with large documents or multiple PDFs, regex can quickly find all instances of a phrase, even if it's embedded in different contexts. This saves time compared to manually searching through the document.
Regex PDF Text Search
Searches for phrases or text patterns in a PDF input document. It supplies sample regular expressions to use in searching for phone numbers, email addresses, or URLs, and you can use them or create your own. You can search the entire PDF document or provide a page range for your search. The program generates an output PDF document that matches the input file except that the search content appears highlighted. You can enter the name of the input file you plan to use, and the name of the output file. The sample uses PDDocTextFinder to find instances of a phrase or pattern in a PDF input document.
But wait, there's more!
You can also use APDFL to do the following with text in a PDF:
- Unicode Text - Illustrates working with Unicode text, adding text in several languages to a PDF page.
- Text Select Enum - Uses the PDTextSelect feature to show how to select the text within a defined area of a PDF page.
- Text Search - Illustrates how to find and highlight every example of a specific word in an input PDF document.
- Hello Japan - This sample program is effectively a version of Hello World, except that when run it generates a PDF document with the text “Hello Japan,” using Japanese Kanji characters.
So, there you have it - all the cool stuff you can do with text in APDFL! But don't take our word for it - start a free trial of Adobe PDF Library today to see for yourself.