Recover editable text from PDFs
Datalogics PDF Alchemist is a new (C/C++) SDK for intelligently extracting text and images from PDFs and exporting to HTML 5 or EPUB. It employs sophisticated techniques to identify and reconstruct “text flows” within the PDF. These text flows are often lost in PDFs, and yet are vital for repurposing the information locked within the PDF.
This short demo walks through using PDF Alchemist to convert a PDF with various formatting (images, tables, underline, etc) into HTML.
Display PDF Content as reflowable text on mobile devices
In certain instances, such as proofing page layout of a PDF flyer or brochure, page fidelity is important. In other situations, proofing the content of a PDF is more important. Consider:
In addition to having easier access to content, generally the HTML content is smaller in size, which can lead to improved performance on mobile devices.
Recontruct source documents when the original was lost
In the business world, it’s not uncommon to “lose” source documents: an old product datasheet, an old report or white paper, etc. And sometimes there’s a need to update those: update terms and conditions on a contract, translate a white paper to another language, etc.
While touching up text in Acrobat is possible, it is not effective at larger edits. The output from PDF Alchemist can be easily loaded into a word processor or desktop publishing application for editing.
Address Accessibility requirements
Many organizations have requirements to provide documentation in an “accessible” manner; for example, to conform to US Section 508 Accessibility Guidelines. This often means delivering documentation in a format compatible with screen readers or other assistive technology; and the key to this is being able to identify “text flow” information (i.e., to programmatically “read” the text of a document in a logical order, like a human would).
PDF Alchemist can recover these text flows as reflowable HTML. Once the text flow information is recovered, it can be used to help create PDF/UA documents, or final form HTML output.