Tables in PDFs

Mauricio asked 2 years ago

I am testing pdfalchemist.exe to extract text information from a PDF file that contains tables. The convertion to html just writes the table as an image; the convertion to xml showed the text from the table, but:

  • it misses the line breaks;
  • it misses the columns alignments when there are empty columns.

Is there a way to address these two issues?

Datalogics Staff replied 2 years ago

What version of PDF Alchemist are you using and on what platform?
I assume the problems are unique to a PDF file. Can you describe the type of table that has this problem (size, complexity). Does it span pages?

Datalogics Staff replied 2 years ago

Also are you using any of the OCR options? Can you show us the XML output for a line or two of the table.

1 Answers
Best Answer
Datalogics Staff answered 2 years ago

Please review the documentation for the -purpose parameter.  Note that HTML and EPUB default to “balanced” which may write tables as an image for a better original appearance.  XML output uses “indexing”  as a default and will preserve text for searching/indexing workflows but the output might differ significantly from the PDF appearance. 

Get instant access to the latest PDF news, tips and tricks!

Do you want monthly updates on the latest document technology trends?

By submitting the form, you agree to receive marketing emails from Datalogics. You may unsubscribe at any time.