In the summer of 2016, Joel Geraci started a series of blog posts about optimizing (or repurposing, as he calls it) PDF files. He talks about optimizing PDF files to serve a specific purpose, then dives into details on image downsampling, font subsetting, and general object clean-up. If you haven’t read these and would like to know more about the exciting world of PDF optimization, I’d suggest you take a look.
Since these blogs have been written, and quite honestly before that, we’ve seen quite a bit of questions about optimizing PDF files, and specifically, reducing PDF file size. A lot of our customers and prospects seem to be running into similar problems – some PDF files take a lot of space. This makes them difficult to send over a network or requires a large amount of space to store them. In the past, we would create samples that target a specific problematic subset of PDF files, but then it occurred to us – this problem requires a more complete solution. Fast forward to September of 2016 – the Datalogics PDF optimization tool is released, available with a copy of the Adobe PDF Library. Its main goal is to reduce file size, making PDF files more convenient to work with over networks, and long term storage. So how does said tool work? Well to answer that question, I’ll first have to go over the most common reasons PDF files can be larger than they might actually need to be.
Why do some PDF files take more space than others?
PDF files that seemingly don’t have much content end up taking a huge amount of space. Why is that? We see this question fairly often. In the general case, this can be caused by one of two things (really more than two, but we will discuss the reasons in two general categories for convenience). Images and fonts. Images can be uncompressed, or poorly compressed, and their pixel density can be out of proportion with the purpose of the document, there could also be multiple copies of the same image, just sitting in a document, taking up space. Similarly, embedded fonts can be incorrectly subsetted, not subsetted at all, or even have multiple copies of the exact same font in a document. Let’s talk about each of these problems separately, and the solution we have for you.
Image downsampling is not a new problem. Solutions for image downsampling and compression are widely available. The real issue here is taking images out of the PDF document, while retaining the correct colors, then inserting the downsampled and compressed images back in the PDF document, preserving the correct color space and transparencies. This difficult task is easily achieved by the Adobe Color Engine working under the hood of the Adobe PDF Library.
The image above is an original test image from the Altona test suite version 1.2. It demonstrates “insert color jargon here,” that we won’t talk about it in this article. For details on what this image is meant to text, take a look at the Altona document here. Pages 10 and 11 go in depth about color consistency, duotone images, spot colors, and overprint. The bottom left image has been optimized using the Adobe PDF Library’s optimization tool. The color space and transparency are identical to the input image. The bottom right image shows the same input PDF optimized by a different optimization tool. While the PDF file size has been reduced, the color space looks very saturated. On top of that, the spot colors are not represented correctly
As PDF files are created, a number of fonts get embedded in them. This is a standard step in the creating process and ensures PDF files can retain their visual appearance across platforms, that might be missing those fonts. This is great, but fonts can sometimes take up a lot of space. Let alone if the tool used decides to embed multiple copies of the exact same font. This happens for a multitude of reasons, one of which is merging PDF documents. How do we aid that? To start, we can use the PDF optimization tool to subset a fully embedded font. We will go over each glyph, and retain only the ones we actually need in a document. This is a complicated task, and not one that can be completed correctly by most tools out there. Common mishaps using third party tools are plenty. They vary from words missing characters, to the whole document looking like it’s in a different language.
Object cleanup, and stream compression
In the beginning of the article, I mentioned that we can have duplicate font and image streams in a PDF file. The PDF Optimizer tool will go over those, and remove any duplicates. Furthermore, it will compress all uncompressed streams.
PDF Optimization is a fairly involved process. While we have a lot of knobs and levers to fine tune it, we also have a sensible set of defaults. This makes it easier to get started right out of the box and optimize PDF files.
If you have any questions about PDF Optimization, leave us a comment below.