Archiving is obviously near and dear to us here at Datalogics because of our deep commitment to PDF technologies. I thought it would be interesting to take us on the road to PDF from an archiving perspective.
Before computers, the only option for archiving documents was physical media. Paper and microfilm are prime examples of physical archives. Though microfilm archives might be a bit before my time, I’ve seen, and had the “pleasure” of experiencing, paper archives first hand.
For hundreds of years, paper was the only way documents had been kept. Paper archives are delicate, and need a steady climate-controlled environment if they are to have any expectation of longevity. Even a small leak from a faucet can have detrimental impact. In fact, it is recommended that a stable temperature no higher than 70°F, and a stable relative humidity between a minimum of 30% and a maximum of 50%, be used in locations where paper archives are stored. But, even that will not ensure that your documents are fit for use when needed.
Backup offsite copies are almost impossible with hard copy archives, which means they are at risk due to disasters like floods, fires, etc. I know someone whose birth records were located only in one small local office in Greece that burnt down – imagine the struggles after that.
There is also the expense around training personnel and maintaining the archive, and the even bigger expense of the square footage necessary to hold all that paper. Another drawback of physical archives is access control. Imagine this situation: you need a copy of your original birth certificate. You go to the archive to get one. A person retrieves it for you and you get a precious paper copy. Then they misfile it, and now your birth certificate is lost. When manual labor is involved, mistakes can happen.
Microfiche and microfilm are a legacy technology that was used instead of, or in addition to, paper. These archives are infrequently accessed, and frankly, out of sight and out of mind. Large quantities of vital business and government records, much of it one of a kind, still exist solely on microfiche and microfilm. Deterioration, bending, and tearing can happen, and as the archives begin to deteriorate, they give off a vinegar smell which means the film is becoming brittle and can start breaking apart. In addition, these archives are very susceptible to environmental conditions. High temperatures and high humidity can encourage fungus growth and once that happens, there is no way to recover the image.
A large portion of these problems were resolved when TIFF come out in 1986. TIFF was specifically designed as a common scanned image file format. As one of the first available electronic formats specifically tailored to black and white documents, it quickly caught on with the archiving community. It solved a lot of the problems revolving around physical archives, after all – it is a physical representation of the paper original. Electronic files are much easier to store than paper ones. They can also be backed up, and accessed, remotely.
As great as TIFF is for archiving, it too comes with drawbacks. The latest TIFF release came out in the 1993. Color is still not well supported, as it was an afterthought to what was initially a black and white only format. As such, color compression in TIFF is not great, and certainly not up to date. TIFF is not an open standard. It relies on the mercy of the company that owns it, which is not great for long term archiving. There are multiple different third-party extensions of TIFF that are often incompatible with each other.
In addition to TIFF, there are other image file formats that might be used for archiving purposes. There could be an archive with GIF, BMP, JPEG or JPEG2000 formats – all of which have issues similar to TIFF.
Another document format to come out in 1993 was PDF. Its main goal was to reproduce documents accurately and reliably across platforms. While it is also a physical representation of the paper document, PDF is much more advanced than image formats. It can contain actual text and can be searchable, versus an image of text. As PDF matured, more features were added to it – digital signatures, versioning, and the ability to contain other files. This made the format much more versatile for archiving than image formats. Then in 2008, Adobe released PDF as open ISO standard, guaranteeing its longevity.
PDF now has a sub-format specifically tailored by and used for the archiving community – PDF/A. The PDF/A standard was released in 2005, using some of the experience TIFF gave to the archiving community and building on it. Like PDF itself, PDF/A is also an open ISO standard. Its main goal is that files are preserved in a way that will allow them to be opened 50 or 100 years from when they are created. PDF/A files are self-contained and they don’t allow for external resources. The fonts, images, and even files that would usually be external to a document are all built into one package. This ensures the longevity of the file. In addition, the PDF/A standard is up to date, and it continues to get updates. PDF/A-1 was released in 2005, followed by PDF/A-2 in 2011 and PDF/A-3 in 2012.
The important thing is that a PDF/A files will look the same as the day they were created, and they can be searchable. Other benefits that come from the PDF standard itself are the ability to digitally sign a file as well as file versioning. PDF files allow for multiple different compression algorithms, allowing the right compression to be selected for the file, resulting in a small file footprint. Of course, you always have the ability to have duplication of the archive so data is never lost, which is the ultimate goal of an archive. With all this in mind, it is hard to find issues with PDF/A for archiving.
The chart below is a summary of file types used for archives that will be held for less than 10 years, versus items that need to be maintained more than 10. I will thank the folks in libraries for thinking long and hard on this subject to come up with guidelines.
|File type||Formats suitable for archives intended to last more than 10 years||Formats suitable for use less than 10 years|
-Plain text (.txt, .c, .cpp, .m, etc.) coded as ASCII, UTF-8, or UTF-16 using byte order mark
|-PDF with embedded fonts
-Plain text (*.txt, *.c, *.cpp, *.m, etc.) (ISO 8859-1 coded)
|Raster image||-TIFF (uncompressed)
-JPEG2000 (lossless compression)
-JPEG (lossy compression)
A lot of the traditional archive forms can be converted to PDF/A – paper archives can be scanned, and TIFF can be converted directly to PDF/A. If you are maintaining a traditional archive, I’d highly advise looking into converting it to PDF/A. We have multiple products that can help you with that – Adobe PDF Library and the PDF Java Toolkit.
PDF/A has multiple conformance levels, and if your organization does not require a specific one, picking the right conformance level can feel like a challenging task. To help with that, I’ll review the different PDF/A conformance levels, what do they mean and when to use them in my next post.