I have been meaning to blog about this for awhile. File this under “Better Late Than Never”.
What’s the best archival format for your important documents? In a previous post I suggested parchment might be—but that’s impractical. All joking aside, the issue is a serious for anyone who is going to spend the better part of their life writing and needs reliable access to this material.
James King of Adobe has a blog, Inside PDF, that’s well worth checking out. Besides covering ISO 32000—the ISO standard based on PDF 1.7, James has made an interesting case for PDF/A as an archival format.
Part of the case is a case against XML based alternatives such as ODF and OOXML. One problem is false advertising: While they do contain XML subfiles, they are, in fact ZIP archives that contain, besides binary files. Not only false advertising, but false promises as well:
There is what I think is a rather technically shallow belief that XML files are easier to work with and will survive the passing of time, even great periods of time, better than other formats. The text held within XML files can usually be viewed with any generic text editor and I guess that gives people a warm feeling that it will therefore also be easier to retrieve with a program. Fair enough. But what is glossed over way too much is that that text is enveloped within XML for [something]. (See my earlier blog entry.) The envelopes (schemas) offered by ODF and OOXML are different. Different enough that a simple program cannot extract just the raw text from either. And is that all I really want from a document in the future, the raw text. Because when you get to the layout and the images and the color space definitions and the fonts, these things do not lend themselves well to XML and are often stored within the ZIP archives as binary data. So tell me again where the advantage to XML is for this purpose?
To these two criticisms, let me add a third. The structure encoded in the XML subfiles of ODF and OOXML is not the logical structure of the document but the functions of the word processor. But that’s not what needs preserving.
Rob Weir at an Antic Disposition, not surprisingly, had an alternative view. Rob observes that not all goals that one might have in archiving is well served by PDF. Reflection on these raise a number of questions, none of which are are answered by PDF:
- What was the nature of collaboration that lead to this document? How many people worked on it? Who contributed what?
- How did the document evolve from revision to revision?
- In the case of a spreadsheet, what was the underlying model and assumptions? In other words, what are the formulas behind the cells?
- In the case of a presentation, how did the document interact with embedded media such as audio, animation, video?
- How was technology used to create this document? In what way did the technology help or impede the author’s expression? (Note that researchers in the future may be as interested in the technology behind the document as the contents of the document itself.)
Nevertheless, Rob is not blind to the attractions of PDF only sensitive to the way it offers a partial solution to the problem of archiving. In the end he entertains a hybrid approach:
An intriguing idea is whether we can have it both ways. Suppose you are in an ODF editor and you have a “Save for archiving…” option that would save your ODF document as normal, but also generate a PDF version of it and store it in the zip archive along with ODF’s XML streams. Then digitally sign the archive along with a time stamp to make it tamper-proof. You would need to define some additional access conventions, but you could end up with a single document that could be loaded in an ODF editor (in read-only mode) to allow examination of the details of spreadsheet formulas, etc., as well as loaded in a PDF reader to show exactly how it was formated.
There is a third way.
It may not be the Final Solution (especially given its current incarnation). But it has the advantages of both approaches and the deficits of neither: Structural markup of plain text files kept under version control. Plain text is the lingua franca of computers and will remain that way in the foreseeable future. Any given file will remain editable, but any given commit will be preserved. Moreover, the version control system will preserve a wealth of metadata about the development of the document, the contribution of collaborators, etc. There are choices in implementation concerning both the markup—be it LaTeX, ConTeXt, or XML variants such as DocBook—and the version control system—be it Subversion, Git, Mercurial. None are perfect. But for now, my bet is on the third way.
Post a Comment