The Power of Plain Text
A plain text file is a computer file. Like all computer files it consists of a sequence of arbitrary bits. Two features distinguish a plain text file from a binary file:
- In plain text, the bits are used to represent unformatted textual characters. These include alphanumeric characters (numerals and letters) as well as control characters such as tabs, line breaks, and carriage returns. These are unformatted since there is no default representations of bold or italic characters.
- A large number of applications (almost all that deal with text in some way) can understand and operate on plain text, whereas only a limited number of applications can understand and operate on any particular binary file.
The virtues of plain text are many. Here are just a few:
- Portability: Plain text is remarkably portable. It is the lingua franca of the computer world. Applications on all the major platforms, Windows, Linux, Mac OS X, can understand and operate on plain text files.
- Stability: Plain text files are less prone to corruption than binary files. Moreover, a corrupt binary file easily results in total data loss. A corrupt plain text file can be recovered at least in part. Moreover, a particular binary format associated with a particular application can change over time, also resulting in data loss
- Human Readability: Plain text files are human readable—they are, after all, representations of textual characters. Most binary files are not, at least without the aid of a special application.
These features, and others, make plain text an excellent archival format for writing.
And its Perils
The problem is that in these post-lapsarian times, there is no such thing as plain text anymore. Should the encoding be ASCII or UTF-8? Should it include BOM or not? Hard wrapped or soft-wrapped?
ASCII or UTF-8?
ASCII is an acronym for American Standard Code for Information Interchange. It is a character encoding, a representation of textual characters, based on the English alphabet. It contains 95 printable characters:
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
If you bothered to count, you noticed that there are only 94 visible characters. That’s because the space character counts as a printable character. Plain text files originally used only ASCII characters. The fact that only English characters were used reflects the cultural context in which this encoding was developed. But what about accented characters? Or Hiragana?
UTF-8 is a form of Unicode. The Unicode format was developed to represent all the characters of human languages. UTF-8 is a superset of ASCII—all the ASCII characters are represented plus more besides.
The short answer to the encoding question is UTF-8. Keep all your text files in UTF-8. For some considerations in its favor, see Allan Odgaard’s post on the TextMate Blog. For more on encodings see Joel Spolsky’s post
BOM or no BOM?
BOM is an acronym for Byte Order Mark. Inserted at the beginning of the file, its function is to identify its encoding. It is unnecessary since UTF-8 files can be recognized without it. Moreover, some *nix utilities choke on files with an embedded BOM.
Short answer, then, no BOM.
Soft Wrap or Hard Wrap?
Word wrapping is a feature of most text editors, applications dedicated to generating plain text. It confines the text to the viewable window thus allowing text to be read without any horizontal scrolling. Hard wraps confine the viewable text to a fixed degree—a specific number of characters. In contrast, soft wraps allow the text to flow as the window is resized.
Short answer, if your editor offers you the choice, use soft wrapping.
Unformatted
Plain text is unformatted. For writers, however, formatting can carry semantic significance. It’s loss will be felt when writing a complex document in plain text.
Short answer, if you need to represent semantic or logical structure in your plain text document, you need to use some kind of markup—be it Markdown, HTML, LaTeX, reStructured text or what have you. More on markup in subsequent posts.
Conclusion
The power of plain text far exceeds its perils even in these post-lapsarian times. I heartily recommend it to you. And remember geeks prefer plain text.
{ 5 } Trackbacks
[…] an ealier post I wrote about the power of plain text…and its perils. One of the perils of plain text, at […]
[…] highlights the security benefits of only working with plain text. Filed under: Text, ODF […]
[…] simply move the cursor over the URL and hit ⌅. Another small step in harnessing the power of plain text. Filed under: Markdown, Text, Text Editor, TextMate […]
[…] I have posted earlier, you should be using UTF-8 for all your plain text needs. Dont trust me? Then trust Allan. Filed […]
[…] corrupted before I could convert it to BibTeX. Yet another cautionary tale speaking in favor of the Power of Plain Text. Filed under: BibTeX, Bibliography, LaTeX, diff, git, version control […]
Post a Comment