Skip to content

LaTeX and Word Count Revisited

In an earlier post I described a TextMate command for determining the word count for a LaTeX document. Simpling using

wc -w

gives an inflated estimate since it will include LaTeX commands. One alternative is to use detex:

detex mydocument.tex | wc -w

This will strip the LaTeX commands first before counting words but this assumes that you are not running commands that insert text. Consider a document with the following command:

\newcommand{\yo}{To Whom It May Concern}
\yo

The detex method would not count these four words. For these reasons, I proposed an alternative method—typesetting the LaTeX document and the converting the resulting PDF to text with ps2ascii and then using wc:

ps2ascii mydocument.pdf | wc -w

There are two problems with this approach. First ps2ascii is sloooow. Second, it gives an inflated estimate since ps2ascii:

  • substitutes spaces for ligatures it cannot handle
  • preserves hypenation with the result that hyphenated words count as two words not one
  • includes page numbers and headers and footers

A different conversion utility, pdftotext, fares better:

pdftotext -enc UTF-8 -nopgbrk mydocument.pdf | wc -w

It is much faster. And does not preserve hyphenation. Ligatures, page numbers, and headers and footers still present a problem, but a much better estimate.

A better approach would be to use TeX to generate the word count. There is a shell script available on CTAN that does just this—wordcount. Unfortunately, it is no longer actively maintained (someone reported to me that its author had died), and it failed on several of my LaTeX documents.

Googling has revealed a couple of perl scripts for determining the word count of LaTeX documents: TeXcount.pl and texWordCount.pl. These provide reasonable though different word count estimates.

Testing these methods on a longish document yielded the following results:

Program Result
detex 15061
ps2ascii 15763
pdftotext 15427
TeXcount 15299
texWordCount 15148

As far as I can determine, texWordCount gave the best estimate. Here is a command that uses it (be sure to correct the path):

perl -T /path/to/texWordCount.pl "$TM_FILEPATH"

In TextMate’s Bundle Editor the setting should be:

  • Save: Nothing
  • Input: None
  • Output: Show as Tool Tip
  • Activation: Key Equivalent ⌃ ⇧ N
  • Scope Selector: text.tex.latex

{ 1 } Trackback

  1. Analyse lexicographique | hilpers | January 17, 2009 at 9:15 pm | Permalink

    […] de > la forme du mot (conjugaison…). Pour ne pas rester complètement sec, quelques pistes : http://markelikalderon.com/blog/2007…unt-revisited/ http://www.tex.ac.uk/cgi-bin/texfaq2…abel=wordcount mais évidemment ça ne tient pas compte des […]

Post a Comment

You must be logged in to post a comment.
FireStats icon Powered by FireStats