In an earlier post I described a TextMate command for determining the word count for a LaTeX document. Simpling using
wc -w
gives an inflated estimate since it will include LaTeX commands. One alternative is to use detex:
detex mydocument.tex | wc -w
This will strip the LaTeX commands first before counting words but this assumes that you are not running commands that insert text. Consider a document with the following command:
\newcommand{\yo}{To Whom It May Concern}
\yo
The detex method would not count these four words. For these reasons, I proposed an alternative method—typesetting the LaTeX document and the converting the resulting PDF to text with ps2ascii and then using wc:
ps2ascii mydocument.pdf | wc -w
There are two problems with this approach. First ps2ascii is sloooow. Second, it gives an inflated estimate since ps2ascii:
- substitutes spaces for ligatures it cannot handle
- preserves hypenation with the result that hyphenated words count as two words not one
- includes page numbers and headers and footers
A different conversion utility, pdftotext, fares better:
pdftotext -enc UTF-8 -nopgbrk mydocument.pdf | wc -w
It is much faster. And does not preserve hyphenation. Ligatures, page numbers, and headers and footers still present a problem, but a much better estimate.
A better approach would be to use TeX to generate the word count. There is a shell script available on CTAN that does just this—wordcount. Unfortunately, it is no longer actively maintained (someone reported to me that its author had died), and it failed on several of my LaTeX documents.
Googling has revealed a couple of perl scripts for determining the word count of LaTeX documents: TeXcount.pl and texWordCount.pl. These provide reasonable though different word count estimates.
Testing these methods on a longish document yielded the following results:
| Program | Result |
|---|---|
| detex | 15061 |
| ps2ascii | 15763 |
| pdftotext | 15427 |
| TeXcount | 15299 |
| texWordCount | 15148 |
As far as I can determine, texWordCount gave the best estimate. Here is a command that uses it (be sure to correct the path):
perl -T /path/to/texWordCount.pl "$TM_FILEPATH"
In TextMate’s Bundle Editor the setting should be:
- Save: Nothing
- Input: None
- Output: Show as Tool Tip
- Activation: Key Equivalent ⌃ ⇧ N
- Scope Selector: text.tex.latex
{ 1 } Trackback
[…] de > la forme du mot (conjugaison…). Pour ne pas rester complčtement sec, quelques pistes : http://markelikalderon.com/blog/2007…unt-revisited/ http://www.tex.ac.uk/cgi-bin/texfaq2…abel=wordcount mais évidemment ça ne tient pas compte des […]
Post a Comment