GutenMark Wordlists Page
Attractively formatting Project Gutenberg texts

home

features

download

usage

FAQ

changes

bugs

links

developer

What are wordlists and namelists?
What are they good for?
What wordlists are available?
Massaging the wordlists
Configuring

What are wordlists and namelists?

They are simply lists of words or names in a given language, prepared in a format required by GutenMark.

What are they good for?

In reformatting Project Gutenberg etexts, there are many features of the text that GutenMark has a relatively easy time interpreting, because interpreting them is simply a matter of transforming data present within the etext into another form. In many other cases, however, the data needed for attractive formatting has simply been discarded—or at least, reduced—during creation of the etext, and hence is no longer present in the etext. In this case, GutenMark has to work a lot harder. It tries its best to recreate this data from whatever general knowledge (not specific to the text) that may be available. Some knowledge of this kind can be obtained from wordlists and namelists, and can be applied to the following problems:

The ALL-CAPS style of italicizing. [NOTE: GutenMark can support this feature without wordlists, but using wordlists makes the support better.] Many PG etexts represent italicized words in all-capital letters, as in "I can't believe SHE did that!" In print, this should be rendered as "I can't believe she did that!" Unfortunately, it is not obvious how the italicized word should be capitalized, or for that matter, if they even should be italicized. Consider the following examples: "I can't believe JOHN did that!" and "I can't believe NASA did that!" These should be rendered "I can't believe John did that!" and "I can't believe NASA did that!" GutenMark handles this by means of wordlists/namelists. Since the wordlists contain the word "she", but not "She" or "SHE", GutenMark knows that SHE must be converted to she. Since the wordlists contain "John" but not "JOHN", JOHN is converted to John. Finally, since the wordlists actually contain the word "NASA", GutenMark understands that it should be left unchanged.
Italicizing foreign words. In English, it is correct (though it seems to me to be increasingly rarer) to italicize foreigns words and phrases, such as "the military's junta was unsuccessful." Obviously, wordlists can be used to locate these non-english words.
Restoration of diacritical marks. Most PG etexts discard diacritical marks. For example, the name "Schrödinger" would normally appear in a PG etext as "Schrodinger." By consulting the wordlists, the correct form is known and can be restored.
Hyphenation. [NOTE: This feature is not yet implemented.] PG etexts contain so-called "hard carriage-returns" at the ends of the text lines, which GutenMark is forced to remove in order to re-justify the paragraphs. If a word in the PG etext was broken across two lines by hyphenation (a practice that PG does not recommend), then a false hyphen might appear in the re-justified etext. For example, if one text line ended with "in-" and the next began with "visible", then the marked-up HTML would contain "in-visible." GutenMark can examine the wordlists to determine that "invisible" is a correct form, and that the hyphen can therefore be removed.

What wordlists are available?

I have not created any wordlist data myself (except as indicated), have no copyrights on the wordlists, and am not in a position to grant licenses for them. I believe that all of the wordlists available from the GutenMark website should be freely usable (to the extent described below), but if you have information to the contrary, please inform me. In each case, the data for creating the wordlist was available for free download from the Internet, and was then massaged by software utilities (available in the GutenMark distribution) to transform the wordlist into a GutenMark-compatible format. This being the case, you could easily download the data from the original source, and process it yourself into the required format.

Description	Original Source	Transformation Utilities (see below)	Apparent Status of Source Data
My own special English words. (Things that annoyed me by not being in english.words.gz.)	Right here!	n/a	GPL.
My own special non-English words. (Things that annoyed me by not being in the non-English wordlists.)	Right here!	n/a	GPL.
U.S. namelist	dist.all.last dist.female.first dist.male.first	names_english	This data is from the U.S. Census Bureau, and seemingly available under the Freedom of Information Act.
U.S. placenames	Numerous files from U.S. Geological Survey	USGS sort NoDups	Public domain.
Non-U.S. placenames	Numerous files from the National Imaging and Mapping Agency	NIMA sort NoDups	No copyright or licensing restrictions
French namelist	Francais-GUTenberg-v1.0.tar.gz	ispell -e string2line	GPL
English wordlist	ispell-enwl-3.1.20.tar.gz	n/a	Free, but refer to the documentation for restrictions.
French wordlist	Francais-GUTenberg-v1.0.tar.gz	ispell -e string2line	GPL
Older, smaller, German wordlist, old spelling rules (german.words.gz)	hk2-deutsch.tar.gz	ispell -e hk2_deutsch	Seemingly free, but I can't be 100% sure from the docs. This was bundled with my SuSE Linux distribution.
Newer, bigger, German wordlist, new spelling rules (german2.words.gz)	igerman	ispell -e hk2_deutsch	GPL.
Latin wordlist	dictpage.txt	words197	Free, though it's hard to infer this with certainty from the docs. Here is the assurance I received when inquiring directly of the author.
Italian wordlist	ispell-it2000.tgz	ispell -e string2line	GPL
Spanish wordlist	espa~nol.tar.gz	ispell -e espa~nol_filter	GPL
Norwegian wordlist	ispell-norsk-2.0.tar.gz	make norsk	GPL
Gaelic wordlist	ispell-gaeilge-1.0.tar.gz	ispell -e string2line	GPL
Danish wordlist	ispell-da-1.4.21.tar.gz	ispell -e string2line	GPL
Swedish wordlist	iswedish-1.2.1.tar.gz	ispell -e string2line	GPL
Finnish wordlist	finnish.dict.bz2 finnish.large.aff.bz2	ispell -e string2line	GPL

Massaging the wordlists

As mentioned above, you don't need to use the wordlists provided with GutenMark. I provide these simply as a convenience for you: alternately, you could download the original datasets from their creators and massage them with GutenMark-provided utilities to get the necessary wordlists. Or you could even produce completely new GutenMark wordlists for unsupported languages or other purposes.

The format of a GutenMark wordlist is simple:

It is an ASCII text file, which has been compressed with the GNU gzip program.
It contains a line for each word. The lines can't contain any whitespace, or anything other than the word itself.
The words should be capitalized as follows: If a word must be in all-caps, like "NASA", then put it in all-caps. If the word requires some special capitalization, such as "John" or "MacMurray", then capitalize it accordingly. For normal words that are usually in lower-case, but are capitalized at the beginnings of sentences, use all lower-case.
The words can contain any character in the following table, but not leading or trailing apostrophes. The table includes both numerical codes (for non-ASCII characters) and the characters themselves, but the characters may or may not appear correctly, depending on your browser and its settings:

' (apostrophe)		173: (soft hyphen)
A	a	192: À	217: Ù	224: à	249: ù
B	b	193: Á	218: Ú	225: á	250: ú
C	c	194: Â	219: Û	226: â	251: û
D	d	195: Ã	220: Ü	227: ã	252: ü
E	e	196: Ä	221: Ý	228: ä	253: ý
F	f	197: Å	222: Þ	229: å	254: þ
G	g	198: Æ	223: ß	230: æ	255: ÿ
H	h	199: Ç		231: ç
I	i	200: È		232: è
J	j	201: É		233: é
K	k	202: Ê		234: ê
L	l	203: Ë		235: ë
M	m	204: Ì		236: ì
N	n	205: Í		237: í
O	o	206: Î		238: î
P	p	207: Ï		239: ï
Q	q	208: Ð		240: ð
R	r	209: Ñ		241: ñ
S	s	210: Ò		242: ò
T	t	211: Ó		243: ó
U	u	212: Ô		244: ô
V	v	213: Õ		245: õ
W	w	214: Ö		246: ö
X	x
Y	y	216: Ø		248: ø
Z	z

Unfortunately, the process of creating a wordlist will not be easy for most people, and since it varies from case to case it cannot be described in detail here. It will be easiest for those with programming experience, and such knowledge is assumed in the next couple of paragraphs.

Most of the existing wordlists were created from language databases for the *nix spell-checker program called "ispell". (Click here for more information.) Ispell databases don't contain wordlists as such, but do contain word data and so-called "affix" files. By combining these two, with the ispell '-e' command-line switch, a wordlist can be produced. Some existing ispell databases don't incorporate diacritical marks directly, but expect them to encoded by some funky sequence of characters. For example, 'Schrödinger' might appear as 'Schro"dinger'. GutenMark wordlists must contain the former rather than the latter.

For this reason and others, ispell wordlists always need some additional post-processing to be acceptable to GutenMark. The post-processing for the existing GutenMark wordlists is performed by various little utility programs (listed in a table above) provided by GutenMark. All of the utilities are simple command-line filters. For more info, I fear you must look at the actual source code for the utilities. Fortunately, this source code is quite simple.

It's probably unlikely that anyone will actually want to create a wordlist. But if you do, you might want to tell me about it, so that I can add post the wordlist here for download.

Configuring

The appropriateness and search-order of the various wordlists depends somewhat on the etexts being formatted. In general, you want to search them in the following order:

Namelists for the language the etext is in.
Namelists for other languages.
Wordlists for the language the etext is in.
Wordlists for other languages, in order of descreasing probability of finding words from that language within the etext.

Obviously, the default search order may not be appropriate for kinds of etexts you are converting. For example, your etexts may not be in English, or they may be more likely to contain Latin than French. You can change the search order by modifying the file GutenMark.cfg . You can do this in any text editor, and the way in which you have to change the file will be obvious to you upon inspection. The configuration file can contain various named 'profiles', and each profile can incorporate different language wordlists or different search orders for the wordlists, and can designate each directory as being "native" or "foreign." The desired profile can be chosen with GutenMark command-line switches.

When running GutenMark as a command-line program, GutenMark.cfg is by default located in the directory from which you run GutenMark. (This can be overridden with the "--config" command-line option in later versions.) When running GutenMark under the control of GUItenMark, the configuration file is by default in the "GutConfigs" subdirectory of the installation directory. (This can optionally be overriden within GUItenMark.) The wordlists are also usually located in the same directory as the program, but need not be if the configuration file is edited appropriately. If in GutenMark.cfg you specify the wordlists by pathname—i.e., by filename plus directory—GutenMark will look for the wordlist files only in the exact locations you have specified. On the other hand, if in GutenMark.cfg you specify the wordlists only by filename without directory, GutenMark will look for the wordlists first in the current directory and then in the directory containing the GutenMark executable. The default configuration file does not contain directory names.

Important note: Prior to version 20020721, there was a bug in which wordlists specified in GutenMark.cfg without directories were sought only in the current directory. Therefore, if wordlists were put into the executable directory as recommended by the installation instructions, they could not be found without modifying the default configuration file to include the full pathnames of the wordlists. In versions 20020721 and later, this problem has been fixed.

Contents

What are wordlists and namelists?

What are they good for?

What wordlists are available?

Massaging the wordlists

Configuring