Attractively formatting Project Gutenberg texts
|
|
What are wordlists and namelists?
What are they good for?
What wordlists are available?
Massaging the wordlists
Configuring
|
|
Utilities (see below) |
|
|
|
|
GPL. |
|
|
|
GPL. |
|
dist.female.first dist.male.first |
|
This data is from the U.S. Census Bureau, and seemingly available under the Freedom of Information Act. |
|
|
sort NoDups |
Public domain. |
|
|
sort NoDups |
No copyright or licensing restrictions |
|
|
string2line |
GPL |
|
|
|
Free, but refer to the documentation for restrictions. |
|
|
string2line |
GPL |
(german.words.gz) |
|
hk2_deutsch |
Seemingly free, but I can't be 100% sure from the docs. This was bundled with my SuSE Linux distribution. |
|
|
hk2_deutsch |
GPL. |
|
|
|
Free, though it's hard to infer this with certainty from the docs. Here is the assurance I received when inquiring directly of the author. |
|
|
string2line |
GPL |
|
|
espa~nol_filter |
GPL |
|
|
norsk |
GPL |
|
|
string2line |
GPL |
|
|
string2line |
GPL |
|
|
string2line |
GPL |
|
finnish.large.aff.bz2 |
string2line |
GPL |
The format of a GutenMark wordlist is simple:
(apostrophe) |
(soft hyphen) |
||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
||
|
|
|
|
||
|
|
|
|
||
|
|
|
|
||
|
|
|
|
||
|
|
|
|
||
|
|
|
|
||
|
|
|
|
||
|
|
|
|
||
|
|
|
|
||
|
|
|
|
||
|
|
|
|
||
|
|
|
|
||
|
|
|
|
||
|
|
|
|
||
|
|
||||
|
|
|
|
||
|
|
Unfortunately, the process of creating a wordlist will not be easy for most people, and since it varies from case to case it cannot be described in detail here. It will be easiest for those with programming experience, and such knowledge is assumed in the next couple of paragraphs.
Most of the existing wordlists were created from language databases for the *nix spell-checker program called "ispell". (Click here for more information.) Ispell databases don't contain wordlists as such, but do contain word data and so-called "affix" files. By combining these two, with the ispell '-e' command-line switch, a wordlist can be produced. Some existing ispell databases don't incorporate diacritical marks directly, but expect them to encoded by some funky sequence of characters. For example, 'Schrödinger' might appear as 'Schro"dinger'. GutenMark wordlists must contain the former rather than the latter.
For this reason and others, ispell wordlists always need some additional post-processing to be acceptable to GutenMark. The post-processing for the existing GutenMark wordlists is performed by various little utility programs (listed in a table above) provided by GutenMark. All of the utilities are simple command-line filters. For more info, I fear you must look at the actual source code for the utilities. Fortunately, this source code is quite simple.
It's probably unlikely that anyone will actually want to create a wordlist. But if you do, you might want to tell me about it, so that I can add post the wordlist here for download.
When running GutenMark as a command-line program, GutenMark.cfg is by default located in the directory from which you run GutenMark. (This can be overridden with the "--config" command-line option in later versions.) When running GutenMark under the control of GUItenMark, the configuration file is by default in the "GutConfigs" subdirectory of the installation directory. (This can optionally be overriden within GUItenMark.) The wordlists are also usually located in the same directory as the program, but need not be if the configuration file is edited appropriately. If in GutenMark.cfg you specify the wordlists by pathname—i.e., by filename plus directory—GutenMark will look for the wordlist files only in the exact locations you have specified. On the other hand, if in GutenMark.cfg you specify the wordlists only by filename without directory, GutenMark will look for the wordlists first in the current directory and then in the directory containing the GutenMark executable. The default configuration file does not contain directory names.
Important note: Prior to version 20020721, there was a bug in which wordlists specified in GutenMark.cfg without directories were sought only in the current directory. Therefore, if wordlists were put into the executable directory as recommended by the installation instructions, they could not be found without modifying the default configuration file to include the full pathnames of the wordlists. In versions 20020721 and later, this problem has been fixed.