GutenMark Wordlists Page
Attractively formatting Project Gutenberg texts


home
features
download
usage
FAQ
changes
bugs
links
developer
Ladders, by Lynnie Rothan

Contents

What are wordlists and namelists?
What are they good for?
What wordlists are available?
Massaging the wordlists
Configuring

What are wordlists and namelists?

They are simply lists of words or names in a given language, prepared in a format required by GutenMark.


What are they good for?

In reformatting Project Gutenberg etexts, there are many features of the text that GutenMark has a relatively easy time interpreting, because interpreting them is simply a matter of transforming data present within the etext into another form.  In many other cases, however, the data needed for attractive formatting has simply been discarded—or at least, reduced—during creation of the etext, and hence is no longer present in the etext.  In this case, GutenMark has to work a lot harder.  It tries its best to recreate this data from whatever general knowledge (not specific to the text) that may be available.  Some knowledge of this kind can be obtained from wordlists and namelists, and can be applied to the following problems:

What wordlists are available?

I have not created any wordlist data myself (except as indicated), have no copyrights on the wordlists, and am not in a position to grant licenses for them.  I believe that all of the wordlists available from the GutenMark website should be freely usable (to the extent described below), but if you have information to the contrary, please inform me.  In each case, the data for creating the wordlist was available for free download from the Internet, and was then massaged by software utilities (available in the GutenMark distribution) to transform the wordlist into a GutenMark-compatible format.  This being the case, you could easily download the data from the original source, and process it yourself into the required format.
 
Description
Original Source
Transformation
Utilities
(see below)
Apparent Status of Source Data
My own special English words.  (Things that annoyed me by not being in english.words.gz.)
Right here!
n/a
GPL. 
My own special non-English words.  (Things that annoyed me by not being in the non-English wordlists.)
Right here!
n/a
GPL.
U.S. namelist
  dist.all.last
dist.female.first
dist.male.first
names_english
This data is from the U.S. Census Bureau, and seemingly available under the Freedom of Information Act.
U.S. placenames
Numerous files from U.S. Geological Survey
USGS
sort
NoDups
Public domain.
Non-U.S. placenames
Numerous files from the National Imaging and Mapping Agency
NIMA
sort
NoDups
No copyright or licensing restrictions
French namelist
  Francais-GUTenberg-v1.0.tar.gz
ispell -e 
string2line
GPL
English wordlist
  ispell-enwl-3.1.20.tar.gz
n/a
Free, but refer to the documentation for restrictions.
French wordlist
  Francais-GUTenberg-v1.0.tar.gz
ispell -e
string2line
GPL
Older, smaller, German wordlist, old spelling rules
(german.words.gz)
  hk2-deutsch.tar.gz
ispell -e
hk2_deutsch
Seemingly free, but I can't be 100% sure from the docs.  This was bundled with my SuSE Linux distribution. 
Newer, bigger, German wordlist, new spelling rules (german2.words.gz)
  igerman
ispell -e 
hk2_deutsch
GPL. 
Latin wordlist
  dictpage.txt
words197
Free, though it's hard to infer this with certainty from the docs.  Here is the assurance I received when inquiring directly of the author.
Italian wordlist
  ispell-it2000.tgz
ispell -e 
string2line
GPL
Spanish wordlist
  espa~nol.tar.gz
ispell -e
espa~nol_filter
GPL
Norwegian wordlist
  ispell-norsk-2.0.tar.gz
make
norsk
GPL
Gaelic wordlist
  ispell-gaeilge-1.0.tar.gz
ispell -e 
string2line
GPL
Danish wordlist
  ispell-da-1.4.21.tar.gz
ispell -e 
string2line
GPL
Swedish wordlist
  iswedish-1.2.1.tar.gz
ispell -e 
string2line
GPL
Finnish wordlist
  finnish.dict.bz2
 finnish.large.aff.bz2
ispell -e 
string2line
GPL


Massaging the wordlists

As mentioned above, you don't need to use the wordlists provided with GutenMark.  I provide these simply as a convenience for you:  alternately, you could download the original datasets from their creators and massage them with GutenMark-provided utilities to get the necessary wordlists.  Or you could even produce completely new GutenMark wordlists for unsupported languages or other purposes.

The format of a GutenMark wordlist is simple:


(apostrophe)
 
173: ­ 
(soft hyphen)
     
A
a
192: À
217: Ù
224: à
249: ù
B
b
193: Á
218: Ú
225: á
250: ú
C
c
194: Â
219: Û
226: â
251: û
D
d
195: Ã
220: Ü
227: ã
252: ü
E
e
196: Ä
221: Ý
228: ä
253: ý
F
f
197: Å
222: Þ
229: å
254: þ
G
g
198: Æ
 223: ß
230: æ
255: ÿ
H
h
199: Ç
 
231: ç
 
I
i
200: È
 
232: è
 
J
j
201: É
 
233: é
 
K
k
202: Ê
 
234: ê
 
L
l
203: Ë
 
235: ë
 
M
m
204: Ì
 
236: ì
 
N
n
205: Í
 
237: í
 
O
o
206: Î
 
238: î
 
P
p
207: Ï
 
239: ï
 
Q
q
208: Ð
 
240: ð
 
R
r
209: Ñ
 
241: ñ
 
S
s
210: Ò
 
242: ò
 
T
t
211: Ó
 
243: ó
 
U
u
212: Ô
 
244: ô
 
V
v
213: Õ
 
245: õ
 
W
w
214: Ö
 
246: ö
 
X
x
       
Y
y
216: Ø
 
248: ø
 
Z
z
       

Unfortunately, the process of creating a wordlist will not be easy for most people, and since it varies from case to case it cannot be described in detail here.  It will be easiest for those with programming experience, and such knowledge is assumed in the next couple of paragraphs.

Most of the existing wordlists were created from language databases for the *nix spell-checker program called "ispell".  (Click here for more information.)  Ispell databases don't contain wordlists as such, but do contain word data and so-called "affix" files.  By combining these two, with the ispell '-e' command-line switch, a wordlist can be produced.   Some existing ispell databases don't incorporate diacritical marks directly, but expect them to encoded by some funky sequence of characters.  For example, 'Schrödinger' might appear as 'Schro"dinger'. GutenMark wordlists must contain the former rather than the latter.

For this reason and others, ispell wordlists always need some additional post-processing to be acceptable to GutenMark.  The post-processing for the existing GutenMark wordlists is performed by various little utility programs (listed in a table above) provided by GutenMark.  All of the utilities are simple command-line filters.  For more info, I fear you must look at the actual source code for the utilities.  Fortunately, this source code is quite simple.

It's probably unlikely that anyone will actually want to create a wordlist.  But if you do, you might want to tell me about it, so that I can add post the wordlist here for download.


Configuring

The appropriateness and search-order of the various wordlists depends somewhat on the etexts being formatted.  In general, you want to search them in the following order: Obviously, the default search order may not be appropriate for kinds of etexts you are converting.  For example, your etexts may not be in English, or they may be more likely to contain Latin than French.   You can change the search order by modifying the file GutenMark.cfg .  You can do this in any text editor, and the way in which you have to change the file will be obvious to you upon inspection.  The configuration file can contain various named 'profiles', and each profile can incorporate different language wordlists or different search orders for the wordlists, and can designate each directory as being "native" or "foreign."  The desired profile can be chosen with GutenMark command-line switches.

When running GutenMark as a command-line program, GutenMark.cfg is by default located in the directory from which you run GutenMark.  (This can be overridden with the "--config" command-line option in later versions.)  When running GutenMark under the control of GUItenMark, the configuration file is by default in the "GutConfigs" subdirectory of the installation directory.  (This can optionally be overriden within GUItenMark.)  The wordlists are also usually located in the same directory as the program, but need not be if the configuration file is edited appropriately.  If in GutenMark.cfg you specify the wordlists by pathname—i.e., by filename plus directory—GutenMark will look for the wordlist files only in the exact locations you have specified.  On the other hand, if in GutenMark.cfg you specify the wordlists only by filename without directory, GutenMark will look for the wordlists first in the current directory and then in the directory containing the GutenMark executable.  The default configuration file does not contain directory names.

Important note:  Prior to version 20020721, there was a bug in which wordlists specified in GutenMark.cfg without directories were sought only in the current directory.  Therefore, if wordlists were put into the executable directory as recommended by the installation instructions, they could not be found without modifying the default configuration file to include the full pathnames of the wordlists.  In versions 20020721 and later, this problem has been fixed.


©2001-2002,2008 Ronald S. Burkey.  Last updated 04/21/2008 by RSB.  Contact me.