GutenMark
Wordlists Page
Attractively formatting Project
Gutenberg texts
|
|
Contents
What are wordlists and namelists?
What are they good for?
What wordlists are available?
Massaging the wordlists
Configuring
What are wordlists and namelists?
They are simply lists of words or names in a given language, prepared in
a format required by GutenMark.
What are they good for?
In reformatting Project Gutenberg etexts, there are many features of the
text that GutenMark has a relatively easy time interpreting, because
interpreting them is simply a matter of transforming data present within
the etext into another form. In many other cases, however, the data
needed for attractive formatting has simply been discarded -- or at least,
reduced -- during creation of the etext, and hence is no longer present
in the etext. In this case, GutenMark has to work a lot harder.
It tries its best to recreate this data from whatever general knowledge
(not specific to the text) that may be available. Some knowledge
of this kind can be obtained from wordlists and namelists, and can be applied
to the following problems:
-
The ALL-CAPS style of italicizing. [NOTE: GutenMark
can support this feature without wordlists, but using wordlists makes the
support better.] Many PG etexts represent italicized words
in all-capital letters, as in "I can't believe SHE did that!" In
print, this should be rendered as "I can't believe she did that!"
Unfortunately, it is not obvious how the italicized word should be capitalized,
or for that matter, if they even should be italicized. Consider
the following examples: "I can't believe JOHN did that!" and "I can't
believe NASA did that!" These should be rendered "I can't believe
John
did that!" and "I can't believe NASA did that!" GutenMark
handles this by means of wordlists/namelists. Since the wordlists
contain the word "she", but not "She" or "SHE", GutenMark knows
that SHE must be converted to she. Since the wordlists contain
"John" but not "JOHN", JOHN is converted to John. Finally,
since the wordlists actually contain the word "NASA", GutenMark
understands that it should be left unchanged.
-
Italicizing foreign words. In English, it is correct (though
it seems to me to be increasingly rarer) to italicize foreigns words and
phrases, such as "the military's junta was unsuccessful."
Obviously, wordlists can be used to locate these non-english words.
-
Restoration of diacritical marks. Most PG etexts discard diacritical
marks. For example, the name "Schrödinger" would normally appear
in a PG etext as "Schrodinger." By consulting the wordlists, the
correct form is known and can be restored.
-
Hyphenation. [NOTE: This feature is not yet implemented.]
PG etexts contain so-called "hard carriage-returns" at the ends of the
text lines, which GutenMark is forced to remove in order to re-justify
the paragraphs. If a word in the PG etext was broken across two lines
by hyphenation (a practice that PG does not recommend), then a false hyphen
might appear in the re-justified etext. For example, if one text
line ended with "in-" and the next began with "visible", then the marked-up
HTML would contain "in-visible." GutenMark can examine the
wordlists to determine that "invisible" is a correct form, and that the
hyphen can therefore be removed.
What wordlists are available?
I have not created any wordlist data myself (except as indicated),
have no copyrights on the wordlists, and am not in a position to grant
licenses for them. I believe that all of the wordlists available
from the GutenMark website should be freely usable (to the extent
described below), but if you have information to the contrary, please inform
me.
In each case, the data for creating the wordlist was available for free
download from the Internet, and was then massaged by software utilities
(available in the
GutenMark distribution) to transform the wordlist
into a GutenMark-compatible format. This being the case, you
could easily download the data from the original source, and process it
yourself into the required format.
Description
|
Original Source
|
Transformation
Utilities
(see below)
|
Apparent Status of Source Data
|
My own special English words. (Things that annoyed me by
not being in english.words.gz.)
|
Right here!
|
n/a
|
GPL. |
My own special non-English words. (Things that annoyed me
by not being in the non-English wordlists.)
|
Right here!
|
n/a
|
GPL. |
U.S. namelist
|
dist.all.last
dist.female.first
dist.male.first
|
names_english
|
This data is from the U.S. Census Bureau, and
seemingly available under the Freedom of Information Act. |
U.S. placenames
|
Numerous files from U.S.
Geological Survey
|
USGS
sort
NoDups
|
Public domain. |
Non-U.S. placenames
|
Numerous files from the National
Imaging and Mapping Agency
|
NIMA
sort
NoDups
|
No copyright
or licensing restrictions |
French namelist
|
Francais-GUTenberg-v1.0.tar.gz
|
ispell -e
string2line
|
GPL |
English wordlist
|
ispell-enwl-3.1.20.tar.gz
|
n/a
|
Free, but refer to the documentation for restrictions. |
French wordlist
|
Francais-GUTenberg-v1.0.tar.gz
|
ispell -e
string2line
|
GPL |
Older, smaller, German wordlist, old spelling
rules
(german.words.gz)
|
hk2-deutsch.tar.gz
|
ispell -e
hk2_deutsch
|
Seemingly free, but I can't be 100% sure from
the docs. This was bundled with my SuSE Linux distribution. |
Newer, bigger, German wordlist, new spelling rules (german2.words.gz)
|
igerman
|
ispell -e
hk2_deutsch
|
GPL. |
Latin wordlist
|
dictpage.txt
|
words197
|
Free, though it's hard to infer this with certainty
from the docs. Here is the assurance I
received when inquiring directly of the author. |
Italian wordlist
|
ispell-it2000.tgz
|
ispell -e
string2line
|
GPL |
Spanish wordlist
|
espa~nol.tar.gz
|
ispell -e
espa~nol_filter
|
GPL |
Norwegian wordlist
|
ispell-norsk-2.0.tar.gz
|
make
norsk
|
GPL |
Gaelic wordlist
|
ispell-gaeilge-1.0.tar.gz
|
ispell -e
string2line
|
GPL |
Danish wordlist
|
ispell-da-1.4.21.tar.gz
|
ispell -e
string2line
|
GPL |
Swedish wordlist
|
iswedish-1.2.1.tar.gz
|
ispell -e
string2line
|
GPL |
Finnish wordlist
|
finnish.dict.bz2
finnish.large.aff.bz2
|
ispell -e
string2line
|
GPL |
Massaging the wordlists
As mentioned above, you don't need to use the wordlists provided on the
GutenMark
download page. This is done simply as a convenience for you:
alternately, you could download the original datasets from their creators
and massage them with GutenMark-provided utilities to get the necessary
wordlists. Or you could even produce completely new GutenMark
wordlists for unsupported languages or other purposes.
The format of a GutenMark wordlist is simple:
-
It is an ASCII text file, which has been compressed with the GNU gzip program.
-
It contains a line for each word. The lines can't contain any whitespace,
or anything other than the word itself.
-
The words should be capitalized as follows: If a word must
be in all-caps, like "NASA", then put it in all-caps. If the word
requires some special capitalization, such as "John" or "MacMurray", then
capitalize it accordingly. For normal words that are usually in lower-case,
but are capitalized at the beginnings of sentences, use all lower-case.
-
The words can contain any character in the following table, but not leading
or trailing apostrophes. The table includes both numerical
codes (for non-ASCII characters) and the characters themselves, but the
characters may or may not appear correctly, depending on your browser and
its settings:
'
(apostrophe)
|
|
173:
(soft hyphen)
|
|
|
|
A
|
a
|
192: À
|
217: Ù
|
224: à
|
249: ù
|
B
|
b
|
193: Á
|
218: Ú
|
225: á
|
250: ú
|
C
|
c
|
194: Â
|
219: Û
|
226: â
|
251: û
|
D
|
d
|
195: Ã
|
220: Ü
|
227: ã
|
252: ü
|
E
|
e
|
196: Ä
|
221: Ý
|
228: ä
|
253: ý
|
F
|
f
|
197: Å
|
222: Þ
|
229: å
|
254: þ
|
G
|
g
|
198: Æ
|
223: ß
|
230: æ
|
255: ÿ
|
H
|
h
|
199: Ç
|
|
231: ç
|
|
I
|
i
|
200: È
|
|
232: è
|
|
J
|
j
|
201: É
|
|
233: é
|
|
K
|
k
|
202: Ê
|
|
234: ê
|
|
L
|
l
|
203: Ë
|
|
235: ë
|
|
M
|
m
|
204: Ì
|
|
236: ì
|
|
N
|
n
|
205: Í
|
|
237: í
|
|
O
|
o
|
206: Î
|
|
238: î
|
|
P
|
p
|
207: Ï
|
|
239: ï
|
|
Q
|
q
|
208: Ð
|
|
240: ð
|
|
R
|
r
|
209: Ñ
|
|
241: ñ
|
|
S
|
s
|
210: Ò
|
|
242: ò
|
|
T
|
t
|
211: Ó
|
|
243: ó
|
|
U
|
u
|
212: Ô
|
|
244: ô
|
|
V
|
v
|
213: Õ
|
|
245: õ
|
|
W
|
w
|
214: Ö
|
|
246: ö
|
|
X
|
x
|
|
|
|
|
Y
|
y
|
216: Ø
|
|
248: ø
|
|
Z
|
z
|
|
|
|
|
Unfortunately, the process of creating a wordlist will not be easy for
most people, and since it varies from case to case it cannot be described
in detail here. It will be easiest for those with programming experience,
and such knowledge is assumed in the next couple of paragraphs.
Most of the existing wordlists were created from language databases
for the *nix spell-checker program called "ispell". (Click here
for more information.) Ispell databases don't contain wordlists as
such, but do contain word data and so-called "affix" files. By combining
these two, with the ispell '-e' command-line switch, a wordlist can be
produced. Some existing ispell databases don't incorporate
diacritical marks directly, but expect them to encoded by some funky sequence
of characters. For example, 'Schrödinger' might appear as 'Schro"dinger'.
GutenMark
wordlists must contain the former rather than the latter.
For this reason and others, ispell wordlists always need some additional
post-processing to be acceptable to GutenMark. The post-processing
for the existing GutenMark wordlists is performed by various little
utility programs (listed in a table above) provided by GutenMark.
All of the utilities are simple command-line filters. For more info,
I fear you must look at the actual source code for the utilities.
Fortunately, this source code is quite simple.
It's probably unlikely that anyone will actually want to create a wordlist.
But if you do, you might want to tell me about it, so that I can add post
the wordlist here for download.
Configuring
The appropriateness and search-order of the various wordlists depends somewhat
on the etexts being formatted. In general, you want to search them
in the following order:
-
Namelists for the language the etext is in.
-
Namelists for other languages.
-
Wordlists for the language the etext is in.
-
Wordlists for other languages, in order of descreasing probability of finding
words from that language within the etext.
Obviously, the default search order may not be appropriate for kinds of
etexts you are converting. For example, your etexts may not be in
English, or they may be more likely to contain Latin than French.
You can change the search order by modifying the file GutenMark.cfg
. You can do this in any text editor, and the way in which you have
to
change the file will be obvious to you upon inspection. The configuration
file can contain various named 'profiles', and each profile can incorporate
different language wordlists or different search orders for the wordlists,
and can designate each directory as being "native" or "foreign."
The desired profile can be chosen with GutenMark command-line switches.
GutenMark.cfg is generally located in the directory from which you run
GutenMark. (This can be overridden with the "--config" command-line
option in later versions.) The wordlists are also usually in this
same directory, but need not be if the configuration file is edited appropriately.
If in GutenMark.cfg you specify the wordlists by pathname -- i.e., by filename
plus directory -- GutenMark will look for the wordlist files
only in the exact locations you have specified. On the other hand,
if in GutenMark.cfg you specify the wordlists only by filename without
directory, GutenMark will look for the wordlists first in the current
directory and then in the directory containing the GutenMark executable.
The default configuration file does not contain directory names.
Therefore, if you set up your system as recommended -- with the GutenMark
executable, the wordlists, and GutenMark.cfg all in the same directory
-- then the unmodified default configuration file can find all of the wordlists.
Important note: Prior to version
20020721, there was a bug in which wordlists specified in GutenMark.cfg
without directories were sought only in the current directory.
Therefore, if wordlists were put into the executable directory as
recommended by the installation instructions, they could not be found without
modifying the default configuration file to include the full pathnames
of the wordlists. In versions 20020721 and later, this problem has
been fixed.
©2001-2002 Ronald S. Burkey. Last updated
07/21/02 by RSB. Contact me.