GutenMark
Features
Attractively formatting
Project
Gutenberg texts
|
|
Here are some of the things GutenMark does when converting
etexts:
- Tries to deduce the title and author.
- Identifies the Project Gutenberg "fine print" header and, by
default,
removes
it. At your option, it can also retain the header, but does not
attempt
to reformat it. The header will appear in a fixed-width font,
unlike
the remainder of the text.
- Usually, a PG etext will begin with items like title pages,
tables of
contents,
notes from the person who created the etext, and so forth. These
materials differ in format from etext to etext, and follow no obvious
rules. GutenMark, tries to identify this section, which it
entitles
"Prefatory
Materials", and performs only minor reformatting on it.
- Adds "smart quotes".
- Adds headings to chapters, sections, etc.
- Identifies paragraphs, and joins together the lines of the
paragraph,
so
that word wrapping can be used. Paragraphs are right justified,
by
default.
- Distinguishes word-wrapped areas from verse.
- PG etexts are highly inconsistent in their handling of italicized
text.
Many etexts simply discard that information. Others mark
italicized
text in some ways, but that marking differs from etext to etext, or
even
within a single text. All PG or newsgroup italicizing styles I'm
aware of are handled:
- _italicized_
- <i>italicized</i>
- /italicized/
- ~italicized~
- ~~italicized~~
- <italicized>
- *italicized*
- _/italicized/_
- _*italicized*_
- */italicized/*
- _*/italicized/*_
- /:italicized:/
- |:italicized:|
- ITALICIZED
- GutenMark automatically italicizes certain words
like "etc.",
"viz.", "i.e.", and so on. When wordlists
are used, it by default italicizes all words which it can identify as
being
in a foreign language—i.e., a language other than the native language
of the etext—with some exceptions such as proper names.
- When wordlists with built-in soft-hyphens are used (presently,
only the
Norwegian wordlist), text can be automatically hyphenated when (or if)
HTML is converted to Postscript. Or, post-processing software
(like html2ps)
may be able to use TeX hyphenation files.
- Locates ends of sentences and colons, so that they can be
followed
by two spaces rather than one. Automatically recognizes that
honorifics
like "Mr. Smith" aren't ends of sentences, and that sentences may
be in quotations. It recognizes that constructs like "929 N.
Durello"
are not the ends of sentences.
- Handles dangling hyphens at the ends of lines, so that they are
not
followed
by spurious spaces.
- Can usually markup centered lines. (Though Project
Gutenberg
frowns
on centered text, a lot of folks use it anyhow.)
- There are no practical limitations in terms of file sizes.
- Only a minuscule subset of HTML is used, so the marked-up
files
should
have maximum portability.
- Traditionally, PG etexts have used so-called "7-bit" ASCII, but
lately
a number of "8-bit" ASCII texts have shown up. These 8-bit files
more accurately represent the diacritical marks found in non-English
texts.
For example, 'ü' in an 8-bit etext shows up merely as 'u' in a
7-bit
etext. GutenMark is able to handle both.
- GutenMark can also, to some extent, restore the
diacritical
marks
which are not present at all in 7-bit ASCII etexts. For example,
if we encounter the word "role" in a 7-bit English-language ASCII text,
it will be converted to "rôle".
- LaTeX support has been added, providing an alternative to HTML
output.
©2001-2002,2008 Ronald S. Burkey.
Last updated
06/01/2008. Contact me.