GutenMark Usage Page
Attractively formatting Project Gutenberg texts


home
features
download
usage
FAQ
changes
bugs
links
developer
Ladders, by Lynnie Rothan

Contents

GutenMark Tutorial
Printing or converting to PDF, via HTML
Printing or converting to PDF, via LaTeX
Manual Tweaking

GutenMark Tutorial

GutenMark is a command-line utility, so you have to use it from the Win32 "MS-DOS Prompt" or from the Linux/UNIX/BSD/MacOS-X command shell:
GutenMark [options] [inputfile [outputfile ]]
For example,
GutenMark tomco10.txt tomco10.html
Other possibilities are to use the program in "filter" style:
GutenMark tomco10.txt > tomco10.html
     or
GutenMark < tomco10.txt > tomco10.html
GutenMark is intended to be fully automatic, but there are quite a few command-line options that are available anyway.
 
Option
Description
--config=path
(20020713 and later.)  Allows you to specify an alternate GutenMark configuration file.  GutenMark looks for its configuration file in various locations:
  1. It looks for the file specified by --config, if any; if this fails, then
  2. it looks for GutenMark.cfg in the current directory; if this fails, then
  3. the full pathname of the executable program is modified by appending ".cfg" (and removing ".exe", in Windows).  Usually, the GutenMark program is called "GutenMark" (or "GutenMark.exe" in Windows), so this results in using GutenMark.cfg from the same directory in which the GutenMark executable is stored; if this fails, then
  4. no configuration file is used.  Wordlists and namelists can still be used but there is much less control over them.  Any wordlists/namelists in the no-config-file case must be stored in the current directory, and must have a name like "*.names.gz", "*.places.gz", or "*.words.gz".  Furthermore, wordlists will processed in whatever order they are found.
The principle use of configuration files is to allow GutenMark to locate wordlists/namelists.  Note that the default configuration file distributed with GutenMark assumes that the wordlists/namelists are in the current directory.  If this is not so, then the configuration file should be modified to indicate exact pathnames for the wordlists/namelists.
--debug
(20011122 and later.)  Creates a log file, GutenMark.log, from which certain internal operations of GutenMark can be examined.  It also causes the files GutenMark.native.gz and GutenMark.foreign.gz to be created; these are wordlists containing only the words that actually appear in the source file.  These supplementary output files are useful only for developers.
--first-capital (20011209 and later.)  By default, if GutenMark finds the first word of a chapter in ALL-CAPS, it simply fixes the capitalization of the word.  With the '--first-capital' option, it instead allows the first word of the chapter to remain in ALL-CAPS. (However, it does not convert such a word to ALL-CAPS if it is not already.)  Cannot be used with the '--first-italics' switch.
--first-italics (20011209 and later.)  By default, if GutenMark finds the first word of a chapter in ALL-CAPS, it does not italicize the word as it would with other ALL-CAPS words.  With the '--first-italics' option, it treats the first word of the chapter just like all other words and italicizes it if it had been in ALL-CAPS.   Cannot be used with the '--first-capital' switch.
--force-numeric
--force-symbolic
(20011210 and later.)  You can use this switch if your browser displays special characters (like soft hyphens) in a funky way.  Explanation:  special characters are encoded in HTML either symbolically (such as "&lsquo;" for a left single-quote) or numerically (such as "&#8216;" for a left single-quote).   The symbolic form is easier for people who want to read the raw HTML (rather than just viewing it in a browser) and is perfectly correct -- but browsers are more consistent at supporting the numeric form.  By default, the numeric form is used; the '--force-symbolic' switch can change to the symbolic form instead, and would be recommended if you intended to add manual markups to the HTML.  (The '--force-numeric' switch actually has no use an any released version, and is present only for development purposes, but it doesn't hurt to use it.)
--help
Displays a list of the available options.
--latex
(20011126 and later.)  Thanks to Joe Cherry, GutenMark can create LaTeX output as an alternative to HTML.   In early versions this was quite buggy, but in versions 20020616 and later begins to be very satisfactory for most texts except those with a large amount of verse. 
--no-diacritical
(20011125 and later.)  By default, GutenMark restores diacritical marks in words for which there is no native equivalent without diacritical marks.  For example, suppose the word "Fraulein" appears in an English-language etext.  This is not an English word.  In fact, it is not a word in any language.  The correct (German) word is "Fräulein".  This is a systematic problem that appears through almost all PG etexts.  GutenMark will notice this kind of thing, and try to restore the word to its proper form.  (This is a separate issue from italicizing the word as foreign -- see below.)  You can turn this feature off with the '--no-diacritical' command-line switch.
--no-foreign
(20011125 and later.)  By default, GutenMark attempts to italicize foreign words -- i.e., words not in the native language of the etext.  The '--no-foreign' command-line switch turns this feature off.
--no-justify
Outputs paragraphs in ragged-right format.  The default format is right justified.  This option is useful if the htmldoc utility is used to convert HTML to Postscript because htmldoc is (or has been) buggy in regard to right justification.  Or, I guess, if you just prefer ragged-right text. 
--no-mdash
(20011109 and later.)  By default, GutenMark replaces constructs like "--" with an mdash character.  This looks better when printed, but most browsers do a very poor job of rendering mdashes, so that HTML looks better with the original dashes in place.  The "--no-mdash" command-line option turns off the mdash conversion.
--no-parskip
(20020615 and later.)  Used only with "--latex".  By default, paragraphs are not indented but are separated with blank lines.  When "--no-parskip" is used, LaTeX indents the paragraphs and does not separate them with blank lines.
--no-prefatory
(20020118 and later.)  This option does not affect the HTML as far as a browser is concerned, but if post-processed by html2ps causes the entire "prefatory area" to be deleted from the Postscript.
--page-breaks
(20020118 and later.)  This option does not affect the HTML as far as a browser is concerned, but if post-processed by html2ps causes a page break to be inserted prior to each section heading.
--profile=name
(20011122 and later.)  GutenMark uses wordlists and namelists to help it perform various tasks (such as identifying which words are in the native language of the etext and which are foreign).  A configuration file, GutenMark.cfg, lists the wordlists and defines their search ordering and native/foreign status.  The configuration file can contain multiple named profiles, perhaps representing different native languages.  The default profile is named 'english', but alternate profiles can be selected using the '--profile' command-line option.  If the specified profile is not found in the configuration file, GutenMark uses all wordlists and namelists it can find, in the following order:  namelist for name language, all other namelists, wordlist for name language, all other wordlists.  Note that using all wordlists and namelists can be quite time consuming, so defining a custom profile is generally a better idea.  The configuration file, as distributed, contains profiles "english" (using a small set of wordlists), "none" (using no wordlists), and "english_all" (using all wordlists).
--single-space (20011210 and later.)  By default two blank spaces are used between sentences or after colons, which is standard editorial practice (at least, in American English).  By user request, the '--single-space' command-line switch has been added to reduce this to a single blank space instead.
--yes-header
(20011112 and later.)  By default, GutenMark removes Project Gutenberg's file header from the HTML output, in order to insure conformance with PG requirements.  The "--yes-header" command-line option causes the PG header to be retained.  You need to read the PG header and evaluate for yourself whether retention of the header is legal or desirable for your application.  (Removal of the header is guaranteed to be legal.)


Printing or converting to PDF, via HTML

Another thing you might want to do, of course, is to make a hardcopy of the reformatted etext.  You can do this by printing directly from your browser, but the typical browser does not do a great job of making the HTML (however well it has been created) print like a book.  Several options are available, such as loading the HTML into Microsoft Word, and printing it from there.  A better method is to use one of the freely available  HTML-to-Postscript conversion utilities to create a Postscript or PDF version of the book.  This is, perhaps, easier if you are a Linux/BSD user than if you are a Windows user.  To create the various PDF samples that appear on this website, I used (on Linux) the free utility html2ps, along with various custom configuration files that you can get from my download page.

Here is what the complete sequence of steps looked like, in Linux, for converting the sample etext to PDF format:

# Create HTML from the PG etext.
GutenMark bldhb10.txt bldhb10.html
# Create 8.5"x5.5" Postscript, hyphenated, from the HTML.
html2ps -H -f half12schoolbook.rc bldhb10.html > bldhb10.ps
# Create PDF from the Postscript.
ps2pdf bldhb10.ps
Or, in Linux, we could simply have printed it rather than creating PDF, by replacing the final command with
# Print the Postscript file.
lpr bldhb10.ps
Another interesting thing you can do is to print in booklet format -- two pages on the front and two pages on the back of standard letter-sized paper, with the pages reordered so the whole mess can be folded or cut into half-letter sized pages.  This can be done with the freely available PSUtils tools.  In Linux, you'd replace the ps2pdf step with this:
# Form the Postscript pages into a "signature":
psbook bldhb10.ps signature.ps
# Combine the pages 2-up.
pstops "2:0L@1.0(8.5in,0)+1L@1.0(8.5in,5.5in)" signature.ps booklet.ps
# Pull off the odd-numbered 2-up pages, in reverse order.
psselect -o -r booklet.ps frontsides.ps
# Pull off the even-numbered 2-up pages, in normal order.
psselect -e booklet.ps backsides.ps
# Print it.
lpr frontsides.ps
... feed the paper back into the printer ...
lpr backsides.ps

Printing or converting to PDF, via LaTeX

While LaTeX-based output from GutenMark lags that of HTML-based output in terms of maturity, in versions 20020616 and later it has become pretty satisfactory.  And because the ability to produce visually attractive printouts or PDF from LaTeX is so much better than from HTML, it is definitely worth considering the use of the LaTeX option rather than the HTML default.

While there are free tools for TeX/LaTeX processing in Windows, I am sadly not familiar with them.  I would be happy to hear from anyone with advice in this area.

In Linux, fortunately, it quite often happens that TeX/LaTeX tools are provided free with the Linux distribution, and perhaps even automatically installed.  Here, for example, is what the processing for the sample etext looks like on my iMac running SuSE Linux:

# Convert the etext to LaTeX format.
GutenMark/LinuxPPC/GutenMark --latex bldhb10.txt bldhb10.tex
# Convert the LaTeX to PDF format.
pdflatex bldhb10.tex
The various other tricks mentioned in the previous section (creating booklets and so forth) can be accomplished quite easily, simply by converting the PDF to Postscript and vice-versa.


Manual Tweaking

GutenMark aims to provide a completely automatic system for formatting Project Gutenberg etexts.  At the same time, GutenMark is a program which is fairly and not perfect.  Consequently, depending on your purpose in creating the formatted texts, you may desire to improve the results with a some manual tweaking of the HTML or LaTeX.

How to perform this tweaking is a matter of taste.  While it is possible to edit either HTML or LaTeX in an ordinary text-editing program, it is certainly much easier to do so in a WYSIWYG editor.  Quite often, all that is required is a quick scan through the text to catch anything that really leaps out as objectionable.   For editing HTML, I personally prefer freeware like Netscape/Mozilla.  But there are many available choices, obviously, including even Microsoft Word.

Screenshot of bldhb10.tex in lyx.For tweaking LaTeX, I'd suggest the free program LyX (www.lyx.org), which is a near-WYSIWYG editing environment.  I've provided a screenshot -- which you can enlarge -- of LyX being used to edit LaTeX created from the sample etext.

With that in mind, here's a list of things that I find objectionable in GutenMark HTML output, roughly in descending order of importance.  I would hazard a guess that only the first two items are truly objectionable to most people.

  1. GutenMark does not produce a title page, copyright notice, etc.
  2. GutenMark is not perfect at deducing section headings.  The most common problem is lines that are falsely marked as headings when they are actually normal text.  This does not happen in most documents, but does happen in some documents.
  3. GutenMark is not perfect at distinguishing between prose and verse.  This can result in verse that is falsely formatted as a justified paragraph or, more commonly, as a ragged-right prose paragraph with shorter-than-average lines.  This commonly happens only a few times within a document, and is often not noticeable to the average reader.  With LaTeX, verse is more objectionable in appearance than with HTML.
  4. GutenMark is not perfect at distinguishing between native-language text and foreign text.  This commonly manifests itself either as proper names that are incorrectly identified as foreign words (and hence are italicized), or else as individual words in foreign phrases that are not identified as being foreign.  The latter problem results in occasional multi-word italicized foreign phrases having a few words that are not italicized.
For those who want to work directly with the HTML, rather than through a WYSIWIG editor, the earliest versions of GutenMark produced quite ugly (and sometimes invalid) HTML.  Later versions (say, 20011216 and later) do a much better job of producing HTML readable by humans.  If you want to work directly with the HTML, consider using GutenMark's --force-symbolic command-line switch.


©2001-2002 Ronald S. Burkey.  Last updated 07/13/02 by RSB.  Contact me .