GutenMark Command-Line Usage Page
Attractively formatting Project Gutenberg texts

home

features

download

usage

FAQ

changes

bugs

links

developer

GutenMark Tutorial
Printing or converting to PDF, via HTML
Printing or converting to PDF, via LaTeX

Overview
Installation of LaTeX Tools and Fonts in Windows
Using the LaTeX Tools and Fonts
The Thing About LaTeX ...

Manual Tweaking

GutenMark Tutorial

GutenMark is a command-line utility, so you have to use it from the Win32 "MS-DOS Prompt" or from the Linux or Mac OS X command shell:

GutenMark [options] [inputfile [outputfile ]]

(Depending on your computer's setup, it may be necessary to say "./GutenMark" rather than just "GutenMark".)

For example,

GutenMark tomco10.txt tomco10.html

Other possibilities are to use the program in "filter" style:

GutenMark tomco10.txt > tomco10.html
or
GutenMark < tomco10.txt > tomco10.html

GutenMark is intended to be fully automatic, but there are quite a few command-line options that are available anyway.

Option	Description
--author=name	(20020808 and later.) Allows you to specify the name of the author of the etext. (By default, GutenMark will try to deduce the author's name from the etext itself.) If the author's name contains any blank space—and I bet it does—then you'll want to quote it. For example, GutenMark "--author=Mark Twain" tsawyer.txt tsawyer.html This is, perhaps, a more interesting option for LaTeX output than for HTML output.
--caps-ok	(20020809 and later.) By default, GutenMark automatically converts ALL-CAPS to italics (e.g., all-caps). Many other other types of markup are converted to italics as well, of which the best-known is probably _underscores_ (which becomes underscores). These various methods can be intermixed within a document. However, it is undesirable to mix-and-match ALL-CAPS with the other methods, because if explicit markup such as _underscore_ is present, it is usually intended that ALL-CAPS remain ALL-CAPS. GutenMark has a rudimentary means of determining for itself when this situation arises, and automatically suppresses the conversion of ALL-CAPS. However, this detection method is not perfect, and you can manually eliminate the ALL-CAPS conversion by using the "--caps-ok" command-line switch.
--config=path	(20020713 and later.) Allows you to specify an alternate GutenMark configuration file. GutenMark looks for its configuration file in various locations: It looks for the file specified by --config, if any; if this fails, then it looks for GutenMark.cfg in the current directory; if this fails, then the full pathname of the executable program is modified by appending ".cfg" (and removing ".exe", in Windows). Usually, the GutenMark program is called "GutenMark" (or "GutenMark.exe" in Windows), so this results in using GutenMark.cfg from the same directory in which the GutenMark executable is stored; if this fails, then no configuration file is used. Wordlists and namelists can still be used but there is much less control over them. Any wordlists/namelists in the no-config-file case must be stored in the current directory, and must have a name like ".names.gz", ".places.gz", or ".words.gz". Furthermore, wordlists will processed in whatever order they are found. The principle use of configuration files is to allow GutenMark* to locate wordlists/namelists. Note that the default configuration file distributed with GutenMark assumes that the wordlists/namelists are in the current directory. If this is not so, then the configuration file should be modified to indicate exact pathnames for the wordlists/namelists. Important note: For installations performed by one of the installer programs (version April 2008 or later), the executables are stored in the "binary" subdirectory of the installation directory, while the configuration file is installed in the "GutConfigs" subdirectory of the installation directory. For such an installation, use of the "--config" switch is therefore mandatory when running GutenMark from the command line, since the configuration file is no longer present in any of the directories searched by default.
--debug	(20011122 and later.) Creates a log file, GutenMark.log, from which certain internal operations of GutenMark can be examined. It also causes the files GutenMark.native.gz and GutenMark.foreign.gz to be created; these are wordlists containing only the words that actually appear in the source file. These supplementary output files are useful only for developers. Note: The log files are created in the directory containing the source text.
--first-capital	(20011209 and later.) By default, if GutenMark finds the first word of a chapter in ALL-CAPS, it simply fixes the capitalization of the word. With the '--first-capital' option, it instead allows the first word of the chapter to remain in ALL-CAPS. (However, it does not convert such a word to ALL-CAPS if it is not already.) Cannot be used with the '--first-italics' switch.
--first-italics	(20011209 and later.) By default, if GutenMark finds the first word of a chapter in ALL-CAPS, it does not italicize the word as it would with other ALL-CAPS words. With the '--first-italics' option, it treats the first word of the chapter just like all other words and italicizes it if it had been in ALL-CAPS. Cannot be used with the '--first-capital' switch.
--force-numeric --force-symbolic	(20011210 and later.) You can use this switch if your browser displays special characters (like soft hyphens) in a funky way. Explanation: special characters are encoded in HTML either symbolically (such as "‘" for a left single-quote) or numerically (such as "‘" for a left single-quote). The symbolic form is easier for people who want to read the raw HTML (rather than just viewing it in a browser) and is perfectly correct—but browsers are more consistent at supporting the numeric form. By default, the numeric form is used; the '--force-symbolic' switch can change to the symbolic form instead, and would be recommended if you intended to add manual markups to the HTML. (The '--force-numeric' switch actually has no use an any released version, and is present only for development purposes, but it doesn't hurt to use it.)
--help	Displays a list of the available options.
--latex	(20011126 and later.) Thanks to Joe Cherry, GutenMark can create LaTeX output as an alternative to HTML. In early versions this was quite buggy, but in versions 20020616 and later begins to be very satisfactory for most texts except those with a large amount of verse.
--latex-sections	(20020805 and later.) Normally, each heading detected by GutenMark is treated as the beginning of a chapter (with the LaTeX "chapter") markup. If '--latex-sections' is used, then they are treated as sections* (LaTeX "section") instead. An additional distinction is that for chapters, GutenMark* adds markup to change the page headers on a chapter-by-chapter basis, whereas for sections it does not. For these reasons, chapters (the default) are more useful if the text is subdivided into a small number of large chunks, whereas sections are more useful if the text is subdivided into a large number of small chunks. The LaTeX may be manually edited afterward, if desired, to group sections into chapters. Of course, it would be more useful if GutenMark could recognize these distinctions for itself, and intermix chapter headings and section headings within the same document. Perhaps someday it will be able to do so.
--mdash-size=N	(20021122 and later. Updated 20030105) The LaTex construct for an mdash is "---". With the --mdash-size switch, an mdash is changed any number of hyphens that you like. However, with any value other than 3, you are going against what LaTeX wants, and therefore may experience problems associated with inappropriate line-breaks and/or poor addition of whitespace.
--no-diacritical	(20011125 and later.) By default, GutenMark restores diacritical marks in words for which there is no native equivalent without diacritical marks. For example, suppose the word "Fraulein" appears in an English-language etext. This is not an English word. In fact, it is not a word in any language. The correct (German) word is "Fräulein". This is a systematic problem that appears through almost all PG etexts. GutenMark will notice this kind of thing, and try to restore the word to its proper form. (This is a separate issue from italicizing the word as foreign—see below.) You can turn this feature off with the '--no-diacritical' command-line switch.
--no-foreign	(20011125 and later.) By default, GutenMark attempts to italicize foreign words—i.e., words not in the native language of the etext. The '--no-foreign' command-line switch turns this feature off.
--no-justify	Outputs paragraphs in ragged-right format. The default format is right justified. This option is useful if the htmldoc utility is used to convert HTML to Postscript because htmldoc is (or has been) buggy in regard to right justification. Or, I guess, if you just prefer ragged-right text.
--no-mdash	(20011109 and later.) By default, GutenMark replaces constructs like "--" with an mdash character. This looks better when printed, but most browsers do a very poor job of rendering mdashes, so that HTML looks better with the original dashes in place. The "--no-mdash" command-line option turns off the mdash conversion.
--no-parskip	(20020615 and later.) Used only with "--latex". By default, paragraphs are not indented but are separated with blank lines. When "--no-parskip" is used, LaTeX indents the paragraphs and does not separate them with blank lines.
--no-prefatory	(20020118 and later.) This option does not affect the HTML as far as a browser is concerned, but if post-processed by html2ps causes the entire "prefatory area" to be deleted from the Postscript.
--no-toc	For software versions 20020808 and later, a table of contents is added to LaTeX output by default. Removes that table of contents.
--page-breaks	(20020118 and later.) This option does not affect the HTML as far as a browser is concerned, but if post-processed by html2ps causes a page break to be inserted prior to each section heading.
--profile=name	(20011122 and later.) GutenMark uses wordlists and namelists to help it perform various tasks (such as identifying which words are in the native language of the etext and which are foreign). A configuration file, GutenMark.cfg, lists the wordlists and defines their search ordering and native/foreign status. The configuration file can contain multiple named profiles, perhaps representing different native languages. The default profile is named 'english', but alternate profiles can be selected using the '--profile' command-line option. If the specified profile is not found in the configuration file, GutenMark uses all wordlists and namelists it can find, in the following order: namelist for name language, all other namelists, wordlist for name language, all other wordlists. Note that using all wordlists and namelists can be quite time consuming, so defining a custom profile is generally a better idea. The configuration file, as distributed, contains profiles "english" (using a small set of wordlists), "none" (using no wordlists), and "english_all" (using all wordlists).
--ron	(20021225 and later, LaTeX only.) Groups various settings that I (RSB) find personally useful. Includes --latex, --no-parskip, and --mdash-size=3. Also, provides various settings that normally you're expected to provide by manually editing the LaTeX: It sets the page size to 5.5"x8.5in, margins to 0.75" (but greater along the spine for alternating even/odd pages), font to New Century Schoolbook.
--single-space	(20011210 and later.) By default two blank spaces are used between sentences or after colons, which is standard editorial practice (at least, in American English). By user request, the '--single-space' command-line switch has been added to reduce this to a single blank space instead.
--title=name	(20020808 and later.) Allows you to specify the title of the etext. (By default, GutenMark will try to deduce the title from the etext itself.) If the title contains any blank space—and I bet it does—then you'll want to quote it. For example, GutenMark "--title=Tom Sawyer" tsawyer.txt tsawyer.html This is, perhaps, a more interesting option for LaTeX output than for HTML output.
--yes-header	(20011112 and later.) By default, GutenMark removes Project Gutenberg's file header from the HTML output, in order to insure conformance with PG requirements. The "--yes-header" command-line option causes the PG header to be retained. You need to read the PG header and evaluate for yourself whether retention of the header is legal or desirable for your application. (Removal of the header is guaranteed to be legal.)

Printing or converting to PDF, via HTML

Another thing you might want to do, of course, is to make a hardcopy of the reformatted etext. You can do this by printing directly from your browser, but the typical browser does not do a great job of making the HTML (however well it has been created) print like a book. Several options are available, such as loading the HTML into Microsoft Word, and printing it from there. A better method is to use one of the freely available HTML-to-Postscript conversion utilities to create a Postscript or PDF version of the book. This is, perhaps, easier if you are a Linux/BSD user than if you are a Windows user. To create the various PDF samples that appear on this website, I used (on Linux) the free utility html2ps, along with various custom configuration files that you can get from my download page.

Here is what the complete sequence of steps looked like, in Linux, for converting the sample etext to PDF format:

# Create HTML from the PG etext.
GutenMark bldhb10.txt bldhb10.html
# Create 8.5"x5.5" Postscript, hyphenated, from the HTML.
html2ps -H -f half12schoolbook.rc bldhb10.html > bldhb10.ps
# Create PDF from the Postscript.
ps2pdf bldhb10.ps

Or, in Linux, we could simply have printed it rather than creating PDF, by replacing the final command with

# Print the Postscript file.
lpr bldhb10.ps

Another interesting thing you can do is to print in booklet format -- two pages on the front and two pages on the back of standard letter-sized paper, with the pages reordered so the whole mess can be folded or cut into half-letter sized pages. This can be done with the freely available PSUtils tools. In Linux, you'd replace the ps2pdf step with this:

# Form the Postscript pages into a "signature":
psbook bldhb10.ps signature.ps
# Combine the pages 2-up.
pstops "2:0L@1.0(8.5in,0)+1L@1.0(8.5in,5.5in)" signature.ps booklet.ps
# Pull off the odd-numbered 2-up pages, in reverse order.
psselect -o -r booklet.ps frontsides.ps
# Pull off the even-numbered 2-up pages, in normal order.
psselect -e booklet.ps backsides.ps
# Print it.
lpr frontsides.ps
... feed the paper back into the printer ...
lpr backsides.ps

Printing or converting to PDF, via LaTeX

Overview

While LaTeX-based output from GutenMark lags that of HTML-based output in terms of maturity, in versions 20020616 and later it has become pretty satisfactory. And because the ability to produce visually attractive printouts or PDF from LaTeX is so much better than from HTML, it is definitely worth considering the use of the LaTeX option rather than the HTML default.

Around 200 Project Gutenberg etexts which have been converted into a pretty, printable format by means of GutenMark's LaTeX abilities can be seen here. Be aware, though, that some of these texts are considerably more complex than the typical PG text, and therefore required non-trivial manual editing to appear in the form you see them. For an expert LaTeX user, the amount of manual editing probably would range from less than 5 minutes to about an hour for the linked texts. But no matter how you look at it, the LaTeX approach is more suitable for someone who wants to create a permanent archive of high-quality texts than for someone who wants to quickly throw together something to read on a Palm Pilot. A pseudo-WYSIWYG editor such as LyX is very helpful in this process and is much easier for the novice than directly editing the LaTeX files. See www.lyx.org . LyX is available for free, for Linux, Windows, and Mac OS X.

In Linux, fortunately, it quite often happens that TeX/LaTeX tools and LyX are provided free with the Linux distribution, and perhaps even automatically installed. Here, for example, is what the processing for the sample etext looks like on my iMac running SuSE Linux:

# Convert the etext to LaTeX format.
GutenMark/LinuxPPC/GutenMark --latex bldhb10.txt bldhb10.tex
# Convert the LaTeX to PDF format.
pdflatex bldhb10.tex

The various other tricks mentioned in the previous section (creating booklets and so forth) can be accomplished quite easily, simply by converting the PDF to Postscript and vice-versa.

I am less familiar with using LaTeX in the Windows environment, but it is very possible to do so. Here is a mini-howto on installing LaTeX in Win32 from GutenMark user John Wells that you may find useful. My own recommendations are provided in the next section.

Installation of LaTeX Tools and Fonts in Windows

I've recently discovered and would highly recommend the TeX Live distribution, which is a set of LaTeX tools that are easily installable (or even operable without installation from a "live DVD") on Linux, Windows, Mac OS X, and other computing platforms. With TeX Live, you can actually follow exactly the instructions given in the preceding section for processing LaTeX files, regardless of whether you're using Windows or Linux.

You can get TeX Live for free at www.tug.org. It's a big download, but in my view it's worth it since you get all the tools and lots of fonts, for Linux, Windows, Mac OS X, etc., all in one big pile. In fact, I think it's even worth it in Linux (which usually has easier ways of getting TeX/LaTeX than this), simply for all the extra fonts. But anyway, here are the instructions for Windows. I've supplied a lot of words, but the steps are actually pretty simple. You're going to need at least 3 Gbytes of free space on your hard disk to follow them.

Download TeX Live from the link above. What you get is a zipfile (at this writing, it's a 935 megabyte file called "texlive2007-live-20070212.iso.zip") that you can unzip using the normal methods one uses for such stuff in Windows. On my Windows XP box, for example, if the file was downloaded to the Desktop, I'd right-mouse-click on it and say to "Extract all".
After unzipping, you have an ISO9660 file (at this writing, it's a 1.8 gigabyte file called "texlive2007-live-20070212.iso"). After you have this file, you can delete the zipfile if you're brave and are short of disk space. The ISO9660 file is itself a kind of archive containing all of the files needed to run or to install TeX Live. What you do with the ISO9660 file depends on whether or not you have a DVD burner or not.

If you have a DVD burner, burn the file to a blank DVD. [If you don't know what I mean by this, here's more explanation: I don't mean copy it to a DVD, but rather to make a DVD whose contents are an exact copy of the ISO9660 file. The method for doing this differs from one type of DVD-writing software to the next, so I can't give you step-by-step instructions for it. In the software I use, for example, there's a menu option called "Burn DVD ISO image". In general, if you are dragging-and-dropping the file, you're probably doing the wrong thing. After the DVD has been burned, the contents of an incorrectly burned DVD will simply be a single file called "texlive2007-live-20070212.iso". The contents of a correctly burned DVD, on the other hand, would be things like folders ("bin", "perltl", ...) and files ("00LIVE.TL, autorun.inf", ...).]

If you don't have a DVD burner, you may be able to mount the ISO9660 file as a virtual disk. There's commercial software that lets you do this, or to extract files from the ISO9660 file, but I've never used it; you can find out about that software by clicking on the file with a mouse, and allowing Microsoft to instruct you as to how to find such software. In Windows XP, my suggestion is to use a free but unsupported solution from Microsoft itself, called "Virtual CD-ROM Control Panel". You may have to google to find the download for this. At this writing, it can be found by folowing the links on support.microsoft.com. What you get from this is a self-extracting file that (after it self-extracts) provides three files, one of which is a README file. The README contains exceedingly simple instructions which allow you to mount the ISO9660 file (texlive2007-live-20070212.iso) as a virtual disk having its own drive letter.

For the sake of argument, I'll suppose that the DVD or the virtual disk (depending on which of the options you chose above) can be accessed as drive letter "Z:"; this won't be right for your specific system, but it makes it easier to talk about. If you navigate into the Z: drive, you'll find (among other things) a folder called "setuptl", and in that folder you'll find a program called "tlpmgui.exe". This is the installation program. Run it!
The installation program provide you with a pretty complex-looking configuration screen. Just ignore it, because all of the defaults are probably okay for you anyway. Just click the "INSTALL" button, and let it crank. It will ask you several times about installing other stuff like Ghostscript or Perl, and just let it do so. (Personally, I got messages saying the Ghostscript installation had failed, but everything I tried seemed to work later anyway.) After installation is complete, you don't need the DVD or virtual disk any longer, unless you want to add or subtract things from your TeX Live installation.

Using the LaTeX Tools and Fonts

Suppose that you've processed (say) Project Gutenberg etext "hfinn11.txt" with GutenMark, with settings that create LaTeX rather than HTML. For the sake of argument, we'll suppose that the resulting output file is called "hfinn11.tex". LaTeX contains various tools for processing this file, but the most likely thing you'd want to do with it is to create a PDF file from it. That's as simple as pi(e):

Open up a command line, and use the "cd" command to navigate to the folder containing "hfinn11.tex". If for example, the file is in "My Documents", then at the command line use the command "cd My Documents". (To make sure you're in the right place, use the command "dir" to see a list of all the files, and "hfinn11.tex" should be among them.
Use the command "pdflatex hfinn11.tex" and wait for it to crank out the result. That's all there is to it, and you should be able to view "hfinn11.pdf" afterwards with Acrobat Reader.

The Thing About LaTeX ...

... is that it doesn't stop there. You'll now have the urge to do other things, such as adjust the paper size, change the fonts, and so on. LaTeX happens to be a textual format, so you can open up hfinn11.tex with any text editor or word processor and have a look at it or edit it. At first it will seem weird, but it's not that difficult to get the hang of. I would, however, be lying if I said I knew the best way to get started with it. One way to get started is to look at the documentation that comes with the TeX Live distribution. If you navigate to the installation disk (the Z: drive which is the DVD or virtual disk we installed from earlier), you might want to look at the file "readme.en.html" in your browser, and particularly the link to TeX FAQ near the bottom of that file. Another especially good resource, to my thinking, is the catalog of fonts by Palle Jørgensen, because it not only shows you examples of most fonts available in TeX Live, but also tells you how to use them in your LaTeX files, which is information all-too-often lacking in other resources.

It may also help to examine some of the pre-prepared etexts I've provided at the GutenMark website. Although only the PDF files for etexts are emphasized there, there is also a relatively obscure link that lets you get the LaTeX files (and illustrations) for all of the etexts found there. Because these files were mainly for my personal use, they are compressed/archived using tools that are uncommon in Microsoft Windows systems. For example, for Einstein's Relativity: The Special and General Theory, the LaTeX is available as a file called "EinsteinRelativity.tex.gz" and the figures are available as "EinsteinRelativity-figures.tar.gz". In this case, ".gz" indicates a compressed format created by a program called gzip, while ".tar" indicates an archive which contains sever other files, created by a program called tar. These two programs are not normally provided with Microsoft Windows, but there are commercial programs such as WinZip that can decode the formats. Also, the real gzip and tar programs have been ported to Windows and can be obtained for free from gnuwin32.sourceforge.net/packages/gtar.htm and gnuwin32.sourceforge.net/packages/gzip.htm respectively. In both cases, you simply download and run a setup program. The distinction between the commercial programs and the "real" programs is that the commercial programs cost money, but are presumably easier to use for many people.

With tar and gzip, if installed with the default settings, you could unpack the archives from a command line as follows:

Navigate to the folder containing the downloaded files by using the "cd" command.
Use the commands "c:\mingw\bin\gzip -d EinsteinRelativity.tex.gz" and "c:\mingw\bin\gzip -d EinsteinRelativity-figures.tar.gz" to uncompress the LaTeX file and the graphics archive. Use the command "c:\mingw\bin\tar -xf EinsteinRelativity-figures.tar" to unpack the graphics archive.
Use the command "pdflatex EinsteinRelativity.tex" to create a PDF file, or use a text editor to view and modify EinsteinRelativity.tex.

Manual Tweaking

GutenMark aims to provide a completely automatic system for formatting Project Gutenberg etexts. At the same time, the problem GutenMark is trying to solve is quite difficult, and neither GutenMark nor I (the programmer) are perfect. Consequently, depending on your purpose in creating the formatted texts, you may desire to improve the results with a some manual tweaking of the HTML or LaTeX.

How to perform this tweaking is a matter of taste. While it is possible to edit either HTML or LaTeX in an ordinary text-editing program, it is certainly much easier to do so in a WYSIWYG editor. Quite often, all that is required is a quick scan through the text to catch anything that really leaps out as objectionable. For editing HTML, I personally prefer freeware like Netscape/Mozilla. But there are many available choices, obviously, including even Microsoft Word.

For tweaking LaTeX, I'd suggest the free program LyX (www.lyx.org), which is a near-WYSIWYG editing environment. I've provided a screenshot—which you can enlarge by clicking on it—of LyX being used to edit LaTeX created from the sample etext.

With that in mind, here's a list of things that I find objectionable in GutenMark HTML output, roughly in descending order of importance. I would hazard a guess that only the first two items are truly objectionable to most people.

GutenMark does not produce a title page, copyright notice, etc.
GutenMark is not perfect at deducing section headings. The most common problem is lines that are falsely marked as headings when they are actually normal text. This does not happen in most documents, but does happen in some documents.
GutenMark is not perfect at distinguishing between prose and verse. This can result in verse that is falsely formatted as a justified paragraph or, more commonly, as a ragged-right prose paragraph with shorter-than-average lines. This commonly happens only a few times within a document, and is often not noticeable to the average reader. With LaTeX, verse is more objectionable in appearance than with HTML.
GutenMark is not perfect at distinguishing between native-language text and foreign text. This commonly manifests itself either as proper names that are incorrectly identified as foreign words (and hence are italicized), or else as individual words in foreign phrases that are not identified as being foreign. The latter problem results in occasional multi-word italicized foreign phrases having a few words that are not italicized.

For those who want to work directly with the HTML, rather than through a WYSIWIG editor, the earliest versions of GutenMark produced quite ugly (and sometimes invalid) HTML. Later versions (say, 20011216 and later) do a much better job of producing HTML readable by humans. If you want to work directly with the HTML, consider using GutenMark 's --force-symbolic command-line switch.

Contents