GutenMark Frequently Asked Questions
Attractively formatting Project Gutenberg texts

Ladders, by Lynnie Rothan

Why Did It Take So Long to Get a GUI Front-End and an Installer Program?

Well, these details cause more of a strain on a developer (who doesn't actually need them himself) than you might imagine.  The sudden availability of both the GUItenMark front-end and the Windows and Linux installer programs is actually the result of only a few days of work, because work from the related but separate project called "I'm Cross!" was leveraged.  I mention it because if you happen to be a software developer, "I'm Cross!" may be of some interest to you.  (If you're not a software developer, it would be of no interest at all.)

What's with the cute graphic?

It's a scaled-down image of the painting Ladders, courtesy of the artist Lynn Rothan.  For me, it repesents the fact that while Project Gutenberg has given us a solid foundation by providing etexts, there's still some climbing to be done before reaching an exciting experience for the reader—and hopefuly GutenMark is one of the ladders.  And, it's pretty!  Check out the artist's website at

How much does GutenMark cost?

It's free, licensed to you under the GPL.

What if I want to use GutenMark in a commercial application?

For example, printing on demand, or setting up a website for automatically prettifying PG texts.   Fine, do it!  I'll be happy to encourage this in any way I can, if you let me know what you're trying to do.  Again, consult the GPL if you are thinking of distributing the GutenMark program itself.  Also, you will want to carefully read the Project Gutenberg " fine print" to determine if what you want to do is acceptable to Project Gutenberg.

Why use GutenMark rather than another text-to-html markup utility?

Although GutenMark is a text-to-html markup tool, it is not a general-purpose utility.  It is designed to correct the deficiencies of books that are in plain-vanilla ASCII format, and (specifically) Project Gutenberg etexts.   The goal is 100% automatic publishable-quality markup.  In  other words, to produce books that look as if they had been published.  General-purpose tools are not really suited for this, but it doesn't hurt to try them. Let me know if any of them do a really good job, and I'll take a look at them.  Here are some samples from the most promising general-purpose text-to-HTML and text-to-PDF tools I've found.  To get an apples-to-apples comparison, all samples have been converted to PDF, using the Times font.  (In viewing them, ignore things like goofed-up page headings, because these are my fault from not wanting to spend a lot of time figuring the tools out.)  Remember:  No manual markup or editing has taken place.

Can GutenMark be used to make PDF for my non-PG etexts?

Yes and no:

What's the status of GutenMark?

GutenMark has reached the stage of being pretty suitable for personal use.   It should also be very useful for anyone intending to manually mark up a PG etext in HTML, since it does most of the work for you.  For a commercial printing operation—e.g., a print-on-demand service—GutenMark can use some improvement.  For a list of things that GutenMark can't do (or perhaps, can't do well), look at the buglist.

I have other open-source projects in addition to GutenMark, and I am cycling through them.  In other words, GutenMark development proceeds in spurts.

Modern versions of GutenMark seem to work much more slowly than the very early versions.

That's true.  GutenMark is now able to use wordlists and namelists to help it work more intelligently, but this intelligence comes at the cost of speed.  Processing the wordlists isn't really dependent on the etext file's size—in other words, it adds roughly the same amount of time for big etexts as for small etexts—and so the speed difference seems more obvious, and is more objectionable, if you process a small test etext.  Here's a rough speed comparison made on my 450 MHz iMac, processing the 400 Kbyte etext file TMOTB10.TXT.
(the default)
American names, English, French, German, Italian, Latin, Spanish
25 seconds
4 seconds
American names, Danish, English, Finnish, French, Gaelic, German (2), Italian, Latin, Norwegian (monstrously big), Spanish, Swedish
66 seconds

I find this acceptable, but if you don't, here are a few things you can do about it, short of getting a faster computer.  :-)  And besides, 450 MHz is dog-slow in modern terms.

(NOTE:  Since the Q/A above was written, the available wordlists have increased somewhat:  as of 12/23/01, they contain about 4 million words, 12 Mbytes compressed, 45 Mbytes uncompressed.  I assume that the benchmarks using the complete set of wordlists would slow down proportionally.)

Why does GutenMark discard the Project Gutenberg file header?

Or:  Is this even legal?  Personally, I'm of mixed feelings on this.  I'd prefer to retain the header, on the grounds of giving credit where it's due, but I'd also like to delete the header, on the grounds that it's ugly, ugly, ugly.

Speaking legalistically, if you refer to the Project Gutenberg standard file header (an example of which may be seen here), under the section titled DISTRIBUTION UNDER "PROJECT GUTENBERG-tm", you'll note that Project Gutenberg specifically requires the header (and all other references to PG) to be removed if the etext has been changed.  It's unclear whether GutenMark changes the etext sufficiently to activate this clause, but in any case removal of the header is always allowed.  Therefore, the default is to remove the header.  You can restore the PG header with GutenMark 's "--yes-header" command-line option.  If you do so, please keep in mind that complying with PG's requirements is entirely your responsibility.

Why does the HTML produced by GutenMark look funny in my browser?

Each browser tends to have its own individual quirks that limit the accuracy with which it can display HTML correctly.  In other words, browsers (even very popular ones that I won't name) don't necessarily follow the HTML standards as closely as you might like.  You can check some of your own browser's capabilities by looking at the following table.
What your browser displays
long dash (em-dash)
short dash (en-dash)
soft hyphen
should be in­vis­i­ble
curly left double-quote
curly right double-quote
curly left single-quote
curly right single-quote

And, there's an additional problem.  HTML allows two separate ways of representing special symbols (like those in the table above), the numeric way and the symbolic way, and your browser's quirks may be different in the numeric mode than in the symbolic mode!  By default, GutenMark uses the numeric mode, because browsers tend to support it better.  To use the symbolic mode instead,  the '--force-symbolic' command-line switch is available.  This may or may not work differently with your browser, but will definitely produce more readable raw HTML if additional manual markups are going to be made.

Remember, though, that the goal of the GutenMark project is to produce good-looking printouts, or good-looking PDF-based online displays, and only secondarily to produce good-looking browser-based online displays.

Why does GutenMark use custom "wordlists"?

Or:  I already have spelling dictionary installed on my computer, and this wastes precious disk space!  GutenMark is designed to be very portable, and the types of spelling dictionaries available differ greatly from one computer platform to the next.  Most GutenMark wordlists are derived from the spelling dictionaries of the ispell program, which are installed on many Linux computers.  However, even on a Linux platform, some of these ispell dictionaries have technical deficiencies from GutenMark 's standpoint.  For some languages (such as Latin), no comprehensive ispell dictionary has been previously available.  Certainly there is no ispell dictionary of personal and geographical proper names.  Therefore, the choice has been made to produce a set of custom wordlists used only by GutenMark, even if it has the unfortunate side effect of increased download-times and disk-usage.  Besides, disk-space isn't all that precious these days, and is getting less precious all of the time.

What about the crackers?

Interestingly, most regular users of GutenMark don't seem to download the wordlists at all.  Seemingly, large percentage of the folks who download the wordlists seem to be "crackers"—i.e., people who cause mayhem by breaking into computer systems.  A surprising number of GutenMark wordlist downloads seem to be made by people engaged in stealing passwords and credit-card numbers from pornography websites.  People sometimes ask me "What are you going to do about that?!!!" and seem surprised when I say "Nothing."  All of the GutenMark wordlists are available elsewhere on the web anyhow, and the crackers know where to find them.

What if I want to mirror the GutenMark website?

Swell!  Just make sure you use the material as-is, without change. Let me know about it, and I'll provide a link.

©2001,2002,2008 Ronald S. Burkey.  Last updated 06/01/2008 by RSB.  Contact me.