GutenMark Bug/Issue List
Attractively formatting Project Gutenberg texts

GutenMark has no formal bug-tracking system (the level of community interest not having justified it as of yet), but here's a simple table which I'll use to record outstanding issues (including any you tell me about), and their resolutions.
 # Date posted Status Description of bug or issue 115 03/19/04 Thinking ... Branko Collin notes that headings which are terminated by periods are not recognized as being headings.  Actually, this is not so much finding a new bug---since GutenMark actually deliberately interprets lines terminated by periods as non-headings---but rather the discovery that some etexts actually have chapter headings like this (which I had been hoping wasn't true). 114 03/19/04 Fixed 03/20/04. Branko Collin has pointed out that the output HTML has end-tags in places where there are no start-tags.  Yikes! 03/20/04:  Whew!  It turns out that this problem only occurs if GutenMark can't find any chapter headings---i.e., only in the "prefatory" area. 113 02/21/04 Fixed Jeff Rich has pointed out that the files created by GutenSplit have HTML headers only for the first two files, but then lack them for all succeeding files. 112 08/07/03 To do With --latex, or perhaps with --latex --no-foreign, it is possible for the occasional "/textit" to appear as "/texti".  An example is etext02/11001008.txt..  (Thanks to Rodrigo Fonseca.) 111 01/05/03 Fixed LaTeX:  Sigh!  The reason I've had so much trouble with implementing mdashes (see numerous problem reports below) is that I've been using the wrong LaTeX construct for it all along.  Should simply be "---". 110 12/24/02 Fixed, partially. LaTeX:  (Actually, not a bug, but rather a problem noted in LyX's ability to import LaTeX.)  In importing LaTeX constructs like "\ \ \ " or "\mbox{----}\mbox{----}", LyX (1.2) will arbitrarily insert line-feeds between the LaTeX commands.  For example, "\ \ \ " after importing becomes "\ \ \ " This is very inconvenient, since it result either in extra linefeeds being inserted in the output, or else in LaTeX which is illegal.  In most cases nothing can be done.  However, for the specific case of "\ " commands which lead lines, it should be fairly harmless to replace successive "\ " (not the first in a chain, though) with "~". 109 12/23/02 Fixed, partially LaTeX:  The construct "\mbox{--}" doesn't work fully as expected, because it prevents LaTeX from hyphenating the words preceding or following the mdash, nor a linebreak after the mdash, and the lack of these things can goof up the spacing and force a lot of manual hyphenating.  The construct "\mbox{--}\linebreak[1]" cleans up a lot of this, but cannot be imported properly into LyX. 108 12/16/02 Fixed (I hope) Thomas Klausner reports that GutenMark won't compile in NetBSD because the GLOB_ABORTED and GLOB_ABEND constants are defined identically there, messing up a switch statement in the GlobErrorMessage function. 107 11/28/02 To-do In moon10.txt, there is a point (the text preceding "On the receipt") where "\end{quotation" is generated rather than "\end{quotation}" with --latex. 106 11/23/02 Under consideration. Numerous suggestions/points have been made by Ben FrantzDale.  For the moment, I'll quote his email, and parse into individual issues later as required: Bug #102 and #101: Modern typographic style actually goes against LaTeX's (and TeX's) defaults. It is now discouraged to put extra space after sentences. (See The Elements of Typographics Style by Bringhurst, page 28). In LaTeX this just means putting \frenchspacing in the preamble. Doing this would eliminate bug #102 entirely. As for bug #101, I've seen it recomended to type Mr. Soandso as Mr.~Soandso, thereby preventing a line break and increasing readability. This may or may not be useful here. Bug #91: I suspect you are using \emph{} for italics. I think if you used {\em} you might be able to span paragraphs. (If not that, I know there's some other way to set the font to itallic that would work.) As for bug #59, it's a tough question. My understanding is that modern typography preffers spaced en-dashes -- like this -- to em-dashes.  However, the Gutenberg books are old and therefor might be best reproduced with full em-dashes---like this. Bug #40: For what it's worth, I believe British style uses single quotes for quotes, and double quotes for quotes within quotes. Bug #11: I'm not sure which problem is happening, but if people are using "-" for dashes in etexts, that should be fixed in the etext, presumably. 105 11/23/02 Figured out. (Relates to PDF files I've created, rather than to GutenMark itself.)  Acrobat Reader 5.x under Windows 98 displays some spurious error messages--though, as I understand it, does display the files more-or-less properly.  (For me, this doesn't happen in Windows 98, but does happen in MacOS-X.)  Thanks to John de Longpre for this report. Workaround:  This apparently occurs when PDF is generated using the tools latex/dvips/ps2pdf (which happens to be the default in LyX ).  It apparently does not happen if instead pdflatex is used (which is an alternate option in LyX ).  Using pdflatex has an additional advantage, in that it seems to work reliably when mathematical stuff like the degree symbol, superscripts, and subscripts appear in the text. 104 11/20/02 Fixed. The name "Thé" appears in the wordlist NonUS.places.gz.  This can cause global replacement of "The" by "Thé" if diacriticals are turned on.   Note: The "fix" for this involves downloading a new special.words.gz, which now has the word "The" added to it.  However, you must have an appropriate GutenMark.cfg file (such as the standard one distributed with GutenMark) for this to work.  Otherwise, NonUS.places.gz will (by default) be treated as "native" words and special.words.gz will (by default) be treated as "foreign"; hence "Thé" (being native) will override "The" (being foreign). 103 09/08/02 Fixed 20021122. In LaTeX output, special characters like hyphens and single or double quotes are removed from page headings and TOC entries. 102 09/08/02 Partially fixed in version 20021122. Things like the following are treated as ends-of-sentences in LaTeX output, and consequently are sometimes have too much blank space in them: "What!" exclaimed Tom. "What?" asked Tom. Workaround:  This very easily handled by editing the LaTeX output with a text editor and using a global search-and-replace, as follows.  (Note that some of the strings end with a space, and that they contain two right-hand single-quotes.) Replace !''  by !''\  Replace ?''  by ?''\  Replace !''$by !''\ Replace \?''$ by ?''\  The latter two replacements depend on having an editor that can deal with regular expressions. 101 09/05/02 Partially fixed in version 20021120. Occasionally, an honoric (such as "Mr.") is not recognized and is therefore treated as the end of a sentence.  I suspect, but have not confirmed, that this occurs only at the ends of lines. Workaround:  This very easily handled by editing the LaTeX output with a text editor and using a global search-and-replace, as follows.  (Note that many of the strings end with a space.) Replace Mr.  by Mr.\  Replace Mrs.  by Mrs.\  Replace Mr.$by Mr.\ Replace Mrs.$ by Mrs.\  The latter two replacements depend on having an editor that can deal with regular expressions. 100 08/26/02 To-do. Under some undefined circumstances, instead of converting all headings to chapter headings, GutenMark will (roughly speaking) alternate chapter headings with section headings.  This isn't really noticeable in HTML but is very irksome in LaTeX. 99 08/26/02 Closed. Automatic elimination of ALL-CAPS conversion when some minimum number of other types of italics delimiters is fooled by the fact that a text with ALL-CAPS italicizing will often have to contain constructs like "_I_". 98 08/11/02 To-do. I've encountered a text in which the chapter headings are indicated as in the following example: (blank line) 2 (blank line) To the Death! (blank line) This fools GutenMark into thinking there's no actual chapter break.  (The example file is "Tarzan the Terrible".) 97 08/11/02 To-do. The "--first-capital" switch appears to be broken. 96 08/10/02 To-do. Well, apparently after many, many years of putting the huge PG header at the beginning of the text, this is now being replaced with a small header and a huge footer.  GutenMark doesn't handle this in an aesthetic way. 95 08/09/02 To-do. For LaTeX output, LaTeX special-characters appearing in titles and author names will not be correctly interpreted.  (This doesn't seem to actually occur in practice, but it certainly could happen.) 94 08/09/02 Closed 08/09/02. In etexts where italicizing has been indicated with techniques such as _underscores_, text which appears in ALL-CAPS usually does so because it actually appeared that way in the original text.  GutenMark, however, allows mixing-and-matching of different italicizing techniques, and will attempt to convert both to italics.  Instead, it should detect the dominant italicizing technique (such as underscores), and if it is not the all-caps technique should leave all-caps text untouched. 93 08/04/02 Closed 08/09/02. Some constructions of the form       [_an example_] which should become      [an example] are misinterpreted.  I presume that this applies to some other combinations of punctuation/italicizing.  An example is latda10.txt. 92 08/03/02 Closed 08/04/02. In LaTeX, there's some kind of unfortunate interaction if an em-dash appears within an italicized phrase, due the addition of "-%".  The resulting LaTeX cannot be parsed. ... Later:  It's actually much worse:  the "-%" caused text to be deleted, which caused the problem mentined above.  Fortunately, this problem was only introduced on 08/03 (not merely first reported then), so it's unlikely many people will be affected. 91 08/03/02 To-do In LaTeX, if italics span a carriage return, the LaTeX will be illegal because the carriage return will be treated as an end-of-paragraph. This has been fixed in some cases, but not in others. Temporary workaround:  when running latex, an error will be flagged -- usually as something to do with too many '{' or '}' characters, or as a runaway argument.  Simply edit the input text file at the indicated point:  end the italics at the end of the line, and restart the italics at the beginning of the next line. 90 08/03/02 Not true, I think. In LaTeX, texts containing '{' may not be treated correctly. 89 08/03/02 Closed In LaTeX, honorifics and abbreviations are being treated as ends of sentences, resulting in a very large amount of spacing (due to justification) being added after them. 88 08/03/02 Closed In LaTeX, the soft-hyphens I add for HTML at the ends of em-dashes have the opposite effect from intended in some cases. 87 08/03/02 Closed When chapter names are given in the text with a trailing period (as in "CHAPTER I."), the trailing period should be eliminated in LaTeX page headings.  An example text is pklvr10.txt. 86 08/03/02 To-do When converting all-caps to italics (as in THIS IS A PHRASE THAT SHOULD BE ITALICS), a lone 'A' will be mishandled (as in this is A Phrase that should be italics).  Notice that in some cases the succeeding word may also be capitalized.  I've seen this in LaTeX, but I assume it applies to HTML also.  An example text is pklvr10.txt. 85 07/25/02 Closed, but read the notes ! (Fixed 07/25/02.)  Encountering a line longer than 255 characters in the input file will cause corruption in the portions of the output file that follow.  (Note that the PG formatting guidelines specify a maximum line length of, I believe,  70.)  An example is the line beginning "I shall hear the bell ring ..." in the file mollf10.txt.  Thanks to Curtis Weyant for pointing this out. Note: The bug has not really been fixed, but simply increased from 255 to 16383 characters.  To me, this seems more than adequate. The "fix" means that the corruption disappears, but it does not mean that the abnormal line is formatted acceptably.  Please recognize that GutenMark applies formatting rules that are consistent with Project Gutenberg guidelines.  The less the input file conforms to those guidelines, the less acceptable the result.  The only true fix for this problem is to submit a bug report to the Project Gutenberg folks for any funky file in which this kind of problem occurs.  The proper fix is to break up the abnormally long line with hard carriage-returns.  GutenMark now provides various explanatory messages both on the console and in the error-log about this condition. 84 07/22/02 Closed The Win32 problems fixed in PR #83 turn out really only to have been fixed in  Windows 98.  In Win2K (for example), it still is not fixed. 83 07/21/02 Closed The configuration file (and hence wordlists) are not found if the GutenMark executable is located via the PATH.  This problem exists in both Linux and Win32, and therefore probably on all platforms.  Thanks to John Wells and George Russell for reporting this problem. 82 07/14/02 Closed Changes in constants used by glob.h cause the program not to compile in some newer versions of FreeBSD. 81 07/14/02 To-do The  Win32 version and *nix versions do not agree in their treatments of the initial line of the sample etext.  (But do treat the remainder of the sample etext identically.) 80 07/13/02 Closed 07/21/02 The documentation I've provided about configuring wordlists has been wrong and misleading.  I have stupidly stated that wordlists are supposed to be stored in the directory containing the GutenMark executable and configuration files.  While they can be stored there, the default configuration file assumes rather that they are stored in the current directory, and consequently none of the wordlists will be found (using the default configuration file) if the program is not run from within the directory where it resides. Workaround:  Please edit the configuration file to show the exact pathnames of the wordlists/namelists. 79 07/10/02 Closed 07/14/02. In Win32, if GutenMark is not run from within the directory where it lives, then its configuration file may not be found.  If the configuration file is found, then wordlists won't be found.   Several problems conspire to produce this effect: Win32 automatically adds an extension of ".EXE" when it reports the program name to GutenMark (as argv[0]), causing the configuration file to be GutenMark.EXE.cfg rather than GutenMark.cfg. Win32 automatically butchers the directory names in reporting them to GutenMark (as argv[0]) by shortening them to 8 characters as things like "GUTENM~1". My Win32 version of glob seems not to work except within the current directory. Finally -- not really a bug, but nevertheless a related issue of which many users won't be aware -- the wordlists listed within the configuration file must have their full pathnames rather than the relative pathnames found in the default configuration file. Thanks to John Wells for reporting this problem. 78 06/16/02 To-do. In LaTeX, verse is rendered poorly (relative to the way it is rendered in HTML).  If paragraphs are not indented (the default), there is an extra blank line in between every line of verse.  If paragraphs are indented (--no-parskip), these blank lines don't appear, but if the verse is the first thing in the chapter the first verse line is not aligned with the others. 77 06/16/02 Closed. In LaTeX output, italicized text inserted by GutenMark -- for example, all-caps converted to upper&lower case italics -- is omitted. 76 06/15/02 Closed 06/16/02. LaTeX page headings show incorrect chapter names. 75 01/24/02 Needs investigation. (Thanks to Curtis Weyant.)   There is apparently a problem (e.g., lkhst10.txt) when the first lines of paragraphs are not indented, but the subsequent lines are; these are treated as verse by GutenMark. (Yikes!  I never saw such a thing before.) 74 01/24/02 Under consideration (Suggestion thanks to Curtis Weyant.)  Provision might be made for a list of words which are never capitalized, except at the beginnings of sentences. 73 01/24/02 To-do. (Suggestion thanks to Curtis Weyant.)   Conversion of ALL-CAPS headings to upper/lower case (perhaps as a command-line option) would be useful. 72 12/28/01 Closed 08/10/02. For non-PG etexts, the same means of deducing title and author cannot be used as for PG etexts.  Currently, non-PG title and author are left blank. Note:  Eventually "fixed" by adding the --title and --author switches. 71 12/27/01 To do. For OCR'd text that hasn't been proofread well, it is common to find that the OCR software has inserted a '~' character wherever it does not reconize a character.  If this is the first character in a word, it will toggle italics mode on (see issue #64).   Therefore, for the special case of ~italicizing~, GutenMark needs to look for a trailing ~ before toggling italics on. 70 12/27/01 Closed GutenMark does not work with otherwise-suitable plain-vanilla ASCII etexts that don't have a PG header/footer. 69 12/20/01 Probably needs AI. In ytagn10.txt, there is a section titled '273' Not surprisingly, this isn't recognized as a section heading. 68 12/20/01 Probably needs AI. In ytagn10.txt, for the first time, we see a section that has subsections. GutenMark marks the first as a sub-heading, but cannot distinguish any of the rest from normal text. 67 12/20/01 We'll see ... In ytagn10.txt, we find "o^" and "e^", presumably intended to be 'ô' and 'ê'.  I'll have to find this same construction in other files before applying a fix in GutenMark for it.  For reasons I don't quite grasp at this moment, this etext also encodes 'ç' as character #135, which doesn't correspond to anything in any character encoding I'm familiar with. 66 12/18/01 Probably impossible currently (See also issue #32.)  There are many characters which don't appear in the HTML 4.0 character-entity set at all.  Consider, for example, the 6 different regional encodings used by NIMA , as compared to the HTML 4.0 entities.  While there is a substantial (or complete, in some cases) overlap for characters 'a'-'z', 'A'-'Z', and 192-255, there are also many characters simply missing.  This is probably not an issue for English-language (or at least, American) readers, but still ... Various issues make this very difficult.  Probably, unicode is necessary.  Even where browsers have fairly good unicode support, equal support is not available in the HTML-to-Postscript conversion (if used).  Then, too, adding unicode support within GutenMark would be a pretty substantial undertaking ... 65 12/18/01 Closed The simple categorization of wordlists as "foreign" or "native" needs to be made more subtle.  This is most easily understood in terms of the French namelist.  In an English text, French names would need to be treaed as "native" if encountered by themselved, but as "foreign" in the context of a foreign phrase.  Currently, they could only be treated as one or the other (not both) on the basis of the GutenMark.cfg file.  The same principle, of course, applies to any proper names (people, places, etc.).  Resolution:  The fix applied for problem #27 should fix this as well. 64 12/16/01 Closed The following additional emphasizing markups (beyond those already supported) were mentioned on the gutvol-d newsgroup.  Whether any or all of them are used in PG etexts, I can't say, but I guess they should be supported: *emphasized* ~emphasized~ _/emphasized/_ _*emphasized*_ */emphasized/* _*/emphasized/*_ /:emphasized:/ |:emphasized:| 63 12/16/01 Closed In automatic conversion of 7-bit ASCII to 8-bit ASCII, the HTML may contain 8-bit codes rather than HTML character entities. 62 12/16/01 To do A couple of cases (thdvn10.txt) in which the program is fooled into treating verse as  a blockquote: Typo in which one line of a stanza does not begin with a capital. A verse beginning with "----". 61 12/16/01 May be impossible Blockquotes in which the volunteer has used abnormally short lines are indistinguishable from verse, and hence are not wrapped.  Numerous examples appear in thdvn10.txt. 60 12/16/01 Closed Found numerous instances (in thdvn10.txt), in which blockquotes with leading or trailing lines that were indented oddly would be treated as centered text rather than as blockquotes. 59 12/15/01 To-do Question:  should mdashes surrounded by whitespace be normalized by removing the whitespace? 58 12/15/01 Closed Found cases in wuthr10.txt in which mdashes at the ends of paragraphs would appear after the paragraph's closing tag.  Apparently introduced when dealing with issue #42. 57 12/15/01 Closed Found instances in benhr10.txt in which centered paragraphs were begun, but had no closing tags. 56 12/13/01 To-do Normally, "I" is not italicized.  However, if part of an all-caps phrase, like "I AM  THE LIGHT", it should be. 55 12/13/01 Possible Line drawings may now be recognizable (see issue #50), but they are merely converted to a fixed-width font, and not to an attractive drawing with lines that join up nicely.  NOTE :  Some browsers (like Mozilla) do support unicode line-drawing characters, but html2ps doesn't currently support them. 54 12/11/01 Closed In benhr10.txt, the name "Iras" incorrectly turns into "irás". 53 12/11/01 Closed In benhr10.txt, footnotes are preceded and following by a short line of dashes.  These are now incorrectly joined together with the footnote.  In other words -------------- * This is my footnote -------------- turns into -------------- * This is my footnote -------------- 52 12/11/01 Closed Another strange artifact in benhr10.txt:  a messed-up price list near the phrase beginning "From separate sheets he then read". 51 12/10/01 Closed In benhr10.txt, there are 3-4 instances in which you get things like this:  VALERIUS turns into Valerius. 50 12/10/01 Closed Line drawings with dashes and vertical bars appear in benhr10.txt.  (Search for "Gesius".)  They are completely bogus after conversion. 49 12/10/01 Closed An empty paragraph can be opened but not closed under some circumstances at the end of a file.  Actually, this seems to happen in almost every file. 48 12/10/01 Closed When the PG header is discarded, there can be a closing tag without an opening tag
.
47           12/09/01   Possible                                                               Consider alternate output formats:  DocBook, XML, or RTX.  (Thanks to Craig Morehouse.)
46           12/09/01   May be impossible                                                      When "dialect" is used -- i.e., when the author has simply made up a lot of new words to express how something sounds -- there is a rather high probability that the made-up words match some words in a foreign language, and hence are rendered as italicized.  A similar problem occurs if the author has simply made up names.
45           12/08/01   Possible                                                               Consider the use of Cascading Style Sheets for the HTML.  (Thanks to Terence Tan.)
44           12/08/01   Closed                                                                 Add a command-line switch to allow single spaces between sentences and after colons.  (Thanks to Terence Tan.)
43           12/08/01   To-do                                                                  Investigate the feasibility of using the HTML tags  and  rather than opening/closing quotes.  (Thanks to Terence Tan.)
42           12/08/01   Closed                                                                 The HTML created by GutenMark is ugly, resulting in less readable source HTML:                 Newlines may appear before closing tags rather than after them .          Upper/lower case of tags and entitites is inconsistent.          Things like

would be preferable to