Attractively formatting Project Gutenberg texts
|
|
GutenMark has no formal bug-tracking system (the level of community interest not having justified it as of yet), but here's a simple table which I'll use to record outstanding issues (including any you tell me about), and their resolutions.
# | Date posted | Status |
|
---|---|---|---|
115 |
03/19/04 |
Thinking ... |
Branko Collin notes that
headings which are terminated by periods are not recognized as being
headings. Actually, this is not so much finding a new bug---since
GutenMark actually deliberately interprets lines
terminated by periods as non-headings---but rather the discovery that
some etexts actually have chapter headings like this (which I had been
hoping wasn't true). |
114 |
03/19/04 |
Fixed 03/20/04. |
Branko Collin has pointed out
that the output HTML has </font> end-tags in places where there
are no <font> start-tags. Yikes! 03/20/04: Whew! It turns out that this problem only occurs if GutenMark can't find any chapter headings---i.e., only in the "prefatory" area. |
113 |
02/21/04 |
Fixed |
Jeff Rich has pointed out that
the files created by GutenSplit have HTML headers only for the first
two files, but then lack them for all succeeding files. |
112 |
08/07/03 |
To do |
With --latex, or perhaps with
--latex --no-foreign, it is possible for the occasional "/textit" to
appear as "/texti". An example is etext02/11001008.txt..
(Thanks to Rodrigo Fonseca.) |
111 |
01/05/03 |
Fixed |
LaTeX: Sigh! The reason I've
had so much trouble with implementing mdashes (see numerous problem
reports below) is that I've been using the wrong LaTeX construct for it
all along. Should simply be "---". |
110 |
12/24/02 |
Fixed, partially. |
LaTeX: (Actually, not a bug, but rather a
problem noted in LyX's ability to import LaTeX.) In importing
LaTeX constructs like "\ \ \ " or "\mbox{----}\mbox{----}", LyX (1.2)
will arbitrarily insert line-feeds between the LaTeX commands.
For example, "\ \ \ " after importing becomes "\ \ \ " This is very inconvenient, since it result either in extra linefeeds being inserted in the output, or else in LaTeX which is illegal. In most cases nothing can be done. However, for the specific case of "\ " commands which lead lines, it should be fairly harmless to replace successive "\ " (not the first in a chain, though) with "~". |
109 |
12/23/02 |
Fixed, partially |
|
108 |
12/16/02 |
Fixed (I hope) |
|
107 |
11/28/02 |
To-do |
In moon10.txt, there is a point (the text
preceding "On the receipt") where "\end{quotation" is generated rather
than "\end{quotation}" with --latex. |
106 |
11/23/02 |
Under consideration. |
Numerous suggestions/points have been made by
Ben FrantzDale. For the moment, I'll quote his email, and parse
into individual issues later as required: Bug #102 and #101: Modern typographic style actually goes against LaTeX's (and TeX's) defaults. It is now discouraged to put extra space after sentences. (See The Elements of Typographics Style by Bringhurst, page 28). In LaTeX this just means putting \frenchspacing in the preamble. Doing this would eliminate bug #102 entirely. As for bug #101, I've seen it recomended to type Mr. Soandso as Mr.~Soandso, thereby preventing a line break and increasing readability. This may or may not be useful here. Bug #91: I suspect you are using \emph{} for italics. I think if you used {\em} you might be able to span paragraphs. (If not that, I know there's some other way to set the font to itallic that would work.) As for bug #59, it's a tough question. My understanding is that modern typography preffers spaced en-dashes -- like this -- to em-dashes. However, the Gutenberg books are old and therefor might be best reproduced with full em-dashes---like this. Bug #40: For what it's worth, I believe British style uses single quotes for quotes, and double quotes for quotes within quotes. Bug #11: I'm not sure which problem is happening, but if people are using "-" for dashes in etexts, that should be fixed in the etext, presumably. |
105 |
11/23/02 |
Figured out. | Workaround: This apparently occurs when PDF is generated using the tools latex/dvips/ps2pdf (which happens to be the default in LyX ). It apparently does not happen if instead pdflatex is used (which is an alternate option in LyX ). Using pdflatex has an additional advantage, in that it seems to work reliably when mathematical stuff like the degree symbol, superscripts, and subscripts appear in the text. |
104 |
11/20/02 |
Fixed. |
Note: The "fix" for this involves downloading a new special.words.gz, which now has the word "The" added to it. However, you must have an appropriate GutenMark.cfg file (such as the standard one distributed with GutenMark) for this to work. Otherwise, NonUS.places.gz will (by default) be treated as "native" words and special.words.gz will (by default) be treated as "foreign"; hence "Thé" (being native) will override "The" (being foreign). |
103 | 09/08/02 | Fixed 20021122. | |
102 | 09/08/02 | Partially fixed in version 20021122. | Things like the following are treated as ends-of-sentences
in LaTeX output, and consequently are sometimes have too much blank
space in them:
Replace !'' by !''\The latter two replacements depend on having an editor that can deal with regular expressions. |
101 | 09/05/02 | Partially fixed in version 20021120. | Occasionally, an honoric (such as "Mr.") is not recognized
and is therefore treated as the end of a sentence. I suspect,
but have not confirmed, that this occurs only at the ends of lines.
Workaround: This very easily handled by editing the LaTeX output with a text editor and using a global search-and-replace, as follows. (Note that many of the strings end with a space.) Replace Mr. by Mr.\The latter two replacements depend on having an editor that can deal with regular expressions. |
100 | 08/26/02 | To-do. | Under some undefined circumstances, instead of converting all headings to chapter headings, GutenMark will (roughly speaking) alternate chapter headings with section headings. This isn't really noticeable in HTML but is very irksome in LaTeX. |
99 | 08/26/02 | Closed. | |
98 | 08/11/02 | To-do. | I've encountered a text in which the chapter headings are
indicated as in the following example:
(blank line)This fools GutenMark into thinking there's no actual chapter break. (The example file is "Tarzan the Terrible".) |
97 | 08/11/02 | To-do. | The "--first-capital" switch appears to be broken. |
96 | 08/10/02 | To-do. | Well, apparently after many, many years of putting the huge PG header at the beginning of the text, this is now being replaced with a small header and a huge footer. GutenMark doesn't handle this in an aesthetic way. |
95 | 08/09/02 | To-do. | For LaTeX output, LaTeX special-characters appearing in titles and author names will not be correctly interpreted. (This doesn't seem to actually occur in practice, but it certainly could happen.) |
94 | 08/09/02 | Closed 08/09/02. | |
93 | 08/04/02 | Closed 08/09/02. | |
92 | 08/03/02 | Closed 08/04/02. | |
91 | 08/03/02 | To-do | In LaTeX, if italics span a carriage return, the LaTeX will
be illegal because the carriage return will be treated as an
end-of-paragraph. This has been fixed in some cases, but not in others.
Temporary workaround: when running latex, an error will be flagged -- usually as something to do with too many '{' or '}' characters, or as a runaway argument. Simply edit the input text file at the indicated point: end the italics at the end of the line, and restart the italics at the beginning of the next line. |
90 | 08/03/02 | Not true, I think. | |
89 | 08/03/02 | Closed | |
88 | 08/03/02 | Closed | |
87 | 08/03/02 | Closed | |
86 | 08/03/02 | To-do | When converting all-caps to italics (as in THIS IS A PHRASE THAT SHOULD BE ITALICS), a lone 'A' will be mishandled (as in this is A Phrase that should be italics). Notice that in some cases the succeeding word may also be capitalized. I've seen this in LaTeX, but I assume it applies to HTML also. An example text is pklvr10.txt. |
85 | 07/25/02 | Closed, but ! | (Fixed 07/25/02.) Encountering a line longer than 255
characters in the input file will cause corruption in the portions of
the output file that follow. (Note that the PG formatting
guidelines specify a maximum line length of, I believe,
70.) An example is the line beginning "I shall hear the bell
ring ..." in the file mollf10.txt. Thanks to Curtis Weyant for
pointing this out.
Note:
|
84 | 07/22/02 | Closed | |
83 | 07/21/02 | Closed | |
82 | 07/14/02 | Closed | |
81 | 07/14/02 | To-do | The Win32 version and *nix versions do not agree in their treatments of the initial line of the sample etext. (But do treat the remainder of the sample etext identically.) |
80 | 07/13/02 | Closed 07/21/02 |
|
79 | 07/10/02 | Closed 07/14/02. |
|
78 | 06/16/02 | To-do. | In LaTeX, verse is rendered poorly (relative to the way it is rendered in HTML). If paragraphs are not indented (the default), there is an extra blank line in between every line of verse. If paragraphs are indented (--no-parskip), these blank lines don't appear, but if the verse is the first thing in the chapter the first verse line is not aligned with the others. |
77 | 06/16/02 | Closed. | |
76 | 06/15/02 | Closed 06/16/02. | |
75 | 01/24/02 | Needs investigation. | (Thanks to Curtis Weyant.) There is apparently a problem (e.g., lkhst10.txt) when the first lines of paragraphs are not indented, but the subsequent lines are; these are treated as verse by GutenMark. (Yikes! I never saw such a thing before.) |
74 | 01/24/02 | Under consideration | (Suggestion thanks to Curtis Weyant.) Provision might be made for a list of words which are never capitalized, except at the beginnings of sentences. |
73 | 01/24/02 | To-do. | (Suggestion thanks to Curtis Weyant.) Conversion of ALL-CAPS headings to upper/lower case (perhaps as a command-line option) would be useful. |
72 | 12/28/01 | Closed 08/10/02. | |
71 | 12/27/01 | To do. | For OCR'd text that hasn't been proofread well, it is common to find that the OCR software has inserted a '~' character wherever it does not reconize a character. If this is the first character in a word, it will toggle italics mode on (see issue #64). Therefore, for the special case of ~italicizing~, GutenMark needs to look for a trailing ~ before toggling italics on. |
70 | 12/27/01 | Closed | |
69 | 12/20/01 | Probably needs AI. | In ytagn10.txt, there is a section titled
'273'Not surprisingly, this isn't recognized as a section heading. |
68 | 12/20/01 | Probably needs AI. | In ytagn10.txt, for the first time, we see a section that has subsections. GutenMark marks the first as a sub-heading, but cannot distinguish any of the rest from normal text. |
67 | 12/20/01 | We'll see ... | In ytagn10.txt, we find "o^" and "e^", presumably intended to be 'ô' and 'ê'. I'll have to find this same construction in other files before applying a fix in GutenMark for it. For reasons I don't quite grasp at this moment, this etext also encodes 'ç' as character #135, which doesn't correspond to anything in any character encoding I'm familiar with. |
66 | 12/18/01 | Probably impossible currently | (See also issue #32.) There are many characters
which don't appear in the HTML 4.0 character-entity set at all.
Consider, for example, the 6 different regional encodings
used by NIMA ,
as
compared to the HTML 4.0 entities.
While there is a substantial (or complete, in some cases) overlap for
characters 'a'-'z', 'A'-'Z', and 192-255, there are also many
characters
simply missing. This is probably not an issue for
English-language (or at least, American) readers, but still ...
Various issues make this very difficult. Probably, unicode is necessary. Even where browsers have fairly good unicode support, equal support is not available in the HTML-to-Postscript conversion (if used). Then, too, adding unicode support within GutenMark would be a pretty substantial undertaking ... |
65 | 12/18/01 | Closed | |
64 | 12/16/01 | Closed |
|
63 | 12/16/01 | Closed | |
62 | 12/16/01 | To do | A couple of cases (thdvn10.txt) in which the program is
fooled into treating verse as a blockquote:
|
61 | 12/16/01 | May be impossible | Blockquotes in which the volunteer has used abnormally short lines are indistinguishable from verse, and hence are not wrapped. Numerous examples appear in thdvn10.txt. |
60 | 12/16/01 | Closed | |
59 | 12/15/01 | To-do | Question: should mdashes surrounded by whitespace be normalized by removing the whitespace? |
58 | 12/15/01 | Closed | |
57 | 12/15/01 | Closed | |
56 | 12/13/01 | To-do | Normally, "I" is not italicized. However, if part of an all-caps phrase, like "I AM THE LIGHT", it should be. |
55 | 12/13/01 | Possible | Line drawings may now be recognizable (see issue #50), but they are merely converted to a fixed-width font, and not to an attractive drawing with lines that join up nicely. NOTE : Some browsers (like Mozilla) do support unicode line-drawing characters, but html2ps doesn't currently support them. |
54 | 12/11/01 | Closed | |
53 | 12/11/01 | Closed |
|
52 | 12/11/01 | Closed | |
51 | 12/10/01 | Closed | |
50 | 12/10/01 | Closed | |
49 | 12/10/01 | Closed | |
48 | 12/10/01 | Closed | |
47 | 12/09/01 | Possible | Consider alternate output formats: DocBook, XML, or RTX. (Thanks to Craig Morehouse.) |
46 | 12/09/01 | May be impossible | When "dialect" is used -- i.e., when the author has simply made up a lot of new words to express how something sounds -- there is a rather high probability that the made-up words match some words in a foreign language, and hence are rendered as italicized. A similar problem occurs if the author has simply made up names. |
45 | 12/08/01 | Possible | Consider the use of Cascading Style Sheets for the HTML. (Thanks to Terence Tan.) |
44 | 12/08/01 | Closed | |
43 | 12/08/01 | To-do | Investigate the feasibility of using the HTML tags <q> and </q> rather than opening/closing quotes. (Thanks to Terence Tan.) |
42 | 12/08/01 | Closed |
|
41 | 12/08/01 | Closed | |
40 | 12/08/01 | To-do | Need to check that texts in which single-quotes are used systematically in place of double-quotes (such as wuthr10.txt) are handled correctly. |
39 | 12/05/01 | Closed | |
38 | 12/05/01 | To-do | ALL-CAPS Roman numerals may or may not be handled correctly. |
37 | 12/04/01 | To-do | For people who actually want to view HTML output in their browser, most HTML files currently output will be too large. There needs to be a command-line option to break the file into smaller files, perhaps at chapter headings. |
36 | 12/03/01 | Closed | |
35 | 12/03/01 | To-do | Require a more-sensible installation procedure, with less manual steps. |
34 | 12/03/01 | To-do | There is an appearance of "--" not converted to emdash in bldhb10.html. It may involve a sequence such as "- -". |
33 | 12/01/01 | Closed | |
32 | 12/01/01 | Possible | Addition of diacriticals and ligatures (such as the oe ligature), which don't fit into the 8-bit subset of the HTML 4.0 character set, to the wordlists. |
31 | 12/01/01 | Closed | |
30 | 12/01/01 | To-do | Lists of proper names should be provided for more languages, particularly Latin. |
29 | 12/01/01 | Partially handled, for single-word placenames. Full treatment to-do. |
Geographical references should not be italicized unless in ALL-CAPS, and should be capitalized properly in this case. Since many placenames are multi-word, this cannot be completely handled by the wordlist mechanism. |
28 | 12/01/01 | Ongoing | All existing wordlists, particularly Latin and German, require improvement. |
27 | 12/01/01 | Closed | |
26 | 12/01/01 | To-do | Automatic detection of text native language, rather than relying on command-line parameter. |
25 | 12/01/01 | To-do | Language-profile should be used to modify the type of quotation marks. |
24 | 11/26/01 | Fixed 06/15/02 | |
23 | 11/26/01 | Closed | |
22 | 11/26/01 | Closed | |
21 | 11/26/01 | Fixed 06/15/02 | |
20 | 11/26/01 | Fixed 06/15/02. |
|
19 | 11/26/01 | Fixed 06/16/02. | |
18 | Antiquity | Possible. | Bullets. I haven't seen many bullets in PG etexts, but I'm sure GutenMark won't handle them. |
17 | Antiquity | May be impossible |
Illustrations. Well, PG etexts don't have illustrations. But still ... |
16 | Antiquity | May be impossible within HTML. May need A.I. |
Spacing in verse or dramatic scripts. Verse and scripts
(like plays) are depicted in a variable-width font, and this may result
in incorrect alignment among successive lines. Consider the
following example, that might appear in a play, in which several
characters respond simultaneously to another character: ( Nonsense! | You're not serious! I'm leaving! { What! | Not a chance! ( That's crazy talk! The intention of the person creating the etext was clearly that a single large left-hand brace should precede the text at the right. GutenMark, however, will not only not add a large brace, but will jumble up the spacing so that it doesn't even look as good as it does here. |
15 | Antiquity | May need A.I. |
Attributions. By this, I mean quotes which are set off from the surrounding text, and which are followed by the author's name (which is supposed to be at the far right of the quotation). Actually, GutenMark's treatment of this case seems to be not unreasonable, but it needs improvement to be professional. |
14 | Antiquity | May need A.I. |
Detection and treatment of double-column verse. I'm not sure this appears in any actual Gutenberg text, but I know that it does appear in certain books that have been partially converted to PG, such as Burton's Arabian Nights. |
13 | Antiquity | Ongoing | Improvement of table-detection and treatment, as in FLYMC10.TXT. |
12 | Antiquity | To-do | Dealing with things like "right-" when appearing at the end of the line, as (for example) in the phrase "this happens with both the right- and left-hand versions." GutenMark would threat this as "this happens with both the right-and left-hand versions." |
11 | Antiquity | To-do | Use of systematic misuse of "-" where "--" was actually intended. |
10 | Antiquity | To-do | Removal of false hard-hyphens. For example, suppose one line of the etext ended with "soft-", and the next line began with "hyphen". Should this be treated as "soft-hyphen" or as "softhyphen"? |
9 | Antiquity | May need A.I. |
Footnotes/endnotes. Innumerable footnote/endnote styles
appear in PG etexts. Here are some cases I've found:
|
8 | Antiquity | May need A.I. |
Restoration of Greek transliterated to Latin, back into Greek. In some PG etexts, Greek text is simply discarded (and obviously cannot be recovered). In other cases it has been transliterated to Latin characters, but there are various schemes for doing so, and these are seldom specified. Furthermore, the transliterated text is often not marked in any way as being Greek. |
7 | Antiquity | May need A.I. |
Restoration of missing currency symbols, particularly Pound (£) and Yen (¥). |
6 | Antiquity | To-do | Restoration of Spanish inverted exclamation points (¡) and question marks (¿). |
5 | Antiquity | To-do | The ability to recognize and italicize book titles should be added, along with a database of book titles in various languages. |
4 | Antiquity | To-do | Determination of Title/Author should be improved by using PG header data rather than just the first line of the file. |
3 | Antiquity | Ongoing | Recognition of verse vs. normal paragraph text needs improvement. |
2 | Antiquity | Ongoing | Identification of "prefatory" section needs improvement. |
1 | Antiquity | Ongoing | Identification of section headings needs improvement. |