Attractively formatting Project Gutenberg texts
|
|
I really appreciate those who have contributed features or bug fixes to GutenMark, but I still haven't provided any systematic means for you to do so. If you have any such changes in hand, I'd suggest communicating them directly to me .
In its current incarnation, GutenMark is certainly an unattractive mass of spaghetti. Hopefully, this can be corrected in the future. However, there is some underlying order to the thing, and this order is best understood in terms of the passes which GutenMark makes on various files.GutenMark (both Windows and Linux versions) is developed on a Linux workstation using cross-compilers installed using the I'm Cross! project.
- Generate text-file wordlist. A sorted list of all distinct words within the input text file is created. This list is maintained in memory. Each different capitalization pattern represents a distinct word. The list has various purposes, such as allowing conversion of all-cap words to italics ("Are you SURE you want to do that?").
- Read spelling dictionaries ("wordlists" and "namelists"). GutenMark does not attempt to correct spelling, but it needs to have comprehensive lists of proper words. This allows it do several things (or to do some things better), including handling of all-caps italicizing (see above), restoration of diacritical marks to 7-bit ASCII text, and italicizing foreign words. None of that occurs at this point in the program, however. All that happens right now is that the text-file wordlist (see above) is updated with information derived from the spelling dictionaries. The wordlists and namelists can be huge. As of 11/20/01, they are about 25M uncompressed, or 6M in the gzipped compressed format in which they are distributed. GutenMark can use the namelists/wordlists directly in gzipped format.
- Global analysis of the input text file. The input-text file is read, line by line, until we can determine where the Project Gutenberg file header ends, and where the actual text begins. In between these is generally what I call a "prefatory" area, where you can find editing notes from the PG volunteers, a rudimentary title page, perhaps a table of contents, and so on. The contents of the prefatory area are highly variable, but the analysis will attempt to interpret parts of it as a table of contents, because this information sometimes helps out later in locating section headings. Also, by examining the very first line of the file, the analysis attempts to determine the title of the work and the author. None of this information is used immediately; it is merely stored for later.
- Line analysis. GutenMark then reads the entire file line by line, and records various useful information about these lines, such as: Are they all caps? Are they shorter than usual? Do they begin with capital letters? Are they indented unusually? Do they appear to be quotes? And so on. None of this information is used immediately; it is merely stored for later. Since the number of lines in the input file can be quite large, and the information being stored is non-trivial, a temporary file is used to store the data.
- Body markup. This pass again reads the entire input file (as well as the line-data file created in the step above), and determines all of the necessary markup. It does not actually perform any markup, instead creating a temporary file, storing data about the desired markup. Each output markup record stored in this temporary file contains an indicator of the type of markup, as well as the offset in the input file at which the markup is supposed to occur. These records are in order of increasing offset, and if the records are out of order GutenMark will not produce correct HTML. At any rate, the input file is again read line-by-line, and two phases of analysis are applied to each of the lines:
- The "line data" stored previously is combined heuristically in very complex ways to determine things like this: Is there a paragraph break? Is the line centered? Is it part of a section heading? Is it a line of verse? Should the line be right-justified or ragged-right? Given the general inconsistencies in the preparation of PG etexts, these are much more difficult questions to answer than they may seem.
- The line is then examined character by character to determine things like this: Should smart quotes be added? Should em-dashes be added? Should the text be italicized? Again, these are difficult questions to answer with certainty.
- Actual markup. Only after all of this analysis has been accomplished is actual HTML output created. The input text file is again read, this time character by character, and compared to the temporary file of markup records created in the step above. Each character is either output directly, removed, or altered according to the markup records, and additional markup may be inserted as well. Since creation of HTML is confined to this small section of code, it's theoretically possible to change the output-file format merely by changing this portion of the code. That's how LaTeX support has been added relatively easily.
I don't have any clear definition of what "near-term" means. But I do have other projects, and GutenMark is competing with those other projects for my time.
I have recently discovered the Stanford Natural Language Parser (NLP) project. This is a Java program that can accept text and parse it into parts of speech, and specifically can be used to detect whether a given string of text is a sentence. I believe that this principle can be used to improve detection of headings—which at present is the principal bugaboo of GutenMark—and conceivably help with other problem areas such as detection of verse. This approach seems more likely to be of value than the neural network approach about which I previously speculated in this space. At any rate, code related to this has already been added to GutenMark, but hasn't yet reached the point of being useful. These ideas are described further in the source file StanfordPass.c.