Frequently Asked Questions Console-based text indexing, retrieval, and browsing
|
Check out the artist's websiteif you think the painting looks cool.
Also, the source code was originally written for a 16-bit compiler, Borland's Turbo C 2.x, but has been ported to a 32-bit compiler, GNU gcc. There are numerous differences between these systems that are difficult to overcome in a program of any complexity. The primary difficulties are that the integer datatypes are different (int and unsigned are 16 bits in Turbo C but 32 bits in GNU gcc) and the "console i/o" functionality of Turbo C is completely missing in gcc and has been mimicked with the ncurses library. Rather than extensively rewriting RIP to overcome these limitations, I chose instead to write a general-purpose library (TurboC) that could be used to port any Turbo C program (not merely RIP) with minimal rewriting. But an unfortunate side effect is that the code has thereby become more confusing, in particular because the int and unsigned datatypes appear to be 32-bit, but have been made 16-bit by macro substitution.
Before creating RIP in 1996, I tried as hard as I could to find an existing (free) system that had all of the characteristics I wanted. A couple of years later, I did find (and purchase) a commercial system that worked quite well (www.dtsearch.com), but I wouldn't characterize it as free -- nor even as "affordable" for individuals other than enthusiasts. It's a little ironic that exactly two days after reviving this project and creating the RIP website, in 2002, I came across a notice of a pretty acceptable GPL'd indexing/retrieval system having most of the characteristics I want, called Namazu.
Anyhow, I became curious and decided to compare the two systems. (If I come across any other alternatives, I'll post a comparison with them also.) For a test, I've indexed the Project Gutenberg year 2000 etexts (i.e., the etexts added just in the year 2000, and not the complete set of etexts as of 2000), from which I've removed the Human Genome Project files (which aren't really text files). This leaves a set of 498 etexts totalling 199 megabytes uncompressed. Considering RIP's age, and the fact that it's a 16-bit application, I'm pretty pleased with the results of the comparison.
By the way, don't treat this as a full feature-by-feature comparison
of the systems being examined. The test involves just the specific
application that RIP was designed for, whereas general-purpose indexing
systems (such as Namasu) have many features that RIP lacks.
Indexing system | Test conditions | Resulting
database size |
Time taken
to index the database |
Comments |
---|---|---|---|---|
RIP, UNIX | 450 MHz iMac (PowerPC) with 320M RAM, running Linux | 166M | 14 minutes, including compression | (None) |
RIP, MS-DOS | 500 MHz Pentium 3 with 128M RAM, emulating Windows 98 by means of VMware running under Linux. | 166M | 31 minutes, including compression | Since emulated file operations under VMware are very slow, one would suppose that the indexing process would have run much faster on a native Win32 machine. |
Namazu | 450 PowerPC with 320M RAM running Linux. | 76M (projected 133M if all files had been indexed) | 78 minutes (projected 137 minutes if all files had been indexed). Also, the text files were pre-compressed with gzip before indexing, and this processing time is not included in the 78 (137) minutes. | Unfortunately, Namazu rejected 78 files, comprising about 43% of the database, as being "too big". In other words, about 43% of the database was not indexed. That's why various "projected" numbers appear in the test results. |