Ron's Indexing Program (RIP)
Home Page
Console-based text indexing, retrieval, and browsing


 Home  Docs FAQ  Download  Changes  Links
"Reflections" by Lynn Rothan

Contents

RIP in a nutshell
Why was RIP created?
Pros and Cons of RIP
Using RIP
Compiling RIP
Future plans

RIP in a nutshell

RIP is a command-line program for efficiently indexing and retrieving very large (multi-gigabyte) plain-text databases to which new text files are often added (but old ones relatively seldom changed).  RIP is no longer under development as such, but is a perfectly fine program that works quite well at what it was designed for.  Just in case it might prove useful for somebody, I've decided to release it as free software under the GNU General Public License (GPL).


Why was RIP created?

In 1996, I was interested in using the etexts (such as those of Project Gutenberg) that were becoming widely available on the Internet, but I wanted the ability to browse these etexts in a much less-restrained fashion than was possible with existing methods.  Basically, I wanted to be able to do full-text searches on the entire set of such etexts, and to quickly and conveniently hop back and forth between etexts at will.

The way online etexts were used in 1996 (and still in 2002, as far as I can tell) is that if you know the author/title of the work, then you can download the etext and read it.  The problem is that you often don't know what you want to read.  What I wanted to do, in contrast, was a kind of free association using etexts.  Say, for example, that I took it into my head to read about Henry Tree; and while doing so, something about Granville-Barker caught my eye; and then while reading about Granville-Barker I saw something that made me want to read about the city of Birmingham.   For this kind of free association, you need to be able to find what you need and get to it fast.  Web search engines simply didn't work very well for this.

So, my idea was to download all etexts I could find, from everywhere on the web, onto my own local hard disk; and then, to index and browse these etexts using my own software (RIP).  At the peak, in 1997, my text database contained 9000 etexts and was about 3G in size.  RIP worked fine -- just as I had intended -- but the effort of daily seeking out and downloading all of the etexts just became too great for me, particularly at the slow download speeds then available, so I lost interest and gave up the project.  In other words, the dreary task of collecting the etexts overcame my pleasure in reading them.  (It would be much easier today, because Project Gutenberg has grown so much, and seems to be absorbing so many of the other available etexts, that I would simply stick to Project Gutenberg and ignore all of the others.)

RIP was originally written for MS-DOS, but I've ported to *NIX (and improved it a little in the process) because nobody is interested in MS-DOS programs today.  I've personally used only the MS-DOS, Linux 'x86, and Linux PPC versions of the program, but I presume that it can be made to work easily with FreeBSD and other reasonably pure unices.  Mac OS X may prove a bit more of a challenge, I'm afraid.

It is somewhat interesting after the passage of several years to compare RIP against other similar systems.


Pros and Cons of RIP

RIP has many cool features for solving the problem I've described above; but also, some drawbacks which have become apparent with the passage of the eons.
 
Advantages Disadvantages
It is not, as it stands, something which can simply be plunked into a cgi directory and used as the basis of a web-enabled search service.
Has its own built-in text browser. The text browser is not a classy, graphics-based, GUI.
The database text is compressed. The text-compression format is a weird one I invented for myself, and doesn't correspond to any of the standard compression/decompression utilities.
The index created from the text database is very small.  (Typically, compressed text plus index was around 75% of the uncompressed text size.) The text-compression method is language-specific, and was tailored to English.  Its performance suffers for other languages.  (Though, all things considered, it doesn't work too badly.)
Works with limited computer resources, and particularly (since it was originally an MS-DOS program) limited RAM.  (Of course, these days, who cares?) Command-line only.  (Well, there's a Win32 browser, but I don't like it as well as the text-mode one.  And you could strip off the user interface and use the indexing, search, retrieval functions by themselves.)
Search time is very fast, as contrasted to a linear search.  Within a single directory or CDROM database segment, the speed is about log N.  (The search time is proportional to the number of directories or CDROMs used, but this is generally a small number that isn't increasing very fast, and hence isn't really an important limitation.) Search time isn't fast, though, compared to indexing systems that have been optimized just for speed.  Achieving some of the other goals in the "advantages" column (such as "small index") were tradeoffs against search speed.  (But the speed was fine in 1996-1997, when I was using the program daily, and computers are a lot faster now than they were then!)
In adding new text files, only the new files need to be indexed; existing text files don't need to be reindexed.
The text database can span multiple directories, or even multiple media.  (Typically, I grew a directory until the database + index reached the size of a CDROM.  Then I would freeze that chunk of the database, burning it onto a CDROM, and begin a new directory for new text files.  Each directory or CDROM could be used in a standalone fashion, or could be integrated to form a single database.)
Compression/indexing of the text database occurs in place.  In other words, if you have N bytes of text data, you don't need any extra disk space to compress/index it.  The original files are removed in the process of indexing.  (Actually, you do need some extra space for temporary indexes, but the indexes are small relative to the text; the point is that you don't need an extra N bytes or 2N bytes, or whatever.)
There is no practical limit on the size of the database.
A kooky feature of the text-browser that I particularly like -- though it doesn't actually depend on any search/retrieval functionality -- is to be able to display random text from the database.  This is a great way to browse through text when you don't know what you're interested in looking at.


Using RIP

Consult detailed instructions here.  As an overview, though, here's what happens:
  1. You have to collect the etexts you're interested in -- in plain-vanilla ASCII format -- and store them on your local hard disk.  You'll want to put them all in just a few directories.  (RIP doesn't help you to actually acquire the etexts.)
  2. Compress/index the etexts.  These etexts will now be accessible only by using RIP.  But you can always easily undo the process later to completely restore the original etexts, if you choose to do so.  In other words, it's presumed that you intend to more-or-less permanently store the etexts in RIP format.
  3. Browse/search the etexts.  RIP contains not only a search engine, but also a reader program which you can use to view the etexts.  Both are rather primitive aesthetically, but serve their purpose.
  4. If you want to add new etexts later, repeat the process.  Only the changed etexts will be indexed, so the process will be pretty fast.
If you want to keep things really, really simple, though,  just mindlessly follow these instructions:

In *NIX ...

In MS-DOS or Win32 ...

Compiling RIP

In *NIX ...

I hope that the rip executable can be built on any system (*NIX or not) with a GNU gcc compiler and an ncurses library.  The necessary files are in the "ripUnix" directory.  You also need to download and install a copy of the TurboC library.  To build rip, the GNU make program is required.  Simply run make (possibly gmake on FreeBSD systems).  After the build is complete, copy the executable rip and its data-file (rip_allf.dat) into some directory (say, ~/MyEtexts), and run the program from there.

I've done this only on the following systems:

Linux 'x86 (SuSE 7.2)
Linux PPC (SuSE 7.3)
FreeBSD (4.2)
There may complications on other *nix systems of which I'm not aware.   By the way, if you're interested in looking at rip's source code -- or in modifying the code -- the use of the TurboC library somewhat complicates this, in ways explained in the FAQ.

In MS-DOS or Win32 ...

I can't imagine why you'd want to do this, but here are some notes just in case.  All of the necessary files are contained in the "ripMSDOS" directory.

RIP is compiled with Borland's Turbo C 2.0, and the zip file you download contains both a Turbo C project file (rip.prj) and configuration file (rip.tc).  Turbo C 2.0 is an ancient DOS-based compiler, but is terrific in that it has IDE and debugger built into it.  You can currently download it (actually, version 2.01) for free from Borland's "museum" website, though you may have to undergo a (free) registration process to do so.  (I'd put a copy here, but the notices on Borland's website specifically forbid doing so.)  Why Turbo C 2.0?  Well, it's what I used to develop the original version of the program, even though it was already long-obsolete at that time, because it worked and I was used to it.  To run the program, both the executable (RIP.EXE) and its data file (RIP_ALLF.DAT) must be together in the same directory, and you must run the program from within this directory.

If you want to use some other compiler (such as Mingw32, or Borland's free C++ compiler v.5, or Borland C++Builder, or even -- God forbid! -- Visual C++) there are some problems you'll have to watch out for:

  1. I always used unsigned char by default (in other words, everywhere you encounter char, it really means unsigned char), so you'll have to set this as the default.
  2. Turbo C 2.0 used 16-bit int and unsigned, and 32-bit long.  Modern 32-bit compilers tend to use 32-bit int/unsigned and 32 or 64 bit long.   This will very likely break the code, and you'll have to fix it somehow.
  3. There are several Turbo-centric things in the code.  The "#pragma -sig" peppered throughout the code are merely there to eliminate a "loss of significant digits" warning from the compiler, and can be removed or ignored.  The _stklen variable controls the allocation of stack space, and can simply be removed.  The various operations associated with conio.h are the real sticking points, if you depart from using a Borland compiler.
  4. ... and many other little quirks described ad nauseum on the TurboC library website.

Future plans

I have no immediate plans to further develop RIP as such, though some cleanup of the application will probably take place.  I believe that in today's environment, a good, portable GUI is a necessity for a program of this kind.  Further, I believe that it is now feasible (and more efficient in many cases) to use a standard compression format such gzip or zip files, rather than to use the RIP compression format.  (Not to mention the fact that a variety of "search engine" type applications are already available in the open-source software world.)

These objections together imply that a replacement of RIP is more reasonable (for my purposes) than modifications to it.  What I would do today, to accomplish the same purpose, is the following:  provide a portable Win32/UNIX/MacOS GUI; stick just to Project Gutenberg etexts, rather than scanning the entire web; keep the files in zip format (as provided by PG), and simply to provide my own indexing rather than my own compression.

If/when such a replacement is available, I'll put a link to it here.


©2002 Ronald S. Burkey.  Last updated 04/20/02 by RSB.  Contact me.