Home Page Console-based text indexing, retrieval, and browsing
|
RIP in a nutshell
Why was RIP created?
Pros and Cons of RIP
Using RIP
Compiling RIP
Future plans
The way online etexts were used in 1996 (and still in 2002, as far as I can tell) is that if you know the author/title of the work, then you can download the etext and read it. The problem is that you often don't know what you want to read. What I wanted to do, in contrast, was a kind of free association using etexts. Say, for example, that I took it into my head to read about Henry Tree; and while doing so, something about Granville-Barker caught my eye; and then while reading about Granville-Barker I saw something that made me want to read about the city of Birmingham. For this kind of free association, you need to be able to find what you need and get to it fast. Web search engines simply didn't work very well for this.
So, my idea was to download all etexts I could find, from everywhere on the web, onto my own local hard disk; and then, to index and browse these etexts using my own software (RIP). At the peak, in 1997, my text database contained 9000 etexts and was about 3G in size. RIP worked fine -- just as I had intended -- but the effort of daily seeking out and downloading all of the etexts just became too great for me, particularly at the slow download speeds then available, so I lost interest and gave up the project. In other words, the dreary task of collecting the etexts overcame my pleasure in reading them. (It would be much easier today, because Project Gutenberg has grown so much, and seems to be absorbing so many of the other available etexts, that I would simply stick to Project Gutenberg and ignore all of the others.)
RIP was originally written for MS-DOS, but I've ported to *NIX (and improved it a little in the process) because nobody is interested in MS-DOS programs today. I've personally used only the MS-DOS, Linux 'x86, and Linux PPC versions of the program, but I presume that it can be made to work easily with FreeBSD and other reasonably pure unices. Mac OS X may prove a bit more of a challenge, I'm afraid.
It is somewhat interesting after the passage of several years to compare RIP against other similar systems.
Advantages | Disadvantages |
---|---|
It is not, as it stands, something which can simply be plunked into a cgi directory and used as the basis of a web-enabled search service. | |
Has its own built-in text browser. | The text browser is not a classy, graphics-based, GUI. |
The database text is compressed. | The text-compression format is a weird one I invented for myself, and doesn't correspond to any of the standard compression/decompression utilities. |
The index created from the text database is very small. (Typically, compressed text plus index was around 75% of the uncompressed text size.) | The text-compression method is language-specific, and was tailored to English. Its performance suffers for other languages. (Though, all things considered, it doesn't work too badly.) |
Works with limited computer resources, and particularly (since it was originally an MS-DOS program) limited RAM. (Of course, these days, who cares?) | Command-line only. (Well, there's a Win32 browser, but I don't like it as well as the text-mode one. And you could strip off the user interface and use the indexing, search, retrieval functions by themselves.) |
Search time is very fast, as contrasted to a linear search. Within a single directory or CDROM database segment, the speed is about log N. (The search time is proportional to the number of directories or CDROMs used, but this is generally a small number that isn't increasing very fast, and hence isn't really an important limitation.) | Search time isn't fast, though, compared to indexing systems that have been optimized just for speed. Achieving some of the other goals in the "advantages" column (such as "small index") were tradeoffs against search speed. (But the speed was fine in 1996-1997, when I was using the program daily, and computers are a lot faster now than they were then!) |
In adding new text files, only the new files need to be indexed; existing text files don't need to be reindexed. | |
The text database can span multiple directories, or even multiple media. (Typically, I grew a directory until the database + index reached the size of a CDROM. Then I would freeze that chunk of the database, burning it onto a CDROM, and begin a new directory for new text files. Each directory or CDROM could be used in a standalone fashion, or could be integrated to form a single database.) | |
Compression/indexing of the text database occurs in place. In other words, if you have N bytes of text data, you don't need any extra disk space to compress/index it. The original files are removed in the process of indexing. (Actually, you do need some extra space for temporary indexes, but the indexes are small relative to the text; the point is that you don't need an extra N bytes or 2N bytes, or whatever.) | |
There is no practical limit on the size of the database. | |
A kooky feature of the text-browser that I particularly like -- though it doesn't actually depend on any search/retrieval functionality -- is to be able to display random text from the database. This is a great way to browse through text when you don't know what you're interested in looking at. |
In *NIX ...
I hope that the rip executable can be built on any system (*NIX or not) with a GNU gcc compiler and an ncurses library. The necessary files are in the "ripUnix" directory. You also need to download and install a copy of the TurboC library. To build rip, the GNU make program is required. Simply run make (possibly gmake on FreeBSD systems). After the build is complete, copy the executable rip and its data-file (rip_allf.dat) into some directory (say, ~/MyEtexts), and run the program from there.
I've done this only on the following systems:
Linux 'x86 (SuSE 7.2)There may complications on other *nix systems of which I'm not aware. By the way, if you're interested in looking at rip's source code -- or in modifying the code -- the use of the TurboC library somewhat complicates this, in ways explained in the FAQ.
Linux PPC (SuSE 7.3)
FreeBSD (4.2)
In MS-DOS or Win32 ...
I can't imagine why you'd want to do this, but here are some notes just in case. All of the necessary files are contained in the "ripMSDOS" directory.
RIP is compiled with Borland's Turbo C 2.0, and the zip file you download contains both a Turbo C project file (rip.prj) and configuration file (rip.tc). Turbo C 2.0 is an ancient DOS-based compiler, but is terrific in that it has IDE and debugger built into it. You can currently download it (actually, version 2.01) for free from Borland's "museum" website, though you may have to undergo a (free) registration process to do so. (I'd put a copy here, but the notices on Borland's website specifically forbid doing so.) Why Turbo C 2.0? Well, it's what I used to develop the original version of the program, even though it was already long-obsolete at that time, because it worked and I was used to it. To run the program, both the executable (RIP.EXE) and its data file (RIP_ALLF.DAT) must be together in the same directory, and you must run the program from within this directory.
If you want to use some other compiler (such as Mingw32, or Borland's free C++ compiler v.5, or Borland C++Builder, or even -- God forbid! -- Visual C++) there are some problems you'll have to watch out for:
These objections together imply that a replacement of RIP is more reasonable (for my purposes) than modifications to it. What I would do today, to accomplish the same purpose, is the following: provide a portable Win32/UNIX/MacOS GUI; stick just to Project Gutenberg etexts, rather than scanning the entire web; keep the files in zip format (as provided by PG), and simply to provide my own indexing rather than my own compression.
If/when such a replacement is available, I'll put a link to it here.