Shelved the rcs parser for a while. I cranked out as much performance as it can possibly get (without making the code look *really* horrible). Overall, it is somewhere between 10 and 12 times faster than when I started. For small RCS files, it is comparable for forking off rlog and parsing the result. For large files, though, rlog/parse is faster. I think the next step is to use something like mxTextTools or to use a custom RCS file tokenizer. The internal architecture is set up using a "token stream" plus the parser. That should make it easy to swap in different stream implementations.
I tried using mmap, but it was no faster than just reading the darned thing into memory (in 100k chunks). It is simply that the algorithm is not I/O bound, so using mmap to optimize the I/O doesn't help at all.
Over the weekend, I've been working on revamping Subversion's build system. We currently use automake. It is a total dog and some parts of automake are actually a bit hard to deal with. I've tossed out automake and recursive makes, with just a single top-level makefile. The inputs to the makefile are generated by a Python script. Net result is that ./configure will produce a Makefile from Makefile.in, and then the build-outputs.mk will be included by that. build-outputs.mk is generated by the Python script when we create the distribution tarballs (so end users don't need Python just to build; this is similar to how automake uses Perl, but the outputs are portable).
The resulting build process is much faster. ./configure is also going to be speedy since we only need to process one Makefile.in. In addition, automake creates a billion "sed" replacements within configure, then applies all of those to all the files. We'll be reducing the replacements to just a couple dozen. With the reduced file count, it should scream. We also don't have automake's time consuming process (producing Makefile.in from Makefile.am); my Python script executes in just 2 seconds of wall clock time. That includes examining all the directories to find .c files to include into the build.
I've got make all, install, and clean working. I still need to do distclean, extraclean, debug the "make check" target, and then do dependency generation. On the latter, the Python script will just open the files and look for #includes. This will be much more portable than automake's reliance on gcc-specific features. Oh, and we also get rid of automake's reliance on gmake.
Nice all around...