corps: bespoke text codecs
Happy 12/12/12, peoples! In honor of repeated subsequences, today I'm happy to release a new set of compression tools, Corps.
Corps is a toolkit for generating custom text codecs, specialized to particular bodies of text. You give it an example corpus to analyze, and Corps can generate codecs based on what it finds.
I say "probably", because in reality you don't know what substrings are most common. Corps uses the Re-Pair algorithm to build up an optimal token set. This algorithm treats all characters in the input as tokens, and recursively creates composite tokens from the most common pair of adjacent tokens, repeating the process until there are no more repeated pairs of tokens. For full details, see "Off-Line Dictionary-Based Compression" by Larsson and Moffat.
Special-purpose codecs can also provide some interesting properties, such as the ability to decompress a substring of the original text.
Corps is written for Guile version 2.0. See http://gnu.org/s/guile for more information on Guile and how to install it on your system.
To build Corps from git, do:
git clone git://gitorious.org/corps/corps.git cd corps ./autogen.sh && ./configure && make && make check
You can install using make install, but it's often more convenient to just run corps uninstalled, using the env script.
Corps includes a simple command-line tool called corps. For full documentation, run corps help. You can run it uninstalled by prefixing the ./env from the build tree.
As an example, say you want to build a database of wikipedia pages on cheese. Let's fetch a page:
$ curl -s 'http://en.wikipedia.org/wiki/Maroilles_(cheese)' > cheese $ ls -l cheese -rw-r--r-- 1 wingo wingo 43123 Dec 12 15:12 cheese
Now we analyze it to determine common substrings:
./env corps extract-all cheese > cheese-tokens
This generates a list of (string,frequency) pairs. An extract-all token set is usually quite large. We can pare it down to something manageable, the 500 most common substrings:
./env corps extract-subset cheese-tokens 500 cheese > 500-tokens
With this dictionary, we can huffman-code the page:
$ ./env corps encode -t 500-tokens -m huffman cheese cheese.huff $ ls -l cheese.huff -rw-r--r-- 1 wingo wingo 18060 Dec 12 16:09 cheese.huff
Well that's pretty cool -- it's less than half the size of the source text. We can also pare down this set of tokens to an appropriate number of tokens for a bytecode, and try doing a byte-encode of the file:
$ ./env corps extract-subset 500-tokens 254 cheese > 254-tokens $ ./env corps encode -t 254-tokens cheese > cheese.bc $ ls -l cheese.bc -rw-r--r-- 1 wingo wingo 22260 Dec 12 16:19 cheese.bc
It's larger than the huffman-code, not only because of the smaller dictionary, but also because a bytecode is less dense. In practice though a bytecode is good enough while being very fast, so we'll continue with the bytecode.
Now let's generate a C encoder and decoder for this token set.
$ ./env corps generate-c-byte-encoder 254-tokens > encoder.inc.c $ ./env corps generate-c-byte-decoder 254-tokens > decoder.inc.c $ cp ~/src/corps/corps/decoder.c . $ cp ~/src/corps/corps/encoder.c . $ gcc -o decoder -O3 decoder.c $ gcc -o encoder -O3 encoder.c $ ls -l encoder decoder -rwxr-xr-x 1 wingo wingo 13192 Dec 12 16:23 decoder -rwxr-xr-x 1 wingo wingo 31048 Dec 12 16:23 encoder
Nice! We could use the corps tool to decode cheese.bc, but to vary things up we can use our zippy C decoder:
$ ./decoder < cheese.bc > cheese.out $ cmp cheese.out cheese && echo 'excellent!' excellent!
The performance of the C encoder is pretty good:
$ for ((x=0;x<1000;x++)) do cat cheese >> megacheese; done $ time gzip -c < megacheese > megacheese.gz real 0m1.418s user 0m1.396s sys 0m0.016s $ time ./encoder < megacheese > megacheese.bc real 0m0.523s user 0m0.480s sys 0m0.044s $ ls -l megacheese* -rw-r--r-- 1 wingo wingo 43123000 Dec 12 17:03 megacheese -rw-r--r-- 1 wingo wingo 22370002 Dec 12 17:04 megacheese.bc -rw-r--r-- 1 wingo wingo 11519311 Dec 12 17:04 megacheese.gz
Gzip gets better compression results for many reasons, but our quick-and-dirty bytecode compressor does well enough and is quite fast. The decoder is also quite good:
$ time ./decoder < megacheese.bc > /dev/null real 0m0.179s user 0m0.160s sys 0m0.016s $ time gunzip -c < megacheese.gz > /dev/null real 0m0.294s user 0m0.284s sys 0m0.008s
Amusingly, for this text, gzipping the bytecoded file has quite an impact:
$ gzip -c < megacheese.bc > megacheese.bc.gz $ ls -l megacheese* -rw-r--r-- 1 wingo wingo 43123000 Dec 12 17:03 megacheese -rw-r--r-- 1 wingo wingo 22370002 Dec 12 17:04 megacheese.bc -rw-r--r-- 1 wingo wingo 175246 Dec 12 17:12 megacheese.bc.gz -rw-r--r-- 1 wingo wingo 11519311 Dec 12 17:04 megacheese.gz
It so happens that byte-compressing the original text allows it to fit within the default gzip "window size" of 32 KB, letting gzip detect the thousandfold duplication of the source text. As a codec that works on bytes, gzip tends to work quite well on bytecoded files, and poorly on bit-coding schemes like huffman codes. A gzipped bytecoded file is usually smaller than a gzipped raw file and smaller than a gzipped huffman-coded file.