8 Jun 2008 robertc   » (Master)

So, the last lazyweb question I asked had good results. Time to try again:

Whats a good python-accessible, cross-platform-and-trivially-installable(windows users) flexible (we have plain text, structured data, etc and a back-end storage area which is only accessible via the bzr VFS in the general case), fast (upwards of 10^6 documents ), text index system?

pylucene fails the trivially installable test (apt-cache search lucence -> no python bindings), and the bindings are reputed to be SWIG:(, xapian might be a candidate, though I have a suspicion that SWIG is there as well from the reading I have done so far, and - we'll have to implement our own BackEndManager subclass back into python. That might be tricky - my experience with python bindings is folk tend to think of trivial consumers only, not of python providing core parts of the system :(.

So I'm hoping there is a Better Answer just lurking out there...

Updates: sphinx looks possible, but about the same as xapian - it will need a custom storage backend. google desktop is out (apart from anything else, there is no way to change the location documents are stored, nor any indication of a python api to control what is indexed).

It looks like I need to be considerably more clear :). I'm looking for something to index historical bzr content, such that indices can be reused in a broad manner(e.g. index a branch on your webserver), are specific to a branch/repository (so you don't get hits for e.g. the working tree of a branch), with a programmatic API (so that the bzr client can manage all of this), with no requirement for a daemon (low barrier to entry/non-admin users).

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!