3 Sep 2005 Ankh   » (Master)

Chromatic, I'm with Tim Bray: stopwords are a bug, not a feature. I admit, as I say that, that my own text retrieval package, lq-text, supports stop words: sometimes the bug is in limited disk and memory.

I found, though, that even if you eliminate stop words, remembering where a stop word was eliminated, but not which one, can be a useful compromise. Hence, lq-text can distinguish "printed in The Times" from "printed times".

Stemming tends to conflate senses: you might have a document in which recording is common, and another in which records is common, and you can no longer distinguish them. This may or may not matter to you, of course.

I hope you are familiar with the work by the late Gerald Salton's group at Cornell in document similarity.

One way to improve perceived performance can be to pre-compute things. I found that vector cosine differences were much more useful if you used phrases than words, but you can eliminate a lot of potential docuent pairs and make the work much faster that way too.

What I did was to treat each new document as a query against the indexed corpus before adding it. But this was more than ten years ago, when I was hoping to get involved in TREC.


Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!