Advogato: Blog for Ankh

Chromatic, I'm with Tim Bray: stopwords are a bug, not a feature. I admit, as I say that, that my own text retrieval package, lq-text, supports stop words: sometimes the bug is in limited disk and memory.

I found, though, that even if you eliminate stop words, remembering where a stop word was eliminated, but not which one, can be a useful compromise. Hence, lq-text can distinguish "printed in The Times" from "printed times".

Stemming tends to conflate senses: you might have a document in which recording is common, and another in which records is common, and you can no longer distinguish them. This may or may not matter to you, of course.

I hope you are familiar with the work by the late Gerald Salton's group at Cornell in document similarity.

One way to improve perceived performance can be to pre-compute things. I found that vector cosine differences were much more useful if you used phrases than words, but you can eliminate a lot of potential docuent pairs and make the work much faster that way too.

What I did was to treat each new document as a query against the indexed corpus before adding it. But this was more than ten years ago, when I was hoping to get involved in TREC.

Liam

3 Sep 2005 Ankh » (Master)