I found, though, that even if you eliminate stop words, remembering where a stop word was eliminated, but not which one, can be a useful compromise. Hence, lq-text can distinguish "printed in The Times" from "printed times".
Stemming tends to conflate senses: you might have a document in which recording is common, and another in which records is common, and you can no longer distinguish them. This may or may not matter to you, of course.
I hope you are familiar with the work by the late Gerald Salton's group at Cornell in document similarity.
One way to improve perceived performance can be to pre-compute things. I found that vector cosine differences were much more useful if you used phrases than words, but you can eliminate a lot of potential docuent pairs and make the work much faster that way too.
What I did was to treat each new document as a query against the indexed corpus before adding it. But this was more than ten years ago, when I was hoping to get involved in TREC.
Liam