2 Sep 2005 chromatic   » (Master)

Secrets of Contextual Analysis:

I'm analyzing the content of some documents in order to find potential correlations between them. Breaking each document into individual words, stemming those words, and throwing out the stopwords gave me some 18,000 unique words from a 600-document corpus, with over 40% of words appearing only once in the corpus and almost 80% of the words appearing fewer than ten times.

I knew my existing list of stop words was insufficient, but I really don't want to pick out the top 1000 or 2000 useful words from a list of 18,000, especially because this is a test corpus of perhaps 7% of the actual corpus.

Now I start to wonder if some of the lexical analysis modules would be useful in picking out only the nouns (unstemmed) and verbs (stemmed) from a document, rather than taking all of the words of a document as significant. The correlation algorithm appears sound, but if I can throw out lots of irrelevant data, I can improve the performance and utility of the application.

Any thoughts?

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!