6 Dec 2010 salmoni   » (Master)

Long time no post! I've been very busy with family and work and not had much time to do stuff. If there are no objections, I was thinking of reposting some of my UX stuff here. It's not commercial but informational and might be of use.

As for open source, I've been working a lot on Infomap lately for natural language processing. I had some failures using Semantic Vectors, namely the speed at which it does comparisons between terms. I had an idea for an automated information architecture creator but the speed was too slow. Infomap is much faster so I will try to use that - even though I know it's been superceded by Semantic Vectors.

Plus, being written in C means that it is accessible with Python whereas Semantic Vectors being in Java means going through Jython (and learning lots of new things which I don't have time for) or going through a very awkward process to translate.

With my first run using SV, I generated an information map much like that resulting from a card sort. The card sort took weeks to prepare, perform and analyse - and a lot of staff time. Mine ran in a few hours and got results that weren't entirely dissimilar to the human version. There were some odd surprises but that was because of the corpus (Wikipedia was what I used at the time) which by nature has a focus on particular topics as opposed to general language. This meant that the results were generally quite good but with one or two startling exceptions.

But the difficulty in integrating it with a Python backend is too hard, so back to Infomap. I just need to figure out how to do semantic comparisons of terms in Infomap.

It was a job to get going. The first problem was not having the appropriate symlink to a DB library and a header file. Once rectified, I had to ensure the BLOCKSIZE constant was set to a figure larger than the highest number of words. It defaults to 1 million but the longest document in the corpus was 1.25 million words. Without doing this, I had no warning and left the program building its model for over a week before finding the problem. Once done, the model was analysed and built in under 2 hours on an Asus 701 netbook!

I remember when LSA used to take days...

So in the spirit of openness and the basis of this endeavour being in open source software, I will publish results here to ensure everyone is totally bored.

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!