Older blog entries for salmoni (starting at number 589)

I've been having lots of fun lately with Gensim, a Python framework for vector space modelling. It includes fun stuff like latent semantic analysis, latent dirichlet allocation and other goodies. Allied with NLTK, this makes a very formidable Python- based NLP framework.

My tasks are sorting newsgroup posts into correct groups and I've achieved a reasonable level of accuracy (0.92) which isn't bad given that it's entirely dependent upon content. However, most analyses are showing lower accuracies (0.70+) which isn't bad but not far away enough from chance performance to be taken realistically. However, there are a few ways to improve this and I'm conducting an enormous number of experiments to get an effective mental model of how vector space models work.

This is all the beginning of constructing a relevance engine which I'm sure will be useful to some people.

Great fun!

This is a list of things that have to be done to get Infomap working on a modern Linux distribution (tried on Ubuntu 10.10).

* BLOCKSIZE in preprocessing/preprocessing_env.h : needs to be set to the highest number of words a document has in the corpus. If a document has more words than BLOCKSIZE, the building of the model will hang.

* Install libgdbm-dev with Synaptic or apt-get. Infomap needs a header file and without it, Infomap will not compile (not pass ./configure).

* Not finding ndbm.h : All happens in /usr/include

ln -s gdbm-ndbm.h ndbm.h or just copy gdbm-ndbm.h to /usr/include/ndbm.h

Infomap will not compile (not pass ./configure) without this.

Then it should go through configure, make, and make install well.

This is the code for CompareTerms:

# term1, term2 - terms to be compared

vec1 = "associate -q term1"

vec2 = "associate -q term2"

vec1 = numpy.array(vec1)

vec2 = numpy.array(vec2)

product = numpy.sum(vec1 * vec2)

return product

This produces an association between 2 terms.

When calling this, the 'args' string that calls associate must be formatted as a single string and not by Popen. This is important when sending more than 1 term. If not, associate will treat the terms as a quote search rather than an AND search.

Long time no post! I've been very busy with family and work and not had much time to do stuff. If there are no objections, I was thinking of reposting some of my UX stuff here. It's not commercial but informational and might be of use.

As for open source, I've been working a lot on Infomap lately for natural language processing. I had some failures using Semantic Vectors, namely the speed at which it does comparisons between terms. I had an idea for an automated information architecture creator but the speed was too slow. Infomap is much faster so I will try to use that - even though I know it's been superceded by Semantic Vectors.

Plus, being written in C means that it is accessible with Python whereas Semantic Vectors being in Java means going through Jython (and learning lots of new things which I don't have time for) or going through a very awkward process to translate.

With my first run using SV, I generated an information map much like that resulting from a card sort. The card sort took weeks to prepare, perform and analyse - and a lot of staff time. Mine ran in a few hours and got results that weren't entirely dissimilar to the human version. There were some odd surprises but that was because of the corpus (Wikipedia was what I used at the time) which by nature has a focus on particular topics as opposed to general language. This meant that the results were generally quite good but with one or two startling exceptions.

But the difficulty in integrating it with a Python backend is too hard, so back to Infomap. I just need to figure out how to do semantic comparisons of terms in Infomap.

It was a job to get going. The first problem was not having the appropriate symlink to a DB library and a header file. Once rectified, I had to ensure the BLOCKSIZE constant was set to a figure larger than the highest number of words. It defaults to 1 million but the longest document in the corpus was 1.25 million words. Without doing this, I had no warning and left the program building its model for over a week before finding the problem. Once done, the model was analysed and built in under 2 hours on an Asus 701 netbook!

I remember when LSA used to take days...

So in the spirit of openness and the basis of this endeavour being in open source software, I will publish results here to ensure everyone is totally bored.

20 Apr 2009 (updated 20 Apr 2009 at 08:30 UTC) »

I have a linkedin profile here. Advogatoans are welcome to add me to their network.

Edit: This entry was already turning up in Google's search results less than 2 hours after writing it. I think it was spidered 15 minutes ago.

19 Apr 2009 (updated 19 Apr 2009 at 07:03 UTC) »

Life is going well in NZ. My job is enjoyable - thoroughly so - and I'm learning lots every day. Very little open source work done lately as I need to check the T&Cs of my contract to see if I'm okay. I'm sure there is no problem but I need to check first.

Our application for permanent residence here is going well. I submitted our expression of interest back on 21 March and we were successful on 6th April which is quite quick really. I was expecting it to take a few months. I'm still waiting for the ITA form to come through by post which seems to be taking some time. I'm guessing that receiving it is really the long part of the process.

I hope it comes through quickly as my wife and daughter are still in the Philippines and I'm missing them so much. We could apply for a visitors visa for her, but we have other obligations which need to be met in the immediate future (too much detail to go into here). Still, we chat every day by email and video chat. I've even managed to play games with Louise by webcam which ranks as a good achievement. It's not the same as being with her but it's the best I can do right now.

Well I made it! I'm in New Zealand working for Westpac as an interaction designer for their website. All good fun! The work seems really cool and I have so many ideas to implement.

In other news, I've been exploring neural networks to predict currency markets and found a modicum of success (though nothing that translates into a prediction system that I could make money out of). Been using bpnn in Python. Python slows things down a lot but allows interactive analysis. I tried updating bpnn to use numpy but found my version to be significantly slower (eg, 1.5 seconds against over 5 seconds for the new one) which is odd. Is it worth releasing the code?

captchas

Just an idea for a new captcha system. How's about if you could access a large photographic resource (e.g., Flickr, Picasa), and a thumbnail of the picture is shown to the user. The user then has to guess one of the tags belonging to the picture. If they do, then they pass.

This cannot fight against brute force (I would imagine some tags will be quite common so some research would produce a frequency-based word list that could do this), mistakes would be expected (eg, the photo owner might put in random tags that make no sense to anyone else) and human-based captcha solving will easily get around it, but it's something to consider. Another crack would be to take the thumbnail and compare it against a DB of flickr pictures, but realistically, that's a large job. I wonder if Flickrs API can do that? To defend against this, the thumbnail could be altered somehow (eg, desaturated, change the colour balance etc) so that in machine terms, the images are different but in human terms they mean the same thing.

It's been a while since I posted - work, baby and travel issues have taken all my time and it's only now getting back to normal.

I'm still in the Philippines, still running an interaction design company, and having such a time that my dreams consist of writing algorithms (they are to calm me down).

The big program is still being developed and we're reconsidering UI toolkits. wxPython is wonderful and powerful, but it moves very fast and doesn't offer what I really need (embedded browser for rich interactive experiences, ie, Javascript). I understand that there are bindings to webkit and gecko, but these are not complete or reliable enough for production code. We're also considering XULrunner which does this but offers a less rich basic widget set.

My daughter, Louise Rhiannon Masaredo Salmoni was born at 1.52pm yesterday. Both mother and baby are doing well.

picture of my daughter

I'm totally in love with her: My life has just changed for ever...

Busy, busy, busy.

But not much to show for it. My wife and I are expecting our first child next Friday. This will be a nervous week indeed.

For my statistics project, I wrote a Python module to import SPSS files and was wondering whether anyone would be interested in it if I released it as open source. It's one piece of code that would greatly benefit from community testing. So far, it seems to work on the SPSS files I have without problem but SPSS have added extra things to the format. Cleverly (or rather obviously, but nice to know that they've done it), older versions of the software can still read the new formats, but they just ignore the extra bits that the new formats have. My software does just that: it ignores all the extra bits, though I suspect that there may be some cases where my software misses completely. For example, the architecture: I believe mine only reads one endian.

But it could be useful for some people. There are already FOSS versions in R and PSPP (I think the R version came from the PSPP code) but they are in C and a Python version might be useful. I wonder if SciPy has it? Currently, it can import via COM, but that is Windows only so of limited application. A pure Python module would have no such restrictions. Scientists using Python might appreciate being able to cut one more string to SPSS so I think I will release it. If anyone here is interested, let me know and it could be the spur that motivates me to release it!

I've started a usability consultancy officially instead of doing work ad hoc. This should be fun as I have to learn marketing very quickly indeed. In the Philippines, I think there is one independent consultant who is serious about the work (ie, has advertising) and a few others who seem to do it as a sideline. However, in nearby Hong Kong, there are two that I can find: Apogee and Customer Input. Looking through Google adwords shows that there is only a small market in HK compared to say the UK. However, we will be operating internationally so the location is of less importance. It does create some difficulties in terms of meeting the clients, but for general applications, I can easily get a good sample of users of varying abilities. It's also about time to put my remote testing experience to, erm, the test.

I'm also toying with the idea of joining the UPA whom I gave a talk for some while ago, but I need to check whether I will get my money's worth. It could be good for being noticed by potential customers.

580 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!