Right now, I've got a system where I'm producing data from word interactions but the process for it is too slow when applied to the dataset (1.7 megs) that it needs to be applied to. Right now it's too slow to handle a tenth of that within four hours.
I'm working on ways to speed it up, but I have a feeling they're few and far between. My first try is a binary encoding of whether or not a word contains various letters of the alphabet (because then I can apply an operation to both and find out whether the words contain any common letters. If two words don't contain common letters, they sure as hell don't need lengthy regex comparison.)
Could be that I'm duplicating effort here, that the regex engine does this already, etc., but I don't think so, and even if it does, it's doing it 6 times or so, and I'm just trying to do it once.
The other side of this problem is that the hash insertion is taking longer and longer for each successive word. With a hundred, it's a non-issue. With over a billion...I'm thinking I'll need a database. I've got MySQL on this machine, and if I really had to I could install something else, so I'll noodle that around, implement it, run it on my test data, and see what happens. If it looks appreciably faster I'll throw it at 1/10 the amount of the full dataset, see what that looks like.
And, if it blazes through that in an hour or less, then we'll look at running the whole dataset through it over the weekend or something.
FOAF updates: Trust rankings are now exported, making the data available to other users and websites. An external FOAF URI has been added, allowing users to link to an additional FOAF file.
Keep up with the latest Advogato features by reading the Advogato status blog.
If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!