6 Oct 2009 ingvar   » (Master)

Idleness is the source of many bugs. I suppose. It's certainly the cause for a lot of code I write.

Lately, I have been pondering hashing of octet vectors. Well, actually, I've been pondering the hashing of strings, but these days strings need mangling to octet vectors before one can reasonably reason about them as numbers (in my case, turning a vector of characters into a possibly- longer (well, possibly-more-elemented) vector of (UNSIGNED- BYTE 8), containing a UTF-8 encoding of the character string).

But, what's a hash without an idea how well it performs? So, with a pluggable (or at least semi-pluggable) hash algo, what I now do is run it across a dictionary (in my case it is /usr/dict/words, containing a British word list) and see how many collisions one gets.

The next step is, of course, to try to figurte out what is and isn't bad, as far as collissions go. From memory, the last attempt say something like 600 collisions out of just under 70k words. I'd have to run a simulation to say if it's well above, well below or around what I'd statistically expect.

Next after that will have to be concatenations of words, though that MAY take a bit longer to test-run.

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!