1 Jul 2002 sab39   » (Master)

Chicago: You can do much better than the three-fold increase in filesize that your solution implies. Off the top of my head, I can see an immediate way to increase the filesize by just twofold-plus-a-small-constant with almost the same benefit as your solution. Since this is entirely thought up in a couple of minutes by someone who's not well-versed in error correction mathematics, I guarantee it's possible to do much better than my solution, too. So don't go ahead and implement my solution - rather take the existence of my solution as a heads-up that there are a lot of ways to improve, and read some literature on the subject to find out what the state of the art is. I'd be interested to know myself how close I was able to get to that!

First, represent a as 10 and b as 01. By using this technique, you can identify any single-bit error, but not correct it - if you get 11, you know that's wrong, but you can't tell whether it's supposed to be a or b. (Choosing 01 and 10 rather than 00 and 11 protects against the kind of error where the channel is dead and always reports zeros, or ones). After reading this full stream, the recipient will have a result that contains three states: definitely a, definitely b, or unknown (we deliberately ignore the possibility of getting errors in two consecutive bits; we can catch that case later).

Then, after transmitting your entire stream, send a checksum of the correct data. Use a well-known checksumming algorithm like md5 that is known to have good properties, rather than rolling your own, which almost certainly won't work as well. Transmit the checksum using the same 10 / 01 encoding as the rest of the content.

Here's how the recipient of the data can figure out the corrected data from the original stream in conjunction with the checksum: Assume each "definitely known" bit is correct, but try both values for each unknown bit (in both the data and the checksum). Calculate the checksum of the data and see if it matches the checksum that was sent. This algorithm is O(2^n) in the number of "unknowns" found, but that number is expected to be low. If the number is too high to calculate in a reasonable amount of time, the transmission can be flagged as corrupt and the user can try again. If all has gone well, though, there should be exactly one combination of values for the unknown bits that leads to the checksum matching. Voila, you've now got your correct sequence of 'a's and 'b's.

There are a couple of ways that this can fail to give a corrected stream, but even those cases can be identified. One is mentioned above - too many "unknowns". Another is if there were errors in two consecutive bits, turning a b into a "definite a". That situation will be flagged because no combination of values for the "unknowns" will result in a matching checksum. The other possibility is the astronomical possibility that an incorrect sequence of 'a's and 'b's would generate a correct checksum. With a good checksumming algorithm, that should be absolutely impossible without also producing a too-large number of errors (and hence "unknowns"), so the transfer would be flagged as corrupt anyway.

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!