Chicago: You can do much better than the three-fold increase in
filesize that your solution implies. Off the top of my head, I can see an
immediate way to increase the filesize by just twofold-plus-a-small-constant
with almost the same benefit as your solution. Since this is entirely thought up
in a couple of minutes by someone who's not well-versed in error correction
mathematics, I guarantee it's possible to do much better than
my
solution, too. So don't go ahead and implement my solution - rather take the
existence of my solution as a heads-up that there are a lot of ways to improve,
and read some literature on the subject to find out what the state of the art
is. I'd be interested to know myself how close I was able to get to that!
First, represent a as 10 and b as 01. By using this technique, you can
identify any single-bit error, but not correct it - if you get
11, you know that's wrong, but you can't tell whether it's supposed to be a or
b. (Choosing 01 and 10 rather than 00 and 11 protects against the kind of error
where the channel is dead and always reports zeros, or ones). After reading this
full stream, the recipient will have a result that contains three
states: definitely a, definitely b, or unknown (we deliberately ignore the
possibility of getting errors in two consecutive bits; we can catch that case
later).
Then, after transmitting your entire stream, send a checksum of the correct
data. Use a well-known checksumming algorithm like md5 that is known to have
good properties, rather than rolling your own, which almost certainly won't work
as well. Transmit the checksum using the same 10 / 01 encoding as the rest of
the content.
Here's how the recipient of the data can figure out the corrected data from the
original stream in conjunction with the checksum: Assume each "definitely known"
bit is correct, but try both values for each unknown bit (in both the
data and the checksum). Calculate the checksum of the data and see if it matches
the checksum that was sent. This algorithm is O(2^n) in the number of "unknowns"
found, but that number is expected to be low. If the number is too high to
calculate in a reasonable amount of time, the transmission can be flagged as
corrupt and the user can try again. If all has gone well, though, there should
be exactly one combination of values for the unknown bits that leads to the
checksum matching. Voila, you've now got your correct sequence of 'a's and 'b's.
There are a couple of ways that this can fail to give a corrected stream, but
even those cases can be identified. One is mentioned above - too many
"unknowns". Another is if there were errors in two consecutive bits, turning a
b into a "definite a". That situation will be flagged because no
combination of values for the "unknowns" will result in a matching checksum. The
other possibility is the astronomical possibility that an incorrect
sequence of 'a's and 'b's would generate a correct checksum. With a good
checksumming algorithm, that should be absolutely impossible without also
producing a too-large number of errors (and hence "unknowns"), so the transfer
would be flagged as corrupt anyway.