(modified somewhat from my undeadly comment to the OpenSSH 4.0 release and Damien mentioning he foobed the md5 on the release notes, wanted this to get out to a wider potential reading audience, I and I don't really have a suitable personal blog so this will do)
IANACBIMPOOF (I am not a cryptographer, but I may play one on fora)
http://cryptography.hyperlink.cz/md5/MD5_collisions.pdf is worth taking a look at. I realize it's recent (March 5th), but gives an example of finding a full md5 collision in 8 hours on a notebook, and they're predicting that time will go down once Wang et al release their actual speed up method (perhaps the prediction of 2 minutes is overstating it, but you never know). That said, getting a collision on meaningful substitutes (e.g. a backdoored OpenSSH) might be another challenge, but I doubt it's going to be too much harder if speed keeps increasing.
At any rate, I was wondering - why provide just one type of hash (e.g. just md5) if you are releasing something? Why not also provide a sha-1, or even several different hash types? As we witness more hashes fall victim to improved collision attacks (and there will _always_ be collisions anyways because that's the nature of a frigging hash), it's understandable that any individual (md5, sha-1, crc32, whatever) hash will have possible meaningful collisions. However, finding a meaningful collision for _several_ different hashes simultaneously, I would posit is probably very unlikely.
It could make a damn fine fun challenge to break or open up a new science problem. I can just envision future math assignments where the teacher is telling their students to find the Lowest Common Denominator for crc32, and md5 values.
Maybe I'm wrong about that in some cases, as I know sha-1 is based off of previous work from md5, so maybe any sha-1 collision also results in an md5 collision. But I highly doubt that, since such a property would undoubtedly have been noticed and mentioned by the researchers breaking this stuff. Or at least one would hope such an obvious check would have been noticed in such research, I haven't seen mention of it. If anyone has examples let me know, by no means have I read every paper on the subject. Regardless, it wouldn't be hard to check, just take an example of a two different files wherein they both have the same md5, and then see if they both have the same sha-1 (or crc32, or ripe-md or whatever-the-hell-hash you want to have as well). Does anyone have two files that have an md5 collision I could test against? Would be simple enough to perform the test if I had the files, but IANAC.... stuff above.
Also, it should be noted (and Jose thankfully reminded me of this at RSAConf when we were discussing hashes briefly) that the OpenBSD ports system already provides several different hashes on distfiles. Just check a /usr/ports/blah/blah/distinfo yourself and you'll see something like this:
$ more distinfo MD5 (nmap-3.81.tgz) = 9b32f74e2f6999e4f7668a24f2a1ea85 RMD160 (nmap-3.81.tgz) = d57533f1bf614541dd0cdfcf0f14b257d26a28c9 SHA1 (nmap-3.81.tgz) = 9d1ce1ab3e097ce5d61078fd4bc713f9b701fa1c SIZE (nmap-3.81.tgz) = 1846196
So, since OpenBSD does it in the ports system already, maybe we as an entire security community should look to add it to our release methodologies as well? (See update below on this - while several different hash values are provided, only one is currently checked in the OpenBSD ports system - checking more than one must be done manually).
Put another way - given the properties of hashes, any one hash is likely to fail, but many hashes all failing together in the same way at the same time is very unlikely.
Another proposal for trying to skirt the problems with hashes, rather than just invent a new hash that hasn't been "broken" is to do what bittorrent does. Take _multiple_ hashes over parts of the file rather than just one hash of the whole file. Again, it becomes highly unlikely that one could generate a different file that would be split myriad times and have the same crc's for each split piece.
As an example, a recent torrent of the hitbSecConf2004 vids I leeched had over 7000 pieces - and afaik, in the .torrent each file sub-piece has its own hash value listed. The .torrent files aren't plaintext so I can't verify this easily, I'm basing this understanding upon what I've seen written up about bittorrent. Assuming my understanding is roughly accurate however, in the case of bittorrent in the example provided instead of finding one collision, the attacker needs to generate over 7000, one for each file sub-piece. Even if using something not cryptographically sound or very resource intensive with easily found collisions such as crc32, that becomes a tougher problem.
Of course, this is speculative, maybe it's not that hard. If an attacker is smart enough to put the meaningful change in only a small number of sections, then possibly he would only need to create one CRC collision for that sub-piece, or several for the several sections, and the rest he could leave untouched, and they'll all generate the same hash. I don't know for sure but it's a thought, and moreover already an implementation that I think will prove itself to be rather robust against hash collision attacks that keep improving, even as bittorrent's chosen hash itself will undoubtedly fall prey to smarter researchers over time.
So one problem, two possible ways of dealing with it that can, and are already in use today. In other words - other people should start using these techniques NOW to afford protections, rather than sitting around waiting for some silver-bullet sha-1 replacing hash to be approved, which undoubtedly will also crumble over time. I mean, that's not to say that right now there aren't already other hashes we could be using for which there aren't such attacks for - and by all means we could start using those right now to as a preventative. But given the properties of hashes, it's probably just a matter of time and researcher attention before other algorithms fall victim to more efficient collision generation techniques. So rather than put all our eggs in one basket, or foresake hashes which are still useful the majority of the time, we can just get creative with how we use our existing tools.
Updates and Corrections!
Jolan informs me that even though the OpenBSD ports tree records several different hashes in distinfo, it only checks one. So in order to really make this work for OpenBSD's ports system in the manner in which I'm discussing, the user needs to manually check against the other hash types. Currently that means that the disk would need to be read for each distinct hash as well, so that can obviously be time consuming. In my personal experience, the biggest bottleneck in generating hashes is disk reads - so if one were to check a file with several hash types, it would be wise to design a system in which data is only read from the disk once, even if that data is being fed to several algorithms.
Jolan asks: "got code?"
Chris Palmer correctly points out:"CRC" is not a synonym for "hash", and certainly not "cryptographic hash"
I had some more time sooner than I thought and removed the abusive CRC references Chris mentioned, thanks! I'm not as concerned about the distinction for cryptographically strong hashes actually - just as long as people use hashes that aren't all suceptible to the same collision weaknesses at the same time.