• Register

To what extent do cryptographic hash (MD5, SHA1, etc.) collisions rates matter for fixity checking?

+1 vote

Given the probabilities of hash value collisions are MD5 hashes sufficient for ensuring file fixity? Or should SHA1 or SHA2 be used? Or, should folks be catching all three. I would be interested in both issues to consider for tamper resistance and for simply knowing if what I have is what I think I have.

asked Mar 3, 2014 by tjowens (2,360 points)
Are you interested in the consequences of hash collisions (which are always a possibility) or in the consequences of susceptibility to collisions attacks?
I think it's both. To clarify, people like to say "MD5 is broken" and in the process imply that it's no good for any use. My question is most directly pointed at clarifying when MD5 is good enough and when it is necessary to use somethign else given that digital preservation is not the intended primary use of these cryptographic hash functions.

1 Answer

+1 vote
If these functions (MD5, SHA1, SHA2) behaved as designed, collisions would be a non-issue.

However, MD5 is broken - it no longer meets its design strength and there are techniques for creating pairs of collisions at will. (See http://en.wikipedia.org/wiki/MD5#Collision_vulnerabilities .) There's not yet a published way to match any target MD5. That would be "chosen preimage collision", where any given file, not influenced by the attacked, could be matched. But, if the attacker can influence prefixes/ranges of the original file, they could be withholding, or later create, other files with a matched MD5. I wouldn't be surprised if in a few years a chosen-preimage attack is possible. As referenced in the Wikipedia article, knowledgeable sources have been recommending against MD5's use since 1996.

There are rising concerns about SHA1. It is similar in construction to MD5 and published results suggest it is not as strong as its designers intended it to be. No collisions/collision-forcing techniques on full SHA1 are yet published, but they might emerge in a few years. (They're occasionally rumored as being imminent.) So it's not a good choice for any new systems. NIST recommends against SHA1's use by federal agencies where collision-resistance is important. (http://csrc.nist.gov/groups/ST/hash/policy.html)

The SHA2 (SHA256, SHA512, etc) family is still believed safe... but because of certain similarities to MD5 & SHA1, the SHA3 competition was created and settled on another new hash function ("Keccak"), with a different internal construction.

At this point, I would only use MD5 or SHA1 if backward compatibility were required, and wherever possible record a later, stronger function instead. SHA2 appears safe for the foreseeable future, but if you were planning for 'decades' you might want to record SHA3 as well.

Another interesting newer function is 'Blake2', a faster variant of one of the losing SHA3 competition finalists (Blake), with other interesting tricks for parallel or tree-based computation. Where performance is a concern, or where verification of file ranges without the cost of a full-file scan might be interesting, it deserves consideration as an young, experimental option.

Summary: Don't count on MD5 to detect changes by a motivated malicious entity. Don't count on SHA1 for this for much longer. SHA2 is safe, but if planning for decades of safety or facing performance/throughput concerns, also consider other functions.
answered Mar 7, 2014 by gojomo (160 points)
Great summary. I'd add a couple of thoughts:

1. This illustrates why you really wouldn't want to use a single hash for de-duplification or replication, particularly if you work in an environment (e.g. academia) where someone might intentionally be generating collisions for research purposes.

2. There's an important distinction between future-proofing a live storage service and something where only the hashes will be easily obtained in the future (e.g. referencing a dataset in an academic paper). A storage service can, and should be, designed to allow new hashes to be added by re-hashing existing data which isn't trivial but should be a completely automated task and basically free if you can roll it into active scrubbing.

3. Performance is largely a non-issue unless you have an unusual setup with very fast SSDs and very limited CPU. I did some benchmarks about a year ago (https://groups.google.com/d/msg/digital-curation/3RabkVjSw84/6Dz6vckPqnUJ) – interestingly, rerunning the same test on the current MacBook Air shows the speeds have improved by at least a factor of two even without the AES-NI support from newer versions. If you have a network, spinning disks, etc. hashing is basically free so I'd  recommend including SHA-512 now.