3

For a general purpose hash table that aims for both high performance and correctness, when, if ever, does it make sense to assume that hash equality implies logical equality?


To lay some ground rules for the question, assume either that all inputs are non-malicious or that a cryptographic quality hash function is used. In other words, hash collisions only occur by chance or cannot feasibly be generated intentionally (as best is publicly known). However, don't make any assumption about input number of bits or output number of bits. Let's also assume that all logically equal entities have equal hashes (i.e. ignore the possibility of an equality operator with unusual behavior).

Based on several internet sources including the stack network, the chance of a 256-bit hash collision is so small it can be considered impossible. For the SHA-family in particular, there does not exist a single known collision. Nonetheless, the major hash table implementations I'm familiar with do check for logical equality. Is this a sensible design choice or does it cater to an irrational fear? If the answer is more subtle, what factors affect the answer? (e.g. number of input bits, hash function used, program context?)

The implementations I'm most familiar with and interested in are from C++ (GNU's libstdc++ and LLVM's libc++) and Python. If the 64-bit hashes imposed in the C++ standard library affect the answer, please explain.

Praxeolitic
  • 1,604
  • 3
  • 15
  • 24

1 Answers1

6

In hash tables (assuming a single fixed key type for simplicity), we distinguish "the hash function", which remains the same over the lifetime of the table and often even across tables, from the function which maps the hash (i.e., the output of "the hash function") to a location/bucket in the table. The latter function depends on the exact capacity of the hash table, which changes over time. It's virtually always as simple as hash mod table_capacity, but this doesn't matter for our purposes.

Now, even if you use a 256 bit hash function in your hash table (which would be quite wasteful), it doesn't help you with collisions, because the collisions that matter to a hash tables are two keys being mapped to the same bucket, not two keys having the same hash pre-bucket-mapping. Of course, a hash collision does imply a bucket collision, but when there's a big gap between the table size and the hash size, the odds for two different hashes mapping to the same bucket are quite good.

Thus, to have the collision resistance of a 256 bit hash function carry over to a hash table, the hash table would need 2256 buckets. To call this completely impractical would be a contender for the understatement of the year.

So you won't be able to actually skim on collision resolution , one of the main sources of complexity and slowness in hash tables (either by being too complex to be efficient or too simplistic to be effective). You can still store the full hash and compare it instead of comparing the keys logically. But this has several downsides:

  • In many use cases, keys are small, significantly smaller than 256 bit = 32 byte. So even comparison itself may not be faster (assuming the logical comparison boils down to something as efficient as the hash comparison, which is true in many important cases).
  • Regardless of the key size, 32 bytes of hash per bucket is a lot to store. On today's computer architectures, smaller is faster (and very significantly so) due to the high latency of memory and the resulting reliance on fast but very small caches.
  • Let's not forget the cost of computing the hash. While modern cryptographic hashes are quite fast in absolute terms, insecure hashes can be even faster and there is a continued arms race to write ever faster hash functions that are still adequate in terms of collision probability. And you need to hash at least once on every lookup.

In short, I would expect this design decision to be slower in many common use cases than a shorter, insecure hash, and additionally it would not remove any of the complexity and other design decisions in a hash table implementation. It's an interesting idea but it doesn't seem viable for a general purpose hash table.

  • This is a good point but a hash table could still benefit from storing hashes even when there are bucket collisions. – Praxeolitic Oct 08 '16 at 10:41
  • @Praxeolitic Sorry, I completely forgot about that angle while I was writing my answer. I expanded it with why I don't think "store SHA-256 hashes and compare them instead of the keys" will result in a faster hash table. –  Oct 08 '16 at 11:43
  • The .net implementation stores the full hashes of the keys (32 bits) and uses them as an early-out for comparisons. – CodesInChaos Oct 08 '16 at 12:17
  • 1
    @CodesInChaos The crucial difference is that it's just 32 bit, so none of the points I made apply: it's as cheap or cheaper to compare as virtually any key type, it doesn't take a lot of valuable cache space away, and the hash function can be arbitrary. Exploiting the "equal => hashes equal" invariant to avoid a full comparison when there's a cheaper alternative is good, but it's not what this question is about. –  Oct 08 '16 at 14:19