0

So I've been looking around to find a suitable hash function for a hashmap. Currently the best one I've found was from xxhash and maybe murmur after that, but they excluded CRC32C which can be optimized greatly by using 3 64 bit parallel crc32c instructions on SSE4.2 capable CPUs. CRC32 by itself seems to be quite outdated, but mostly used for things like zip or ethernet. From what I've heard, CRC32C outperforms other hashing algorithms.

Is there any reason not to use CRC32C for a hashmap? For example collisions, the key not being 64-bit (I don't think that matters because it goes into buckets anyways) or it being slower on smaller data?

Thanks for the help

Niels
  • 9
  • 1
  • 1
    Personally, I would approach this problem by surveying which hash functions are used by the standard libraries of popular programming languages. They probably did their research, and I would just bootleg it :D However, be wary of older standard libraries, which might have made a then-optimal choice which has become outdated since (and couldn't be changed because of backwards compatibility needs). Swift, Go and Rust might be good modern candidates to look into. They have open development processes, so you can probably find some discussion of how they made their pick. – Alexander Nov 19 '22 at 15:51
  • @Alexander they might have chosen for it because SSE4.2 wasn't as widespread as it is now though (99.05% of steam hardware survey) and I don't think the 0.95% is worth targeting, but other languages might want to. A good example of outdated choices is C++ unordered map, which apparently uses suboptimal layout (boost flat map is better) since they chose it in 2005. Swift & rust use siphash. Unsure about go, but siphash seems slower than hardware crc32c – Niels Nov 19 '22 at 16:00
  • One thing to not forget is that the performance of a hash function in a hashed collection doesn't just come down to the time it takes to compute the hash function, but also the quality of the output. Regardless if you're side chaining, double-hashing, or whatever other collision-handling scheme you're using, you'll probably see that handling collisions takes more time than computing hash functions. Hopping around in RAM is much slower than `for` loops going brrrrrrrr. – Alexander Nov 19 '22 at 16:04
  • 1
    What are your goals for this hash? What kind of data will you be hashing? Typical suggestions include SipHash when concerned about HashDoS attacks, FNV-1a for hashing short trusted strings. CRC32 is slow compared to many fast hash functions and needs a large lookup table to be somewhat efficient. See also the epic answer to [Which hashing algorithm is best for uniqueness and speed?](https://softwareengineering.stackexchange.com/a/145633). CRC3C might be better, but the strength of CRCs are error detection, not achieving a good bit distribution. See also https://rurban.github.io/smhasher/ – amon Nov 19 '22 at 16:05
  • @Alexander CRC32C has way less collisions than a lot of other algorithms tho. – Niels Nov 19 '22 at 19:07
  • @amon I think hashmaps will primarily use strings in most cases. But I can't guarantee that there are no large strings in there so I'm not sure if FNV-1a is good enough there. If this is good enough for strings then I could support it for hashmap string. It's more that if you use any pod struct as a key it should know how to hash it as a default. I will still allow custom hash functions of course. CRC32C is 15x faster than normal CRC32, but the LUT is still big indeed, so for small data it's probably not a good call – Niels Nov 19 '22 at 19:11
  • @Niels FNV-1a isn't *bad* for large strings (though cost of multiplication becomes noticeable eventually), it just has practically no initialization overhead and is therefore quite efficient for short strings. I also like that it's a pretty good fast hash function that can be implemented in only 4 lines of code. Nevertheless, it fails various statistical randomness tests (like any non-crypto function). If CRC32C is that fast, then it will likely outperform FNV in any relevant metric. The LUT should be fine, other than messing up CPU caches. I'd still recommend a HashDoS resistant default. – amon Nov 19 '22 at 19:55
  • @amon I like FNV too for the same reason. If it's good enough for strings then I might still leave it in for strings and provide an alternative for hashmaps that need it. Default I might have to look at either xxh64 or siphash. Xxh64 seems to have 0 collisions in a test with an English dictionary (on a crypto stackexchange post). With HashDoS you mean when multiple hashes collide on purpose and it's sorted into the same bucket right? Fnv seems to have 51 collisions vs 44 (sha1) or 45 (md5) in the dictionary. Crc32c seems to have 62 so that's not great. – Niels Nov 19 '22 at 21:58
  • @Niels When I talk about “strings” in the concept of hash functions, I mean variable-length inputs, not necessarily ASCII or Unicode text. You can use such a hash function for any kind of data that you can represent as a sequence of bytes. Not sure though whether the number of collisions on a dictionary is (a) relevant and (b) statistically significant. For HashDoS, the question isn't whether collisions are likely, but whether they can be induced maliciously. The resulting collisions could force a hash table to do a O(n) search through all entries, instead of the O(1) expected performance. – amon Nov 19 '22 at 22:58
  • @amon I guess that depends on how you define bad performance. If those 50 collisions were only with 1 item then you'd only have to compare 1 element before you know it's the other element. So when do you declare it as sufficient protection? It seems C++ uses murmur hash on gcc btw. So I think I'll use FNV for small strings (<10) for perf, then once it gets bigger I'm thinking of xxh64. I'm unsure about how secure this one is against attacks tho... https://aras-p.info/blog/2016/08/02/Hash-Functions-all-the-way-down/ – Niels Nov 19 '22 at 23:25

0 Answers0