1635

Which hashing algorithm is best for uniqueness and speed? Example (good) uses include hash dictionaries.

I know there are things like SHA-256 and such, but these algorithms are designed to be secure, which usually means they are slower than algorithms that are less unique. I want a hash algorithm designed to be fast, yet remain fairly unique to avoid collisions.

Earlz
  • 22,658
  • 7
  • 46
  • 60
  • 12
    For what purpose, security or other? – Orbling Feb 19 '11 at 00:05
  • 34
    @Orbling, for implementation of a hash dictionary. So collisions should be kept to a minimal, but it has no security purpose at all. – Earlz Feb 19 '11 at 00:06
  • 5
    Note that you will need to expect at least *some* collisions in your hash table, otherwise the table will need to be enormous to be able to handle even a relatively small number of keys... – Dean Harding Feb 19 '11 at 01:14
  • 2
    what is the input to your hash function? A simple byte array? – stackoverflowuser2010 Feb 19 '11 at 01:27
  • 3
    @stacko For the moment, I'm only concerned with strings. – Earlz Feb 19 '11 at 03:23
  • 2
    If uniqueness (low chance of collision) is all you care about, then any of the usual functions (SHA family) with a long-bit output should do. As for determining which is the fastest, use the newest OpenSSL source compiled with all the architecture speedups made specifically for your platform, then test it with 'openssl speed' as mentioned in here http://www.madboa.com/geek/openssl/#benchmark-speed – Marcin Feb 19 '11 at 17:43
  • 3
    Uniqueness is a side-effect of security, though. Secure Hash algorithms are supposed to be suitable for message authentication. – MSalters Aug 04 '11 at 11:25
  • 1
    Take a look in Google spreedhash template library. http://code.google.com/p/sparsehash/?redir=1 – nohros Apr 23 '12 at 14:10
  • 2
    Why would a CRC not do the job? – whatsisname Apr 23 '12 at 15:18
  • 1
    @nohros Unfortunately since CityHash's minimum hash size is 64-bits, it's too large to be used in practice. It also depends on SSE4.2, which limits it's ability to be deployed in practice. – Ian Boyd Apr 26 '12 at 14:48
  • 5
    @IanBoyd: City Hash "too large to be used in practice"? Can you explain? Nothing forces you to use the whole 64-bit result. Also, it doesn't require SSE 4.2 for normal City Hash. There is an (incompatible) variant, CityHashCrc, which requires SSE 4.2 and produces a longer hash, but you can ignore that if it's inappropriate for your needs. – John Bartholomew Apr 28 '12 at 20:28
  • 2
    You can always find a pathetic case for any algorithm. –  Apr 29 '12 at 09:50
  • 3
    @ThorbjørnRavnAndersen Except the ones with very strict bounds. They tend to not usually be the interesting algorithms though, as they're often more costly than the ones that are cheaper in some cases… – Donal Fellows May 26 '12 at 18:43
  • 1
    @DonalFellows care to share an example? –  May 26 '12 at 19:19
  • 24
    Great post! Could you also check's Yann Collet's xxHash (creator or LZ4), which is twice as fast as Murmur? Homepage: http://code.google.com/p/xxhash/ More info: http://fastcompression.blogspot.fr/2012/04/selecting-checksum-algorithm.html?spref=tw –  May 29 '12 at 12:35
  • 1
    @user1249: It's a long time ago, so I don't have the code any more and my memory might be dodgy. I needed a hash algorithm for a structure where each of the members of the structure could only take a relatively small number of values, so I constructed a dictionary that mapped each member to a bit pattern; ORing them together (with the right shifts) would give me a value which fit in 32 bits. That's the easiest way of doing a perfect hashing function I know, but it was *totally* dependent on the data model I was working with; even minor tweaks would break it utterly. – Donal Fellows Sep 29 '13 at 10:43
  • 1
    I am curious as to the intent of said hash. Depending on your needs, security is the opposite of speed. Fast means easier to brute force, while hashes such as scrypt work to the opposite. I've found that SHA512 is generally good enough, and depending on the settings scrypt can be overkill. Also, as always, one should salt one's hashes, and add random seed bytes into encrypted streams to ensure greater uniqueness. – Tracker1 Oct 14 '13 at 16:31
  • 2
    @Tracker1 it's been a while, but this question was not at all about security. This was for hashes in a hash table, not for hashing passwords – Earlz Oct 14 '13 at 19:29
  • 27
    @zvrba Depends on the algorithm. bcrypt is designed to be slow. – Izkata Oct 31 '13 at 19:20
  • I can't put an answer here. So I write a comment. A property not cited and important in preventing DOS attack is the predictability of collision. It is for that some cryptographic researcher build [SipHash](https://131002.net/siphash/) which happen to be also fast. – Xavier Combelle Apr 04 '14 at 13:34
  • @XavierCombelle that doesn't always matter either though either, especially for local processes that do not interact with the world – Earlz Apr 04 '14 at 19:00
  • 3
    @Izkata `bcrypt` is not a hashing algorithm. It is a key derivation scheme that can be used with a hashing algorithm. – haneefmubarak Sep 06 '14 at 06:46
  • 2
    I'd like to see results for xxHash too! – Alnitak Nov 27 '15 at 12:12
  • 2
    Soo... we're actually asking for a well distributed hash function (rather than unique)? I spent a good few minutes getting confused by the answers until I figured this out. – Stephen Feb 10 '16 at 03:43
  • `lh_strhash` from the openssl library is also one to consider. In my tests it beat *murmur3*, *fnv1* and *djb2* for 16-char string hashing. (which will encompass an overwhelming majority of dictionary words) [**see lhash.c**](https://raw.githubusercontent.com/openssl/openssl/master/crypto/lhash/lhash.c) – David C. Rankin Aug 06 '16 at 00:31
  • i worte a program to test all algorithms supported by your php intallation: http://paste.debian.net/872031/ (on my installation on php7.0.11, it's 46 different algos) – hanshenrik Oct 13 '16 at 11:02
  • We can compute sha 256/whirlpool/etc on gpu or even on fpga. All such things are already implemented and you can find it on github =) – puchu Dec 07 '17 at 20:48
  • 3
    There's a lot of confusion and misinformation here. "unique" is the wrong concept; you want even distribution among your buckets. Message digests achieve "uniqueness" by producing lots of bits, which are useless for hash tables. And most cryptographic hashes (sipHash excepted) are way too slow ... people recommending SHA512 are way off base. xxHash and sipHash are good choices ... neither is included in the accepted answer, which is now quite out of date. (And the test is bad ... there should be collisions.) – Jim Balter Oct 20 '18 at 15:46
  • All decent hash functions that were actually designed as hash functions have good distributions, so the accepted answer (which is now quite out of date) mostly peers into the wrong hole. The only thing that matters is speed, but that's depends on what the inputs look like -- meaning their lengths. Here's the comparison you want for the stated purpose of a hash function for a hash table: https://aras-p.info/blog/2016/08/09/More-Hash-Function-Tests/ – Jim Balter Oct 21 '18 at 00:37
  • One shouldn't assume that SHA is a security hash, in git it also used to avoid collision: https://stackoverflow.com/questions/28792784/why-does-git-use-a-cryptographic-hash-function In fact we shouldn't even use it for password since there are dedicated hash algorithm for password now (argon2) – pdem Apr 29 '19 at 15:51

11 Answers11

2888

I tested some different algorithms, measuring speed and number of collisions.

I used three different key sets:

For each corpus, the number of collisions and the average time spent hashing was recorded.

I tested:

Results

Each result contains the average hash time, and the number of collisions

Hash           Lowercase      Random UUID  Numbers
=============  =============  ===========  ==============
Murmur            145 ns      259 ns          92 ns
                    6 collis    5 collis       0 collis
FNV-1a            152 ns      504 ns          86 ns
                    4 collis    4 collis       0 collis
FNV-1             184 ns      730 ns          92 ns
                    1 collis    5 collis       0 collis▪
DBJ2a             158 ns      443 ns          91 ns
                    5 collis    6 collis       0 collis▪▪▪
DJB2              156 ns      437 ns          93 ns
                    7 collis    6 collis       0 collis▪▪▪
SDBM              148 ns      484 ns          90 ns
                    4 collis    6 collis       0 collis**
SuperFastHash     164 ns      344 ns         118 ns
                   85 collis    4 collis   18742 collis
CRC32             250 ns      946 ns         130 ns
                    2 collis    0 collis       0 collis
LoseLose          338 ns        -             -
               215178 collis

Notes:

Do collisions actually happen?

Yes. I started writing my test program to see if hash collisions actually happen - and are not just a theoretical construct. They do indeed happen:

FNV-1 collisions

  • creamwove collides with quists

FNV-1a collisions

  • costarring collides with liquid
  • declinate collides with macallums
  • altarage collides with zinke
  • altarages collides with zinkes

Murmur2 collisions

  • cataract collides with periti
  • roquette collides with skivie
  • shawl collides with stormbound
  • dowlases collides with tramontane
  • cricketings collides with twanger
  • longans collides with whigs

DJB2 collisions

  • hetairas collides with mentioner
  • heliotropes collides with neurospora
  • depravement collides with serafins
  • stylist collides with subgenera
  • joyful collides with synaphea
  • redescribed collides with urites
  • dram collides with vivency

DJB2a collisions

  • haggadot collides with loathsomenesses
  • adorablenesses collides with rentability
  • playwright collides with snush
  • playwrighting collides with snushing
  • treponematoses collides with waterbeds

CRC32 collisions

  • codding collides with gnu
  • exhibiters collides with schlager

SuperFastHash collisions

  • dahabiah collides with drapability
  • encharm collides with enclave
  • grahams collides with gramary
  • ...snip 79 collisions...
  • night collides with vigil
  • nights collides with vigils
  • finks collides with vinic

Randomnessification

The other subjective measure is how randomly distributed the hashes are. Mapping the resulting HashTables shows how evenly the data is distributed. All the hash functions show good distribution when mapping the table linearly:

Enter image description here

Or as a Hilbert Map (XKCD is always relevant):

Enter image description here

Except when hashing number strings ("1", "2", ..., "216553") (for example, zip codes), where patterns begin to emerge in most of the hashing algorithms:

SDBM:

Enter image description here

DJB2a:

Enter image description here

FNV-1:

Enter image description here

All except FNV-1a, which still look pretty random to me:

Enter image description here

In fact, Murmur2 seems to have even better randomness with Numbers than FNV-1a:

Enter image description here

When I look at the FNV-1a "number" map, I think I see subtle vertical patterns. With Murmur I see no patterns at all. What do you think?


The extra * in the table denotes how bad the randomness is. With FNV-1a being the best, and DJB2x being the worst:

      Murmur2: .
       FNV-1a: .
        FNV-1: ▪
         DJB2: ▪▪
        DJB2a: ▪▪
         SDBM: ▪▪▪
SuperFastHash: .
          CRC: ▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
     Loselose: ▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
                                        ▪
                                 ▪▪▪▪▪▪▪▪▪▪▪▪▪
                        ▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
          ▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪

I originally wrote this program to decide if I even had to worry about collisions: I do.

And then it turned into making sure that the hash functions were sufficiently random.

FNV-1a algorithm

The FNV1 hash comes in variants that return 32, 64, 128, 256, 512 and 1024 bit hashes.

The FNV-1a algorithm is:

hash = FNV_offset_basis
for each octetOfData to be hashed
    hash = hash xor octetOfData
    hash = hash * FNV_prime
return hash

Where the constants FNV_offset_basis and FNV_prime depend on the return hash size you want:

Hash Size  
===========
32-bit
    prime: 2^24 + 2^8 + 0x93 = 16777619
    offset: 2166136261
64-bit
    prime: 2^40 + 2^8 + 0xb3 = 1099511628211
    offset: 14695981039346656037
128-bit
    prime: 2^88 + 2^8 + 0x3b = 309485009821345068724781371
    offset: 144066263297769815596495629667062367629
256-bit
    prime: 2^168 + 2^8 + 0x63 = 374144419156711147060143317175368453031918731002211
    offset: 100029257958052580907070968620625704837092796014241193945225284501741471925557
512-bit
    prime: 2^344 + 2^8 + 0x57 = 35835915874844867368919076489095108449946327955754392558399825615420669938882575126094039892345713852759
    offset: 9659303129496669498009435400716310466090418745672637896108374329434462657994582932197716438449813051892206539805784495328239340083876191928701583869517785
1024-bit
    prime: 2^680 + 2^8 + 0x8d = 5016456510113118655434598811035278955030765345404790744303017523831112055108147451509157692220295382716162651878526895249385292291816524375083746691371804094271873160484737966720260389217684476157468082573
    offset: 1419779506494762106872207064140321832088062279544193396087847491461758272325229673230371772250864096521202355549365628174669108571814760471015076148029755969804077320157692458563003215304957150157403644460363550505412711285966361610267868082893823963790439336411086884584107735010676915

See the main FNV page for details.

All my results are with the 32-bit variant.

FNV-1 better than FNV-1a?

No. FNV-1a is all around better. There was more collisions with FNV-1a when using the English word corpus:

Hash    Word Collisions
======  ===============
FNV-1   1
FNV-1a  4

Now compare lowercase and uppercase:

Hash    lowercase word Collisions  UPPERCASE word collisions
======  =========================  =========================
FNV-1   1                          9
FNV-1a  4                          11

In this case FNV-1a isn't "400%" worse than FN-1, only 20% worse.

I think the more important takeaway is that there are two classes of algorithms when it comes to collisions:

  • collisions rare: FNV-1, FNV-1a, DJB2, DJB2a, SDBM
  • collisions common: SuperFastHash, Loselose

And then there's the how evenly distributed the hashes are:

  • outstanding distribution: Murmur2, FNV-1a, SuperFastHas
  • excellent distribution: FNV-1
  • good distribution: SDBM, DJB2, DJB2a
  • horrible distribution: Loselose

Update

Murmur? Sure, why not


Update

@whatshisname wondered how a CRC32 would perform, added numbers to the table.

CRC32 is pretty good. Few collisions, but slower, and the overhead of a 1k lookup table.

Snip all erroneous stuff about CRC distribution - my bad


Up until today I was going to use FNV-1a as my de facto hash-table hashing algorithm. But now I'm switching to Murmur2:

  • Faster
  • Better randomnessification of all classes of input

And I really, really hope there's something wrong with the SuperFastHash algorithm I found; it's too bad to be as popular as it is.

Update: From the MurmurHash3 homepage on Google:

(1) - SuperFastHash has very poor collision properties, which have been documented elsewhere.

So I guess it's not just me.

Update: I realized why Murmur is faster than the others. MurmurHash2 operates on four bytes at a time. Most algorithms are byte by byte:

for each octet in Key
   AddTheOctetToTheHash

This means that as keys get longer Murmur gets its chance to shine.


Update

GUIDs are designed to be unique, not random

A timely post by Raymond Chen reiterates the fact that "random" GUIDs are not meant to be used for their randomness. They, or a subset of them, are unsuitable as a hash key:

Even the Version 4 GUID algorithm is not guaranteed to be unpredictable, because the algorithm does not specify the quality of the random number generator. The Wikipedia article for GUID contains primary research which suggests that future and previous GUIDs can be predicted based on knowledge of the random number generator state, since the generator is not cryptographically strong.

Randomess is not the same as collision avoidance; which is why it would be a mistake to try to invent your own "hashing" algorithm by taking some subset of a "random" guid:

int HashKeyFromGuid(Guid type4uuid)
{
   //A "4" is put somewhere in the GUID.
   //I can't remember exactly where, but it doesn't matter for
   //the illustrative purposes of this pseudocode
   int guidVersion = ((type4uuid.D3 & 0x0f00) >> 8);
   Assert(guidVersion == 4);

   return (int)GetFirstFourBytesOfGuid(type4uuid);
}

Note: Again, I put "random GUID" in quotes, because it's the "random" variant of GUIDs. A more accurate description would be Type 4 UUID. But nobody knows what type 4, or types 1, 3 and 5 are. So it's just easier to call them "random" GUIDs.

All English Words mirrors

Glorfindel
  • 3,137
  • 6
  • 25
  • 33
Ian Boyd
  • 24,484
  • 1
  • 17
  • 16
  • 2
    It's interesting that FNV-1a seems to have better randomness than FNV-1 (at least, based on the "number string" results), yet has more collisions on the word list you tested. I wonder if this is just a weird coincidence, but your sample size seems to suggest otherwise. – Daniel B Apr 24 '12 at 06:31
  • @DanielB i think it's safe to say that all the serious hashing algorithms will have collisions, but less then 0.01% chance of a collision. The rest is just luck. – Ian Boyd Apr 24 '12 at 13:32
  • 3
    As a random-dot stereogram enthusiast, I had to try viewing your FNV-1a Numbers image as a stereogram. And yes, it is indeed still quite a repetitive pattern for that use case. Treating the image as flowing left to right then top to bottom, I see significant repetition at 805 pixels (is that 805 bits?). 805 = 23 * 7 * 5. I see no repetition in Murmur2. =) – Will Apr 30 '12 at 21:42
  • @maaartinus On the downside, as a practical matter (non-cryptographic) hash functions are usually used to implement a hash table. Hash tables are usually keyed by a *string*. Trying to pick our the correct 4 bytes of a guid so you can convert it to an integer, so you can convert it to a string adds needless complexity. And it also reduces the usefulness (e.g. if i want to find the item for CustomerGUID "{B5225B82-F3DE-41E7-A70B-91F5AB1F9709}") – Ian Boyd May 04 '12 at 13:30
  • 6
    For random input and an ideal hashing function, you expect 5.5 collisions with N=216553 and a 32-bit word size (d=2^32) by [N-d-d*((d-1)/d)^N](http://en.wikipedia.org/wiki/Birthday_problem#Collision_counting). The probability of no collisions in this ideal case is [p ≃ ((d-1)/d)^(N*(N-1)/2)](http://en.wikipedia.org/wiki/Birthday_problem#Cast_as_a_collision_problem) ≃ 0.00426 (0.426%). Sheer luck that CRC32 had low collisions. So when collisions aren't significantly above 5.5 and input is sufficiently random, only speed matters. (For consecutive integers; low collisions are desired). – dr jimbob May 25 '12 at 14:42
  • 83
    It would be really interesting to see how SHA compares, not because it's a good candidate for a hashing algorithm here but it would be really interesting to see how any cryptographic hash compares with these made for speed algorithms. – Michael May 25 '12 at 15:09
  • @IanBoyd: I did a Java CityHash port at one point, and it was faster than my coworkers' Murmur3 implementation. Could we possibly just see the low 32 bits of CityHash benchmarked? – Louis Wasserman May 25 '12 at 17:10
  • 3
    @Michael: I'd expect SHA to have no collisions and a perfectly random looking distribution. Anything else would be unacceptable from a cryptographic POV. – Secure May 26 '12 at 07:37
  • One addition: CRC can be used as a hash function, but it wasn't designed for this purpose. Mandatory reading: A painless guide to CRC error detection algorithms (http://www.ross.net/crc/download/crc_v3.txt) – Secure May 26 '12 at 07:45
  • Care to add Phil Bagwell's hash to the bench? ( described in my answer below) – Viktor Klang May 29 '12 at 10:53
  • 18
    A new hash by the name of 'xxHash', by Yann Collet, was doing the rounds recently. I'm always suspicious of a new hash. It would be interesting to see it in your comparison, (if you aren't tired of people suggesting random hashes they've heard of to be added...) – th_in_gs May 29 '12 at 11:51
  • 1
    The result is x86 specific. Please note that the performance of murmur2 are worse on ARM for instance. I don't know if this is relevant for the question, but still important. – deadalnix May 29 '12 at 13:50
  • This is Phil Bagwell's hash: //32-bit version, Scala, but should be self-explanatory import java.lang.Integer.reverseBytes def sbhash(i:Int) = reverseBytes(i * 0x9e3775cd) * 0x9e3775cd – Viktor Klang May 29 '12 at 14:55
  • 7
    Indeed. The performance numbers announced by the xxHash project page look impressive, maybe too much to be true. Well, at least, it's an open-source project : http://code.google.com/p/xxhash/ – ATTracker May 29 '12 at 16:23
  • 1
    I'm really curious about how the SHA family would fare. If performance is not incredibly worse, the guarantee of no collisions is enough to make it a better choice, since you're virtually never going to be bottlenecked by hashing bandwidth. – Ricardo Tomasi May 29 '12 at 23:11
  • How about testing a true random function for collisions and randomness for comparison? – Alex R May 30 '12 at 02:27
  • 15
    Hi Ian, my Delphi implementation of SuperFastHash is correct. When implementing I created a test set in C and Delphi to compare the results of my implementation and the reference implementation. There are no differences. So what you see is the actual badness of the hash... (That is why I also published a MurmurHash implementation: http://landman-code.blogspot.nl/2009/02/c-superfasthash-and-murmurhash2.html ) – Davy Landman Jun 04 '12 at 08:58
  • 1
    Great answer. One thing, though, that I think must be stressed, is that while MurmurHash2 does seem to be the best, doesn't mean it's the *only* hash function that should *ever* be used. FNV-1a has one significant advantage over Murmur, that being ease of implementation (two constants and a loop of two simple operations), and it is still *good* at the things MurmurHash is *better* at. – KeithS Sep 12 '12 at 19:08
  • 1
    One question; what would an implementation of FNV-1a look like that used 4 bytes at a time? How well would it do at these tests? – KeithS Sep 12 '12 at 19:11
  • Some analysis/graphs can be found at Pluto Scarab — Hash Functions http://home.comcast.net/~bretm/hash/ , or at the MurmurHash old pages: https://sites.google.com/site/murmurhash/avalanche (from https://sites.google.com/site/murmurhash/statistics ) – Yann Droneaud Mar 29 '13 at 10:47
  • Another study of hash speed is http://playcontrol.net/opensource/LuaHashMap/benchmarks.html but that may also be sensitive to the hash table rebuild strategy. – Donal Fellows Sep 29 '13 at 10:46
  • 4
    https://news.ycombinator.com/item?id=6547734 suggests that [SipHash](https://131002.net/siphash/) would be a valuable addition to this answer. – Joey Oct 14 '13 at 15:42
  • 2
    @IanBoyd: This is one of the best answers I've seen on any stackexchange site. Bravo. :). Is there any chance of you posting all the code publicly? I'd like to add a few things to see for myself... Thanks. – JimR Oct 14 '13 at 19:19
  • 1
    @JimR i could, but it's Delphi, using various code files of helper routines and classes that i've collected over the years. You're better off to start over in C# or something. The real work was implementing the algorithms. It would be a lot of pain to cull down the relevant files into the required routines; and even then it was only designed for Delphi 7 - using an ide from this decade would require code-rework. In other words: no. :) – Ian Boyd Oct 17 '13 at 01:18
  • 1
    @IanBoyd: Ahh well, I guess I'll put off looking for that Delphi 7 Enterprise box in the garage... :) – JimR Oct 17 '13 at 01:58
  • Which CRC32 have you used? Apparently there is at least IEEE, Castagnoli and Koopman. – Jan Matějka Oct 23 '13 at 23:24
  • @yaccz i chewed whatever gum i found laying around. [This one i think it was](http://www.delphitips.net/2008/01/01/crc32-cyclic-redundency-check/) – Ian Boyd Oct 24 '13 at 12:25
  • 2
    I'm a bit late to this party but I was wondering how md5 compares with that list. I've found it to be a bit faster than crc, but I don't know how it compares in terms of collisions to the others. – Ian1971 Jan 07 '14 at 12:24
  • 1
    If I'm not incorrect, it's `SuperFastHash collisions` not `SuperFastCash collisions`, right? – Lucas Feb 26 '14 at 03:56
  • 2
    Excellent post! What surprises me is how well good old CRC32 performed. I have always shied away from this for "serious" applications as I though (due to age and simplicity) it would not perform as well as more modern algorithms. – James Anderson Feb 27 '14 at 01:22
  • 66
    Is the poster aware this is not just an awesome answer - this is the world's de facto reference resource on the subject? Anytime I need to deal with hashes, that solves my issue so fast and authoritatively that I don't ever need anything else. – MaiaVictor May 29 '14 at 11:53
  • 3
    I tested a bit more, and basically CRC32 is by far the fastest on a modern CPU if you use zlib (just macports uses the slow SW fallback) or the use the SSE4.2 CRC32-C function manually. This needs 1 cycle per op, with 3 wait cycles. zlib provides on most platforms access to the HW-accelerated function which is optimized to run in parallel (to avoid the 3 wait cycles) for larger keys. There are several more really fast hash function variants, tested at: http://www.sanmayce.com/Fastest_Hash/index.html#TRIADvsCRC https://github.com/rurban/perl-hash-stats https://github.com/rurban/Perfect-Hash – rurban Aug 20 '14 at 05:57
  • @IanBoyd The list of words -- http://www.sitopreferito.it/html/all_english_words.html seems to be unavailable. Do you have a local copy lying around? – Kedar Mhaswade Sep 11 '14 at 20:54
  • 1
    @KedarMhaswade [archive.org](https://web.archive.org/web/20070221060514/http://www.sitopreferito.it/html/all_english_words.html) has it. – Ian Boyd Sep 12 '14 at 13:41
  • murmur3 is also available now. Would you please run the same test on see how many collision are there? https://github.com/PeterScott/murmur3 – harshitgupta Sep 12 '14 at 18:06
  • 1
    I need some extra information. I've tried making a test-program, but it fails miserably. I get 20000+ collisions at best. So I'd like to ask: `1: How many buckets do you use ?` and `2: What's your definition of a collision ?`. It may be my code that's completely off, but my own definition of a collision is that a hash (after modulation) matches another hash (after modulation). I've also tried comparing the hashes before modulation; but still 20000+ of my collisions at best. Do you count collisions each time a word has the same hash, or do you count them just the first time ? –  Nov 09 '14 at 23:07
  • @IanBoyd - as an aside, do you have a high-res version of the Hilbert Map above? This would made a beautiful addition to this geek's wall art. I'd be willing to pay you for it. – cciotti Dec 01 '14 at 19:19
  • @cciotti I was about to say that i can do that. But then i remembered: that map **is** 1:1 resolution. There are `216,553` words in the English language, and you see `524,288` pixels (where white is the unfilled hash map). There are no more pixels to show! Were you thinking you'd just like to be able to see a high resolution Hilbert color gradient? 16:10 aspect ratio and 4,000 pixels high? – Ian Boyd Dec 01 '14 at 19:43
  • @IanBoyd - That makes sense. I was hoping to find something high-res that I could print at a reasonable size. I have just the spot and was considering a DNA marker print but I like this too. – cciotti Dec 01 '14 at 19:58
  • @mcfinnigan "Costarring liquid" – Christian Stewart Jan 12 '15 at 18:21
  • In the page https://web.archive.org/web/20070221060514/http://www.sitopreferito.it/html/all_english_words.html OUTSOURCING and OUTSOURCINGS each repeat twice. Did you account for this? Or did you just used another version of this database without this bug? – user502144 Feb 06 '15 at 14:22
  • @IanBoyd, your answer is really impressive, like many others have noted, and I'm as curious as the next guy, but I also like to get shit done, rather than spending my whole day reading about the intricacies of various hashing algorithms ;) Could you please add a *Conclusion* section or a *TL;DR*? It sounds like maybe Murmur2 is the "best", or maybe Murmur3? It's hard to tell without studying the whole thing. I realize "best" may depend on the situation, but some kind of overview would make your answer much more useful. – Ian Dunn May 23 '15 at 16:05
  • 2
    You are a hero. – FrontierPsycho Jun 25 '15 at 12:59
  • 2
    @IanDunn It's tucked away in there; but i decided on **MurMur2**. I wasn't able to implement MurMur3 easily, and so it wasn't tested. Originally i was going to use FNV-1, but changed my mind to Murmur2. – Ian Boyd Jun 25 '15 at 13:27
  • The 64-bit superfasthast pascal implementation is incorrect. It shouldn't use Cardinal() that will truncate the pointer in 64-bit. – justyy Sep 10 '15 at 21:30
  • The table is very confusing to me. _"The extra * in the above table denotes how bad the randomness is. With FNV-1a being..."_ but the table is _under_ the statement and I don't understand what the '*' mean there... – Noein Nov 07 '15 at 18:39
  • @Noein The `*` denotes relative *badness*, with FNV-1 have no "badness", so there is no **`*`** next to it. Meanwhile **LoseLose** is so bad that the **`** spill out over the edge of the chart onto the floor. – Ian Boyd Nov 09 '15 at 18:08
  • For the randomness aspect of the two most evenly hashed algorithms, you might look into the Serial Correlation Coefficient (Knuth, pp. 64–65). Applying [ent](http://www.fourmilab.ch/random/)'s random analysis on this as a bit stream (rather than byte stream) could be an interesting addition to the analysis. –  Nov 09 '15 at 19:23
  • 1
    @IanBoyd But FNV-1 has a point, FNV-1_a_ has no points, and _CRC_ has a lot of points while SuperFastHash has none. There's the confusion :/ – Noein Dec 02 '15 at 16:54
  • I wrote and refined [my own hash function](https://github.com/jbruchon/jodyhash) to replace md5 in [jdupes](https://github.com/jbruchon/jdupes) with something faster; it was specifically designed to reflect basic CPU instructions and use as few operations as possible. I ran the All English Words (upper and lower case, with OUTSOURCING[S] deduped) and number sequence tests to check collisions: 1 pair collided for UC, no collisions for LC or numbers. My other tests in my actual programs show very low collision rates as well; I'd be interested in more thorough testing and commentary as seen here! – Jody Bruchon Jul 21 '16 at 02:44
  • Would you mind putting your code for comparison up on GitHub or someplace that would accept contributions? It would be nice to have this in an easy place, and there's a few additional hashes that I wouldn't mind seeing added myself. – b4hand Aug 15 '16 at 16:55
  • Your answer is VERY comprehensive, HOWEVER: i dont seen anything (i used ctrl+F) about cores or threads. What is your hardware spec for the analysis? perhaps its due to my limited understanding of cryptology, but would not the CPU type, # or cores, multithreading, compiler opts, and memory type all have non-linear affects on these algos? or is it not affecting your current test sets (or affecting them proportionally)? – Nick Aug 28 '16 at 16:54
  • 1
    @Nick Fortunately i wasn't really investigating the hand-optimized assembly multi-core versions of the algorithms. I was testing the actual algorithms as i implemented them, as they'd actually be used in the language i wrote them in. It's also fortunate that multiple cores wouldn't help hashing your typical 10 character string - the string can already fit inside one 64-byte cache line inside one core. And modern CPUs are already so pipelined that it's able to do a lot. – Ian Boyd Aug 30 '16 at 15:05
  • 2
    @Nick I ***highly*** recommend Eric Brumer's talk [Native Code Performance and Memory: The Elephant in the CPU](https://channel9.msdn.com/Events/Build/2013/4-329). He explains how memory access is everything. Memory is the bottleneck, not processing power. 99% of your CPU die is dedicated to getting stuff out of memory. – Ian Boyd Aug 30 '16 at 15:07
  • @IanBoyd - thanks for the clarification - there i went again, getting all excited thinking about all these use cases, and as it turns out, my use case is pretty similar to yours - "short" string hasing :) I've added the link to my list of todos. Hopefully i'll have time on the weekend! – Nick Aug 30 '16 at 16:43
  • What about SipHash because of HashDoS mentioned at https://en.wikipedia.org/wiki/MurmurHash#Attacks ? – Jeroen Wiert Pluimers Nov 03 '16 at 20:24
  • @Nick Even if memory weren't the bottleneck, hash functions usually are difficult or impossible to parallelize unless they produce final hashes larger than the machine word size (in which case they MIGHT be able to use more cores to compute partial hashes and then recombine them at the end) but even then some pretty specific restrictions would have to be placed on the hash function to force it to be parallelizable. In general most machine-word sized non-secure hash functions are so fast already that the effort is not worth it. – Jody Bruchon Dec 30 '16 at 21:19
  • How do you generate those **randomness pixel graphs**? I tried calculating SBDM for increasing numbers, then dividing the output of SDBM by 0xFFFFFFFF, and check if its over or under .5 but its almost always under. It looks much worse than your graph. MurMurHash3 works great with my method though. This is my output for SDBM: http://i.imgur.com/MMQuYxY.png – bryc Mar 07 '17 at 04:37
  • 1
    @bryc Each pixel is an entry in the internal array. If the array slot is empty, it's drawn as white. If the array slot is filled: it's drawn with a color. So then i just create an image that is sqrt(arrayLength) wide and tall. It also helps that, at the time, i was using a prime number probe. I've since switched to linear probe - cause cache hits. – Ian Boyd Mar 07 '17 at 16:06
  • You should have published a IEEE paper with these much information. ;) – Prasanth Jaya May 03 '17 at 13:42
  • @IanBoyd Would love seeing Adler32 added to this list for comparison. I had been using it for String hashing and saw a related post on a sister site from Adler himself pointing out it should not be used that way. Also, I bet I'm not alone in wishing to see this posted elsewhere as an updated blog type post with more details included for those interested. – mikebabcock Nov 21 '17 at 20:57
  • @IanBoyd Asking @user124114's question: `1: How many buckets do you use ? and 2: What's your definition of a collision ?`, he tried the experiment and got 20000+ collisions at best. `Do you count collisions each time a word has the same hash, or do you count them just the first time ?` – Ninja May 30 '18 at 19:19
  • 1
    And about supposed faster xxHash? http://cyan4973.github.io/xxHash/ – Peter Krauss Jun 17 '18 at 00:11
  • I wonder how the PJW/ELF hash goes. It's standard in UNIX world, so I'd hope the collision rates are not too bad. – paulotorrens Jul 10 '18 at 19:19
  • 1
    Now that you've got these measurements, it's worth noting for readers that the tradeoff of speed and collisions changes depending on the use case. Every use case will pay some cost A for the non-collision case, and a different cost B for the collision case. The total cost of a sequence of N accesses then, with H being the cost of the hash itself and c the collision rate, is `N*(H + (1-c)A + cB)`, and you want to choose your hash to minimize that. – Phil Miller Jul 10 '18 at 22:44
  • Any chance to add Bob Jenkins hash to the comparison? That (one-at-a-time) is the one I use. (I just found this thread form the reddit post) – BeniBela Jul 11 '18 at 12:39
  • 1
    I bet the collision between `codding` and `gnu` for CRC32 is an easter egg! – Yves Aug 08 '18 at 09:29
  • 3
    Since the question is about a hash function for a hash table, this is largely irrelevant .. collisions necessarily occur when you fold the hash into a relatively small number of buckets, and some functions will do poorly on some hash table sizes. And the best hash functions for hash tables, such as xxHash, sipHash, and Bob Jenkins' SpookyHash aren't included. People asking for SHA and MD5 are in the wrong ballpark for speed, so maybe include them just to demonstrate that. – Jim Balter Oct 20 '18 at 16:11
  • 1
    Here's the comparison you really want: https://aras-p.info/blog/2016/08/09/More-Hash-Function-Tests/ – Jim Balter Oct 21 '18 at 00:31
  • @Yves If you look at the definition and history of CRC32, you will see that cannot possibly be the case. Good thing for you that "I bet" is just an emphatic way to state unfounded beliefs, rather than a commitment to making a payment. – Jim Balter Oct 24 '18 at 03:59
  • @PhilMiller Collisions happen upon table insertion, and the costs are internal to the hash table implementation. They don't depend on the "use case", nor do they apply to "accesses". Further, for any given hash table implementation, the costs A and B are not constant--they depend on the key and the prior content of the table--so your formula is wrong. The right way to formulate it is with worst case, best case, and average costs for insertion, deletion and query. Use cases should pick implementations accordingly. – Jim Balter Oct 24 '18 at 04:25
  • @JimBalter Hash collisions can absolutely impose costs on lookups as well as insertions. For the most trivial illustration, consider separate chaining where entries that collide require added traversal of a list to locate. – Phil Miller Oct 24 '18 at 04:42
  • @PhilMiller I didn't say that collisions can't impose costs on lookups; of course they can. But you wrote "The total cost of a sequence of N accesses then ... is [formula]", which should read "insertions", not "accesses". The cost of accesses is different from the cost of insertions, and again are *not constant* -- the cost of traversal of a chain making that obvious. And you ignored the rest of my points. I won't respond further. – Jim Balter Oct 24 '18 at 06:56
  • 1
    Who would guess that snushing playwrighting could cause collisions... – Apollo Jul 23 '19 at 09:20
  • Dictionary words plumless and buckeroo are also CRC32 collisions based on https://www.lammertbies.nl/comm/info/crc-calculation.html Wondering how cityhash would do. – Nick Aug 02 '19 at 13:15
  • @IanBoyd It would be nice if you could include Java String Hashcode algorithm for comparison. :) – Vigneshwaran Jan 27 '20 at 10:59
  • This is an awesome answer, and deserves all its upvotes. In addition to the numbers on compute time and collisions, I would like to see an indication of implementation complexity. In a constrained situation, such as a SQLite query in which you can't define your own functions, which of the hash functions listed here can be practically implemented? – LarsH Mar 19 '20 at 14:43
  • As someone else attempted to say, the FNV-1a diagram does have a pattern to it, it just doesn't repeat on the same frequency as the width of your graph. Do you see the faint diagonal line patterns in the results? If you shift your line length around I believe you will see the pattern become much more pronounced. – Jason Jun 09 '20 at 23:47
  • @Jason I do see them; they're on an angle of around 25°. I'm sure a computer science or math major could write a 4ᵗʰ paper on why the diagonals are there, and come up with the algebraic equation for the exact angle. In the meantime: MurMur2 – Ian Boyd Jun 10 '20 at 01:17
  • I wonder if there's some fast fourier transform people should be doing on scatterplot data to pick the right dimensions. – Jason Jun 10 '20 at 19:02
  • Really great research! Sadly that there are no numbers for MD5 – Bogdan Mart Aug 02 '20 at 15:49
  • 1
    @BogdanMart I was only doing it to implement a hash table; 32-bit keys. A 128-bit key would be excessive - and i promise you: there would have been zero collisions. – Ian Boyd Aug 02 '20 at 17:06
  • Oh, thank you for letting me know! I'm just looking to hash disk sectors, so was choosing between md5 and murmur3 128bit. I think would use murmur now. – Bogdan Mart Aug 02 '20 at 17:42
  • 1
    This is one of the best answers I've ever seen. – BadHorsie Apr 22 '21 at 15:30
  • 1
    I think, this answer is a little bit misleading. It shows that DJB2 function performs almost as good as Murmur. However, DJB2 is *very bad* on short strings. Example collisions: `Liz`/`MHz`, `Bon`/`COM`, `Rey`/`SEX`, etc. See my answer for quite a different picture: https://stackoverflow.com/a/69812981/5407270 – Andriy Makukha Nov 04 '21 at 07:24
75

If you are wanting to create a hash map from an unchanging dictionary, you might want to consider perfect hashing https://en.wikipedia.org/wiki/Perfect_hash_function - during the construction of the hash function and hash table, you can guarantee, for a given dataset, that there will be no collisions.

Damien
  • 751
  • 5
  • 2
  • 4
    Here's more about (minimal) Perfect Hashing http://burtleburtle.net/bob/hash/perfect.html including performance data, although it doesn't use the most current processor etc. – Ellie Kesselman May 29 '12 at 12:24
  • 6
    It's pretty obvious, but worth pointing out that in order to guarantee no collisions, the keys would have to be the same size as the values, unless there are constraints on the values the algorithm can capitalize on. – devios1 Apr 04 '13 at 20:34
  • I improved gperf and provide a nice frontend to most perfect hash generators at https://github.com/rurban/Perfect-Hash It's not yet finished, but already better then the existing tools. – rurban Aug 20 '14 at 06:05
  • @devios: I've recently learned that there are several hash table algorithms that guarantee no collisions, even when you use long strings as keys, strings much longer than the hash table index values generated by the hash function, without any constraints on those strings. See http://cs.stackexchange.com/questions/477/for-what-kind-of-data-are-hash-table-operations-o1 . – David Cary Jun 08 '15 at 17:11
  • Exactly how large is this dataset supposed to be? can it support sizes of between 750GB to 50 terabytes or so? – MarcusJ Oct 13 '15 at 02:53
  • 4
    @devios1 Your statement is meaningless. First, the values in a hash table, perfect or not, are independent of the keys. Second, a perfect hash table is just a linear array of values, indexed by the result of function that has been crafted so that all the indices are unique. – Jim Balter Oct 20 '18 at 16:29
  • 1
    @MarcusJ Perfect hashing is usually used with less than 100 keys, but take a look at http://cmph.sourceforge.net/ ... still far short of your range. – Jim Balter Oct 20 '18 at 16:35
  • 2
    @DavidCary Nothing at your link supports your claim. Possibly you have confused O(1) with "no collisions", but they aren't at all the same thing. Of course, perfect hashing guarantees no collisions, but it requires that all the keys are known in advance and that there are relatively few of them. (But see the link to cmph above.) – Jim Balter Oct 20 '18 at 16:38
  • 1
    @JimBalter I believe it comes down to how you define a collision. If your criteria is an O(1) complexity lookup, yes you can still achieve O(1) lookup on a hash table with collisions. They achieve this by having a finite number of O(1) stage lookups, thus still being O(1). However it's mathematically impossible to reduce a larger space of possible values to a smaller space of values without some redundancy/overlap. By value I meant the original key and by key I meant the hashed key, not the value you are storing in the hashmap. Sorry for the confusing wording. – devios1 Oct 20 '18 at 18:06
  • 1
    @devios1 Again, "Possibly you have confused O(1) with "no collisions", but they aren't at all the same thing". A collision is a collision. O(1) is a different matter entirely. And as I noted, you don't understand how perfect hashing works ... read my comment again. There is no "hashed key" ... or rather, the "hash" is just an index, due to the careful construction of the hash function. It's not worth my time to discuss this further. Over and out. – Jim Balter Oct 20 '18 at 18:27
  • @JimBalter: I clearly should have explicitly stated "several hash table algorithms -- such as cuckoo hashing and [dynamic perfect hashing](https://en.wikipedia.org/wiki/Dynamic_perfect_hashing) -- that guarantee that none of the stored data keys collides". Those algorithms do not require that all the keys are known in advance, and they guarantee that none of the stored items have a collision when they are looked up again. – David Cary Oct 24 '18 at 01:17
  • @DavidCary "guarantee that none of the stored data keys collides" -- They don't and can't guarantee that. e.g., https://en.wikipedia.org/wiki/Cuckoo_hashing, "However, when both cells are already full, it will be necessary to move other keys to their second locations (or back to their first locations) to make room for the new key. " As for "they guarantee that none of the stored items have a collision when they are looked up again" -- collisions happen on insertion, not lookup. You may have meant something not utterly incoherent, but I have no reason to care. And we're off topic. Bye. – Jim Balter Oct 24 '18 at 03:48
40

Here is a list of hash functions, but the short version is:

If you just want to have a good hash function, and cannot wait, djb2 is one of the best string hash functions i know. It has excellent distribution and speed on many different sets of keys and table sizes

unsigned long
hash(unsigned char *str)
{
    unsigned long hash = 5381;
    int c;

    while (c = *str++)
        hash = ((hash << 5) + hash) + c; /* hash * 33 + c */

    return hash;
}
Dean Harding
  • 19,871
  • 3
  • 51
  • 70
  • 9
    Actually djb2 is zero sensitive, as most such simple hash functions, so you can easily break such hashes. It has a bad bias too many collisions and a bad distribution, it breaks on most smhasher quality tests: See https://github.com/rurban/smhasher/blob/master/doc/bernstein His cdb database uses it, but I wouldn't use it with public access. – rurban Aug 20 '14 at 06:03
  • 4
    DJB is pretty bad from a performance and distribution standpoint. I wouldn't use it today. – Conrad Meyer Oct 30 '16 at 21:32
  • 1
    @ConradMeyer I'd bet, DJB can be sped up by a factor of three just like in [this question of mine](https://stackoverflow.com/q/21745619/581205) and then it'd probably beat most usable algorithms. Concerning the distribution, I agree. A hash producing collisions even for two letter strings can't be really good. – maaartinus Jun 04 '17 at 00:42
  • Guys, I have doubts. You are saying `djb2` is bad, but the test results of the accepted answer show it is good. – Leopoldo Sanczyk Oct 31 '19 at 04:43
  • You might at least use a sensible prime that produces less collisions instead of 33. https://stackoverflow.com/a/2816747/21499 – Dr. Hans-Peter Störr Jun 22 '20 at 20:17
32

CityHash by Google is the algorithm you are looking for. It is not good for cryptography but is good for generating unique hashes.

Read the blog for more details and the code is available here.

CityHash is written in C++. There also is a plain C port.

About 32-bit support:

All the CityHash functions are tuned for 64-bit processors. That said, they will run (except for the new ones that use SSE4.2) in 32-bit code. They won't be very fast though. You may want to use Murmur or something else in 32-bit code.

bruceg
  • 103
  • 4
Vipin Parakkat
  • 429
  • 4
  • 4
30

I've plotted a short speed comparison of different hashing algorithms when hashing files.

The individual plots only differ slightly in the reading method and can be ignored here, since all files were stored in a tmpfs. Therefore the benchmark was not IO-bound if you are wondering.

Algorithms include: SpookyHash, CityHash, Murmur3, MD5, SHA{1,256,512}.

Conclusions:

  • Non-cryptographic hash functions like Murmur3, Cityhash and Spooky are pretty close together. One should note that Cityhash may be faster on CPUs with SSE 4.2s CRC instruction, which my CPU does not have. SpookyHash was in my case always a tiny bit before CityHash.
  • MD5 seems to be a good tradeoff when using cryptographic hash functions, although SHA256 may be more secure to the collision vulnerabilities of MD5 and SHA1.
  • The complexity of all algorithms is linear - which is really not surprising since they work blockwise. (I wanted to see if the reading method makes a difference, so you can just compare the rightmost values).
  • SHA256 was slower than SHA512.
  • I did not investigate the randomness of the hash functions. But here is a good comparison of the hash functions that are missing in Ian Boyds answer. This points out that CityHash has some problems in corner cases.

The source used for the plots:

Anirvan
  • 105
  • 4
Sahib
  • 411
  • 4
  • 3
  • 2
    The linear scale graph cuts off the y-axis label which says what quantity it is plotting. I guess it probably would be "time in seconds", same as the logarithmic scale. It's worth fixing. – Craig McQueen Dec 13 '15 at 23:24
24

I know there are things like SHA-256 and such, but these algorithms are designed to be secure, which usually means they are slower than algorithms that are less unique.

The assumption that cryptographic hash functions are more unique is wrong, and in fact it can be shown to be often backwards in practice. In truth:

  1. Cryptographic hash functions ideally should be indistinguishable from random;
  2. But with non-cryptographic hash functions, it's desirable for them to interact favorably with likely inputs.

Which means that a non-cryptographic hash function may well have fewer collisions than a cryptographic one for "good" data set—data sets that it was designed for.

We can actually demonstrate this with the data in Ian Boyd's answer and a bit of math: the Birthday problem. The formula for the expected number of colliding pairs if you pick n integers at random from the set [1, d] is this (taken from Wikipedia):

n - d + d * ((d - 1) / d)^n

Plugging n = 216,553 and d = 2^32 we get about 5.5 expected collisions. Ian's tests mostly show results around that neighborhood, but with one dramatic exception: most of the functions got zero collisions in the consecutive numbers tests. The probability of choosing 216,553 32-bit numbers at random and getting zero collisions is about 0.43%. And that's just for one function—here we have five distinct hash function families with zero collisions!

So what we're seeing here is that the hashes that Ian tested are interacting favorably with the consecutive numbers dataset—i.e., they're dispersing minimally different inputs more widely than an ideal cryptographic hash function would. (Side note: this means that Ian's graphical assessment that FNV-1a and MurmurHash2 "look random" to him in the numbers data set can be refuted from his own data. Zero collisions on a data set of that size, for both hash functions, is strikingly nonrandom!)

This is not a surprise because this is a desirable behavior for many uses of hash functions. For example, hash table keys are often very similar; Ian's answer mentions a problem MSN once had with ZIP code hash tables. This is a use where collision avoidance on likely inputs wins over random-like behavior.

Another instructive comparison here is the contrast in the design goals between CRC and cryptographic hash functions:

  • CRC is designed to catch errors resulting from noisy communications channels, which are likely to be a small number of bit flips;
  • Crypto hashes are designed to catch modifications made by malicious attackers, who are allotted limited computational resources but arbitrarily much cleverness.

So for CRC it is again good to have fewer collisions than random in minimally different inputs. With crypto hashes, this is a no-no!

sacundim
  • 4,748
  • 1
  • 19
  • 16
20

The SHA algorithms (including SHA-256) are designed to be fast.

In fact, their speed can be a problem sometimes. In particular, a common technique for storing a password-derived token is to run a standard fast hash algorithm 10,000 times (storing the hash of the hash of the hash of the hash of the ... password).

#!/usr/bin/env ruby
require 'securerandom'
require 'digest'
require 'benchmark'

def run_random_digest(digest, count)
  v = SecureRandom.random_bytes(digest.block_length)
  count.times { v = digest.digest(v) }
  v
end

Benchmark.bmbm do |x|
  x.report { run_random_digest(Digest::SHA256.new, 1_000_000) }
end

Output:

Rehearsal ------------------------------------
   1.480000   0.000000   1.480000 (  1.391229)
--------------------------- total: 1.480000sec

       user     system      total        real
   1.400000   0.000000   1.400000 (  1.382016)
yfeldblum
  • 1,532
  • 10
  • 9
  • 61
    It's relatively fast, sure, *for a cryptographic hashing algorithm*. But the OP just wants to store values in a hashtable, and I don't think a cryptographic hash function is really appropriate for that. – Dean Harding Feb 19 '11 at 01:10
  • 6
    The question brought up (tangentially, it now appears) the subject of the cryptographic hash functions. That's the bit I am responding to. – yfeldblum Feb 22 '11 at 13:14
  • 18
    Just to put people off the idea of "In particular, a common technique for storing a password-derived token is to run a standard fast hash algorithm 10,000 times" -- while common, that's just plain stupid. There are algorithms designed for these scenarios, e.g., `bcrypt`. Use the right tools. – TC1 Oct 14 '13 at 13:19
  • 3
    Cryptographic hashes are designed to have a high throughput, but that often means they have high setup, teardown, `.rodata` and/or state costs. When you want an algorithm for a hashtable, you usually have very short keys, and lots of them, but do not need the additional guarantees of a cryptographic has. I use a tweaked Jenkins’ one-at-a-time myself. – mirabilos Dec 06 '13 at 13:57
  • Thanks for the amazingly extensive work! Could you perhaps make the source for that available? I'd like to try out something, and it'd be generally helpful for people implementing hash algorithms. – Dr. Hans-Peter Störr Jun 14 '17 at 09:25
  • @DeanHarding (and anyone upvoting that comment): when dealing with untrusted user input, you should *definitely* use cryptographic hash functions, to protect yourself from denial-of-service attacks. Your comment reflects the accepted wisdom of 2011, but since then, basically everyone has switched their hash functions to crypographically-secure hash functions, because people figured out good ways of DoS-attack hash tables using insecure hash functions, e.g. to trigger worst-case performance (e.g. O(n²) on much data rather than the average O(n)) by dropping everything into one bucket. – Chris Morgan Nov 08 '17 at 02:17
  • 1
    @ChrisMorgan: rather than using a cryptographically secure hash, HashTable DoS can be solved much more efficiently using hash randomization, so that every run of the programs or even on every hashtable, so the data doesn't get grouped into the same bucket every time. – Lie Ryan Nov 12 '17 at 17:16
15

Use SipHash. It has many desirable properties:

  • Fast. An optimized implementation takes around 1 cycle per byte.

  • Secure. SipHash is a strong PRF (pseudorandom function). This means that it is indistinguishable from a random function (unless you know the 128-bit secret key). Hence:

    • No need to worry about your hash table probes becoming linear time due to collisions. With SipHash, you know that you will get average-case performance on average, regardless of inputs.

    • Immunity to hash-based denial of service attacks.

    • You can use SipHash (especially the version with a 128-bit output) as a MAC (Message Authentication Code). If you receive a message and a SipHash tag, and the tag is the same as that from running SipHash with your secret key, then you know that whoever created the hash was also in possession of your secret key, and that neither the message nor the hash have been altered since.

Demi
  • 826
  • 7
  • 18
  • 2
    Isn't SipHash overkill unless you need security? Requires a 128-bit key which is just a glorified hash seed. Not to mention MurmurHash3 has 128-bit output and SipHash only has a 64-bit output. Obviously the larger digest has a lower collision chance. – bryc Apr 02 '18 at 00:37
  • 1
    @bryc The difference is that SipHash will continue to be well-behaved, even on malicious input. A hash table based on SipHash can be used for data from potentially hostile sources, and can use an algorithm such as linear probing that is very sensitive to the details of the hash function. – Demi Apr 02 '18 at 01:21
  • 2
    Siphash (and related newer prng style functions) is my default choice for security. For performance, xxhash is hard to beat. There’s tons of bad hashing advice on the internet, even in the discussions here. Good performance on random or semi random inputs is meaningless. What is the worst case performance, with real world inputs? What is the result with malicious inputs? Your hash table will eventually become an attack vector. – Frank Hileman Feb 15 '21 at 17:33
10

It depends on the data you are hashing. Some hashing works better with specific data like text. Some hashing algorithms were specificaly designed to be good for specific data.

Paul Hsieh once made fast hash. He lists source code and explanations. But it was already beaten. :)

Peter Mortensen
  • 1,050
  • 2
  • 12
  • 14
user712092
  • 1,412
  • 10
  • 18
6

Java uses this simple multiply-and-add algorithm:

The hash code for a String object is computed as

 s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

using int arithmetic, where s[i] is the i​-th character of the string, n is the length of the string, and ^ indicates exponentiation. (The hash value of the empty string is zero.)

There are probably much better ones out there but this is fairly widespread and seems to be a good trade-off between speed and uniqueness.

gnat
  • 21,442
  • 29
  • 112
  • 288
biziclop
  • 3,351
  • 21
  • 22
  • 14
    I wouldn't use the exact same one used here, as it's still *relatively* easy to produce collisions with this. It's *definitely* not terrible, but there are much better ones out there. And if there's no significant reason to be compatible with Java, it should *not* be chosen. – Joachim Sauer Apr 23 '12 at 12:51
  • 7
    If you still choose this way of hashing for some reason, you could at least use a better prime like 92821 as a multiplicator. That reduces collisions much. http://stackoverflow.com/a/2816747/21499 – Dr. Hans-Peter Störr Jul 01 '14 at 06:30
  • 3
    You might as well use FNV1a instead. It's also a simple multiplication-based hash, but uses a larger multiplier, which disperses the hash better. – bryc Jan 15 '19 at 12:13
  • 1
    You don't want to do `s[0]*31^3 + s[1]*31^2 + s[2]*31 + s[3]`. Avoid the power operator (^) and do it this way: `((s[0]*31 + s[1])*31 + s[2])*31 + s[3]`. – Leopoldo Sanczyk Oct 31 '19 at 04:51
  • 1
    @LeopoldoSanczyk Yes, in the code it is (and should be) done iteratively, it was just easier to understand in a closed formula. – biziclop Nov 01 '19 at 18:16
5

First of all, why do you need to implement your own hashing? For most tasks you should get good results with data structures from a standard library, assuming there's an implementation available (unless you're just doing this for your own education).

As far as actual hashing algorithms go, my personal favorite is FNV. 1

Here's an example implementation of the 32-bit version in C:

unsigned long int FNV_hash(void* dataToHash, unsigned long int length)
{
  unsigned char* p = (unsigned char *) dataToHash;
  unsigned long int h = 2166136261UL;
  unsigned long int i;

  for(i = 0; i < length; i++)
    h = (h * 16777619) ^ p[i] ;

  return h;
}
  • 3
    The FNV-1a variant is slightly better with randomness. Swap the order of the `*` and `^`: `h = (h * 16777619) ^ p[i]` ==> `h = (h ^ p[i]) * 16777619` – Ian Boyd Apr 23 '12 at 15:04