-1

FNV-1, Murmur2, and DJB2 are examples of non-cryptographic hashing functions used in actual applications (see Which hashing algorithm is best for uniqueness and speed?). These are all similar in that they have an inner loop that computes the result by using simple operations such as XOR or bit shifting.

Perhaps these algorithms are truly the best available; I don't know. But Pearson Hashing, in particular, seems to be rarely considered. Yet it is the algorithm that would seem to be the best, given no domain information about the keys, because it was designed to spread out (randomize) the range well for any domain.

This means (for character string keys) that whether the keys are single letters (a small domain) or strings of length 128 (a much larger domain), the result (the hashed value) is guaranteed to be spread out well (distributed randomly over the range of the function). The reason that such a good random distribution is desired is that such a distribution would be expected to reduce the number of collisions for random selections of keys (again, assuming no special characteristics of the key (domain) distribution).

Pearson hashing accomplishes this by using a 256-entry array that contains a random permutation having a single cycle. To clarify what I mean, here is a four-entry array specifying such a permutation of the value list [0, 1, 2, 3]:

0:  2
1:  3
2:  1
3:  0

Pearson hashing scans the key string. For each byte (it's okay for a character to span multiple bytes), it XORs the byte into a running sum hash, then looks hash up in its 256-byte permutation array. This lookup result is the next running sum. Since the array contains 256 entries, it will handle any byte. To scale up to a larger range, such as a 16 or 32 bit range, the inner loop is performed more than once. Since the algorithm uses one XOR and one simple array lookup for each byte of the range, it is probably the fastest algorithm possible, particularly if implemented in assembly language. And the size of the Pearson hashing function should be small, not much more than the 256 bytes used by its array. Except in very tiny embedded applications, I can't imagine 256 bytes of memory being an objection to the use of Pearson hashing.

Some examples of Pearson function implementations are given at https://gist.github.com/imdario/4758192 .

I would be very interested to see an analysis of Pearson hashing as compared to other more commonly-used algorithms, such as the three listed above, with input sets such as an English dictionary. I would expect it to do very well.

  • 1
    What's the question? Sum it up. – RubberDuck Jun 11 '16 at 15:51
  • The question was stated clearly: "I would be very interested to see an analysis of Pearson hashing as compared to other more commonly-used algorithms..." It is not a yes or no question. It requires the kind of programming that was done in Question 49550. I post this as a question in case someone can answer it more easily than I can. – David Spector Jun 11 '16 at 16:06
  • 1
    That's not a question. It's a request for a scholarly article to be written here. – RubberDuck Jun 11 '16 at 16:07
  • Was Question 49550 "scholarly"? Good answers were already written here. What is wrong with "scholarly"? Education is the basis of all practical work in software engineering, which is the topic of this Forum. – David Spector Jun 11 '16 at 16:11
  • This isn't a forum @DavidSpector. It's a question and answer site. Your post had the potential to be an excellent question, but you need to narrow it down and ask an answerable question. Good questions have answers. – RubberDuck Jun 11 '16 at 17:14
  • Well, first I wanted to ask the OP to add Pearson Hashing in a comment to his great answer. But I didn't have enough points or whatever to post a comment. Next, I didn't have enough points to give an answer to that question. So I did the next possible thing, which was to ask a question. Why all the objections? – David Spector Jun 11 '16 at 20:27
  • Let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/41048/discussion-between-david-spector-and-rubberduck). – David Spector Jun 11 '16 at 20:54
  • possible duplicate of [Which hashing algorithm is best for uniqueness and speed?](http://programmers.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed) – gnat Jun 22 '16 at 08:15
  • 1. Not a duplicate of other postings, because they do not include Pearson Hashing. I posted the question because Pearson Hashing is fundamentally different from other commonly-used functions. Its unique use of a permutation table reduces collisions no matter what the set of keys. I was hoping that the person who performed the other tests at that "possible duplicate" posting would want to add Pearson Hashing to the list of functions tested. – David Spector Jun 23 '16 at 13:09
  • 2. This is not a discussion forum and I was not calling for a discussion but actual testing. I have never seen Pearson Hashing given a fair trial, only for the reason that the common algorithms are more familiar. Actual comparison testing would reveal whether it is actually superior to the other functions, which actually form one family with one underlying but parameterized algorithm. – David Spector Jun 23 '16 at 13:12
  • 3. Even if Pearson Hashing is slower than other techniques, for very large hash tables, its better collision avoidance might be desired in practical use. How can we know without actual statistics? Pearson Hashing will never have a chance to catch on until it is compared with other functions. – David Spector Jun 23 '16 at 13:14
  • Did you benchmark Pearson hashing? I'd expect it to be slower than fast crypto hashes on modern computers and much slower than murmur, cityhash, etc. One idea is to use the AES sbox on a machine that supports AES-NI. That way you Pearson might achieve okay performance. – CodesInChaos Jul 21 '16 at 07:51
  • It is hardly fair to compare Pearson hashing (which does not have hardware acceleration) to an s-box based hashing function (on a machine that has hardware acceleration). – David Spector Jul 21 '16 at 21:28
  • @DavidSpector 1) My suggestion was using the AES s-box for your pearson hash so its performance won't suck as much as a normal implementation. 2) The expected value of collisions for an ideal hash is 5.5 in the linked question. Apart from SuperFastHash and LoseLose all hashes are already close to that. The only way you can produce statistically significantly better results than that is exploiting knowledge of the domain (e.g. CRC32 guarantees the absence of collisions among strings which only differ by 4 consecutive bytes) and not by making the hash more random. – CodesInChaos Jul 22 '16 at 16:20
  • 1) I'm not sure that standard hardware can do an xor and an 8-bit table lookup. If it can, then, yes, that would be an efficient implementation for Pearson. 2) I agree that exploitation of the domain characteristics to avoid collisions should always be done, and will always be superior to randomness. However, in the many cases using a general string key, where there are no special characteristics, best randomness (as in Pearson) implies best collision avoidance. – David Spector Jul 23 '16 at 18:48

2 Answers2

1

I don't have a practical comparison between Pearson hashing and the other common suggestions, but I can highlight some assumptions you're making that aren't necessarily true and which might explain why it isn't as popular as you seem to expect:

  • You state that having good distribution of small keys throughout the entire range is just as important as good distribution of larger keys, but this is not necessarily true. In practical applications, small keys are rare, and cannot occur with any non-trivial frequency in large data sets simply because there are only a small number of possible small keys. We only care about optional performance for large data sets, as small data sets can be processed quickly enough in any case.

  • You assert that performance will be good due to the simplicity of the algorithm, but it doesn't really seem that simple to me. For a 32-bit hash (which is the smallest that's really useful) it requires 8 operations per byte. Compare this to Murmur's 6 operations per 4-byte word, and it is clearly not going to be competitive. Even a single byte output at 2 operations per byte is unlikely to be as fast as Murmur.

Jules
  • 17,614
  • 2
  • 33
  • 63
  • 1. Pearson Hashing is just as good for any size keys, and any key distributions. Its performance is worth testing for very large key sets. 2. Collision avoidance must be compared with execution time for actual key sets that will occur at runtime. There is no other way to choose an algorithm that is likely to be best in practice. One application may need collision avoidance, another may need fast execution time. – David Spector Jun 23 '16 at 13:16
  • You misunderstand my point. All of the popular hash algorithms have these characteristics for large keys; I sincerely doubt you'll see noticeably better results than Murmur's (for example) because if you look at the randomization statistics for that hash you will find that it is, by any reasonable measure, almost entirely "random". There really isn't much that can be gained. And as for trading execution time for "collision avoidance", note that collisions are inevitable in any case and you must therefore provide a backup mechanism. As long as your backup mechanism is not ... – Jules Jun 24 '16 at 22:12
  • ... orders of magnitude slower, the low frequency of collisions with any reasonable hashing algorithm in comparison with the total number of times hashing is required means that even if you do manage to improve the number of collisions radically (say by halving the number) in any reasonable application the cost of hashing is going to be *much* higher than the cost of occasionally resolving a collision. With an algorithm that's around 5-6 times slower, you are *highly* unlikely to find an application where it is worth the cost. – Jules Jun 24 '16 at 22:16
  • Good points, but we won't know for sure until we have actual large-scale statistical testing, similar to what was accomplished for the more popular algorithms. I'm not being unreasonable in calling for inclusion of Pearson Hashing. Arguments against testing reasonable and unique algorithms are at root arguments in favor of ignorance. Let's not do that. – David Spector Jun 25 '16 at 23:30
  • David: It sounds like you're not asking a question so much as pushing for an algorithm you personally like. If you want to test it, feel free to test it and post the results. No one is going to stop you. – Jody Bruchon Jul 21 '16 at 10:56
0

A comprehensive test suite and review of modern hash functions can be found in SMHasher repository of Reini Urban. According to the quality values, 64-bit Pearson hash is somewhat worse than 64-bit Murmur2 hash, but is almost 17 times slower!

Hash function  |  Speed  | A.Bias | Sparse Bias |  Cycl.Bias | 2-byte Bias |
============================================================================
Spooky64       | 9747.47 |   0.8% |      0.362% |     0.167% |      0.128% |
Murmur3_32     | 2413.88 |   0.8% |      0.548% |     0.106% |      0.201% |
Murmur2_64     | 4882.95 |  12.6% |      1.121% |     0.150% |      0.364% |
Pearson64      |  287.95 |  12.8% |      1.541% |     6.881% |      6.794% |
FNV64          |  791.82 | 100.0% |     99.988% |    94.450% |     99.837% |
DJB2           |  791.82 | 100.0% |  Collisions | Collisions |  Collisions |

So, currently something like SpookyHash would be the best option to use due to high speed and high quality of the hash values. If not Spooky, than Murmur3 beats Pearson hash in every meaningful way.

  • Thank you! This is very encouraging for the use of the Spooky family over the Pearson family, but I have two perhaps small reservations. First is that the code for Spooky is much longer; this must be considered minor due to the ready availability of memory/storage space. The second is that I"m not sure how the testing was done: are the data sets the same for all tests? I would say that the code for Spooky is more general than for Pearson, without the inherent table length limit of 256, which is a big improvement. Its number of inner-loop operations look to be of the same order. – David Spector Nov 23 '21 at 13:13
  • @DavidSpector, yes, it's the same test set for all hash functions. The tests were specifically devised to test performance of avalanching. In other words, how "random" the hashes look like and whether they have more collisions than theoretically expected. – Andriy Makukha Nov 23 '21 at 16:26
  • @DavidSpector, if you are concerned about binary size, then you can use Murmur3 or even MurmurOAAT hashes. They are quite small even for embedded environments. Regarding code size, however, it should not be your concern, as long as it compiles to a small and optimized binary. – Andriy Makukha Nov 23 '21 at 16:31