142

In code dealing with hash tables, I often find the constant 0x9e3779b9 or sometimes 0x9e3779b1. For example

hash = n * 0x9e3779b1 >>> 24

Why is this particular value used?

amalloy
  • 142
  • 7
bkgs
  • 1,003
  • 2
  • 6
  • 5
  • 3
    Some specific citations of code with this constant would be helpful. I found it [in SQLite](https://sqlite.org/src/file/src/hash.c), for example. – dan04 Dec 17 '19 at 23:00

3 Answers3

238

0x9e3779b9 is the integral part of the Golden Ratio's fractional part 0.61803398875… (sqrt(5)-1)/2, multiplied by 2^32.

Hence, if φ = (sqrt(5)+1)/2 = 1.61803398875 is the Golden Ratio, the hash function calculates the fractional part of n * φ, which has nice scattering properties. To convince yourself, just create a scatter plot of (n, n*c-FLOOR(n*c)) in your favourite spreadsheet, replacing c with φ, e, π, etc. Some interesting real-life issues when getting it wrong are described in https://lkml.org/lkml/2016/4/29/838.

This method is often referred to as "Golden Ratio Hashing", or "Fibonacci Hashing" and was popularised by Donald Knuth (The Art of Computer Programming: Volume 3: Sorting and Searching). In number theoretical terms, it mostly boils down to the Steinhaus Conjecture (https://en.wikipedia.org/wiki/Three-gap_theorem) and the recursive symmetry of the fractional parts of the multiples of the Golden Ratio φ.

Occasionally, you may also see 0x9e3779b1, which is the prime closest to 0x9e3779b9 (and appears to be a bit of "cargo cult" as this is not a modular hash). Similarly, 0x9e3779b97f4a7c15 and 0x9e3779b97f4a7c55 are the 64 bit equivalents of these numbers.

32f
  • 1,196
  • 1
  • 6
  • 4
  • 11
    btw contrary to your answer, Linus claims in your link that $\phi$ is no better than $e$ or $\pi$, or any constant that "has a reasonable bit distribution". – stewbasic Dec 17 '19 at 00:58
  • 2
    It's never a modular hash, which would be a hash of the form `x % m`. It's a multiplicative hash, and it's still a multiplicative hash if you use an other constant than Knuth *recommended*. –  Dec 17 '19 at 09:20
  • 1
    @stewbasic Just like the choice of this method itself (as correctly pointed out in another answer), "reasonable" probably depends on what is to be achieved: When dealing with subsets of small diameter, Golden Ratio may be good to you, even for large shift counts. If your inputs are randomly distributed over 1..2^32, any odd number might do. And if your sequence is multiples of 340573321, then 0x9e3779b9 well… I guess you already figured that ;-) – 32f Dec 17 '19 at 15:10
  • 1
    @harold correct! AFAICT, choosing a prime instead of the next odd value is a useful enhancement only if you consider working with computers not running on modular binary arithmetic. And yes, I understood the question as asking why that particular constant was recommended! – 32f Dec 17 '19 at 15:11
  • 2
    @user353249 Hash functions of the form `hash(x)=x*magicConstant` are indeed modular hashes -- the modulus is 2^w where w is the word size. But those of the form `hash(x)=(x*magicConstant)>>>shift`, such as the sort the OP mentioned, are not modular hashes. But then they're not multiplicative either. They lack both properties, thanks to the shift which loses some of the low bits. – Rosie F Dec 17 '19 at 16:08
  • 10
    The nice scattering properties do not actually come from the golden ratio. There is nothing inherently magical about the golden ratio other than people _trying_ to find something divine in it for over half a millenium. That constant isn't any worse than thousand others, though, so one can as well use it. The "magic" of this class of hash functions lies in the combination of multiplying with a reasonably large odd constant, and then shifting/rotating to the right. Which will distribute bits in _both_ directions (up and down). – Damon Dec 17 '19 at 16:44
  • 2
    @stewbasic: I agree it's not a great link, but Linus Torvalds is not exactly known for being competent at computer algorithms. Donald Knuth, on the other hand, is considered one of the foremost experts in the world, having literally wrote THE book on the subject. – BlueRaja - Danny Pflughoeft Dec 18 '19 at 13:26
  • 6
    Reading the actual Knuth source states the following (page 517pp in my third edition), paraphrased: There exists a theorem for all irrational numbers that shows the good scattering properties but some irrational numbers lead to more uniformly distributed sequences and Exercise 9 proves that the golden ratio leads to the "most uniformly distributed sequences". So yes the claim that pi or e would work just as well is unproven afaics. Although I guess they might fall into the same category as phi (have fun whoever wants to prove or disprove that though). – Voo Dec 18 '19 at 17:59
  • Knuth also goes on to say that the theorem suggests A to be "the nearest integer phi^-1*w" relative prime to w (which I guess is where the "pick prime" thing comes from) and then blows holes into that argument. At which point I'll admit I'm just not motivated enough to follow the thought process that leads to the multiplier that "can be recommended". – Voo Dec 18 '19 at 18:05
  • 8
    @Damon That's not true. **The golden ratio *is* special.** It's the "most irrational" number! And this is directly connected to its good equidistribution properties. See https://mathoverflow.net/a/304860. – user76284 Dec 19 '19 at 03:30
33

The other answers explain the intent behind those magic numbers, which is probably what you wanted to know. However one could say that where "they come from" is from bad programming practices. Magic numbers are bad, and they should never be used. Constants such as those mentioned should be given proper descriptive variable names, and perhaps even comments should be added to where they are defined. Then, every appearance of the values in the code should be in the form of the named variable. Where this the case in the codes where you met those values, you would not have been preplexed by their intent in the first place.

example:

Bad example - uses magic numbers

hash = n * 0x9e3779b1

Better example - with comments and meaningful variable

# Golden Ratio constant used for better hash scattering
# See https://softwareengineering.stackexchange.com/a/402543 
GOLDEN_RATIO = 0x9e3779b1
hash = n * GOLDEN_RATIO
squarecandy
  • 103
  • 2
isilanes
  • 479
  • 3
  • 3
  • 31
    "Magic numbers are bad, and should never be used." Absolute general statements are almost always false. See https://crypto.stackexchange.com/a/76301 for some good reasons AES uses magic numbers. – Dessa Simpson Dec 16 '19 at 19:53
  • 17
    @DuncanXSimpson That linked answer is mostly wrong, and is clearly written by someone who doesn't code. The first point about parameterisation is kind of correct, but this answer doesn't suggest that. As for the other two points in the linked answer, they are simply flat wrong. All modern programming languages provide some mechanism for setting constant values, and most compilers additionally can optimise unchanging variable values. The linked code from the linked question even uses constant values with explanatory comments, in defiance of that incorrect answer. – Graham Dec 16 '19 at 22:56
  • 24
    @Graham I think you and Duncan are using the term "magic number" differently and talking past each other. You and isilanes are talking about how the algorithm is expressed in code; specifically whether the source code makes clear the reason for using a particular value. This has no impact on behaviour. The linked question (which perhaps misuses the phrase "magic number") is about changing the behaviour of an encryption algorithm, making those values parameters supplied as input rather than constants. With this understanding, the three points in that answer make sense. – stewbasic Dec 17 '19 at 00:47
  • 16
    "Absolute general statements are almost always false." Amen to that. For example the absolute general statement that absolute general statements are false is false (as you probably realized before sending the comment and added "almost" to not make your answer ironic). Some things are just always wrong, for example magic numbers (meaning unexplained arbitrary values) in code :) – isilanes Dec 17 '19 at 07:42
  • 7
    only the Sith deal in absolutes! – NKCampbell Dec 17 '19 at 14:52
  • 6
    There is no particular point assigning the constant a name if it only occurs in one place (which is quite likely). A comment describing the source is good though. – Martin Bonner supports Monica Dec 17 '19 at 15:32
  • 1
    @DuncanXSimpson "Absolute general statements are almost always false." - one word less and this statement would become self-descriptive. Great truth though. – Błażej Michalik Dec 18 '19 at 00:24
  • 3
    @isilanes I've seen people following this "truth" about magic numbers beyond the edge of the cliff, where incrementing by one became adding a value of a constant `ONE`. Let's be reasonable: common sense is named that way for a reason. Magic numbers are *almost* always bad. – Błażej Michalik Dec 18 '19 at 00:28
  • 2
    @BłażejMichalik I've seen people take it even further and define a function "int add(int a, int b) { return a+b;}" because everything you do more than once should be turned into a function... – jwenting Dec 18 '19 at 04:32
  • @BłażejMichalik I've also seen constants named `ONE` and `ZERO` whose value was absolutely not 1 and 0, but it was a different context for that :). But I guess you could say that "magic numbers" are just numbers that gives no clue where they come from, so it really boils down to each case. – bracco23 Dec 18 '19 at 08:59
  • 1
    @BłażejMichalik, but "magic number" (magic value) is not "any number (value) that is not abstracted to a variable". A magic value (as I have used and stated) is an "unexplained arbitrary value". And the definition does not imply that just turning it into a variable solves the problem. "next = current + 1" is not a case of magic number. "price = 1 * DOOR_PRICE" is, because "1" there should rather be the variable "amount_of_doors", regardless of how often you expect to use it. And neither case is "solved" by naming ONE=1. – isilanes Dec 18 '19 at 10:04
  • @isilanes I didn't said that turning it to variable does make it "not a magic number". Unless the code is about algebraic structures, one would always be able to reason that `1` in `x + 1` is magic. Why `1`? Why not `11`? I know, it's an unsatisfactory evasion of the original argument, but the logic holds. – Błażej Michalik Dec 18 '19 at 10:11
  • 1
    @BłażejMichalik, I didn't say you did :) I was commenting on defining 1 as "ONE", to agree with you that doing so would be misguided. 1 in x+1 can be magic, yes. If there is any reason why the meaning would not be apparent, then a comment might be in order. There is also a difference between 1 meaning the number one (as in i=i+1), and 1 as in "amount of spoons", which happens to be one. The former might need explaining, but probably should not be abstracted to a variable. The latter should always be abstracted to a variable (IMHO), regardless of how obvious or constant one thinks it is. – isilanes Dec 18 '19 at 11:13
5
In code dealing with hash tables, I often find the constant 0x9e3779b9 or sometimes 0x9e3779b1

The other answer correctly explained why this value is used. However, if you often find this constant, what you may not be realizing that you often find code vulnerable to hash flooding attacks.

There are two strategies against hash flooding attacks:

  1. Use a secure hash function having a secret random seed. Your hash function doesn't have a secret random seed. Murmurhash3_32 has a secret random seed, but it has seed-independent multicollisions due to the small internal state. The best hash function having near cryptographical security and still nearly acceptable performance is probably SipHash. Unfortunately, it is slow, although not as slow as SHA512 etc.

  2. Use a hash function that is quick to calculate (such as the hash function you found, or Murmurhash3_32), and make each hash bucket into the root of a balanced binary search tree. So, an ordinary separately chained hash table has each bucket as a linked list, which is slow if lots of values hash to the same bucket. By making it a balanced binary search tree such as AVL tree or red-black tree, you still have guaranteed worst-case performance.

My opinion is that (2) is better because SipHash is so slow. Also, in operating system kernel space there may not be enough entropy to create a secret random seed early in boot-up stage, so in kernel space you may not have the ability to create random numbers early in bootup.

Hash tables are widely misused. It is easy to bring down many systems to a practical halt merely by sending lots of values that hash to the same bucket.

juhist
  • 2,579
  • 10
  • 14
  • Note that a random seed can be re-seeded every time the hash table grows -- thus even if at boot-up a kernel has no entropy to draw from, so long as the attack comes later there is no problem. – Matthieu M. Dec 16 '19 at 12:38
  • @MatthieuM. That's assuming you accept the latency of a hash table growth operation. In many cases, kernels cannot accept that and set the hash table size to a fixed value based on the memory amount in the computer. – juhist Dec 16 '19 at 12:39
  • 3
    If that's a problem, it's also solvable: linear re-hashing which triggers on number of collisions would allow reseeding -- you'd just need a double look-up during the transition period. In general, I do prefer linear re-hashing, however hash map implementations are generally tuned for throughput over latency :( – Matthieu M. Dec 16 '19 at 12:46
  • Amen on easy to calculate. Generally mod is not easy easy as a substring or high order truncation. Security and seeding are generally irrelevant if you are only balancing queues or generating a synthetic key. It may never *need* to be calculated again. – mckenzm Dec 18 '19 at 22:34