5

Forgive me if this is a noob question - My CS education is a somewhat incomplete

Basically, I need a way to hash an input, so that someone seeing the output doesn't see the original input value. Security is not an issue, and I need to be able to make this unique across about 1 million inputs.

Originally I was using a hashcode() just fine, but when I started getting to around 1 million inputs, I had very high collision.

Can someone help me conceptually understand where I should look to approach this problem? Should I be trying to write my own function to do this? I thought I could just find a better hash function, like hashcode(), but I only find stuff about writing my own.

Eric S.
  • 207
  • 1
  • 6
  • 2
    you'll need only ~40 bits in the hash output, (unless I remember the birthday paradox wrong) – ratchet freak Oct 27 '14 at 21:30
  • 2
    Just use SHA-2. –  Oct 27 '14 at 21:33
  • Java `hashCode()` is _not_ guaranteed to be unique. The built-in function relies on memory addresses, but on 64 bit systems has to lose data to fit into 32 bits which can result in collisions. For custom `hashCode()` implementations (e.g. `java.lang.String`), you will have collisions. As ratchetfreak mentioned, 32 bits is not enough output for your purpose. Delnan has the right idea: use a better-quality hashing algorithm with many more bits in the output. Related reading: [What is a good 64bit hash function in Java for textual strings?](http://stackoverflow.com/q/1660501/439793) –  Oct 27 '14 at 21:36
  • 1
    See the answers here: http://stackoverflow.com/q/5531455/1345223 –  Oct 27 '14 at 21:37
  • @ratchetfreak at 40 bits there is about 1 colliding pair among those 1 million inputs. Since 20 bits are easily enumerable you can easily reject those hash functions that include a collision. – CodesInChaos Oct 27 '14 at 21:39
  • One alternative is using a specialized library to compute a perfect hash function. But I have my doubts that the result will be superior to a simple hash plus collision handling in most practical applications. – CodesInChaos Oct 27 '14 at 21:40
  • @CodesInChaos unless he needs a stateless hash function – ratchet freak Oct 27 '14 at 21:40
  • 1
    As an aside, give [Which hashing algorithm is best for uniqueness and speed?](http://programmers.stackexchange.com/a/145633/40980) a look. While its lowercase sample size is smaller than yours (only ~200k entries), the ideas and concepts within the post may be of use in evaluating solutions. –  Oct 27 '14 at 21:42
  • @ratchetfreak If you can precompute a perfect hash function, you can precompute a hashtable. For some parameters/inputs perfect hashfunctions offer better performance, but I think it's rarely worth the bother. – CodesInChaos Oct 27 '14 at 21:43
  • @ratchetfreak note: Java's hashcode is 32 bits (hashcode returns an int - not a long ([javadoc](http://docs.oracle.com/javase/7/docs/api/java/lang/Object.html#hashCode())). So even if it was a great hash function (incidentally, optimized for speed rather than uniqueness), it still wouldn't be 40 bits. No matter what, Java's built in hashcode (and presumably HashMap) aren't going to be sufficient for the task of low collision rate with 1M+ inputs. –  Oct 27 '14 at 22:05
  • @JimmyHoffa Don't believe this is really a duplicate. The question in that case does not mention the requirement for remaining collision-free with a large number of strings, which changes the nature of the problem. – Periata Breatta Oct 28 '14 at 02:49
  • 2
    @PeriataBreatta the answer in that question goes into *extreme* detail of analysis on the uniqueness and distribution of various hashing algorithms. It should have more than enough information to help the asker here choose an algorithm with a good distribution. – Jimmy Hoffa Oct 28 '14 at 06:04
  • @JimmyHoffa Yes, but as discussed in the answer there, all of that analysis is performed on the 32-bit variants of the algorithms in question, and as ratchetfreak mentions above, 32 bits of hash output is not sufficient to solve the problem posed in this question. – Periata Breatta Oct 30 '14 at 10:37

0 Answers0