Randomized Hash function with no collisions

Question

Related to the question Which hashing algorithm is best for uniqueness and speed?

Is there a way to create a hash function, or find one, whose hash length depends completely on the input length, has an adjustable hash character set (Since it must be 1-to-1 function the input must comply to this character set constraint as well).

Moreover the most important part is that the hash function generates as random as possible strings. So for example it would be possible to get the following two different results.

Hash(aaaa)->blue // for character set a-z

and

Hash(aaaz)->pink  // for character set a-z

More examples:

Hash(aa@12a)->dakj@4   // for character set a-z, @, 0-9

and

Hash(a2#46Ww)->@3#0Pdw // for character set a-z A-Z, !-), 0-9

Notice the character length between input and output and the character sets.

What I thought so far was a probability distribution function, there are random distribution functions but I am not sure how to get there.

http://en.wikipedia.org/wiki/Probability_distribution

http://en.wikipedia.org/wiki/Normal_distribution

@delnan that is a very interesting way to think about it. But a permutation won't map data, it will just permute a random string. — iordanis, Apr 26 '14 at 18:31
It's a permutation of the space of possible strings. In other words, a permutation of `a, b, c, ..., z, aa, bb, ..., zzz` might be `hro, z, , fas, ..., mfx`, so that in function form `f(a) = hro`, `f(b) = z`, `f(zzz) = mfz`. It's a bijective mapping of strings. Block ciphers are commonly considered a family of permutations/bijective maps (one for each key), though with constraints that differ slightly from yours. — , Apr 26 '14 at 18:35
A function with no collisions and the output length depending on the input length seems more like a true _encryption_ algorithm than a _hash_ algorithm to me. Something like AES would meet your requirement. — , Apr 26 '14 at 23:45

score 4 · Accepted Answer · answered Apr 26 '14 at 18:55

4

This is not a hash function. It's either an encoding function or an encryption function.

Usually when I've encountered a problem of this form, it is because I have some sort of IDs that I want to "look" random. Assuming I don't care about security, I'll normally create an algorithm which works as follows:

Map my data to numbers (usually, by treating a string in a character set containing X characters as a baseX number).
Map my numbers to new numbers. Eric Lippert suggests using Multiplicative Inverses for this purpose. Skip32 encoding (AKA Skip32 encryption, but it's not secure) also works for this purpose.
Reverse the mapping in step 1.

To decode, perform the same steps in reverse.

answered Apr 26 '14 at 18:55

Brian

4,480
1
22
37

Interesting idea. But how random/adjustable can this algorithm be? Also is it guaranteed to be one-to-one? – iordanis Apr 26 '14 at 19:09
I'm not sure what "how random" means, but both algorithms "look" random. Skip32 is harder to reverse (neither is secure against reversal). Skip32 can be adjusted to produce different numbers by changing the key. Multiplicative Inverses can be adjusted to produce different numbers by selecting different multiplicative inverses (and by selecting different ranges). The range of Multiplicative inverses can be selected very precisely, whereas the range of Skip32 encoding is always 32bit. – Brian Apr 26 '14 at 19:19

score 1 · Answer 2 · answered Apr 26 '14 at 21:54

I'd use a perfect hash function to map your set of n inputs to a unique integer < n, then use that as an index into either an array of outputs (eg a list of n colours) or map it to a unique string in your character set (e.g. convert the number to base 26 and render each base-26 digit as that letter of the alphabet).

score 0 · Answer 3 · answered Jun 07 '18 at 08:15

It seems to me that Format Preserving Encryption is exactly what you're looking for, except for maybe the security properties implied by "encryption". Its wikipedia page notes that

For such finite domains, and for the purposes of the discussion below, the cipher is equivalent to a permutation of N integers {0, ... , N−1} where N is the size of the domain.

For your case, 4-letter words with x possibilities for each character would be coded as integers by treating them as base-x numbers. Then the permutation would map that input to a distinct output, which would be decoded back into a string by reversing the base-x transform.

It seems that the best way to implement such a permutation is to apply an M-bit permutation to the input value, and if the result is N or more to apply it again to the result until the result is less than N, also called cycle walking. This works best when N is slightly less than 2^M.

There are ways to use Feistel Networks to adapt wide ciphers to produce arbitrary bit-width permutations, but this requires iterating the network to produce a good permutation.

Randomized Hash function with no collisions

3 Answers3

Linked