Explanation of how to resolve Hash conflicts in HAMT or hashtables in general

Question

I am working on trying to understand HAMT and now am uncertain generally what to do when you run into a conflict in a hash. All I understand so far is you create a list to append the keys to, but I don't understand any deeper. I would like to know (ideally for the HAMT case, but any hash would do) how you resolve conflicts, what you do exactly.

I would like to implement a hashtable / HAMT sort of thing, but don't conceptually grasp how the collisions work. This goes into the HAMT implementation more, but it's also hard to understand.

Essentially what I would like to do is create a binary trie, where the key of some data is hashed and used as the trie key. This works fine except for collisions. I'm not sure what to do in the case of collisions.

In case of a collision you create a linked list containing the value and hash key. If a collision is detected you walk through the linked list and compare the keys. — Marco, Mar 10 '19 at 12:05
It seems there must be a more optimal approach than comparing the keys directly, I don't know. — user10869858, Mar 10 '19 at 12:58

score 5 · Answer 1 · answered Mar 10 '19 at 12:07

A hashtable is an array of buckets, where the bucket index corresponds to the hash values of the keys (modulo the array length).

As collisions do happen (two different keys mapping to the same bucket), you need a strategy to deal with that. Here are two popular strategies.

Using a list: you design a bucket data structure that supports multiple key/value pairs, e.g. a linked list of pairs. In case of collision (list isn't empty), you simply add a second/third/... pair to the list.
Using the next free bucket: your bucket supports a single key/value pair. If the "correct" bucket (according to hash value) is already occupied, you search linearly for the next free one and place the key/value pair there.

With both strategies, the performance degrades to a linear search when you have a significant number of collisions. The second strategy degrades faster.

If you are in a memory-tight situation, you might prefer the second one, as it doesn't have the overhead of a list per bucket.

With the second approach, it can be tricky to correctly implement especially the remove action.

You don't *have* to use a list. You can use any data structure you want. It's just that for a well-designed hash function, and properly implemented resizing, collisions should be relatively rare, so that the simplest possible solution is good enough. Or, put another way, if collisions are so common that the performance of the list becomes a problem, you are better off investing resources to fix the underlying problem of too many collisions rather than add the complexity of, say, an AVL tree to your hash table. — Jörg W Mittag, Mar 10 '19 at 13:48

Explanation of how to resolve Hash conflicts in HAMT or hashtables in general

1 Answers1