12

I want to implement a Hash Table using Binary Search Trees to reduce the search complexity in the Separate Chaining process from O(n) (using linked list) to O(log n) (using BST). Can this be done, and if yes then how? It would be easier to understand if solution is step by step, implementation of the logic.

I want to reduce the Search time in the hashtable (build using separate chaining), but at the same time I do not want the insert time to increase. For my project i cannot change the hash function to reduce collisions. But due to scalability, collision are happening. I am trying to find a work around, so that i can somehow work with the best access and insert time in case a collision occurs... i.e. to manage the current state of thing than to restructure the entire algorithm. If it does not pan out then will have to restructure. So any ideas?

Ixrec
  • 27,621
  • 15
  • 80
  • 87
Aviral
  • 139
  • 1
  • 1
  • 5
  • 4
    Hash tables and Binary Search Trees are *different* containers. So you can't do what you suggest (or you are making a terminological mistake). – Basile Starynkevitch May 02 '15 at 11:51
  • I guess you could put a hash/value pair in each node in a tree...but that would be either a bad hash table or a bad binary tree. Without some clarification on why you want to do this at all and what you want the end result to be capable of, I'm not sure this is really answerable. – Ixrec May 02 '15 at 11:55
  • if i understand correctly you want to use a BST to hold the HashTable buckets? and have a some sort of this bit go left that bit go right? please elaborate what are you trying to do and why? – AK_ May 02 '15 at 12:54
  • 1
    @AK_ : Yup something of that sort, as you said. i want to handle the collisions using binary search tree. i corrected my question a bit to make it clearer. – Aviral May 02 '15 at 13:07
  • you mean this: http://stackoverflow.com/questions/18931126/how-to-implement-a-hash-table-using-a-bst or do you want to use a search tree instead of the array (you'll lose the O(1) thing)? – AK_ May 02 '15 at 13:15
  • A binary search tree implies that the data has an ordering. –  May 02 '15 at 14:09
  • @AK_ :Yup i guess so.. except, in the link u gave, they are discussing some book and its lines. All i wanted to do was to reduce the search complexity in the Separate Chaining process of hashtable from O(n) (using linked list) to O(log n) (using BST). Diving deep and working on the solution. Fingers crossed.. If you or anyone figures out then do post the sol. – Aviral May 02 '15 at 14:29
  • @MichaelT: Yup the data in the separate chaining of the hash table has an order or can be given an order. But i want to reduce the search time in the separate chaining of the hash table. – Aviral May 02 '15 at 14:32
  • 1
    Note that comes with the *penalty* of O(n log n) for *every* insert then. In general, when you have a hash table that starts to get too full (and you have chains longer than you can tolerate), you rebuild the hash. If you regularly encounter chains longer than 3 or 4, something is wrong. –  May 02 '15 at 14:49
  • @MichaelT: Very True. That's what the problem is... I want to reduce the Search time in the hashtable (build using separate chaining), but at the same time i do not want the insert time to increase. For my project i cannot change the hash function to reduce collision. But due to scalability, collision are happening. I am trying to find a work around, so that i can somehow work with the best access and insert time in case a collision occurs.. i.e. to manage the current state of thing than to restructure the entire algorithm. If it does not pan out then will have to restructure. So any ideas? – Aviral May 02 '15 at 15:00
  • 3
    There are a [myriad of variations](http://en.wikipedia.org/wiki/Hash_table) on the hash table for collision reduction, open addressing, and dynamic resizing of the table. Which one fits your requirements is something that you'll need to look into. Your current approach is covered under [Separate chaining with other structures](http://en.wikipedia.org/wiki/Hash_table#Separate_chaining_with_other_structures) –  May 02 '15 at 15:02
  • @MichaelT : Thanks for sharing the links. Will look into them to find a solution. – Aviral May 02 '15 at 15:07
  • @Aviral - why are collisions happening? Do objects have the same hashcode, or are they going into the same bucket? If the latter, then pre-sizing your table to be much larger than needed may be appropriate. If the former, I suggest replacing your `HashMap` by a `TreeMap` *and measuring the difference in performance*. I think you might be surprised. – kdgregory May 02 '15 at 15:21
  • @Aviral I've tried rewriting your question using the many clarifications from your comments, and voted to reopen. Feel free to revert or edit further if you think I went too far. – Ixrec May 02 '15 at 16:54
  • Java 8's `HashMap` will use a [balanced tree](http://openjdk.java.net/jeps/180) for chaining in certain situations, this improves performance in some worst-case situations. – ggovan May 02 '15 at 17:28
  • @Aviral: Are collisions happening because (possibility #1) there are hotspots (very high spikes of data all having the same "key value" - the key used as the sole search criteria, and therefore would have caused collisions regardless of hash function), or (possibility #2) because the hash function did not have enough entropy to effectively use the space of hash table? Also, by any chance did you come up with this idea after reading about [Cuckoo hashing](http://en.wikipedia.org/wiki/Cuckoo_hashing)? – rwong May 02 '15 at 18:49
  • @Aviral: you need to diagnose the cause of collision, and then include that in the question. (You are expected to research this in depth before posting.) Otherwise, whatever answers given here will not help you solve the particular cause of collision you are encountering. – rwong May 02 '15 at 18:55

2 Answers2

12

What you are asking for is possible given your constraints.

Analysis

A hash table's strength is its fast lookup and insertion speed. To get that speed, one must forsake any semblance of order in the table: i.e. entries are all jumbled up. A list is acceptable to use as a table entry because while traversal is O(n), the lists tend to be short assuming the hash table is sufficiently large and the objects stored in the table are hashed using a good quality hashing algorithm.

A binary search tree (BST) has fast insertion and lookup at O(log2 n). It also imposes a restriction on the elements it stores: there must be some way to order the elements. Given two elements A and B stored in the tree, it must be possible to determine if A comes before B or if they have equivalent order.

A hash table imposes no such restriction: elements in a hash table must have two properties. First, there must be a way to determine if they are equivalent; second, there must be a way to calculate a deterministic hash code. Order is not a requirement.

If your hash table elements do have an order, then you can use a BST as a hash table entry to hold objects with the same hash code (collisions). However, due to a BST having O(log2 n) lookup and insertion, that means the worst case for the whole structure (hash table plus BST) is technically better than using a list as a table entry. Depending on the BST implementation it will require more storage than a list, but likely not much more.

Please note that normally the overhead and behavior of a BST brings nothing to the table in real world situations as hash table buckets, which is why the theoretical poor performance of a list is acceptable. In other words, the hash table compensates for the list's weakness by placing fewer items in each list (bucket). However: the problem specifically stated that the hash table cannot increase in size, and collisions are more frequent than is typical in a hash table.

Implementation

I am not going to put code here because honestly it is not really necessary and you did not give a language anyway.

What I would do is simply copy whatever standard hash table your language's standard library contains into a new class, then change the table bucket type from a list to a tree. Depending on the language and its standard library this may be a very trivial thing to do.

Normally I would not advocate copy and paste coding like this. However, it is an easy way to get a battle-tested data structure very quickly.

  • In asymptotic terms, using a binary tree for collision handling doesn't change expected performance of a hash table *provided* that the hash table already did the usual tricks to achieve amortized O(1) performance anyway. Resizing the hashtable to ensure good performance means that the expected items per bucket (the size of binary trees) is expected to be small too, so you end up with the same expected amortized O(1) either way. Even for worst case - without any balancing constraint specified, worst case performance for a binary tree is that it ends up behaving like a linked list anyway. –  May 02 '15 at 19:43
  • @Steve314 Keep in mind that the problem is there are a lot of collisions, so he expects a bucket to contain more items than a hash table normally would. –  May 02 '15 at 19:45
  • Good point - e.g. for a constant-sized hash table with unbounded data, the asymptotic performance of the hash table is the same as the asymptotic performance of the collision handling - the hash table only changes the constant factors. –  May 02 '15 at 19:49
  • @Steve314 right, essentially if the hash table cannot effectively limit the number of elements in each bucket, the asymptotic performance degrades into whatever sub-data structure is used in each bucket. I added a paragraph to my answer to make this clear. –  May 02 '15 at 19:51
8

Using a binary tree for collision handling in a hash table isn't just possible - it has been done.

Walter Bright is best known as the inventor of the D programming language, but also wrote an ECMAScript variant called DMDScript. In the past, a headline claim of DMDScript (or possibly an ancestor - I seem to remember the name DScript) was that its hashtables tended to outperform those in a lot of similar languages. The reason - collision handling using binary trees.

I don't remember exactly where this is from, but the trees used were naive binary trees, with no partial balance scheme (not AVL, red-black or whatever) which makes sense as assuming the hashtable itself gets resized when it gets overfull and you don't get absurdly improbable rates of hash collisions, the binary trees should always be small. Basically, the worst case is still the same as using a linked list for collision handling (except you pay the price of two pointers per node instead of one) but the average case reduces the amount of searching within each hash bucket.