-8

I had a very large ArrayList in Java, and I often need to check if it contains a particular value. This has proven very slow.

Then I discovered that you can use a data structure based on a Hash. Because apparently, a method like containsKey() is O(1).

Nice result - but, how is this achieved? Clearly it doesn't go key by key checking for a match.

I imagine it is similar to arrays: Why is the complexity of fetching a value from an array be O(1)? - however, I understand it for arrays since you only have to do some arithmetic to get the address of the desired data. But I am not quite sure how is this applied to Hash tables.

Saturn
  • 3,887
  • 9
  • 30
  • 40
  • @gnat I can see a close answer there: that the hash table computes the hashcode into a number that corresponds to an array's index - but what if two keys just so happen to generate the same hashcode? Surely that wouldn't be `O(1)`? – Saturn Aug 04 '15 at 06:30
  • 1
    what you describe is known as [collision resolution](https://en.wikipedia.org/wiki/Hash_table#Collision_resolution) and properly implemented hash tables maintain it so that it remains O(1) – gnat Aug 04 '15 at 06:34
  • 1
    Just as a courtesy reminder, a question such as this may be flagged by future potential employers or technical hiring reviewers as a sign of lacking basic algorithmic knowledge. I said this because it had happened to other people. Make sure to maintain some cyberspace separation between your P.SE profile and your professional profile (e.g. by not mentioning it from the other site profile). (That said, I understand that not everyone need an employment from a traditional, risk-adverse software company.) – rwong Aug 04 '15 at 06:49
  • 4
    did you even look at the Wikipedia article for hash table? If so, what didn't you understand? – kevin cline Aug 04 '15 at 07:28
  • 1
    @kevincline no I didn't, obviously :) - otherwise why would I have made this question?? Also, what does it matter that Wikipedia has the answer? If the question is not a duplicate then it should be fine to post it, regardless of how trivial it may be. Heck, many good questions in Stack Exchange have answers in Wikipedia so I don't see the problem. You make it sound like trivial/easy questions are not okay, when they clearly are. – Saturn Aug 04 '15 at 14:41
  • I guess a question title that won't get closed would be "How do I get a strong intuition about different approaches and performance tradeoffs of implementing sets and associative containers?" The candidates are: bit array, unsorted array, sorted array, binary search tree, hash table. – rwong Aug 05 '15 at 04:06
  • The way hash functions work has to do with [the "image" (codomain) of a mathematical function](https://en.wikipedia.org/wiki/Image_(mathematics)#Image_of_a_function). – rwong Aug 05 '15 at 04:08
  • @gnat: "and properly implemented hash tables maintain it so that it remains O(1)" There is nothing that a general `Map` can do to avoid collisions. Hash codes are provided by the objects themselves, and there can always be collisions due to 1) the class implementing `hashCode` that way, or 2) even if the class implements it well, elements can be chosen with the same hash code. – user102008 Aug 05 '15 at 08:22
  • @user102008 I didn't wrote (nor did I mean to) that the way to keep it O(1) is to avoid collisions. See link referred to [in my prior comment](http://programmers.stackexchange.com/questions/291744/how-can-the-containskey-method-of-a-java-hash-table-be-o1?noredirect=1#comment604662_291744) – gnat Aug 05 '15 at 08:24
  • @gnat: It is impossible for it to be O(1) in the worst case – user102008 Aug 05 '15 at 08:34
  • @user102008 with that, I agree. Again, article referred in my prior comment covers this as well – gnat Aug 05 '15 at 08:37
  • @gnat: No. You said "and properly implemented hash tables maintain it so that it remains O(1)", which makes it sound like the hash table can be "implemented" in some smart way so that it's always O(1). But that's wrong. – user102008 Aug 05 '15 at 08:41
  • @user102008 yeah, the part of my comment you quoted has inaccurate wording – gnat Aug 05 '15 at 08:42
  • 2
    How a hash works is not trivial but it is well established and documented. Why post a generic question without doing any research? – paparazzo Aug 05 '15 at 10:11
  • @Frisbee how an array works is not trivial but it is well established, and yet http://programmers.stackexchange.com/questions/252407/why-is-the-complexity-of-fetching-a-value-from-an-array-be-o1?lq=1. The same for many, many other Stack Exchange questions. You are merely trying to expose this question as particularly bad, ignoring the fact that several (good) questions were made under the same circumstances. That is laughable. – Saturn Aug 05 '15 at 14:54
  • @Frisbee An additional example, and my favourite, is this: ***"What does the Spring Framework do?"*** (http://programmers.stackexchange.com/questions/92393/what-does-the-spring-framework-do-should-i-use-it-why-or-why-not) - are you really going to tell me this question is not generic? That it's not possible to find out by Googling it? Yet, it is still a good and useful question. – Saturn Aug 05 '15 at 14:59
  • @Frisbee Or this: ***Why do we need private variables?*** (http://programmers.stackexchange.com/questions/143736/why-do-we-need-private-variables). It's a good question, but under your logic, it is not worth posting because clearly the benefits of private variables are well-established and well-documented. Anyone with a little searching would have immediately found out the pros and cons quickly. – Saturn Aug 05 '15 at 15:06
  • 3
    @Omega I don't have the meta links on hand, but citing old links won't really prove your point. Old questions, whether they're good or bad, are grandfathered and left alone. The crux of your question is "how is this achieved?" Doing the bare minimum of research for this question would have you go to a search engine and put in "How is Hash table lookup O(1)?" The first result on google is [this](http://stackoverflow.com/questions/2771368/can-hash-tables-really-be-o1) – Shaz Aug 05 '15 at 15:55
  • 1
    @Ryan probably you meant something like [this one](http://meta.stackexchange.com/tags/broken-windows/info) – gnat Aug 05 '15 at 16:16
  • @Omega There are many question on SO that I think are poor and this is one of them. Other bad questions does not make this a good question. This question clearly lacks research. – paparazzo Aug 05 '15 at 16:25
  • 1
    Comments are not for extended discussion; this conversation has been [moved to chat](http://chat.stackexchange.com/rooms/26645/discussion-on-question-by-omega-how-can-the-containskey-method-of-a-java-hash). – yannis Aug 06 '15 at 08:52

3 Answers3

9

The reason for hash access being O(1) is actually quite different from array access.

Arrays are contiguous memory areas. To read element 15, you only have to multiply the element size by 15 and add the start address. Both addition and multiplication are O(1) (it takes just as long to add two big numbers as two small numbers), and since computers provide O(1) memory access, the overall complexity is still O(1).

A hash works quite differently. It stores things at predictable places, but those indexed places are not user-visible. A string that you put into a hashtable isn't stored at the address you specify; instead the table takes the content of that string and computes a fitting address by hashing that content to a number taken from a small set of possibilities. Then it stores the value to that place, and if you ask for that key again, it re-computes the hashed value and looks up that cell.

Since the set of possible hash values is smaller than the set of possible keys, you can have collisions, so that you might have to spend a little more time to find the right value when more than one of them has been put into the same bucket, but it can be shown that this happens infrequently and doesn't affect the overall amortized complexity, which is still O(1).

So you see that an array can find things fast because you tell it where to load from; a hash table returns things fast because it knows where it put them, and can reconstruct that computation efficiently.

gnat
  • 21,442
  • 29
  • 112
  • 288
Kilian Foth
  • 107,706
  • 45
  • 295
  • 310
  • 1
    "but it can be shown that this happens infrequently and doesn't affect the overall amortized complexity, which is still O(1)." That's not what "amortized" means. Amortized considers what happens over multiple operations to the *same* data structure *in the worst case*. The amortized complexity here is O(n). – user102008 Aug 05 '15 at 08:17
  • 3
    @user102008 no, that's not what "amortized" means. To quote from [wikipedia](https://en.wikipedia.org/wiki/Amortized_analysis), "[u]nlike other analyses that focus on an algorithm's run time in a worst case scenario, amortized analysis examines how an algorithm will perform in practice or on average". The amortized complexity is O(1) because you are unlikely to see the worst case performance more than occasionally in a real scenario, so the average performance is O(1). – Jules Aug 05 '15 at 12:08
  • @Jules: Yes, what I said is exactly what "amortized" means. Read any CS source and it will say exactly that. Just because somebody edited the Wikipedia article to make it wrong does not change what it means. – user102008 Aug 05 '15 at 21:37
  • @user102008 presumably you consider [this](http://www.cs.cornell.edu/courses/cs312/2008sp/lectures/lec20.html) not to be a "CS source" then? It clearly shows an analysis of the amortized complexity of hashtable lookup to be O(1). – Jules Aug 05 '15 at 22:53
  • @Jules: I said "CS source" with respect to the definition of "amortized", which the source clearly says means average of a sequence of operations on the same data structure, for a particular set of inputs. You cannot average over different inputs. – user102008 Aug 06 '15 at 00:49
  • @Jules: And about hash table lookup being amortized O(1), the article first assumes that you have a good hash function and the operations are O(1), and then does an amortized analysis of the cost of adding an element considering resizing. It is NOT saying that in the worst case the amortized cost is O(1). In the worst case it is amortized O(n). – user102008 Aug 06 '15 at 00:51
  • For a particular set of inputs - but not assuming worst case inputs. – Jules Aug 06 '15 at 12:10
  • Amortization talks about a particulGranted, it is possible to talk about "average-case amortized complexity" and "worst-case amortized complexity", etc., but without explicit qualification "amortized complexity" generally means "worst-case amortized complexity". – user102008 Aug 06 '15 at 23:09
  • In any case, both the "worst-case amortized complexity" and "worst-case non-amortized complexity" for lookup in this hash table is O(n), and both the "average-case amortized complexity" and "average-case non-amortized complexity" for lookup in this hash table is O(1), so the mention of "amortized" is pointless in any case. Amortization cannot affect the complexity of lookup because lookup is a non-mutating operation -- one lookup should not have any effect on the complexity of a future lookup -- so the cost redistribution of amortization cannot come into play. – user102008 Aug 06 '15 at 23:10
2

In Java, every object have a hashCode() method that returns a 32 bit integer value, which is always the same for the same object. In the simplest version, a hashtable just contain an array of size 2^32, and a key-value pair is stored at the index corresponding to the hash code of the key. Since array access by index is O(1), hashtable access by key (for storing or retrieving) can also be O(1).

Of course it is a bit more complex in reality. First, you can always have collisions, that is, two different object giving the same hashcode. So the items are not stored directly in the array, rather each array index contains a "bucket", which is an ordinary list of key-value pairs. (In the Java hashtable the buckets are implemented as linked lists.) You have to search through the bucket to find the item and this search will be O(n), but unless your hashtable contains an extreme number of items (or your hash algorithm is bad), the items will be distributed pretty evenly across the array, and each bucket will contain only a few items. (Only one in the best case.)

Second, you will not initially create an array of size 2^32, since that would be a crazy waste of space. Instead you initially create a smaller array, where each entry maps to multiple hashcodes. This will of course lead to higher risk of collision. You keep track of the number of entries, and when they reach a certain threshold you double the size of the array, and then re-distribute the items. Of course this will also have a performance cost. There is some design tradeoff in deciding when to resize the array. The bigger the array relative to the number of items, the fewer collisions and hence better performance, but also more waste of space.

So finding an item is O(n) in the worst case where all items happen to end up in the same bucket, but O(1) in the common case (given a well-behaved hash function. Since anybody can override hashCode() this is of course not guaranteed. If you write int hashCode(){return 17;} you get worst case performance consistently). And if the number of items grows larger than the hash size, the buckets start to grow and again you get O(n) lookup. On 32 bit systems you would run out of memory before this ever happened, but with 64 bit memory it could theoretically be an issue.

Adding an item is also O(1) in the common case, but O(n) if the add triggers a resize of the array. However the aggregate cost of the resize operations are predictable and proportional to the number of items, so the amortized cost for adds is still O(1). This is not the case for the worst case with lookups, since if we are unlucky and all items ends up in the same bucket every lookup will have the worst-case performance and there is no way to amortize this cost.

Of course both the worst case and the common or average case may be relevant. In a real-time system, it is pretty important to know the worst case performance of an operation. For most business application, there average case is the more important metric.

JacquesB
  • 57,310
  • 21
  • 127
  • 176
1

When talking about measurements (even abstract measurements such as "algorithmic complexity") you should always specify exactly what you are measuring, otherwise what you say is completely meaningless.

In this particular case, you are simply saying "hash tables are O(1)", but you are not saying what exactly you are measuring.

In particular, accessing a value by key in a (properly designed) hash table has

  • worst-case step complexity of O(n) (or more precisely, worst-case step complexity of whatever data structure is used for the "bucket", which usually is a simple linked list)
  • amortized worst-case step complexity of O(1)

In other words, your whole confusion is due to the fact that you are talking about the worst-case and the others are talking about the amortized worst-case, but except for @Kilian Foth nobody bothered to mention that.

The argument is similar to the one for why adding an element to a dynamically-sized array is O(n) worst-case and O(1) amortized worst-case. @JacquesB explains how this amortization works for hash tables.

Jörg W Mittag
  • 101,921
  • 24
  • 218
  • 318
  • 1
    "amortized worst-case step complexity of O(1)" No. The amortized worst-case step complexity for lookup for hashed data structures in Java is O(n). – user102008 Aug 05 '15 at 08:36
  • 1
    "@JacquesB explains how this amortization works for hash tables." No he doesn't. – user102008 Aug 05 '15 at 08:38
  • No matter what data structure you are looking at, amortization cannot affect the complexity of lookup because lookup is a non-mutating operation, so one lookup should not have any effect on the complexity of a future lookup. Therefore, the fact that you have different complexities for amortized and non-amortized for the same case indicates that something is definitely wrong. – user102008 Aug 06 '15 at 23:03