I've been doing a bit of research on the subject. I know unordered_sets are hash tables, where the key and value are one and the same. What I'd like to know is how the compiler figures out where in the table each object belongs.
-
By hashing the key/value. I believe the precise mechanism of mapping hash to bucket is implementation-dependent but I haven't looked into it. What are you trying to accomplish? – Lightness Races in Orbit Jan 08 '16 at 20:09
-
3It sounds like you're asking how hash tables work in general, rather than unordered_sets specifically. Is that the case? Also, "the key and the value are one and the same" doesn't sound right; what do you mean by that? – Ixrec Jan 08 '16 at 20:12
-
@moonman239 that sort of question may be best suited to the relevant GCC mailing list then - could try to decode [the libstdc++ `unordered_set` source](https://gcc.gnu.org/onlinedocs/gcc-4.6.3/libstdc++/api/a01107_source.html) (really, [this](https://gcc.gnu.org/onlinedocs/gcc-4.6.3/libstdc++/api/a00903_source.html)) too – Lightness Races in Orbit Jan 08 '16 at 20:25
-
@LightnessRacesinOrbit I've changed my mind - I actually use Clang in my day-to-day work. – moonman239 Jan 08 '16 at 20:30
-
Heh, well, same story for libc++ – Lightness Races in Orbit Jan 08 '16 at 20:38
-
C++ is just a specification. It specifies the requirements that library functions and types must have (including time complexity of operations). C++ library implementations (of which there are many) are free to implement it however they like as long as it satisfies the requirements in the language specification. The programmer generally shouldn't rely on any details of the implementation except those required by the language. If you are asking how a particular implementation implements it, or how you might implement it yourself, you should say so in your question. – user102008 Jan 09 '16 at 01:49
1 Answers
An unordered_set
is specified quite a bit more "tightly" than most people seem to realize, so it's difficult (if possible at all) to implement it as anything except a hash table that resolves hash collisions by chaining.
In particular, the standard requires that the hash table consist of buckets, that each bucket contain items that hashed to the same value, that the size of a bucket be obtainable with constant complexity, and that traversing the items in a bucket be possible with linear complexity.
Those requirements are trivial to meet if you use collision chaining--and difficult or impossible to meet otherwise.
As such, the basic data structure would typically look something like this (ignoring some extra template parameters we don't care about right now):
template <class T>
class unordered_set {
std::vector<std::vector<T> > data;
// ...
The insert
member might look vaguely like this:
pair<iterator, bool>
insert(T t) {
auto raw_pos = hash(t); // hash the input
auto pos = raw_pos % data.size(); // reduce the range to that of the table
auto &bucket = data[pos];
// try to find existing item:
auto existing = std::find(bucket.begin(), bucket.end(), t);
// if there was no existing item, add the new one
if (existing == bucket.end()) {
bucket.push_back(t);
return {bucket.back(), true};
}
// if there was an existing item, return iterator and false to indicate we didn't add
return {existing, false};
}
Note that this definitely isn't an attempt at anything similar to production code--just a general sketch of basic logic. And when I say "basic", I mean it--a real implementation needs considerably more logic, such as to re-size the table in case the new addition increases the load factor beyond the specified limit. In fact, some of the code isn't just simplified, it's technically wrong as it is right now (but fixing it would make the code more complex without really adding much).

- 44,385
- 5
- 89
- 162
-
This is a very nice explanation. My only remark is that the iterator you return doesn't work for traversing the entire set but rather a single bucket. And `bucket.back()` gives you a reference and no iterator at all. I don't know how to fix that without making the answer needlessly complicated, though. Maybe just say it. – 5gon12eder Jan 09 '16 at 04:20
-