22

Let's have this C# class (it would be almost the same in Java)

public class MyClass {
   public string A {get; set;}
   public string B {get; set;}

   public override bool Equals(object obj) {
        var item = obj as MyClass;

        if (item == null || this.A == null || item.A == null)
        {
            return false;
        }
        return this.A.equals(item.A);
   }

   public override int GetHashCode() {
        return A != null ? A.GetHashCode() : 0;
   }
}

As you can see, equality of two instances of MyClass depends on A only. So there can be two instances that are equal, but holding different piece of information in their B property.

In a standard collection library of many languages (including C# and Java, of course) there is a Set (HashSet in C#), which a collection, which can hold at most one item from each set of equal instances.

One can add items, remove items and check if the set contains an item. But why is it impossible to get a particular item from the set?

HashSet<MyClass> mset = new HashSet<MyClass>();
mset.Add(new MyClass {A = "Hello", B = "Bye"});

//I can do this
if (mset.Contains(new MyClass {A = "Hello", B = "See you"})) {
    //something
}

//But I cannot do this, because Get does not exist!!!
MyClass item = mset.Get(new MyClass {A = "Hello", B = "See you"});
Console.WriteLine(item.B); //should print Bye

The only way to retrieve my item is to iterate over the whole collection and check all items for equality. However, this takes O(n) time instead of O(1)!

I haven't found any language that supports get from a set so far. All "common" languages I know (Java, C#, Python, Scala, Haskell...) seem to be designed in the same way: you can add items, but you cannot retrieve them. Is there any good reason why all these languages do not support something that easy and obviously useful? They cannot be just all wrong, right? Are there any languages that do support it? Maybe retreiving a particular item from a set is wrong, but why?


There are a few related SO questions:

https://stackoverflow.com/questions/7283338/getting-an-element-from-a-set

https://stackoverflow.com/questions/7760364/how-to-retrieve-actual-item-from-hashsett

vojta
  • 338
  • 1
  • 2
  • 7
  • I have run up against the same thing in C# .NET and I don't think there is a way around it. The just don't expect it to be used that way. – paparazzo Nov 01 '16 at 08:15
  • @Paparazzi Well, it could be added so easily. They could have done it ages ago, at least in one language. There must be a good reason why they did not. Maybe one should use another collection in these situations...? – vojta Nov 01 '16 at 08:26
  • The *could* do a lot of things. What Microsoft does is out of our control. You could probably access the source code and add it. I would be happy if you could just enumerate the bucket. – paparazzo Nov 01 '16 at 08:39
  • 12
    C++ `std::set` supports object retrieval, so not all "common" languages are like you describe. – Angew is no longer proud of SO Nov 01 '16 at 08:47
  • 17
    If you claim (and code) that "equality of two instances of MyClass depends on A only" then another instance that has the same A value and different B effectively *is* "that particular instance", since you yourself defined that they're equal and the differences in B don't matter; the container is "allowed" to return the other instance since it's equal. – Peteris Nov 01 '16 at 10:46
  • 6
    That's like asking why you can't use a screwdriver to put a nail into the wall. It's a different tool for a different task. – Roman Reiner Nov 01 '16 at 10:59
  • @RomanReiner Quote (112 upvotes) by KyleM from SO: _There absolutely could be a point to getting the element. What if you wish to update some of the element's values after it has already been added to the Set? For example, when .equals() does not use all of the fields, as the OP specified. A less efficient solution would be to remove the element and re-add it with its values updated._ – vojta Nov 01 '16 at 11:04
  • 1
    @RomanReiner See http://stackoverflow.com/a/18380755/3899583 – vojta Nov 01 '16 at 11:07
  • Objective-C `NSSet` has it as well... – vojta Nov 01 '16 at 12:45
  • 7
    True story: in Java, many `Set` implementations are just `Map` on the inside. – corsiKa Nov 01 '16 at 16:51
  • 10
    *speaking to person A*: "Hi, can you bring Person A right here please" – Bradley Thomas Nov 01 '16 at 17:27
  • @BradThomas Eh, not really... There is some difference between identity and equality you might have overlooked... – vojta Nov 01 '16 at 18:46
  • 5
    @vojta, but equality doesn't get "a particular item", which is what you said you wanted... it gets ANY item that would happen to match the equality criterion. Sure that's just one item in a set. But then imagine you had two different sets. You could ask for the same item from each, and end up getting two different items back. – Bradley Thomas Nov 01 '16 at 19:21
  • 1
    Sets are for answering questions of membership, or intersection, or union, etc. If you want to do something off the beaten path, roll your own data structure! Build on top of HashTable and it should be pretty easy - write a Set-like wrapper with the additional Get function you need. I once needed a HashTable that also supported indexes, so I implemented a HashTable with nodes that were doubly linked in a list. Rather than complain about how the developers didn't forsee your arcane use-case, roll it yourself and learn something new! – Blackhawk Nov 01 '16 at 20:43
  • (see the Wikipedia entry on [Sets](https://en.wikipedia.org/wiki/Set_(mathematics)) for information on the kinds of things Sets are useful for) – Blackhawk Nov 01 '16 at 20:54
  • 7
    This breaks reflexivity (`a == b` always true) in case `this.A == null`. The `if (item == null || this.A == null || item.A == null)` test is "overdone" and checks to much, possibly in order to create artificially "high-quality" code. I see this kind of "overchecking" and being overly correct all the time on Code Review. – usr Nov 01 '16 at 21:37
  • 2
    General side note: Equality (and HashCode) in a Set should usually not depend on mutable state. What do you expect to happen when `A` gets changed after insertion? – Hulk Nov 02 '16 at 12:59
  • Agree with @usr (except I would describe reflexivity as "(`x.Equals(x)` always true)", or similar). Since the `.A` property has compile-time type `string`, the end of the method should just be `return this.A == item.A;` which handles null strings well. – Jeppe Stig Nielsen Nov 02 '16 at 13:38
  • To add this method to the [`HashSet<>` source code](https://referencesource.microsoft.com/#System.Core/System/Collections/Generic/HashSet.cs), you would simply do the same as in the existing `Contains` method, except where it says `return true;` you would just `return m_slots[i].value;`. – Jeppe Stig Nielsen Nov 02 '16 at 14:23
  • If your object consists of a `Key` part and a `Value` part, you should consider using `KeyValuePair`, and your concept of equality and presence (contains) will need to specify which part to apply. For example, a `Dictionary` will have `ContainsKey` and `Contains`. If you generalize the concept to a table-like container that has two columns (first column: A, second column: B), you could potentially have a `ContainsValue` lookup method as well. – rwong Nov 02 '16 at 18:08
  • Not part of a language, but available in a library. http://stackoverflow.com/questions/12670292/get-an-item-from-a-java-set – Donald Raab May 15 '17 at 03:18

7 Answers7

66

The problem here isn't that HashSet lacks a Get method, it's that your code makes no sense from the perspective of the HashSet type.

That Get method is effectively, "get me this value , please", to which the .NET framework folk would sensibly reply, "eh? You already have that value <confused face />".

If you want to store items and then retrieve them based on matching another slightly different value, then use Dictionary<String, MyClass> as you can then do:

var mset = new Dictionary<String, MyClass>();
mset.Add("Hello", new MyClass {A = "Hello", B = "Bye"});

var item = mset["Hello"];
Console.WriteLine(item.B); // will print Bye

The information of equality leaks from the encapsulated class. If I wanted to change the set of properties involved in Equals, I would have to change the code outside MyClass...

Well yes, but that's because MyClass runs amok with the principle of least astonishment (POLA). With that equality functionality encapsulated, it is completely reasonable to assume that the following code is valid:

HashSet<MyClass> mset = new HashSet<MyClass>();
mset.Add(new MyClass {A = "Hello", B = "Bye"});

if (mset.Contains(new MyClass {A = "Hello", B = "See you"})) 
{
    // this code is unreachable.
}

To prevent this, MyClass needs to be clearly documented as to its odd form of equality. Having done that, it's no longer encapsulated and changing how that equality works would break the open/closed principle. Ergo, it shouldn't change and therefore Dictionary<String, MyClass> is a good solution for this odd requirement.

Robert Harvey
  • 198,589
  • 55
  • 464
  • 673
David Arno
  • 38,972
  • 9
  • 88
  • 121
  • IMO this is not correct from the OOP point of view. The information of equality leaks from the encapsulated class. If I wanted to change the set of properties involved in `Equals`, I would have to change the code outside `MyClass`: the dictionary declaration etc. – vojta Nov 01 '16 at 08:54
  • 2
    @vojta, In that case, use `Dictionary` as it will then fetch the value based on a key that uses `MyClass.Equals`. – David Arno Nov 01 '16 at 08:58
  • Yes, that is the solution proposed in related SO answers. – vojta Nov 01 '16 at 09:00
  • @vojta, I've updated my answer to better explain why the problem lies with the code, rather than the lack of `HashSet.Get`. – David Arno Nov 01 '16 at 09:27
  • 8
    I would use a `Dictionary` supplied with an appropriate `IEqualityComparer`, and pull the equivalence relation out of `MyClass` Why does `MyClass` have to know about this relation over its instances? – Caleth Nov 01 '16 at 09:35
  • @Caleth, that is indeed an excellent solution. You should post that as an answer, for - IMO at least - it is probably the best C# solution to this requirement. – David Arno Nov 01 '16 at 09:43
  • Thank you for the edit, I see what you mean, but have a look at this well accepted SO answer: http://stackoverflow.com/a/18380755/3899583, please. I think your answer and that answer are just opposite. – vojta Nov 01 '16 at 12:30
  • 16
    @vojta and the comment there: "*meh. Overriding the implementation of equals so that non-equal objects are "equal" is the problem here. Asking for a method that says "get me the the identical object to this object", and then expect a non-identical object to be returned seems crazy and easy to cause maintenance problems*" is spot on. That's often the problem with SO: seriously flawed answers get upvoted by folk who haven't thought through the implicants of their desire for a quick fix to their broken code... – David Arno Nov 01 '16 at 12:35
  • 6
    @DavidArno: kind of inevitable though for as long as we persist in using languages that distinguish between equality and identity ;-) If you want to canonicalise objects that are equal but not identical then you need a method that says not "get me the identical object to this object", but "get me the canonical object that is equal to this object". Anyone who thinks that HashSet.Get in these languages necessarily would mean "get me the identical object" is already gravely in error. – Steve Jessop Nov 01 '16 at 13:04
  • 4
    This answer has many blanket statements such as `...reasonable to assume...`. All of this might be true in 99% of the cases but still the ability to retrieve an item from a set can come handy. Real world code cannot always adhere to POLA etc. principles. For example, if you are deduplicating strings case-insensitively you might want to get the "master" item. `Dictionary` is a workaround, but it costs perf. – usr Nov 01 '16 at 21:35
  • 2
    +1 for `` since not everyone is able to view `¯\_(ツ)_/¯` properly. – Num Lock Nov 02 '16 at 12:06
24

You already have the item that is "in" the set - you passed it as the key.

"But it isn't the instance that I called Add with" - Yes, but you specifically claimed that they were equal.

A Set is also a special case of a Map|Dictionary, with void as the value type (well the useless methods are not defined, but that doesn't matter).

The data structure you are looking for is a Dictionary<X, MyClass> where X somehow gets the As out of the MyClasses.

The C# Dictionary type is nice in this regard, as it allows you to supply an IEqualityComparer for the keys.

For the example given, I would have the following:

public class MyClass {
   public string A {get; set;}
   public string B {get; set;}
}

public class MyClassEquivalentAs : IEqualityComparer<MyClass>{
   public override bool Equals(MyClass left, MyClass right) {
        if (Object.ReferenceEquals(left, null) && Object.ReferenceEquals(right, null))
        {
            return true;
        }
        else if (Object.ReferenceEquals(left, null) || Object.ReferenceEquals(right, null))
        {
            return false;
        }
        return left.A == right.A;
   }

   public override int GetHashCode(MyClass obj) {
        return obj?.A != null ? obj.A.GetHashCode() : 0;
   }
}

Used thusly:

var mset = new Dictionary<MyClass, MyClass>(new MyClassEquivalentAs());
var bye = new MyClass {A = "Hello", B = "Bye"};
var seeyou = new MyClass {A = "Hello", B = "See you"};
mset.Add(bye);

if (mset.Contains(seeyou)) {
    //something
}

MyClass item = mset[seeyou];
Console.WriteLine(item.B); // prints Bye
svick
  • 9,999
  • 1
  • 37
  • 51
Caleth
  • 10,519
  • 2
  • 23
  • 35
  • There are a number of situations where it may be advantageous for code which has an object matching the key, to replace it with a reference to the object used as the key. For example, if many strings are known to match a string in a hashed collection, replacing references to all those strings with references to the one in the collection could be a performance win. – supercat Nov 01 '16 at 21:09
  • @supercat today that is achieved with a `Dictionary`. – MikeFHay Nov 01 '16 at 23:30
  • @MikeFHay: Yeah, but it does seem a little inelegant to have to store each string reference twice. – supercat Nov 01 '16 at 23:41
  • 2
    @supercat If you mean an *identical* string, that's just string interning. Use the built in stuff. If you mean some kind of "canonical" representation (one that *cannot* be achieved using simple case changing techniques, etc.), that sounds like you basically need an index (in the sense DBs use the term). I don't see a problem with storing each "non-canonical form" as a key that maps to a canonical form. (I think this applies equally well if the "canonical" form is not a string.) If this isn't what you're talking about, then you've completely lost me. – jpmc26 Nov 02 '16 at 02:41
  • 1
    Custom `Comparer` and `Dictionary` is a pragmatic solution. In Java, the same can be achieved by `TreeSet` or `TreeMap` plus custom `Comparator`. – Markus Kull Nov 02 '16 at 12:00
  • @jpmc26: Built-in string-interning mechanisms keeps a string around *forever*. Further, the kinds of situations which would make it useful to consolidate duplicate references to string objects are even more useful with some other kinds of immutable objects, but I used strings as an example because everyone knows what they are. – supercat Nov 02 '16 at 15:22
  • @supercat Is this an example of what you mean: `string someValue = getFromExternal(); someValue = Collection.getOrInsert(someValue);` ? – Caleth Nov 02 '16 at 15:51
19

Your problem is that you have two contradictory concepts of equality:

  • actual equality, where all fields are equal
  • set membership equality, where only A is equal

If you would use the actual equality relation in your set, the problem of retrieving a particular item from the set doesn't arise – to check whether an object is in the set, you already have that object. It is therefore never necessary to retrieve a particular instance from a set, assuming you are using the correct equality relation.

We could also argue that a set is an abstract data type that is defined purely by the S contains x or x is-element-of S relation (“characteristic function”). If you want other operations, you're not actually looking for a set.

What happens quite often – but what is not a set – is that we group all objects into distinct equivalence classes. The objects in each such class or subset are only equivalent, not equal. We can represent each equivalence class through any member of that subset, and it then becomes desirable to retrieve that representing element. This would be a mapping from the equivalence class to the representative element.

In C#, a dictionary can use an explicit equality relation, I think. Otherwise, such a relation can be implemented by writing a quick wrapper class. Pseudocode:

// The type you actually want to store
class MyClass { ... }

// A equivalence class of MyClass objects,
// with regards to a particular equivalence relation.
// This relation is implemented in EquivalenceClass.Equals()
class EquivalenceClass {
  public MyClass instance { get; }
  public override bool Equals(object o) { ... } // compare instance.A
  public override int GetHashCode() { ... } // hash instance.A
  public static EquivalenceClass of(MyClass o) { return new EquivalenceClass { instance = o }; }
}

// The set-like object mapping equivalence classes
// to a particular representing element.
class EquivalenceHashSet {
  private Dictionary<EquivalenceClass, MyClass> dict = ...;
  public void Add(MyClass o) { dict.Add(EquivalenceClass.of(o), o)}
  public bool Contains(MyClass o) { return dict.Contains(EquivalenceClass.of(o)); }
  public MyClass Get(MyClass o) { return dict.Get(EquivalenceClass.of(o)); }
}
amon
  • 132,749
  • 27
  • 279
  • 375
  • "retrieve a particular instance from a set" I think this would convey what you mean more directly if you changed "instance" to "member." Just a minor suggestion. =) +1 – jpmc26 Nov 02 '16 at 02:47
7

But why is it impossible to get a particular item from the set?

Because that's not what sets are for.

Let me rephrase the example.

"I have a HashSet that I want store MyClass objects in and I want to be able to get them by using the property A that equals the object's property A".

If replace "HashSet" with "Collection", "objects" with "Values" and "property A" with "Key", the sentence becomes:

"I have a Collection that I want to store MyClass Values in and I want to be able to get them by using the Key that equals the object's Key".

What's being described is a Dictionary. The actual question being asked is "Why can't I treat HashSet as a Dictionary?"

The answer is that they are not used for the same thing. The reason to use a set is to guarantee the uniqueness of it's individual contents, otherwise you could just a use a List or an array. The behavior being described in the question is what a Dictionary is for. All the language designers didn't screw up. They don't provide a get method because if you have the object and it's in the set, they are equivalent, which means you would be "getting" an equivalent object. Arguing that HashSet should be implemented in such a manner that you can "get" non-equivalent objects that you've defined to be equal is a non-starter when the languages provide other data structures that allow you to do that.

A note on the OOP and equality comments/answers. It's okay to have the key of the mapping be a property/member of the stored value in a Dictionary. For example: having a Guid as the key and also the property that is used for the equals method is perfectly reasonable. What's not reasonable is having different values for the rest of the properties. I find that if I'm heading in that direction, I probably need to rethink my class structure.

Old Fat Ned
  • 147
  • 3
6

As soon as you override equals you better override hashcode. As soon as you have done this your "instance" should never change internal state again.

If you do not override equals and hashcode VM object identity is used to determine equality. If you put this object into a Set you are able to find it again.

Changing a value of an object that is used to determine equality will lead to untraceability of this object in hash based structures.

So a Setter on A is dangerous.

Now you do not have B that does not participate in equality. The problem here is semantically not technically. Because technically changing B is neutral to the fact of equality. Semantically B has to be something like a "version" flag.

The point is:

If you have two objects that equal A but not B you have a assumption that one of these objects is newer than the other one. If B has no version information this assumption is hidden in your algorithm WHEN you decide to "overwrite/update" this object in a Set. This source code location where this happens may not be obvious so a developer will have a hard time to identify the relation between object X and object Y that differs from X in B.

If B has version information you expose the assumption that previously was only implicitly derivable from the code. Now you can see, that object Y is a newer version of X.

Think about yourself: Your identity remains your whole life, maybe some properties change (e.g. color of your hair ;-)). Sure you can assume that if you have two photos, one with brown hair one with grey hair, that you may be younger on the photo with brown hair. But maybe you have colored your hair? The problem is: YOU may know that you colored your hair. May others? To put this into a valid context you have to introduce the property age (version). Then you you are semantically explicit and unambigious.

To avoid the hidden operation "replacing old against new object" a Set should not have a get-Method. If you want a behaviour like this, you have to make it explicit by removing the old object and adding the new object.

BTW: What should it mean if you pass in an object that equals the object you want to get? That does not make sense. Keep your semantics clean and don't do this although technically nobody will hinder you.

oopexpert
  • 769
  • 4
  • 7
  • 8
    "As soon as you override equals you better override hashcode. As soon as you have done this your "instance" should never change internal state again." That statement is worth +100, right there. – David Arno Nov 02 '16 at 08:41
  • +1 for pointing out the dangers of equality and hashcode depending on mutable state – Hulk Nov 02 '16 at 14:05
3

Specifically in Java, HashSet was initially implemented using a HashMap anyway, and just ignoring the value. So the initial design didn't anticipate any advantage in providing a get method to HashSet. If you want to store and retrieve a canonical value among various objects that are equal, then you just use a HashMap yourself.

I haven't kept up to date with such implementation details, so I can't say whether this reasoning still applies in full in Java, let alone in C# etc. But even if HashSet were reimplemented to use less memory than HashMap, in any case it would be a breaking change to add a new method to the Set interface. So it's quite a lot of pain for a gain not everyone sees as worth having.

Steve Jessop
  • 5,051
  • 20
  • 23
  • Well, in Java it *could* be possible to provide a `default`-implementation to do this in a non-breaking way. It just doesn't seem a terribly useful change. – Hulk Nov 02 '16 at 14:11
  • @Hulk: I may be wrong, but I think any default implementation would be heinously inefficient, since as the questioner says, "The only way to retrieve my item is to iterate over the whole collection and check all items for equality". So good point, you can do it in a backward-compatible way, but adding a gotcha that the resulting get function only guarantees to run in `O(n)` comparisons even if the hash function is giving good distribution. Then implementations of `Set` that override the default implementation in the interface, including `HashSet`, could give a better guarantee. – Steve Jessop Nov 02 '16 at 19:30
  • Agreed - I don't think it would be a good idea. There would be precedences for this kind of behavior though - [List.get(int index)](https://docs.oracle.com/javase/8/docs/api/java/util/List.html#get-int-) or - to pick a default implementation added recently [List.sort](https://docs.oracle.com/javase/8/docs/api/java/util/List.html#sort-java.util.Comparator-). Maximum complexity guarantees are provided by the interface, but some implementations may do a lot better than others. – Hulk Nov 03 '16 at 09:21
2

There is a major language whose set has the property you want.

In C++, std::set is an ordered set. It has a .find method that looks for the element based on the ordering operator < or binary bool(T,T) function you provide. You can use find to implement the get operation you want.

In fact, if the bool(T,T) function you provide has a specific flag on it (is_transparent), you can pass in objects of a different type for which the function has overloads for. That means you do not have to stick the "dummy" data intomtye second field, just ensure the ordering operation you use can order between the lookup and set-contained types.

This permits an efficient:

std::set< std::string, my_string_compare > strings;
strings.find( 7 );

where my_string_compare understands how to order integers and strings without first converting the integer to a string (at a potential cost).

For unordered_set (the hash set of C++), there is no equivalent transparent flag (yet). You must pass in a T to an unordered_set<T>.find method. It could be added, but hashes require == and a hasher, unlike ordered sets which just require an ordering.

The general pattern is that the container will do the lookup, then give you an "iterator" to that element within the container. At which point you can get the element within the set, or delete it, etc.

In short, not all languages' standard containers have the flaws you describe. The C++ standard library's iterator-based containers do not, and at least some of the containers existed before any of the other languages you described, and the ability to do a get even more efficiently than how you describe has even been added. There is nothing wrong with your design, or wanting that operation; the designers of the Sets you are using simply did not provide that interface.

C++ standard containers where designed to cleanly wrap the low-level operations of the equivalent hand-rolled C code, which was designed to match how you might write it efficiently in assembly. Its iterators are an abstraction of C-style pointers. The languages you mention have all moved away from pointers as a concept, so they did not use the iterator abstraction.

It is possible that the fact that C++ does not have this flaw is an accident of design. The iterator-centric path means that to interact with an item in an associative container you first get an iterator to the element, then you use that iterator to talk about the entry in the container.

The cost is that there are iteration invalidation rules that you have to track, and some operations require 2 steps instead of one (which makes client code noisier). The benefit is that the robust abstraction permits more advanced use than the ones the API designers had in mind originally.

Yakk
  • 2,121
  • 11
  • 10