9

I'm tempted to create a final class CaseInsensitiveString implements CharSequence.

This would allow us to define variables and fields of this type, instead of using a regular String. We can also have e.g. a Map<CaseInsensitiveString, ?>, a Set<CaseInsensitiveString>, etc.

What are some of the pros and cons of this approach?

gnat
  • 21,442
  • 29
  • 112
  • 288

6 Answers6

27

Case insensitivity is a property of the comparison, not of the object (*). You'll want to compare the same string independently of the case or not depending on the context.

(And you have a whole kind of worms as what is a case insensitive comparison depend on the language -- i is upper cased as İ in Turkish -- and even the context -- depending on the word and the dialect ß can be uppercased as SS or SZ in German.)

(*) It can be a property of the object containing the string, but that is somewhat different of being a property of the string itself. And you can have an class which has no state excepted a string, and comparing two instances of that class will use a case insensitive comparison of the string. But that class won't be a general purpose string as it won't provide methods expected for a general purpose strings and will provide methods which aren't. This class won't be called CaseInsensitiveString but PascalIdentifier or whatever is pertinent to describe it. And BTW, the case independent comparison algorithm will most probably be provided by its purpose and be locale independent.

AProgrammer
  • 10,404
  • 1
  • 30
  • 45
  • 1
    So would you recommend a `TreeSet` using `String.CASE_INSENSITIVE_ORDER` over a `HashSet`? Note that using `TreeSet` means `O(log n)` for `contains`. Moreover, this comparator is _inconsistent_ with `equals`, meaning that the resulting `TreeSet` doesn't obey the general `Set` contract (i.e. it may `contains(x)`, even though it has no element that is `equals` to `x`). – polygenelubricants Jul 06 '11 at 10:53
  • Since the mid 90, the generic hash tables I've designed takes both an hash function and an equality function as generic parameters with a default deduced from the key type. (If it isn't the case for those provided by the Java library, I'll risk the explanation that they were designed by someone more familiar with OO programming than generic programming, typing strongly the type with those operations is something you need to do in OOP but a code smell in GP). – AProgrammer Jul 06 '11 at 11:42
  • @AProgrammer The Java collections use the `equals()` implementation on each object. There is a default implementation, which any object can override. I don't think you can define the hash, but then I've never tried to - the tables always worked fine without worrying about it (one reason I like Java over C++ :)). – Michael K Jul 06 '11 at 13:06
  • @Michael, ISTR that you can override hashCode (excepted perhaps in some specific cases, my knowledge of Java is partial and outdated). Using by default class provided comparison and hashing is good (even more so if the language can provide suitable default in the class), but that doesn't change the fact that preventing to use something else is bad style when you are designing generic components. – AProgrammer Jul 06 '11 at 15:56
  • @AProgrammer Ah yes, forgot that hashtables and the like use that :) However, I'm not sure I understand what you mean by "preventing to use something else." Those methods are in Object; every class has them. Is that not generic enough? (Of course, if OOP and first class functions weren't supposedly "mutually exclusive" this wouldn't be an issue...) – Michael K Jul 06 '11 at 16:00
  • @Michael, You may want to change the equality for some use (just look at the OP question for an example), and when you do that, you have to change the hash as well. – AProgrammer Jul 06 '11 at 16:32
  • +1 for the first sentence. – back2dos Jul 10 '11 at 10:45
  • 1
    @AProgrammer - I disagree with "Case insensitivity is a property of the comparison, not of the object", and with "maybe the object but not the string" proviso. This may describe how things are, but the question is about a proposed *change* to how things are. In modulo 3 arithmetic, 2 is shorthand for { ..., -4, -1, 2, 5, 8, 11, ... }. The notation represents an abstraction, but isn't the same thing as the abstraction. Why can't 'H' represent the abstraction { 'h', 'H' }? Characters don't exist in the computers memory at all - whether a code represents 'H' or { 'h', 'H' }, it's an abstraction. –  Jul 10 '11 at 22:17
  • 1
    @AProgrammer - on the second paragraph, I probably agree though. At the very least, it would imply English case-insensitive strings, Turkish case-insensitive strings, etc etc. A class with subclasses or an i18n option, IOW. And then you get the double dispatch issue (how to compare two case-insensitive strings with different language options). I guess that's back to "property of the comparison". Damn! –  Jul 10 '11 at 22:26
  • I don't buy this "depends on the context" pov. You can say that there's no point of using generics for type safety by declaring a `List` or a `List`, because the type of objects that the list contains depends on how you use it, and not the property of the list itself (which of course, is not a very defensible claim). Besides, arguing that it depends on the context actually makes the case stronger for making this property intrinsic and invariant of the type, because it's so much simpler precisely when you DON'T care about the context. – polygenelubricants Jul 17 '11 at 07:01
  • @polygenlubricants, I don't understand your remark. I stated one position on case insensitive string in my answer: you won't use a case insensitive general purpose string. Either you'll use a general purpose string an use a contextually chosen comparison or you'll write a far more specialized class you won't name string (I've expanded the point in the answer). In some comments, I stated that I consider bad style in generic programming to build a generic component depending on an equivalence relation to depend on the fact that this equivalence is the equality relationship provided by the class. – AProgrammer Jul 19 '11 at 11:47
7

Just off the top of my head:

Pros:

  • Makes a lot of code self-documenting, e.g:
    • bool UserIsRegistered(CaseInsensitiveString Username)
  • May streamline comparisons
  • May remove the potential for comparison bugs

Cons:

  • Might be a waste of time
    • people can just convert regular strings to lowercase if they need case-insensitive comparisons
  • Using it for front-end code will cause capitalization problems
    • For example, if you use CaseInsensitiveString to store a username, even though it makes sense to have case-insensitive back-end comparisons, the front-end code will display the user's name as "bob smith" or "BOB SMITH"
  • If your code base already uses regular strings, you will have to go back and change them or live with inconsistency
Maxpm
  • 3,146
  • 1
  • 25
  • 34
  • 4
    Depending on the implementation, your second "Cons" point doesn't have to be valid - you can implement CaseInsensitiveString to store case-sensitively and merely override the comparison operators. – tdammers Jul 06 '11 at 06:00
  • 1
    @tdammers: if the CaseInsensitiveString is stored with case and then with the comparison operator overrided, it reinforces the point of @AProgrammer that the comparison operator could have been decoupled from the whatever string object. – rwong Jul 10 '11 at 12:09
  • 3
    @tdammers - some things already work similarly. Windows filesystems preserve case, for example, but are case insensitive for comparisons. It's not a bad system, but can cause confusion when you want to "rename" something to change the case. Basically, you still sometimes need case-sensitive comparison to avoid making bad judgements about whether a rename is making a genuine change - and if there's one special case, maybe there's others too. –  Jul 10 '11 at 22:23
  • @rwong: I agree. Best thing would be explicit case-insensitive comparisons where needed. However, sometimes you want strings to behave like SQL strings (with a CI collation), and then preserving case on storage but ignoring case on comparison would be the closest match. – tdammers Jul 11 '11 at 05:50
4

CaseInsensitiveString is not a bad idea depends on your use, as long as you don't expect it to work together with String.

You may convert a CaseInsensitiveString to a String, or vice-versa, and that's all you should do.

Problem will happen if you try to do something like

class CaseInsensitiveString {
  private String value;

  public boolean equals(Object o) {
    // .....
    if (o instanceof String) {
      return value.equalsIgnoreCase((String) o);
    }
  }
}

You are doomed to fail if you are going to make your CaseInsensitiveString corporate with normal String, because you will be violating symmetric-ness and transitive-ness for equals() (and other contracts)

However, please ask yourself, in what case you really need this CaseInsensitiveString which it is not suitable to use String.CASE_INSENSITIVE_ORDER ? I bet not many case. I am sure there will be case that worth having this special class, but ask yourself first.

Adrian Shum
  • 1,095
  • 7
  • 11
2

Explicitly creating types in your domain/model is very good practice. Like Maxpm said it is self documenting. Also a big plus: people can't (by accident) use wrong input. The only negative thing it has would be that it will scare off junior (and even some medior) programmers.

Ivo Limmen
  • 279
  • 2
  • 5
1

A CaseInsensitiveString class and its helpers add a lot of code and they will make everything less readable than the String.toLoweCase() method.

CaseInsensitiveString vaName1 = new CaseInsensitiveString('HeLLo');
//... a lot of lines here
CaseInsensitiveString vaName2 = new CaseInsensitiveString('Hello');
//... a lot of lines here
if (varName1.equals(varName2)) ...

is more complex, less self documenting, and less flexible than

String vaName1 = 'HeLLo';
//... a lot of lines here
String vaName2 = 'Hello';
//... a lot of lines here
if (varName1.toLowerCase().equals(varName2.toLowerCase())) ...
Ando
  • 1,071
  • 6
  • 15
0

The most frequently used implementations on the web are case sensitive - XML, JavaScript. In terms of performance, it is always best to use the most appropriate function/property/object for each case.

If you are dealing with structures - XML or JS or similar, case sensitivity is important. It is much faster using system libraries.

If you are dealing with data in a database, as mentioned above the database indexing shall be used for case sensitive/insensitive strings.

If you are handling data on the fly, it is important to make the necessary conversion cost calculation for each string. It is probable that the strings should be compared or sorted somehow.

Alper TÖR
  • 11
  • 1