58

It seems to me that many bigger C++ libraries end up creating their own string type. In the client code you either have to use the one from the library (QString, CString, fbstring etc., I'm sure anyone can name a few) or keep converting between the standard type and the one the library uses (which most of the time involves at least one copy).

So, is there a particular misfeature or something wrong about std::string (just like auto_ptr semantics were bad)? Has it changed in C++11?

Tamás Szelei
  • 7,737
  • 7
  • 38
  • 42
  • 35
    It's called "Not Invented Here syndrome". – Cat Plus Plus Jun 05 '12 at 14:07
  • 1
    I think this belongs more in Programmers.se, not having a definite answer. – David Thornley Jun 05 '12 at 14:09
  • I've edited the title of your post to be less subjective. – John Dibling Jun 05 '12 at 14:25
  • 12
    @CatPlusPlus QString and CString both predated std::string. – Gort the Robot Jun 05 '12 at 15:36
  • 10
    @Cat Plus Plus: This syndrome does not seem to affect the Java String class. – Giorgio Jun 05 '12 at 18:06
  • 21
    @Giorgio: Java programmers are too busy inventing workarounds for language deficiencies to worry about string classes (Android reinvented String, by the way). – Cat Plus Plus Jun 06 '12 at 00:26
  • @Giorgio because the Java String is `final`. If not, the same would have happened there. –  Jun 06 '12 at 01:08
  • 6
    `final` only stops you from deriving something from `String`. It doesn't stop you from making another class that does the same stuff. – Gort the Robot Jun 06 '12 at 01:44
  • 9
    @Giorgio: That's probably because Java's hard-coded syntactic support for `java.lang.String` (lack of operator overloading, etc.) would make it a pain to use anything else. – Mechanical snail Jun 06 '12 at 04:13
  • 7
    ___"writing string classes is one of the more popular indoor sports among C++ programmers" — [P.J. Plauger](http://www.drdobbs.com/article/print?articleID=184403044)___ (Note the date, though. In fact, [Ben had it right](http://programmers.stackexchange.com/a/151620/1512) this is mostly for historical reasons.) – sbi Jun 06 '12 at 07:43
  • 1
    It used to be that writing a string class was always one of the first "C++ 101" assignments and/or "how to write OO code" tutorial examples. – Gort the Robot Jun 06 '12 at 17:02
  • 1
    @CatPlusPlus Android reinvented `String`? Been developing for Android for years, but no idea what you mean. – Malcolm Jan 03 '16 at 15:00
  • @Malcolm I'm sure I know what I meant 4 years ago – Cat Plus Plus Jan 04 '16 at 07:45
  • @CatPlusPlus Can't hurt to ask, especially considering so many people upvoted the comment, I'm very curious about that point. – Malcolm Jan 04 '16 at 10:40
  • not always not invented here syndrome, often it is a "I don't trust it" syndrome or "I will do it better syndrome", or "does too much" or "does too little" or just hate that it has std in it and don't want to use more layers of abstraction. std string isn't perfect, it has plenty of stds. – Dmytro Sep 04 '16 at 20:33

7 Answers7

58

Most of those bigger C++ libraries were started before std::string was standardized. Others include additional features that were standardized late, or still not standardized, such as support for UTF-8 and conversion between encodings.

If those libraries were implemented today, they would probably choose to write functions and iterators that operate on std::string instances.

Ben Voigt
  • 3,227
  • 21
  • 24
  • 5
    Support for UTF-8 is standardized since C++98. In such an inconvenient and partially implementation defined way that about nobody seems to be able to use it – AProgrammer Jun 05 '12 at 14:16
  • 10
    @AProgrammer: `char` is guaranteed to be large enough to hold any UTF-8 codepoint. AFAIK, that's the only "support" that C++98 provided. – Ben Voigt Jun 05 '12 at 16:00
  • 4
    @AProgrammer: That support is really quite useless. – DeadMG Jun 05 '12 at 17:15
  • 2
    @Ben There’s no such thing as a UTF-8 code point. I’m not sure how UTF-8 is formally defined but it’s somehow in terms of *octets*. So there’s even less support than that, since the only guarantee C++ makes is that a char is large enough to contain an octet, which, yes, happens to be big enough to hold the encoding unit of UTF-8 but that’s more by happenstance than by design. – Konrad Rudolph Jun 06 '12 at 08:43
  • The support is part of the locale system. You need one locale supporting Unicode. Such locale will use `wchar_t` to store unicode code points and `char` to store multibyte representation (the sane choice is to use UTF-8). The support as complete as for any other encoding. The whole use model as so many problems that I won't start to give a list, but it exists and was probably following the common practice at the time it was standardized (i.e. for C95) even if I wonder how much foresight was needed to see it was already after its best years. – AProgrammer Jun 06 '12 at 09:54
  • 1
    @Konrad: Right, I meant code unit, not codepoint. – Ben Voigt Jun 06 '12 at 12:50
  • 5
    @AProgrammer That locale is arguably broken since `wchar_t` is *not* big enough to represent all Unicode code points. Furthermore, there was this whole discussion about [UTF-16 considered harmful](http://programmers.stackexchange.com/q/102205/2366) where the very compelling argument was made that [UTF-8 should be used exclusively](http://utf8everywhere.org/) … – Konrad Rudolph Jun 06 '12 at 12:54
  • 1
    UTF-8 fitting isn't happenstance either, it's an explicit requirement (in C++11): "A byte is at least large enough to contain any member of the basic execution character set and the eight-bit code units of the Unicode UTF-8 encoding form and is composed of a contiguous sequence of bits, the number of which is implementation-defined." -- section 1.7 `[intro.memory]` – Ben Voigt Jun 06 '12 at 14:31
  • 6
    @KonradRudolph, it is not the locale system which is broken there (the definition of wchar_t is "wide enough for any supported character set"); systems having committed to a 16 bits wchar_t did at the same time commit to not supporting Unicode. Well, the culprit is Unicode which first guaranteed that it would never use codepoints needing more than 16 bits, then systems committing to a 16 bits wchar_t, then unicode switching to need more than 16 bits. – AProgrammer Jun 06 '12 at 15:14
  • @BenVoigt We were talking about C++98 though. Sure, C++11 offers much more in terms of Unicode support. – Konrad Rudolph Jun 06 '12 at 15:33
  • @Konrad: Conversely, UTF-8 gained prominence over UTF-7 and UTF-16 exactly because major programming languages such as C++98 guaranteed their built-in string types could accommodate octets. Still not an accident. And C++11 respects the connection by guaranteeing that UTF-8 will continue to fit. – Ben Voigt Jun 06 '12 at 15:41
42

String is C++'s big embarrassment.

For the first 15years you don't provide a string class at all - forcing every compiler on every platform and every user to create their own.

Then you make something that's confused about whether it's supposed to be a full string manipulation API or just an STL char container, with some algorithms that duplicate the ones on a std::Vector or are different.

Where an obvious string operation like replace() or mid() involves such a mess of iterators that you need to introduce a new 'auto' keyword to keep the statement fitting on a single page and leads most people to give up on the whole language.

And then you have unicode 'support' and std::wstring that is just arghh.....

< rant off > thank you - I'm feeling much better now.

Martin Beckett
  • 15,776
  • 3
  • 42
  • 69
  • @DeadMG: C++ existed for 15 years before it was first standardized. – dan04 Jun 05 '12 at 20:11
  • 12
    @DeadMG - yes and it was standardised in 1998, 15years after it was invented and 6years after even MSFT were using it. Yes iterators are a useful way of making an array and list look the same, do you think they are an obvious way to do string manipulation? – Martin Beckett Jun 05 '12 at 20:27
  • 3
    C with Classes was invented in 1983. Not C++. The only Standard libraries are those determined by Standard- which, strangely enough, can only happen once you have a Standard, so the earliest possible date for *any* Standard library is 1998. And iterators could be considered exactly equal to indexes, but strongly typed. I'm all for the fact that iterators suck compared to ranges, but that's not really specific to `std::string`. The lack of a String class in 1983 does not justify having more of them now. – DeadMG Jun 05 '12 at 21:04
  • 1
    In a way, there's been a C++ standard library since 1989. But back then, it was just called the C standard library. – dan04 Jun 05 '12 at 22:07
  • 8
    I thought iostreams were C++s big embarrassment... – Doug T. Jun 06 '12 at 00:20
  • 1
    @DougT.: Let's face it, there are not many coding practices from before about 1998 that are not big embarassments. – DeadMG Jun 06 '12 at 00:45
  • 18
    @DeadMG People were using something called "C++" for many years prior to 1998. I wrote my first program using something called "C++" in 1985. If you want to say that this isn't "real" C++, that's fine, but prior to this, we were writing code and had to get a string class from somewhere. Once we had these legacy codebases, we couldn't exactly throw them out or rewrite from scratch when we got a standard. Now what *should* have happened is that there should have been a string class that came with cfront. – Gort the Robot Jun 06 '12 at 01:48
  • @StevenBurnap: If you use a language prior to Standardization, then you get what you ask for- a language which is still in development and incomplete. CFront did not have many of the language features which make `std::string` possible, like exceptions. – DeadMG Jun 06 '12 at 20:38
  • 8
    @DeadMG - If nobody used a language until it had ISO cert then no language would ever be used since it would never get to ISO. There is no ISO standard for x86 assembler but I'm happy to use the platform – Martin Beckett Jun 06 '12 at 20:41
  • 2
    @MartinBeckett: Oh, I'm not saying that it should not have been used. What I am saying is that when you use a language in such a fashion, then there are certain things to be expected, and one of them is that the authors haven't finished with it quite yet. – DeadMG Jun 06 '12 at 20:43
  • @DeadMG I'm not denying that. At the time, no one expected perfection. We just wrote our own string classes all the while whining that there wasn't one that came with the compiler. Which is, of course, the answer to the question of "why are their so many string classes?" – Gort the Robot Jun 06 '12 at 22:02
  • @StevenBurnap: It is, however, more than a tad misleading to say that CFront should have had `std::string` when there was no way it could possibly have supported such a thing. – DeadMG Jun 07 '12 at 06:42
  • Not `std::string` exactly, but certainly a string class. – Gort the Robot Jun 07 '12 at 16:23
  • Is `std::string` good to use in modern C++17? – Aaron Franke Feb 27 '19 at 09:46
33

Actually... there are several issues with std::string, and yes it gets a bit better in C++11, but let's not get ahead of ourselves.

QString and CString are part of old libraries, therefore they existed prior to C++ being standardized (much like the SGI STL). They thus had to create a class.

fbstring address very specific performance concerns. The Standard prescribes an interface and algorithmic complexity guarantees minima, however it is a Quality of Implementation details whether this end up being fast or not. fbstring has specific optimizations (storage-related, or a faster find for example).

Other concerns that were not evoked here (en vrac):

  • in C++03 it is not mandatory that the storage be contiguous, making interoperability with C potentially difficult. C++11 fixes this.
  • std::string is encoding unaware, and has no special code for UTF-8, it's easy to store a UTF-8 string in it and corrupt it inadvertendly
  • std::string interface is bloated, many methods could have been implemented as free-functions and many are duplicated to conform both to an index-based interface and an iterator-based interface.
Matthieu M.
  • 14,567
  • 4
  • 44
  • 65
  • 5
    Re concern #1 -- C++03 21.3.6/1 guarantees that `c_str()` returns a pointer to contiguous storage, which provides for some C-interoperability. However you cannot modify the pointed-to data. Typical workarounds include using a `vector`. – John Dibling Jun 05 '12 at 14:22
  • @JohnDibling: Yes, and there is another limitation: it could incur a copy in newly allocated storage (the Standard does not say it shall not). Of course C++11 does not prevent copying either, but since you can simply do `&s[0]` it does not matter any longer :) – Matthieu M. Jun 05 '12 at 14:38
  • 1
    @MatthieuM.: The pointer obtained via `&s[0]` may not point to a NUL-terminated string (unless `c_str()` has been called since the last modification). – Ben Voigt Jun 05 '12 at 16:01
  • @BenVoigt: I believe even if `c_str()` has been called it may not be null-terminated as `c_str()` could (though it is unlikely) use another buffer. – Matthieu M. Jun 05 '12 at 17:34
  • 2
    @Matthieu: Another buffer is not allowed. "`c_str()` Returns: A pointer `p` such that `p + i == &operator[](i)` for each `i` in `[0,size()]`". – Ben Voigt Jun 05 '12 at 17:36
  • @BenVoigt: Ah good! Somehow this had completely escaped my notice. `std::string` is now less broken, it'll soon be as good as a `std::vector`! – Matthieu M. Jun 05 '12 at 18:27
  • @MatthieuM.: Yes, well, there still is no requirement that `&s[size()] == &s[0] + size()` until after `c_str()` or `data()` is called... – Ben Voigt Jun 05 '12 at 19:02
  • 3
    What's also worth noting is that nobody in their right mind uses MFC anymore, so it's hard to argue that CString is a string class in modern C++. – DeadMG Jun 05 '12 at 21:05
  • @MatthieuM.: you says that "it's easy to store a UTF-8 string in it and corrupt it inadvertendly"? Why & How it can be corrupted by stroring UTF string in it? Will you give an example that does this? – Destructor Feb 21 '16 at 17:09
  • @PravasiMeet: in a `std::string`, you can store the [pile of poo](http://www.fileformat.info/info/unicode/char/1f4a9/index.htm) character U+1F4A9 in 4 bytes: 0xF0 0x9F 0x92 0xA9. You can then obtain a substring: 0xF0 0x9F 0x92; this, unfortunately, is no longer valid UTF-8 because the Code Point is incomplete. When a string implementation allows splitting/truncating in the middle of a code point, it's close to being useless for manipulating text (unless you can ensure that the text in question is solely composed of ASCII or whatever locale-specific you use which is byte-only). – Matthieu M. Feb 21 '16 at 17:18
9

Apart from the reasons posted here there is also another one - binary compability. Libraries' writers have no control over which std::string implementation you are using and whether it has the same memory layout as theirs.

std::string is a template, so its implementation is taken from your local STL headers. Now imagine that you are locally using some performance-optimised STL version, fully compatible with the standard. For example, you may have chosen to intrudce static buffer in each std::string to reduce the number of dynamic allocations and cache misses. As a result, memory layout and/or size of your implementation is different than library one's.

If only the layout is different, some std::string member function calls on instances passed from library to the client or the other way around may fail, dependending on which members were shifted.

If the size is different as well, all library types having std::string member will appear to have different sizeof when checked in the library and in the client code. Data members following std::string member will have offsets shifted as well, and any direct access/inline accessor called from the client will return rubbish, despite "looking OK" when debugging the library itself.

Bottomline - if library and the client code are compiled agains different std::string versions, they will link just fine, but it may result in some nasty, hard to understand bugs. If you change your std::string implementation all libraries exposing members from STL have to be recompiled to match the client's std::string layout. And because programmers want their libraries to be robust you'll rarely see std::string exposed anywhere.

To be fair, this applies to all STL types. IIRC they don't have standarised memory layout.

gwiazdorrr
  • 227
  • 1
  • 4
  • 2
    You must be a *nix programmer. C++ binary compatibility is not equal on all platforms, and specifically on Windows NO classes containing data members are portable between compilers. – Ben Voigt Jun 06 '12 at 12:53
  • (I mean except POD types, and even then explicit packing requirements are needed) – Ben Voigt Jun 06 '12 at 15:43
  • 1
    Thanks for input, although I'm not talking different compiler, I'm talking different STL. – gwiazdorrr Jun 07 '12 at 09:29
  • 1
    +1: ABI is a huge reason to roll your own version of a compiler supplied class. For that alone, I wish this were the accepted answer. – Thomas Eding Jul 02 '14 at 18:10
6

There are many answers to the question but here are some:

  1. Legacy. Many string libraries and classes were written PRIOR to the existence of std::string.

  2. For compatibility with code in C. The library std::string is C++ where as there are other string libraries which work with C and C++.

  3. To avoid dynamic allocations. The library std::string uses dynamic allocation and may not be suitable for embedded systems, interrupt or real-time related code, or for low-level functionality.

  4. Templates. The library std::string is based on templates. Until fairly recently a number of C++ compilers had poorly performing or even buggy template support. Unfortunately, I work in an industry that uses a lot of custom tools and one of our toolchains from a major player in the industry doesn't "officially" 100% support C++ (with buggy stuff being templates et al).

There are probably many more valid reasons as well.

svick
  • 9,999
  • 1
  • 37
  • 51
Adisak
  • 171
  • 4
  • 2
    "Fairly recently" meaning "It's been a decade since even Visual Studio had pretty reasonable support for them"? – DeadMG Jun 06 '12 at 20:48
  • @DeadMG - Visual Studio is not the only non-compliant compiler in the world. I work in video games and we are often working on custom compilers for unreleased hardware platforms (happens every few years in the console cycles or as new hardware appears). "Fairly recently" means today -- Right now certain compilers don't support templates well. I can't be specific without violating NDA's but I am currently working on a platform with custom toolchains where C++ support -- especially template compliance -- is considered to be "experimental". – Adisak Jun 12 '12 at 21:32
4

It's mostly about Unicode. The Standard support for Unicode is abysmal at best, and everyone has their own Unicode needs. For example, ICU supports every Unicode functionality you could ever want, behind the most disgusting automatically-generated-from-Java interface you could possibly imagine, and if you're on Unix being stuck with UTF-16 may well not be your idea of a good time.

In addition, many people need differing levels of Unicode support- not everyone needs the complex text layout APIs and such things. So it's easy to see why numerous string classes exist- the Standard one is pretty suck and everybody has different needs from the new ones, with nobody managing to create a single class that can perform lots of Unicode support cross-platform with a pleasant interface.

In my opinion, this is mostly the fault of the C++ Committee for not correctly providing support for Unicode- in 1998 or 2003, maybe it was understandable, but not in C++11. Hopefully in C++17 they will do better.

DeadMG
  • 36,794
  • 8
  • 70
  • 139
-4

It's because every programmer has something to prove and feels the need to create their own awesome, faster string class for their one, awesome function. It's usually a little superfluous and leads to all kinds of extra string conversions in my experience.

Chad Stewart
  • 139
  • 3
  • 7
    Were this true I'd expect to see a similar number of String implementations in languages like Java where a good implementation has been available all along. – Bill K Jun 05 '12 at 17:40
  • @BillK the Java String is final, so you have to put new functionality elsewhere. –  Jun 06 '12 at 01:10
  • And my point is, even being final, in 20 years I've never seen anyone write a custom string impelementation (Well, I did to attempt to improve string concatenation performance but it turns out java is MUCH smarter at string+string than you'd imagine) – Bill K Jun 06 '12 at 01:46
  • 2
    @Bill: That might have to do with a different culture. C++ attracts those who want to understand the low-level details. Java attracts those who just want to get the job done using someone else's building blocks. (Note that this is not a statement about any specific individual choosing to use either language, but about the languages' respective design goals and culture) – Ben Voigt Jun 06 '12 at 15:45