Why are strings so slow?

Question

Ever since my very first programming class in high school, I've been hearing that string operations are slower — i.e. more costly — than the mythical "average operation." Why makes them so slow? (This question left intentionally broad.)

If you know that these "average operations" are mythical, can you at least tell us what some of them are? Given that you're asking such a vague question, it's hard to trust your assertion that these unspecified operations truly are mythical. — seh, Oct 09 '10 at 14:49
@seh, unfortunately, I actually can't answer that. The few times I've actually asked people what strings are slower than, they just kind of shrug and say "they're just slow." Besides, if I had more specific information, this would be a question for SO, not Programmers; it's already kinda borderline. — Pops, Oct 09 '10 at 22:37
What is the point ? If told strings are actually slow, will you stop using them ? — Tulains Córdova, Nov 07 '12 at 16:40
Forget it. If someone tells you such nonsense, the counterquestion is: "Really? Are they? Should we use an int-array then?" — Ingo, Nov 07 '12 at 23:10

score 48 · Accepted Answer · answered Oct 09 '10 at 05:36

48

"The average operation" takes place on primitives. But even in languages where strings are treated as primitives, they're still arrays under the hood, and doing anything involving the whole string takes O(N) time, where N is the length of the string.

For example, adding two numbers generally takes 2-4 ASM instructions. Concatenating ("adding") two strings requires a new memory allocation and either one or two string copies, involving the entire string.

Certain language factors can make it worse. In C, for example, a string is simply a pointer to a null-terminated array of characters. This means that you don't know how long it is, so there's no way to optimize a string-copying loop with fast move operations; you need to copy one character at a time so you can test each byte for the null terminator.

answered Oct 09 '10 at 05:36

Mason Wheeler

82,151
24
234
309

4

And certain languages make it much better: Delphi's encoding of the string length at the beginning of the array makes string concatenation very fast. – Frank Shearar Oct 09 '10 at 05:53
@Frank Shearar: How does this make string concatenation faster? – gablin Oct 09 '10 at 08:35
@gablin: Because you can immediately jump to the end of the string. If the structure containing the array (Delphi's String is not internally a plain array of characters: it has copy-on-write semantics and a bunch of other stuff) has sufficient free space at the end, you can immediately start appending characters, possibly simply by copying a block of memory. – Frank Shearar Oct 09 '10 at 09:04
@Frank Shearar: This only helps if the array containing the string has excess space left to fit the other string. I assumed that the array did not. Of course, you could arrange the strings as a linked list of array blocks. But then you wouldn't need the know the size of the first string either... – gablin Oct 09 '10 at 09:52
4

@gablin: It also helps by making the string copying itself a lot faster. When you know the size up front, you don't have to copy one byte at a time and check each byte for a null terminator, so you can use the full size of any register, including the SIMD ones, for data movement, making it up to 16 times faster. – Mason Wheeler Oct 09 '10 at 12:54
Last paragraph is wrong, you can easily make a struct that keeps track of the length of string, for example see `strbuf` in the git source code. – alternative Oct 09 '10 at 14:53
4

@mathepic: Yeah, and that's fine for as far as it will take you, but when you start interacting with libc or other external code, it expects a `char*`, not a `strbuf`, and you're back to square 1. There's only so much you can do when a bad design is baked into the language. – Mason Wheeler Oct 09 '10 at 15:26
If you have a `strbuf *str` then `str->buf` gives `char *` to use with external libraries. Please stop complaining about C; your FUD is really annoying. – alternative Oct 09 '10 at 15:54
1

@Mason Wheeler: Oh yeah. Didn't think about that one. Thanks! – gablin Oct 09 '10 at 16:03
6

@mathepic: Of course the `buf` pointer's there. I never meant to imply that it's not available; rather, that it's *necessary.* Any code that doesn't know about your optimized-but-nonstandard string type, including things as fundamental as *the standard library*, still has to fall back on the slow, unsafe `char*`. You can call that FUD if you want to, but that doesn't make it not true. – Mason Wheeler Oct 09 '10 at 19:07
3

No, in C a string is, by definition, "a contiguous sequence of characters terminated by and including the first null character". Pointers are used to manipulate strings; they are not themselves strings. – Keith Thompson Feb 29 '12 at 00:55
7

People, there's a Joel Spolsky column about Frank Shearer's point: [Back to Basics](http://www.joelonsoftware.com/articles/fog0000000319.html) – user16764 Feb 29 '12 at 04:06

score 14 · Answer 2 · edited Apr 12 '17 at 07:31

^{This is an old thread and I think that the other answers are great, but overlook something, so here's my (late) 2 cents.}

Syntactic Sugar-Coating Hides Complexity

The problem with strings is that they are second class citizens in most languages, and are in fact most of the time not really a part of the language specification itself: they are a library-implemented construct with some occasional syntactic sugar-coating on the top to make them less of a pain to use.

The direct consequence of this is that the language hides a very large part of their complexity away from your sight, and you pay for the sneaky side-effects because you grow into the habit of considering them like a low-level atomic entity, just like other primitive types (as explained by the top-voted answer and others).

Implementation Details

Good Ol' Array

One of the elements of this underlying "complexity" is that most string implementations would resort to using a simple data-structure with some contiguous memory space to represent the string: your good ol' array.

This makes good sense, mind you, as you want the access to the string as a whole to be fast. But that implies potentially dreadful costs when you want to manipulate this string. Accessing an element in the middle might is fast if you know what index you are after, but looking for an element based on a condition isn't.

Even returning the size of string might be costly, if your language doesn't cache the string's length and needs to run through it to count characters.

For similar reasons, adding elements to your string will prove costly as you'll most likely need to re-allocate some memory for this operation to occur.

So, different languages take different approaches to these issues. Java, for instance, took the liberty of making its strings immutable for some valid reasons (caching length, thread-safety) and for its mutable counterparts (StringBuffer and StringBuilder) will choose to allocate size using larger-sized chunks to not need to allocate every time, but rather hope for best case scenarios. It generally works well, but the down-side is to sometimes pay for memory impacts.

Unicode Support

Also, and again this is due to the fact that the syntactic sugar coating of your language hides this from you to play nice, you often don't think it terms of unicode support (especially for as long as you don't really need it and hit that wall). And some languages, being forward thinking, do not implement strings with underlying arrays of simple 8-bit char primitives. They baked in UTF-8 or UTF-16 or what-have-you support for you, and the consequence is a tremendously larger memory consumption, which is often times not needed, and a larger processing time to allocate memory, process the strings, and implement all the logic that goes hand in hand with manipulating code points.

The results of all this, is that when you do something equivalent in pseudo-code to:

hello = "hello,"
world = " world!"
str = hello + world

It may not be - despite all the best efforts the language developers put in to have them behave as you'd except - a simple as:

a = 1;
b = 2;
shouldBeThree = a + b

As a follow-up, you may want to read:

Is it Inefficient to concatenate strings one at a time?

I just realized this is the best answer because the mythical statement can be applied to anything like RSA encryption is slow. The only reason for string being put in this embarrassing spot is because the plus operator provided for strings in most languages, which makes newbies not aware of the cost behind the operation. — Codism, Nov 07 '12 at 15:45
@Abel: thanks, seemed to me the was room for more generic details. — haylem, Nov 07 '12 at 17:03
@Codism: thanks, glad you liked it. I do indeed think this can be applied to many cases where it's just a matter of complexity being hidden (and of us not paying that much attention to lower-level details anymore until we finally need to because we hit a bottleneck or brickwall of some sort). — haylem, Nov 07 '12 at 17:04

score 1 · Answer 3 · answered Feb 28 '12 at 23:59

The phrase "average operation" is probably shorthand for a single operation of a theoretical Random-Access Stored-Program machine. This is the theoretical machine it's customary to use to analyse the running time of various algorithms.

The generic operations are normally taken to be load, add, subtract, store, branch. Maybe also read, print and halt.

But most string operations require several of these fundamental operations. For example, duplicating a string normally requires a copying operation, and hence a number of operations which is proportional to the length of a string (that is, it's "linear"). Finding a substring inside another string also has linear complexity.

score 1 · Answer 4 · answered Feb 29 '12 at 00:34

It completely depends on the operation, how strings are represented, and what optimizations exist. If strings are 4 or 8 bytes in length (and aligned), they wouldn't necessarily be slower - many operations would be just as fast as primitives. Or, if all strings have a 32-bit or 64-bit hash, many operations would also be just as fast (though you pay the hashing cost up front).

It also depends on what you mean by "slow". Most programs will process strings plenty fast for what is needed. String comparisons might not be as fast as comparing two ints, but only profiling will reveal what "slow" means to your program.

score 0 · Answer 5 · answered Oct 09 '10 at 05:39

0

Let me answer your question with a question. Why does saying a string of words take longer than saying a single word?

answered Oct 09 '10 at 05:39

ChaosPandion

6,305
2
33
33

2

It doesn't necessarily. – user16764 Feb 29 '12 at 07:11
3

Supercalifragilisticexpialidocious – Spoike Apr 30 '12 at 14:04
s/word/syllable/g – Caleb Nov 07 '12 at 18:48
Let me answer your question-answer with a question: why don’t you say what your answer is meant to mean? It is, after all, far from clear how it can be interpreted as applying to some run-time system. – PJTraill Feb 15 '16 at 11:58

Why are strings so slow?

5 Answers5

Syntactic Sugar-Coating Hides Complexity

Implementation Details

Good Ol' Array

Unicode Support

Linked