Is it possible to write a generalized string reverse function that works for all localisations and string types?

Question

I was just watching the Jon Skeet (with Tony the Pony) presentation from Dev-Days.

Although "write a string reverse function" is coding interview 101 - I'm not sure that it's actually possible to write a general string reverse function, certainly not one that works in all localisations and all string types.

Apart from detecting if the input string is ascii, UTF8, UTF16 (fixed and variable length) etc.
There is the 'apply accent to next character' (U+0301) code that Jon highlighted. Then there are ligatures that may or may not be displayed, or encoded as double characters.

Seems that "reverse a string" is actually one of the harder computer science tasks!

Nah, try the halting problem for something one step up in difficulty but simpler to explain to people. — JB King, Jul 26 '11 at 17:32
Being a nonsubjective, technical question, I would venture to say this would fit better on StackOverflow (please don't repost it there though, it will be automigrated if enough people vote to close it here). — Péter Török, Jul 26 '11 at 17:35
Depends on the programming language. For example in Ruby it is as easy as `"stressed".reverse` :p — Marcelo, Jul 26 '11 at 17:50
Great philosophical question. FWIW, Java's StringBuilder gets surrogates right but not combiners — kdgregory, Jul 26 '11 at 18:14
"Reverse this string in place using Java" is a good trick question. :) — Scott C Wilson, Jul 26 '11 at 18:38
@Alex, no - the point is to get the person to say "String is immutable" as quickly as possible. The weak candidate will fumble with charAt, etc. — Scott C Wilson, Jul 26 '11 at 18:50

Jon Raynor · Accepted Answer · 2011-07-26T18:23:12.000

5

Yes. If we get a string we can definately reverse each character.

The problem as Jon points out is that does the reversal make sense and does it conform to language and cultural rules, characters, and encoding. The water gets murky the deeper you go.

If you are doing any type of string manipulation in C# use the Invariant culture when writing and reading, that way you can safely manipulate them. Otherwise, prepare for the Turkish support call failure.

ToUpper() looks so innocent, but its is an epic fail waiting to happen.

edited Jul 26 '11 at 18:23

answered Jul 26 '11 at 18:09

Jon Raynor

10,905
29
47

2

The other question is - what does anyone ever use string reverse for (other than interview Q)? I've only ever needed it for low level buffer manipulation of I/O ports - and even then almost never actually with strings – Martin Beckett Jul 26 '11 at 18:22
@Martin - Agreed. Maybe for an english language program to find palidromes? I don't think I've used it other than solving a quiz question. – Jon Raynor Jul 26 '11 at 18:30
@Martin true. I think it's only done ironically. :) – Scott C Wilson Jul 26 '11 at 18:51

score 2 · Answer 2 · answered Jul 26 '11 at 18:03

2

In general, when this question is asked it's assuming US-ASCII. The point is not so much to test the person's knowledge of Unicode (although this would be an interesting follow on), as to see if they understand how pointers work. A surprising number of people can't do this kind of pointer arithmetic.

answered Jul 26 '11 at 18:03

Scott C Wilson

3,908
20
25

2

"How would this fail with unicode?" is a good follow-up question – Martin Beckett Jul 26 '11 at 18:23
Good but perhaps somewhat advanced - after all, "reverse this string in place" is an entry level interview question. You probably wouldn't ask a seasoned person something this simple, unless perhaps they were very shy and you were trying to warm them up. – Scott C Wilson Jul 26 '11 at 18:36

score 1 · Answer 3 · answered Jul 26 '11 at 18:50

As an interview question, it's usually asked just about the technical bits of doing an in-place swap of 8-bit items to reverse their order (regardless of what characters those might actually represent).

At the same time, especially if you're interviewing a relatively senior person, you could at least hope to hear some questions about the specification and the exact form of the input. Even if you direct them back to the simple case of just swapping 8-bit items, knowing whether or not they think in broader terms than that may be valuable.

If you do have to deal with a broad range of inputs, you just about have to think in terms of a "stack", a bit like a network stack. You have to build your software in a number of layers, each of which applies a fairly specific set of transforms in a specific order. This lets you keep each part of the transformation simple enough that you can keep it under control, and stand a reasonable chance of making it meet its requirements.

I'll outline one possibility that I have found at least somewhat workable. I'm the first to admit that there may be others who have better ideas though. At least to me, this seems a bit like brute-force engineering, with little real elegance.

You normally want to start by converting any other representation to UCS-4 (aka UTF-32). For this, you'd generally prefer to rely on input from the user than attempt to figure it out on your own. In some cases, you can be sure a particular sequence of octets does not follow the rules of a particular encoding scheme, but you can rarely (if ever) be sure that it does follow a particular encoding scheme.

The next step is optional. You can normalize the input to one of the four Unicode normalization forms. In this case, you'd probably want to apply the "NFKC" transformation: compatibility decomposition followed by canonical composition. This will (where possible) convert combining diacritical forms (such as the U+301 that Jon mentioned) into single code points (e.g., an "A" with a "U+301" would be converted to "Latin capital A with acute", U+00C1).

You then walk through all the characters from beginning to end, breaking the string into actual characters -- and if there are (still) combining diacritic marks, keeping them with the characters they modify. The result of this will typically be an index of the actual characters in the string, such as the position and length of each.

You the reverse the order of those complete characters, typically by using the index you created in the previous step.

You then (again, optionally) apply another Unicode normalization process, such as NFD (canonical decomposition). This will turn the aforementioned "Latin A with acute" back into two code points -- a "Latin capital A" and a "combining Acute". If your input happened to contain a U+00C1 to start with, however, it would also convert that into two code points as well.

You then encode the sequence of UCS-4 code points into the desired encoding (UTF-8, UTF-16, etc.)

Note that the Unicode normalization steps can/will change the number of code points needed to store the string, so if you include those, you can no longer plan on the result string fitting into the original storage. Obviously enough, the resulting code points may not correspond directly to the input code points either.

I hadn't come across U+301 before Jon brought it up. I can't see why it's needed in unicode with glyphs for all the accented characters - I imagine it's backwards compatibility — Martin Beckett, Jul 26 '11 at 20:20
@Martin: There are actually a fair number of [combining diacritics](http://unicode.org/charts/PDF/U0300.pdf) (the entire range from U+0300 to U+036F, though from U+0363 to U+036F are obsolescent at best). Precomposed characters are provided for some of the most common possibilities, and combining diacritics for anything else needed. — Jerry Coffin, Jul 26 '11 at 21:11
Too much extra-storage, normalization and conversion. Just iterate the characters, and reverse the order of the constituent code-units in-place. Then reverse the order of all code-units in-place. — Deduplicator, Feb 14 '15 at 18:40

Is it possible to write a generalized string reverse function that works for all localisations and string types?

3 Answers3