When is it beneficial to not use utf-8?

Question

When is it beneficial to use encodings other than UTF-8? Aside from dealing with pre-unicode documents, that is. And more importantly, why isn't UTF-8 the default in most languages? That is, why do I often need to explicitly set it?

tell that to alllll the people who voted it to be a duplicate — Electric Coffee, Apr 02 '14 at 13:54
@MartijnPieters: I know most of those people and I respect their opinions. While I wouldn't argue against closing the question, I'm still not sure it's that close a duplicate. — david.pfx, Apr 03 '14 at 10:07
@david.pfx: it's all the new people you (and the rest of the community) that we don't know that's the problem for such questions. Everyone and their granny will come wade in with their opinion, and you never end up with a clear answer that we can vote on. Then there's the tool merchants that all come in to promote their solution as the better choice, etc. That's why this *class* of question is off-topic. — Martijn Pieters, Apr 03 '14 at 10:10
@david.pfx and Electric Coffee - There are 3 options to consider. 1) Open a question on Meta to discuss this question's closing reason; whether it's a duplicate; or ways to improve it. 2) Drop into [chat] on The Whiteboard and have the same conversation. 3) Flag for moderator consideration. Keep in mind that mods prefer to see options 2 or 3 prior to a post being flagged as it allows the community to participate instead of unilateral mod action. — , Apr 03 '14 at 10:37
I won't be flagging it for reopen. I find it an interesting topic, but not one to dwell on. — david.pfx, Apr 03 '14 at 13:22

score 6 · Accepted Answer · answered Mar 31 '14 at 08:54

6

For an external encoding (i.e., an encoding of things not inside your program) it is very hard to beat UTF-8; it supports every character your users might ever reasonably need and there's lots of support in many OSes and tools. (The one place that counts as an exception to this is in file names, where you must use the platform's conventions if you want any kind of interoperability at all. Fortunately, many platforms now use UTF-8 for this so the warning is a moot point there.)

For an internal encoding, things are more complex. The issue is that a character in UTF-8 is not a constant number of bytes, which makes all sorts of operations rather more complex than you might hope. In particular, indexing into the string by character (a very common operation when doing string processing!) changes from an O(1) operation into an O(N) operation, and that can be a very significant performance issue. There are a number of possible workarounds, such as using a rope data-structure or converting the string into a fixed-width character format (typically ASCII, ISO 8859-1, UTF-16 or UTF-32, depending on the maximum Unicode value of the characters in the string). The problems that plague such formats (limited character support and/or endian-ness problems) don't actually apply here because you can only apply a transformation where it is meaningful and you are only using it as an internal encoding.

Don't think that you can get away with storing that internal encoding to disk or giving it to another program. It might be “convenient” but it's a problem waiting to happen; send/store the data as UTF-8.

And don't forget that there's a lot of legacy data out there, far too much to dismiss. Of particular concern are various East Asian languages which have complex encodings that are potentially quite a bit shorter than UTF-8, so resulting in less pressure to convert, but there are many other issues lurking even in Western systems. (I don't want to know what is happening in major bank databases…)

answered Mar 31 '14 at 08:54

Donal Fellows

6,347
25
35

4

Depending on the kind of processing, even UTF-32 doesn't provide a fixed-width encoding. For example, accented characters in non-Latin scripts (and uncommon accented characters in Latin scripts) are represented by a sequence of multiple Unicode codepoints (multiple UTF-32 'characters'). – Bart van Ingen Schenau Mar 31 '14 at 09:27
I think you mean UTF-16 Bart. UTF-32 can store all Unicode values into it's 32 bit integer. There is however an composed and decomposed notation for special characters which means I can store a character and it's diacritical as two separate values. But that's part of Unicode rather than the Unicode data encoding. – dj bazzie wazzie Mar 31 '14 at 10:42
@BartvanIngenSchenau: I think that is no longer correct. Since RFC 3619 in 2003 UTF-8/16/32 are all limited to 4 bytes. – david.pfx Apr 01 '14 at 00:13
1

@djbazziewazzie actually Unicode allows unlimited usage of combinig characters which destroys any claim of guaranteed 0(1) access, even with UTF-32. Look at the interesting question: [How does Zalgo text work?](https://stackoverflow.com/q/6579844) for one case where it's being used. Whilst one could argue if such usage were really useful, it's still valid according to the specification (at least as far as I know) and therefore 0(1) access to UTF-32 is nothing but a myth if you wanna stay universal. – AliciaBytes Apr 01 '14 at 00:29
3

-1 for the constant time access myth. Almost all string processing be done equally fast and convenient using any arbitrary unit (e.g. bytes or code units) for indexing and sequential iteration (from start and end). And as others point out, UTF-32 only gets you O(1) access to code points, but another important (I'd say *more important*) notion of character are grapheme clusters, and in that regard UTF-32 gets you nowhere. See also: http://utf8everywhere.org/#myths – Apr 01 '14 at 06:37
2

@david.pfx: The Zalgo text in the link from @RaphaelMiedl is an extreme example, but that was exactly what I was referring to.` – Bart van Ingen Schenau Apr 01 '14 at 06:40
@Raphael Miedl: Thanks, that is exactly what I'm looking for. It's a character sequence to represent another character. Bart's post implied that it was on data level and that there are characters in the 21-bit Unicode range that needed multiple 32-bit code points to present 1 character. UTF-32 is therefore fixed width Unicode encoding, that was my point. – dj bazzie wazzie Apr 01 '14 at 10:10
@BartvanIngenSchenau: Ah, I see what you mean. According to http://en.wikipedia.org/wiki/Universal_Character_Set_characters there are 249,764 assigned code points, and the terms code point and character are more or less interchangeable. You were talking about 'characters including composed characters', of which there would seem to be arbitrarily many. Obviously the former can fit in 32 bits and the latter cannot. – david.pfx Apr 01 '14 at 10:43
2

@david.pfx: The meaning of 'character' is context dependent. Sometimes it means the same as codepoint, but sometimes it means the same as a grapheme cluster (that what most non-programmers would identify as a character in a displayed/printed text, i.e. a base character combined with all diacritical marks) – Bart van Ingen Schenau Apr 01 '14 at 10:53
@BartvanIngenSchenau: I agree entirely, they're easily confused. Sometimes it means the same as a code point (that's what most programmers would identify as a character in a program, whatever can fit in a 32-bit wchar_t). – david.pfx Apr 01 '14 at 12:55
@delnan You oversimplify. There's a surprisingly large number of algorithms that require indexing into a string at arbitrary offsets and which don't have an easy transformation into streaming form. I say this because I maintain a library where we had to fix this; users were complaining that their code was terribly slow with large strings (yeah, because their O(N) code was now O(N²)!) Going round telling users “you're holding it wrong” when things used to work is just a way to make people upset with you. – Donal Fellows Apr 01 '14 at 13:16
The key things: __Use UTF-8 for external encodings__ (including database content). You can use other encodings internally to a program (and for _some_ algorithms you get a substantive performance boost if you do so). Normalising your strings (to either NFC or NFD, but not both) can be a _very_ good idea. – Donal Fellows Apr 01 '14 at 13:23
2

@DonalFellows Could you give an example? Note that I encourage constant time indexing, but with byte indices rather than code point indices. What problem requires constant time access to the ith *code point*, yet not to the ith *grapheme cluster*, and can't make do with the nth *byte*? UTF-8 is no panacea for bad algorithms, but I don't know any problems where it prevents good algorithms. – Apr 01 '14 at 16:44

david.pfx · Answer 2 · 2014-04-01T10:26:10.550

1

The answer is that UTF-8 is by far the best general-purpose data interchange encoding, and is almost mandatory if you are using any of the other protocols that build on it (mail, XML, HTML, etc).

However, UTF-8 is a multi-byte encoding and relatively new, so there are lots of situations where it is a poor choice. Here are a few.

Internal encoding in Windows/C/C++/C#/Java/ObjectiveC. These environments do not internally support UTF-8 (or any multibyte encoding). Strings are respectively ANSI/UCS-2/UTF-16.
Legacy code, especially C/C++. Strings are typically ANSI/ISO/UTF-16/UTF-32.
Legacy data. There are vast mountains of textual data already encoded in some 8 bit format, including various code pages, JIS, etc.

The remaining cases involve the use of text files. They will likely remain an issue as long as plain old text files remain popular. The point is that text files do not encode their encoding, so the reader and writer have to make assumptions. Yes, there is something called a Byte Order Mark but it is neither required or recommended for UTF-8 files so any file containing 8 bit characters is of uncertain encoding.

Here are some examples using text files with little reason to allow or use UTF-8.

Software tools. Things like sed, awk, tr etc may or may not work with UTF-8. It's often easier not to try.
Compilers. Most computer languages are defined in terms of 7 bit ASCII and read plain text files from disk, with special tricks for extended characters.
Log files, simple protocols, embedded systems. Sometimes 7/8 bit ASCII is just the easiest.
Not always needed. Most European languages can be encoded in code page 850 or 1252, with possible savings in space and coded logic.

I confidently expect that many of these will go away over time, but they are real reasons to avoid UTF-8 in certain situations until then.

edited Apr 01 '14 at 10:26

answered Mar 31 '14 at 11:06

david.pfx

8,105
2
21
44

1

New? 1993 isn't exactly new. – whatsisname Mar 31 '14 at 22:36
3

@whatsisname: __Relatively__ new. ASCII is 1963, code pages around 1981. The web only 50% in 2008. Plenty of older stuff out there. – david.pfx Apr 01 '14 at 00:05
I'm not sure what you're getting at with #3 and #5. Are you saying languages require source files to be 7 bit ASCII? Then that should be just another example of legacy systems/data/protocols. And for #5, unless you're referring to *legacy* systems, you should elaborate why it's "easiest". Also, #6 doesn't make sense to me. Even if it's not "needed" you can simplify everything, improve user experience, and rule out several kinds of data mangling by simply using UTF-8 whenever possible. That it's not *needed* should make no difference since using UTF-8 should be at least as easy, if not easier. – Apr 01 '14 at 06:45
@delnan: See edits. I know it won't satisfy everyone, but these are genuine answers to the question as put. – david.pfx Apr 01 '14 at 10:27
1

ASCII was being merrily misunderstood back in the early 1990's. Our printer thought `#` should render as `£`, which made C code rather “different” to read… – Donal Fellows Apr 01 '14 at 13:24
@DonalFellows: When the Brits adopted the teletype and ASCII (1960s) they had no use for dollar but needed a pound (currency) symbol, so they took shift-3 and the corresponding ASCII code. Your printer was simply putting on a British accent. – david.pfx Apr 02 '14 at 02:17
@DonalFellows: That was the GB variant of ISO-646, not ASCII. And, well.., after all it was a pound sign, wasn't it? – ninjalj Mar 09 '15 at 22:43
@ninjalj: It's a long time ago, but I think you'll find ISO-646 was a retrofit to unify a series of national standards. GB was using "British ASCII" and then standardised it as BS 4730 before ISO-646. After all, what use is a computer that can't print your national currency symbol? – david.pfx Mar 11 '15 at 00:28

When is it beneficial to not use utf-8?

2 Answers2