432

I'm going to ask what is probably quite a controversial question: "Should one of the most popular encodings, UTF-16, be considered harmful?"

Why do I ask this question?

How many programmers are aware of the fact that UTF-16 is actually a variable length encoding? By this I mean that there are code points that, represented as surrogate pairs, take more than one element.

I know; lots of applications, frameworks and APIs use UTF-16, such as Java's String, C#'s String, Win32 APIs, Qt GUI libraries, the ICU Unicode library, etc. However, with all of that, there are lots of basic bugs in the processing of characters out of BMP (characters that should be encoded using two UTF-16 elements).

For example, try to edit one of these characters:

  • π„ž (U+1D11E) MUSICAL SYMBOL G CLEF
  • π•₯ (U+1D565) MATHEMATICAL DOUBLE-STRUCK SMALL T
  • 𝟢 (U+1D7F6) MATHEMATICAL MONOSPACE DIGIT ZERO
  • π ‚Š (U+2008A) Han Character

You may miss some, depending on what fonts you have installed. These characters are all outside of the BMP (Basic Multilingual Plane). If you cannot see these characters, you can also try looking at them in the Unicode Character reference.

For example, try to create file names in Windows that include these characters; try to delete these characters with a "backspace" to see how they behave in different applications that use UTF-16. I did some tests and the results are quite bad:

  • Opera has problem with editing them (delete required 2 presses on backspace)
  • Notepad can't deal with them correctly (delete required 2 presses on backspace)
  • File names editing in Window dialogs in broken (delete required 2 presses on backspace)
  • All QT3 applications can't deal with them - show two empty squares instead of one symbol.
  • Python encodes such characters incorrectly when used directly u'X'!=unicode('X','utf-16') on some platforms when X in character outside of BMP.
  • Python 2.5 unicodedata fails to get properties on such characters when python compiled with UTF-16 Unicode strings.
  • StackOverflow seems to remove these characters from the text if edited directly in as Unicode characters (these characters are shown using HTML Unicode escapes).
  • WinForms TextBox may generate invalid string when limited with MaxLength.

It seems that such bugs are extremely easy to find in many applications that use UTF-16.

So... Do you think that UTF-16 should be considered harmful?

Artyom
  • 2,079
  • 4
  • 17
  • 17
  • I tried copying the characters to a filename and tried to delete them and had no problems. Some Unicode characters read right to left and keyboard input handling sometimes changes to accommodate that (depending on the program used). Can you post the numeric codes for the specific characters you are having trouble with? –  Jun 26 '09 at 17:30
  • 1
    Have you tried to work with them in Notepad and see how this work? For example edit file name with this character and put coursor at the right of this character and press backspace. You'll see that in both. Notepad of file name editing dialog it requires two times to press "backspace" to remove this character. –  Jun 27 '09 at 07:50
  • 17
    The double backspace behavior is mostly intentional http://blogs.msdn.com/michkap/archive/2005/12/21/506248.aspx –  Jun 27 '09 at 10:56
  • 64
    Not really correct. I explain, if you write "שָׁ" the compound character that consists of "Χ©",β€Ž "ΦΈ" and "ׁ", vovels, then removal of each one of them is logical, you remove one code-point when you press "backspace" and remove all character including vovels when press "del". But, you never produce **illegal** state of text -- illegal code points. Thus, the situation when you press backspace and get illegat text is incorrect. –  Jun 27 '09 at 12:43
  • Are you referring to how sin and shin are composed of two code points, and by deleting the code-point for the dot you get an "illegal" character? –  Jun 29 '09 at 16:39
  • 3
    No, you get "vowelless" writing. It is totally legal. More then that, in most of cases vowels like these (shin/sin) are almost ever written unless they are required for clearification of something that is not obvious from context like שׁם and שׂם these are two different words, but according to context you know which one of is vowelless שם means. –  Jun 29 '09 at 17:24
  • 41
    CiscoIPPhone: If a bug is "reported several different times, by many different people", and then a couple years later a developer writes on a dev blog that "Believe it or not, the behavior is mostly intentional!", then (to put it mildly) I tend to think it's probably not the best design decision ever made. :-) Just because it's intentional doesn't mean it's not a bug. –  Mar 18 '10 at 01:18
  • 3
    For the record, I don't have problems with any of these characters in Apple's TextEdit.app (which uses Cocoa and thus UTF-16), but trying to insert them in Emacs (which uses a variant of UTF-8 internally) produces garbage. I do think that such bugs are not the fault of the character encoding, but of the lack of competence of the programmers involved. –  Aug 15 '10 at 08:20
  • 1
    BTW, I've just checked editing these letters, they don't give me a problems neither in Opera, nor in Windows 7. Opera seems to edit them properly, so does Notepad. File with these letters in the name has been created successfully. – Malcolm Dec 31 '10 at 14:00
  • 1
    @Malcolm, 1st there is no problem creating such files - the question editing them. Now I've tested on XP maybe in 7 MS fixed this issue. Take a look how backspace works, do you need to hit it once or twice. –  Dec 31 '10 at 14:52
  • 1
    Once. I specially checked for this issue, and in Windows 7 the problem with the characters beyond BMP seems to be gone. Maybe this problem had been solved even in Vista. – Malcolm Jan 01 '11 at 00:45
  • 2
    @Malcolm - even thou it does not make UTF-16 less harmful :-) –  Jan 01 '11 at 08:21
  • 1
    Well, I don't think that mere existence of crappy implementations indicates harmfulness of the standard at all. :p This is just an update on the current situation: how problematic characters beyond BMP in Windows (and Opera) are now. – Malcolm Jan 01 '11 at 14:35
  • 145
    Great post. UTF-16 is indeed the "worst of both worlds": UTF8 is variable-length, covers all of Unicode, requires a transformation algorithm to and from raw codepoints, restricts to ASCII, and it has no endianness issues. UTF32 is fixed-length, requires no transformation, but takes up more space and has endianness issues. So far so good, you can use UTF32 internally and UTF8 for serialization. But UTF16 has no benefits: It's endian-dependent, it's variable length, it takes lots of space, it's not ASCII-compatible. The effort needed to deal with UTF16 properly could be spent better on UTF8. – Kerrek SB Jun 09 '11 at 11:38
  • 1
    UTF-8 has the same caveats as UTF-16. Buggy UTF-16 handling code exists; although probably less than buggy UTF-8 handling code (most code handling UTF-8 thinks it's handling ASCII, Windows-1252, or 8859-1) – Ian Boyd Aug 12 '11 at 00:10
  • 26
    @Ian: UTF-8 *DOES NOT* have the same caveats as UTF-8. You cannot have surrogates in UTF-8. UTF-8 does not masquerade as something it’s not, but most programmers using UTF-16 are using it wrong. I know. I've watched them again and again and again and again. – tchrist Aug 15 '11 at 19:44
  • @tchrist UTF-16 can sometimes require more than 16-bits to represent a single code-point, UTF-8 can sometimes require more than 8-bits to represent a single code-point. UTF-16 can sometimes use multiple code points to represent a single character, UTF-8 can sometimes use multiple code points to represent a single character. `U+0061 U+0301 U+0317` forms one character: `á̗`. When converted to UTF-8 the byte sequence (without the BOM) is `61 CC 81 CC 97`. When converted to UTF-16 the byte sequence (without the BOM) is `61 00 01 03 17 03`. Same caveats. – Ian Boyd Aug 15 '11 at 21:21
  • 17
    @Ian You are welcome to spout off the theory all you want: it’s wasted on me. I teach this stuff myself. I can promise you that the UTF-16 problems are everywhere. These people can’t even get code points right. No one using UTF-8 ever screws that up. It’s these damned two-bit use-to-be-UCS2 P.O.S. UTF-16 interfaces that screw people up. That is the real world. That is the calibre of the average UTF-16 programmer out there. What are you some Microsoft apologist or something? It’s a screwed-up choice that has caused endless misery in this world: you can’t make a silk purse from a sow’s ear. – tchrist Aug 15 '11 at 22:30
  • 1
    i don't see how someone fresh to a subject can be stymied simply because it is named "UTF-16". Yet if you change the name to "UTF-8" it becomes obvious and intuitive. – Ian Boyd Aug 16 '11 at 00:00
  • 16
    @Ian: You have listed *common* caveats, that doesn't mean that the two have the *same* caveats. UTF-16 has more: It has endianness issues, and it does not contain ASCII as a subset. Those two alone make a huge difference. – Kerrek SB Aug 16 '11 at 12:25
  • 1
    @Kerrek S: In terms of writing code to handle caveats, endian order is not an issue for programmers. Take me, for example, as a programmer who is dealing with UTF-8 and UTF-16: multiple character diacritis, BMPs and surrogate pairs are (still) difficult to handle. Endian order is trivial. UTF-16 not containing an ASCII subset? What is UTF-16 missing? ASCII has `ACK` (0x06), UTF-8 has `ACK` (0x06), UTF-16 has ACK (0x0006). – Ian Boyd Aug 16 '11 at 13:19
  • 6
    @Ian Boyd: Just so you know, you're arguing with the author of http://98.245.80.27/tcpc/OSCON2011/gbu.pdf See also http://98.245.80.27/tcpc/OSCON2011/index.html – Christoffer HammarstrΓΆm Aug 18 '11 at 20:15
  • 18
    Also, UTF-8 doesn't have the problem because everyone treats it as a variable width encoding. The reason UTF-16 has the problem is because everyone treats it like a fixed width encoding. – Christoffer HammarstrΓΆm Aug 18 '11 at 20:22
  • @Christoffer HammarstrΓΆm You can't blame one non-fixed width encoding for being non-fixed width, while embracing another non-fixed width encoding because it's non-fixed width. – Ian Boyd Aug 18 '11 at 20:57
  • UTF-32 good enough for you? –  Aug 18 '11 at 22:09
  • 3
    Tell me about it, I've been shouting this at my stupid Windows programming colleages for years. The only safe encodings are UTF32 and UTF8 (As long as people don't treat is as a fixed length encoding). –  Aug 19 '11 at 02:35
  • 2
    Can you elaborate on the assertion that "Python encodes such characters incorrectly"? How would you even write this into a file? AFAIK, Python cannot read files whose encoding is not a superset of ASCII (at byte level). – Ringding Aug 19 '11 at 09:31
  • 3
    Another example: JavaScript's `charCodeAt` selects UTF-16 words, not Unicode characters. This arguably isn't a bug, but applications that assume charCodeAt works on Unicode characters will be broken. – Joey Adams Aug 21 '11 at 00:13
  • i think [this](http://meyerweb.com/eric/comment/chech.html) link provides some useful context to your question, though not related to an answer – Ryathal Dec 07 '11 at 13:57
  • I think @Ringding is right, the Python example seems flawed. In Python 2, `unicode('', 'utf-16')` interprets the bytes of `''` as a UTF-8 string and then decodes that to UTF-16; that obviously goes wrong. – Fred Foo Apr 20 '12 at 22:00
  • 9
    Please enjoy the grand summary of the popular POV at: http://www.utf8everywhere.org/ – Pavel Radzivilovsky Apr 20 '12 at 20:26
  • I'd rather see UTF-8 as well; but I have to say, I've seen just as many people who have said "my char strings are now UTF-8" and not dealt with the problems therein at all. – Billy ONeal Apr 27 '12 at 15:59
  • 13
    http://www.utf8everywhere.org/ – alex Apr 30 '12 at 03:59
  • @larsmans: That's because using regular quotes like that tells the interpreter that there's bytes inside. If you use `u''`, it should work correctly. Python2 has the "helpful" feature of automatically encoding/decoding between utf8/ascii in some situations. In python3 your example works because the quotes denote a type that contains unicode codepoints, not bytes. – Daenyth Apr 30 '12 at 22:05
  • 1
    @tchrist: "UTF-8 DOES NOT have the same caveats as UTF-8." But surely UTF-8 has EXACTLY the same caveats that UTF-8 does? (Sorry, I couldn't help myself...) – Teemu Leisti Jul 16 '13 at 11:29
  • 2
    http://www.theregister.co.uk/2013/10/04/verity_stob_unicode/ – Pavel Radzivilovsky Oct 30 '13 at 12:54
  • 2
    I am not sure what caveats on UTF-8, but at least, those caveats (if exists) should be a lot more visible than UTF-16, because the non-ASCII result will look broken immediately. – Eonil Nov 15 '13 at 21:39

20 Answers20

339

This is an old answer.
See UTF-8 Everywhere for the latest updates.

Opinion: Yes, UTF-16 should be considered harmful. The very reason it exists is because some time ago there used to be a misguided belief that widechar is going to be what UCS-4 now is.

Despite the "anglo-centrism" of UTF-8, it should be considered the only useful encoding for text. One can argue that source codes of programs, web pages and XML files, OS file names and other computer-to-computer text interfaces should never have existed. But when they do, text is not only for human readers.

On the other hand, UTF-8 overhead is a small price to pay while it has significant advantages. Advantages such as compatibility with unaware code that just passes strings with char*. This is a great thing. There're few useful characters which are SHORTER in UTF-16 than they are in UTF-8.

I believe that all other encodings will die eventually. This involves that MS-Windows, Java, ICU, python stop using it as their favorite. After long research and discussions, the development conventions at my company ban using UTF-16 anywhere except OS API calls, and this despite importance of performance in our applications and the fact that we use Windows. Conversion functions were developed to convert always-assumed-UTF8 std::strings to native UTF-16, which Windows itself does not support properly.

To people who say "use what needed where it is needed", I say: there's a huge advantage to using the same encoding everywhere, and I see no sufficient reason to do otherwise. In particular, I think adding wchar_t to C++ was a mistake, and so are the Unicode additions to C++0x. What must be demanded from STL implementations though is that every std::string or char* parameter would be considered unicode-compatible.

I am also against the "use what you want" approach. I see no reason for such liberty. There's enough confusion on the subject of text, resulting in all this broken software. Having above said, I am convinced that programmers must finally reach consensus on UTF-8 as one proper way. (I come from a non-ascii-speaking country and grew up on Windows, so I'd be last expected to attack UTF-16 based on religious grounds).

I'd like to share more information on how I do text on Windows, and what I recommend to everyone else for compile-time checked unicode correctness, ease of use and better multi-platformness of the code. The suggestion substantially differs from what is usually recommended as the proper way of using Unicode on windows. Yet, in depth research of these recommendations resulted in the same conclusion. So here goes:

  • Do not use wchar_t or std::wstring in any place other than adjacent point to APIs accepting UTF-16.
  • Don't use _T("") or L"" UTF-16 literals (These should IMO be taken out of the standard, as a part of UTF-16 deprecation).
  • Don't use types, functions or their derivatives that are sensitive to the _UNICODE constant, such as LPTSTR or CreateWindow().
  • Yet, _UNICODE always defined, to avoid passing char* strings to WinAPI getting silently compiled
  • std::strings and char* anywhere in program are considered UTF-8 (if not said otherwise)
  • All my strings are std::string, though you can pass char* or string literal to convert(const std::string &).
  • only use Win32 functions that accept widechars (LPWSTR). Never those which accept LPTSTR or LPSTR. Pass parameters this way:

    ::SetWindowTextW(Utils::convert(someStdString or "string litteral").c_str())
    

    (The policy uses conversion functions below.)

  • With MFC strings:

    CString someoneElse; // something that arrived from MFC. Converted as soon as possible, before passing any further away from the API call:
    
    std::string s = str(boost::format("Hello %s\n") % Convert(someoneElse));
    AfxMessageBox(MfcUtils::Convert(s), _T("Error"), MB_OK);
    
  • Working with files, filenames and fstream on Windows:

    • Never pass std::string or const char* filename arguments to fstream family. MSVC STL does not support UTF-8 arguments, but has a non-standard extension which should be used as follows:
    • Convert std::string arguments to std::wstring with Utils::Convert:

      std::ifstream ifs(Utils::Convert("hello"),
                        std::ios_base::in |
                        std::ios_base::binary);
      

      We'll have to manually remove the convert, when MSVC's attitude to fstream changes.

    • This code is not multi-platform and may have to be changed manually in the future
    • See fstream unicode research/discussion case 4215 for more info.
    • Never produce text output files with non-UTF8 content
    • Avoid using fopen() for RAII/OOD reasons. If necessary, use _wfopen() and WinAPI conventions above.

// For interface to win32 API functions
std::string convert(const std::wstring& str, unsigned int codePage /*= CP_UTF8*/)
{
    // Ask me for implementation..
    ...
}

std::wstring convert(const std::string& str, unsigned int codePage /*= CP_UTF8*/)
{
    // Ask me for implementation..
    ...
}

// Interface to MFC
std::string convert(const CString &mfcString)
{
#ifdef UNICODE
    return Utils::convert(std::wstring(mfcString.GetString()));
#else
    return mfcString.GetString();   // This branch is deprecated.
#endif
}

CString convert(const std::string &s)
{
#ifdef UNICODE
    return CString(Utils::convert(s).c_str());
#else
    Exceptions::Assert(false, "Unicode policy violation. See W569"); // This branch is deprecated as it does not support unicode
    return s.c_str();   
#endif
}
Pavel Radzivilovsky
  • 1,036
  • 1
  • 8
  • 7
  • 5
    I would like to add a little comment. Most of Win32 "ASCII" functions receive locale strings in local encodings. For example std::ifstream can accept Hebrew file name if locale encoding is Hebrew one like 1255. Anything needed to support these encodings for windows is make MS add UTF-8 code page to the system. This would make the life much simpler. So all "ASCII" functions would be fully Unicode capable. –  Dec 08 '09 at 15:13
  • FWIW the AfxMessageBox(MfcUtils::Convert(s), _T("Error"), MB_OK) example should probably really have been a call to a wrapper of that function that accepts std::string(s). Also, the Assert(false) in the functions toward the end should be replaced with static assertions. – Assaf Lavie Dec 09 '09 at 03:38
  • 39
    I can't agree. The advantages of utf16 over utf8 for many Asian languages completely dominate the points you make. It is naive to hope that the Japanese, Thai, Chinese, etc. are going to give up this encoding. The problematic clashes between charsets are when the charsets mostly seem similar, except with differences. I suggest standardising on: fixed 7bit: iso-irv-170; 8bit variable: utf8; 16bit variable: utf16; 32bit fixed: ucs4. –  Dec 09 '09 at 15:24
  • 82
    @Charles: thanks for your input. True, some BMP characters are longer in UTF-8 than in UTF-16. But, let's face it: the problem is not in bytes that BMP Chinese characters take, but the software design complexity that arises. If a Chinese programmer has to design for variable-length characters anyway, it seems like UTF-8 is still a small price to pay compared to other variables in the system. He might use UTF-16 as a compression algorithm if space is so important, but even then it will be no match for LZ, and after LZ or other generic compression both take about the same size and entropy. –  Dec 09 '09 at 18:04
  • 32
    What I basically say is that simplification offered by having One encoding that is also compatible with existing char* programs, and is also the most popular today for everything is unimaginable. It is almost like in good old "plaintext" days. Want to open a file with a name? No need to care what kind of unicode you are doing, etc etc. I suggest we, developers, confine UTF-16 to very special cases of severe optimization where a tiny bit of performance is worth man-months of work. –  Dec 09 '09 at 18:08
  • 5
    Well, if I had to choose between UTF-8 and UTF-16, I would definitely stick to UTF-8 as it has no BOM, ASCII-compliant and has the same encoding scheme for any plane. But I have to admit that UTF-16 is simpler and more efficient for most BMP characters. There's nothing worng with UTF-16 except the psychological aspects (mostly fixed-size isn't fixed size). Sure, one encoding would be better, but since both UTF-8 and UTF-16 are widely used, they have their advantages. – Malcolm Feb 20 '10 at 00:10
  • 10
    @Malcolm: UTF-8, unfortunately, has a BOM too (0xEFBBBF). As silly as it looks (no byte order problem with single-byte encoding), this is true, and it is there for a different reason: to manifest this is a UTF stream. I have to disagree with you about BMP efficiency and UTF-16 popularity. It seems that majority of UTF-16 software do not support it properly (ex. all win32 API - which I am a fan of) and this is inherent, the easiest way to fix these seems to switch them to other encoding. The efficiency argument is only true for very narrow set of uses (I use hebrew, and even there it is not). –  Feb 20 '10 at 06:53
  • 1
    Well, what I meant is that you don't have to worry about byte order. UTF-8 can have a BOM indeed (it is actually UTF-16 big endian BOM encoded in 3 bytes), though it's neither required, nor recommended according to the standard. As for the APIs I think the problem is that they were designed when surrogate pairs were either non-existent yet, or not really adopted. And when something gets patched up, it's always not as good as redesigning from the scratch. The only (painful) way is to drop any backwards compability and redesign the APIs. Should they switch to UTF-8 in the process, I don't know. – Malcolm Feb 20 '10 at 21:46
  • 2
    @Malcolm, I think the natural way of this redesign is thru changing existing ANSI APIs. This way existing broken programs will unbreak (see my answer). This adds to the argument: UTF-16 must die. –  Mar 07 '10 at 08:14
  • 3
    I'm sorry, I didn't really get the idea why transition to UTF-8 should be less painful. I also think that inconsistency in C++ makes it worse. Say, Java is very specific on the characters: char[] is no more than a char array, String is a string and Character is a character. Meanwhile, C++ is a mess with all the new stuff added to an existing language. To my mind, they should've abandoned any backwards compablity and design C++ in the way that doesn't allow to mix up structural programming and OOP or Unicode and other encodings. Not that I want to start a holy war, that's merely my opinion. – Malcolm Mar 07 '10 at 16:59
  • 7
    UTF-8's disadvantage is NOT a small price to pay at all. Looking for any character is a O(n) operation, and other more complex operations can be far far worse than with UTF-16. Also UTF-8 is variable-length, just as UTF-16, so what's the point? UTF-8 was designed for storage and interoperability with ASCII. UTF-16 is the preferred way to store strings in memory, as anything outside the BMP is incredibly rare (you're wiring in Klingon?). With a little trick, storing characters outside of the BMP in a hash or map, UTF-16 can have constant processing time. – Mircea Chirea Mar 17 '10 at 14:18
  • 2
    @iconiK: non-english BMP is also quite rare. Consider all program sources and markup languages. One should have very good reasons to use UTF-16. See what is going on in Linux world wrt unicode to measure the price of breaking changes. –  Mar 17 '10 at 16:02
  • 17
    Linux has had a specific requirement when choosing to use UTF-8 internally: compatibility with Unix. Windows didn't need that, and thus when the developers implemented Unicode, they added UCS-2 versions of almost all functions handling text and made the multibyte ones simply convert to UCS-2 and call the other ones. THey later replaces UCS-2 with UTF-16. Linux on the other hand kept to 8-bit encodings and thus used UTF-8, as it's the proper choice in that case. – Mircea Chirea Mar 17 '10 at 17:56
  • 4
    you may wish to read my answer again. Windows does not support UTF-16 properly to date. Also, the reason for choosing UCS-2 was different. Again, see my answer. For linux, I believe the main reason was compatibility not with unix but with existing code - for instance, if your ANSI app copies files, getting names from command arguments and calling system APIs, it will remain completely intact with UTF-8. Isn't that wonderful? –  Mar 17 '10 at 20:08
  • 7
    @Pavel: The bug you linked to (Michael Kaplan's blog entry) has long been resolved by now. Michael said in the post already that it's fixed in Vista and I can't reproduce it on Windows 7 as well. While this doesn't fix legacy systems running on XP, saying that Β»there is still no proper supportΒ« is plain wrong. – Joey Apr 02 '10 at 13:37
  • @Johannes: [1] many thanks for the info. [2] IMO a programmer, today, should be able to write programs that support windows XP. It is still a popular one, and I don't know of a windows update that fixes it. –  Apr 03 '10 at 10:19
  • 4
    Well, the program works just fine; it just has a little trouble dealing with astral planes, but that's an OS issue, not one with your program. It's like asking that current versions of Uniscribe are backported to old OSes that people on XP can enjoy a few scripts that would render improperly before. It's not something MS does. Besides, XP is almost a decade old by now and supporting it becomes a major burden in some cases (see for example the reasoning why Paint.NET will require Vista with its 4.0 release). Mainstream support for that OS has already ended, too; only security bugs are fixed now – Joey Apr 03 '10 at 11:22
  • 2
    Still not convincing to use UTF-16 for in-memory presentation of strings on windows :) I wish Windows7 guys would extend their support of already existing #define of CP_UTF8 instead.. –  Apr 03 '10 at 17:50
  • 3
    @Pavel Radzivilovsky: I fail to see how your code, using UTF-8 everywhere, will protect you from bugs in the Windows API? I mean: You're copying/converting strings for all calls to the WinAPI that use them, and still, if there is a bug in the GUI, or the filesystem, or whatever system handled by the OS, the bug remains. Now, perhaps your code has a specific UTF-8 handling functions (search for substrings, etc.), but then, you could have written them to handle UTF-16 instead, and avoid all this bloated code (unless you're writing cross-platform code... There, UTF-8 could be a sensible choice) – paercebal Sep 04 '10 at 12:20
  • 34
    @Pavel Radzivilovsky: BTW, your writings about *"I believe that all other encodings will die eventually. This involves that MS-Windows, Java, ICU, python stop using it as their favorite."* and *"In particular, I think adding wchar_t to C++ was a mistake, and so are the unicode additions to C++Ox."* are either quite naive or very very arrogant. And this is coming from someone coding at home with a Linux and who is happy with the UTF-8 chars. To put it bluntly: **It won't happen**. – paercebal Sep 04 '10 at 12:28
  • 2
    @paercebal: If majority of the code is API calls, this is a very simple code. Typically, majority of code dealing with strings is libraries that treat them as cookies, and they are optimized for. Hence, the bloating argument fails. As for the 'favorite utf16' for ICU and python, this is very questionable: these tools use UTF-16 internally, and changing it as a part of the evolution is the easiest. Can happen on any major release, coz it doesn't break the interfaces. –  Sep 04 '10 at 21:59
  • 10
    In ICU we already see more and more UTF-8 interfaces and optimizations. However, UTF-16 works perfectly well, and makes complicated lookup efficient, more than with UTF-8. We will not see ICU drop UTF-16 internally. UTF-16 in memory, UTF-8 on the wire and on disk. All is good. –  Oct 06 '10 at 17:33
  • 2
    @Steven, It looks like differentiating between wire and RAM is not a small thing as it may seem. BTW, comparison is cheaper with UTF8. I agree that ICU is certainly a major player on this market, and there's no need to "drop" support of anything. The simplification of application design and testing with UTF8 is exactly what will, in my humble opinion, drive UTf-16 to extinction, and the sooner the better. –  Oct 08 '10 at 14:50
  • 3
    @Pavel Radzivilovsky I meant, drop UTF-16 as the internal processing format. Can you expand on 'not a small thing'? And, anyways, UTF-16/UTF-8/UTF-32 have a 1:1:1 mapping. I'm much more interested in seeing non-Unicode encodings die. As far as UTF-8 goes for simplification, you say "they can just pass strings as char*"- right, and then they assume that the char* is some ASCII-based 8-bit encoding. Plenty of errors creep in when toupper(), etc, is used on UTF-8. It's not wonderful, but it is helpful. –  Oct 08 '10 at 15:46
  • 1
    @Steve First and foremost I agree about non-unicode. There's no argument about that. Practically, it already happened, they are already dead, in this exact sense: any non-unicode operation on a string is considered a bug just like any other software bug, or a 'text crime' in my company's slang. It is true that char* is misleading many into unicode bugs as well. Good luck with toupper() a UTF-8 string, or, say, with assuming that ICU toupper does not change the number of characters (as in german eszet converting to SS). After the standard has been established, there's no more reason to bug. –  Oct 09 '10 at 22:55
  • 7
    @Steve, 2; and then we come to a more subtle thing, which is everything around human engineering and safety and designing proper way of work for a developer to do less and for the machine to do more. This is exactly where UTF-16 doesn't fit. Most applications do not reverse or even sort strings. Most often strings are treated as cookies, such as a file name here and there, concatenated here and there, embedded programming languages such as SQL and other really simple transformations. In this world, there's very little reason to have different format in RAM than on the wire. –  Oct 09 '10 at 23:00
  • 1
    @Pavel: "...that widechar is going to be what UCS-4 now is." this is incorrect in general since widechar is not fixed to be 2 bytes in size, unless you restrict yourself to Windows. You should write "UCS-2" instead of "widechar". –  Dec 10 '10 at 10:16
  • 2
    @ybungalobill Right; I should edit this. In fact, I will do this when wchar_t is standardized to hold one Unicode character. –  Dec 11 '10 at 22:51
  • 10
    @Pavel: In fact your sentence is just wrong, because wchar_t is not meant to be UTF-16, it has absolutely no connection to UTF-16, and it is already UCS-4 on some compilers. wchar_t is a C++ type that is actually (quote from the standard) "a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales". So the only problem here is your system (windows) that doesn't have a UCS-4 locale. –  Dec 12 '10 at 09:36
  • 9
    I'm mostly impressed with how this long rant completely fails in arguing its point, at least outside the narrow world of having to deal with UTF-16 in C and pointers. That might be considered dangerous, but that is if anything C's fault, not UTF-16. – Lennart Regebro Dec 21 '10 at 00:49
  • 7
    Well, as I mentioned earlier, I didn't find this post very convincing either. This post goes into details of handling UTF-16 in certain APIs or languages. If the software doesn't handle the standard properly, that's a problem. But what's wrong with the encoding itself anyway? If some software implements only half of the standard, that's not the standard's problem. – Malcolm Dec 29 '10 at 16:56
  • 4
    There are so many things that are wrong in the bullets that can't even be captures in a comment. But probably the most dangerous one is to store UTF-8 in std::string in a Windows environment. Problem is, everything in the Windows world assumes that char* are strings in the current system code page. Use one wrong API on that string, and you are assured of many hours of debugging. The other problem is the religious recommendation for UTF-8 no matter what. "there's a huge advantage to using the same encoding everywhere, and I see no sufficient reason to do otherwise" is pushed, no advantage given –  May 28 '11 at 08:54
  • 3
    Wow, that's a really insightful comment. I'll start converting all my apps to UTF-8 right now. Thanks! –  Jun 11 '11 at 14:00
  • 6
    @Mihai there *is* this advantage. you’ll start noticing it when you don’t do it and get cryptic runtime encoding exceptions nobody can possibly understand nor track back to it’s source. python 3 has made the jump, and guess what: the frequent encoding issues i had in python 2 magically disappeared completely. – flying sheep Aug 04 '11 at 06:34
  • 1
    @flying sheep: And the reason you had problems in Python2 and not in Python3 is that Py3 is much stricter with distinguishing between Unicode and bytes. Apart from how you encode literal strings in your source file (and that can be changed to any other encoding if wanted without problems), you can't - or more exactly you SHOULDN'T - detect what encoding python is using internally. When communicating with the rest of the world you have to specify the encoding anyhow to avoid problems (otherwise you get a platform specific encoding). – Voo Aug 15 '11 at 15:21
  • 1
    Yeah, I know. I love it this way: Input gets converted to Strings while reading it and the error happens there (if any), and not anywhere else in the code. Makes us guess less where the fuck it sneaked in. (Just like Scala’s `Option` keeps Scala programmers from encountering `NullpointerExceptions`: A implicit, error-prone process is replaced by a deliberate choice I make) – flying sheep Aug 15 '11 at 15:49
  • 5
    Sorry but our programming languages should hide the encoding from us an implementation detail. We need datatypes that logically represent unicode characters without worrying about how they were stored on the harddrive. –  Aug 18 '11 at 15:52
  • 1
    I'll agree to use a variable-length encoding for text when I am given an O(1) way of accessing a random character from a string. UCS2 and UCS4 have this nice ability, and are therefore surely better suited to internal uses than UTF-{8,16}? – holdenweb Aug 19 '11 at 05:35
  • 13
    @holdenweb: Yyou cannot access a random character even in UCS-4, remember that character != codepoint. Moreover, there is absolutely *no* application where accessing nth character make sense. All the access to text is always sequential. – ybungalobill Aug 21 '11 at 12:19
  • 1
    @Pavel: Why doesn't the answer contain an implementation for convert()? – Gili Mar 22 '12 at 02:32
  • 11
    Guys, thank you all for the feedback, both positive and negative! The discussion has inspired me and friends to publish a clear manifesto on the subject. Please enjoy, share and support: http://www.utf8everywhere.org/ – Pavel Radzivilovsky Apr 20 '12 at 20:25
  • 1
    _T("") is not part ofthe standard, it is MS/Windows stuff. – ysdx Apr 20 '12 at 21:46
  • @PavelRadzivilovsky There is a memory leak in ::SetWindowTextW(Utils::convert(someStdString or "string litteral").c_str()) ..., isn't it? – kiewic Jan 22 '13 at 05:40
  • There is no leak - the memory is freed properly by the destructor of std::wstring – Pavel Radzivilovsky Jan 22 '13 at 12:32
  • 1
    Python has switched away from UCS-2 (which was only used for windows builds): http://python.org/dev/peps/pep-0393/ – Tobu Mar 25 '14 at 11:12
  • The very start of this answer **is wrong** The whole point of UTF16 is to encode 21-bit code points into a 16 bit wordspace – Michael Shaw Mar 25 '14 at 18:46
  • I strongly disagree with the "Don't use `TCHAR`" part, because TCHAR is actually the **key** for switching from "ANSI" and UTF-16 to UTF-8 with less pain. Create `MessageBoxU8` as a different function from both `MessageBoxA` and `MessageBoxW` and do the same for all other string functions, and that way you can develop new programs that support UTF-8 without breaking the old ones that don't. – Medinoc Oct 01 '14 at 13:54
  • Dear Medinoc, please consider that one of the major points of the utf8everywhere.org manifesto is to say that you should never be switching between ANSI and any unicode encoding or support ANSI in the first place. There should be no non-unicode-aware programs written, compiled or tested. – Pavel Radzivilovsky Oct 08 '14 at 19:14
  • The footnote on Python should mention the https://docs.python.org/3/library/io.html#io.StringIO API, as that is the preferred way to manipulate large amounts of text that nevertheless still fit into memory. The fact that it exposes a file like seek/tell API without providing support for O(1) code point indexing lets it default to avoiding the memory cost of providing the latter, and also provide a better foundation for the kinds of cursor based algorithms needed to manipulate graphemes and characters rather than code points. – ncoghlan May 07 '15 at 05:37
  • @ncoghlan, I would appreciate if you formulate the text to insert. Also, if you are into Python - it's high time to write to Python authors to change their internal string implementation to the better, UTF-8 way. – Pavel Radzivilovsky Jun 01 '15 at 15:32
  • I'm the author of http://developerblog.redhat.com/2014/09/09/transition-to-multilingual-programming-python/, and "UTF-8 everywhere" is a proposal that assumes every piece of software in the world is a POSIX program designed to manipulate streaming data. It's a wonderful design choice in that domain, but far more questionable elsewhere. The Python ecosystem, by contrast, also encompasses Windows, JVM and CLR based programming, and array oriented programming in addition to stream processing, with a suite of battle tested text manipulation algorithms optimised to run on fixed width encodings. – ncoghlan Jun 28 '15 at 04:05
  • 1
    As far as the text to insert goes: If you are manipulating text data in Python, and don't need O(1) code point indexing, then you should be using https://docs.python.org/3/library/io.html#io.StringIO as your data storage API, not `str`. https://docs.python.org/3/library/io.html#io.BytesIO is an alternative option for storing UTF-8 data directly in its encoded form. – ncoghlan Jun 28 '15 at 04:12
156

Unicode codepoints are not characters! Sometimes they are not even glyphs (visual forms).

Some examples:

  • Roman numeral codepoints like "β…²". (A single character that looks like "iii".)
  • Accented characters like "Γ‘", which can be represented as either a single combined character "\u00e1" or a character and separated diacritic "\u0061\u0301".
  • Characters like Greek lowercase sigma, which have different forms for middle ("Οƒ") and end ("Ο‚") of word positions, but which should be considered synonyms for search.
  • Unicode discretionary hyphen U+00AD, which might or might not be visually displayed, depending on context, and which is ignored for semantic search.

The only ways to get Unicode editing right is to use a library written by an expert, or become an expert and write one yourself. If you are just counting codepoints, you are living in a state of sin.

  • 19
    This. Very much this. UTF-16 can cause problems, but even using UTF-32 throughout can (and will) still give you issues. – bcat Dec 24 '10 at 00:48
  • 11
    What is a character? You can define a code point as a character and get by pretty much just fine. If you mean a user-visible glyph, that’s something else. – tchrist Aug 11 '11 at 14:54
  • 7
    @tchrist sure for allocating space that definition is fine, but for anything else? Not so much. If you handle a combining character as a sole character (ie for a delete or "take first N characters" operation) you'll get strange and wrong behavior. If a code point has only meaning when combined with at least another you can't handle it on its own in any sensible manner. – Voo Aug 15 '11 at 15:28
  • 2
    bmp and astral is not the real issue. But allowing accented characters to have a separate diacritic is **horrible**. The guy who even thought of that idea deserved to have his head flushed in the toiletbowl – Pacerier Nov 24 '11 at 07:31
  • 2
    This doesn't really answer the question. You have to deal with variable-length graphemes regardless of which encoding you use. However, variable-length code points are purely *additional* complexity that UTF-8 and UTF-16 have that UTF-32 doesn't. – dan04 Dec 14 '11 at 04:00
  • 6
    @Pacerier, this is late to the party, but I have to comment on that. Some languages have very large sets of potential combinations of diacritics (c.f. Vietnamese, i.e. mệt Δ‘α»«). Having combinations rather than one character per diacritic is very helpful. – asthasr Apr 20 '12 at 21:23
  • 21
    a small note on terminology: *codepoints* **do** correspond to *unicode characters*; what Daniel is talking about here are *user-perceived characters*, which correspond to *unicode grapheme clusters* – Christoph Apr 21 '12 at 11:58
  • 6
    I agree with this answer. Internationalization is like software security : hard. There is all sort of edge case that you must account for, and the complexity is inherent to the domain. UTF-8 is great but its not a magic solution to make the pain go away. – Laurent Bourgault-Roy Dec 19 '12 at 18:44
  • 2
    True, but do you really care about "counting"? When do you have to count codepoints at all? You may have to determine number or glyphs for rendering. But where is the use-case that requires counting codepoints? (we tend to think that this is important, because we so often need number or bytes of a buffer..). However, detecting & iterating codepoints is important for all kinds of text processing, but counting, no. – frunsi Jan 29 '13 at 22:16
54

There is a simple rule of thumb on what Unicode Transformation Form (UTF) to use: - utf-8 for storage and comunication - utf-16 for data processing - you might go with utf-32 if most of the platform API you use is utf-32 (common in the UNIX world).

Most systems today use utf-16 (Windows, Mac OS, Java, .NET, ICU, Qt). Also see this document: http://unicode.org/notes/tn12/

Back to "UTF-16 as harmful", I would say: definitely not.

People who are afraid of surrogates (thinking that they transform Unicode into a variable-length encoding) don't understand the other (way bigger) complexities that make mapping between characters and a Unicode code point very complex: combining characters, ligatures, variation selectors, control characters, etc.

Just read this series here http://www.siao2.com/2009/06/29/9800913.aspx and see how UTF-16 becomes an easy problem.

  • Would upvote twice if I could. – Andrey Tarantsov Dec 08 '10 at 02:18
  • 26
    Please add some examples where UTF-32 is common in the UNIX world! – maxschlepzig Jan 13 '11 at 13:39
  • 48
    No, you do not want to use UTF-16 for data processing. It's a pain in the ass. It has all the disadvantages of UTF-8 but none of its advantages. Both UTF-8 and UTF-32 are clearly superior to the vicious hack previously known as Mrs UTF-16, whose maiden name was UCS-2. – tchrist Aug 11 '11 at 14:18
  • 34
    I yesterday just found a bug in the Java core String class’s `equalsIgnoreCase` method (also others in the string class) that would never have been there had Java used either UTF-8 or UTF-32. There are millions of these sleeping bombshells in any code that uses UTF-16, and I am sick and tired of them. UTF-16 is a vicious pox that plagues our software with insidious bugs forever and ever. It is clearly harmful, and should be deprecated and banned. – tchrist Aug 11 '11 at 14:53
  • 7
    @tchrist Wow so a non-surrogate aware function (because it was written when there were none and is sadly documented in such a way that makes it probably impossible to adapt - it specifies .toUpperCase(char)) will result in the wrong behavior? You're aware that a UTF-32 function with an outdated code point map wouldn't handle this any better? Also the whole Java API handles surrogates not especially well and the more intricate points about Unicode not at all - and with the later the used encoding wouldn't matter at all. – Voo Aug 15 '11 at 15:32
  • 5
    There is absolutely no reason to use UTF-16 for data processing. – niXar Aug 18 '11 at 20:02
  • 8
    -1: An unconditional `.Substring(1)` in .NET is a trivial example of something that break support for all of non-BMP Unicode. *Everything* that uses UTF-16 has this problem; it's too easy to treat it as a fixed-width encoding, and you see problems too rarely. That makes it an actively harmful encoding if you want to support Unicode. – Roman Starkov Dec 04 '12 at 15:24
  • The problems mentioned in this thread are just sloppy programming, IMHO. Programmers should be able to deal with corner cases correctly, and programs should have unit tests to detect any such errors often. No one blames the concept of linked lists for the (alleged) difficulty of inserting/removing at the ends; the problem of surrogates in UTF-16 should be no different. – musiphil Jun 18 '14 at 19:35
  • @musiphil Carpenters should be able to handle corner cases too, but you don't see many of them building dodecahedrons or using 2x4x4x4s to do their framing. An intelligent programmer/carpenter will pick the tools and materials that will make their job easier unless specifications dictate otherwise. – technosaurus Jul 15 '14 at 21:28
43

Yes, absolutely.

Why? It has to do with exercising code.

If you look at these codepoint usage statistics on a large corpus by Tom Christiansen you'll see that trans-8bit BMP codepoints are used several orders if magnitude more than non-BMP codepoints:

 2663710 U+002013 ‹–›  GC=Pd    EN DASH
 1065594 U+0000A0 β€Ή β€Ί  GC=Zs    NO-BREAK SPACE
 1009762 U+0000B1 β€ΉΒ±β€Ί  GC=Sm    PLUS-MINUS SIGN
  784139 U+002212 β€Ήβˆ’β€Ί  GC=Sm    MINUS SIGN
  602377 U+002003 ‹ ›  GC=Zs    EM SPACE

 544 U+01D49E β€Ήβ€Ί  GC=Lu    MATHEMATICAL SCRIPT CAPITAL C
 450 U+01D4AF β€Ήβ€Ί  GC=Lu    MATHEMATICAL SCRIPT CAPITAL T
 385 U+01D4AE β€Ήβ€Ί  GC=Lu    MATHEMATICAL SCRIPT CAPITAL S
 292 U+01D49F β€Ήβ€Ί  GC=Lu    MATHEMATICAL SCRIPT CAPITAL D
 285 U+01D4B3 β€Ήβ€Ί  GC=Lu    MATHEMATICAL SCRIPT CAPITAL X

Take the TDD dictum: "Untested code is broken code", and rephrase it as "unexercised code is broken code", and think how often programmers have to deal with non-BMP codepoints.

Bugs related to not dealing with UTF-16 as a variable-width encoding are much more likely to go unnoticed than the equivalent bugs in UTF-8. Some programming languages still don't guarantee to give you UTF-16 instead of UCS-2, and some so-called high-level programming languages offer access to code units instead of code-points (even C is supposed to give you access to codepoints if you use wchar_t, regardless of what some platforms may do).

ninjalj
  • 161
  • 3
  • 6
  • 16
    "Bugs related to not dealing with UTF-16 as a variable-width encoding are much more likely to go unnoticed than the equivalent bugs in UTF-8." This is the core of the issue, and hence, the correct answer. – Sean McMillan Aug 19 '11 at 13:01
  • 3
    Precisely. If your UTF-8 handling is borked, it'll be immediately obvious. If your UTF-8 handling is borked, you'll only notice if you put in uncommon Han characters or math symbols. – Mechanical snail Aug 01 '12 at 07:26
  • 1
    Very true, but on the other hand, what are unit tests for if you should depend on luck to find bugs on less frequent cases? – musiphil Jun 18 '14 at 21:09
  • @musiphil: so, when was the last time you created a unit test for non-BMP characters? – ninjalj Jun 19 '14 at 21:14
  • @ninjalj: I could be a text processing programmer, or a device driver programmer, or a graphics programmer; my answer to the specific question of yours shouldn't affect the truth/usefulness of the general statement that unit tests should be used to test corner cases. – musiphil Jun 19 '14 at 21:49
  • 1
    To elaborate on my earlier statement: even with UTF-8, you cannot be assured that you have covered all cases after only seeing some working examples. Same with UTF-16: you need to test whether your code works both with non-surrogates and with surrogates. (Someone could even argue that UTF-8 has at least four major cases while UTF-16 has only two.) – musiphil Jun 19 '14 at 22:00
40

I would suggest that thinking UTF-16 might be considered harmful says that you need to gain a greater understanding of unicode.

Since I've been downvoted for presenting my opinion on a subjective question, let me elaborate. What exactly is it that bothers you about UTF-16? Would you prefer if everything was encoded in UTF-8? UTF-7? Or how about UCS-4? Of course certain applications are not designed to handle everysingle character code out there - but they are necessary, especially in today's global information domain, for communication between international boundaries.

But really, if you feel UTF-16 should be considered harmful because it's confusing or can be improperly implemented (unicode certainly can be), then what method of character encoding would be considered non-harmful?

EDIT: To clarify: Why consider improper implementations of a standard a reflection of the quality of the standard itself? As others have subsequently noted, merely because an application uses a tool inappropriately, does not mean that the tool itself is defective. If that were the case, we could probably say things like "var keyword considered harmful", or "threading considered harmful". I think the question confuses the quality and nature of the standard with the difficulties many programmers have in implementing and using it properly, which I feel stem more from their lack of understanding how unicode works, rather than unicode itself.

  • 33
    -1: How about addressing some of Artyom's objections, rather than just patronising him? –  Jun 26 '09 at 16:12
  • 8
    BTW: When I started writing this article I almost wanted to write "Does Joel on Softeare article of Unicode should be considered harmful" because there are **many** mistakes. For example: utf-8 encoding takes up to 4 characters and not 6. Also it does not distinguish between UCS-2 and UTF-16 that are really different -- and actually cause the problems I talk about. –  Jun 26 '09 at 16:12
  • 1
    My point is that those character points are designed and implemented for specific tasks. The "bugs" you describe are no different than the "bugs" one would encounter if you attempted to give input outside the scope of any application. –  Jun 26 '09 at 16:21
  • 2
    I agree with the last edit. The simplest example: we still use C and C++ though both languages use pointers and thus are not safe. – Malcolm Jun 26 '09 at 16:40
  • 32
    Also, it should be noted that when Joel wrote that article, the UTF-8 standard WAS 6 bytes, not 4. RFC 3629 changed the standard to 4 bytes several months AFTER he wrote the article. Like most anything on the internet, it pays to read from more than one source, and to be aware of the age of your sources. The link wasn't intended to be the "end all be all", but rather a starting point. –  Jun 26 '09 at 16:42
  • 1
    Actually, the problem is not with the standard. It is 100% ok. In fact, there are good implementations that work with utf-16: ICU, Java Swing etc. But, the problem is that there are too much **basic** bugs in processing of surragate pairs when working with utf-16, such, you should probably never pic utf-16 for internal encoding of new applications... Because there are **lot** of real life examples where utf-16 nature causes big troubles: even Stackoverlow can't deal with them –  Jun 27 '09 at 08:07
  • Not to try and flog a dead horse here, but if you shouldn't pick utf-16 as the reasonable standard, what should you pick? I'm interested in your perspective on what an acceptable alternative would be. For instance, a lot of my work involves working with ancient languages (greek, aramaic, hebrew, syriac, etc), and work a lot with these oddball unicode characters, so I'm constantly having to transition documents between utf-8, 16 and 32. –  Jun 29 '09 at 16:46
  • 7
    I would pic: utf-8 or utf-32 that are: variable length encoding in almost all cases (including BMP) or fixed length encoding always. –  Jul 12 '09 at 06:50
  • 2
    Artyom, SO doesn't NEED to use UTF-16, since UTF-8 is the de facto standard for storage and communication of text, while UTF-16 is the de facto standard for processing of text. I don't know of any web page using UTF-16, and it wouldn't be really bold to do so, especially since a really popular language has no Unicode support: PHP (and UTF-16 isn't really easy to deal with; UTF-8 is the standard encoding in most Linux installs, where PHP is commonly run). – Mircea Chirea Mar 17 '10 at 14:31
  • 18
    @iconiK: Don’t be silly. **UTF-16 is absolutely not the *de facto* standard for processing text.** Show me a programming lanuage more suited to text processing that Perl, which has always (well, for more than a decade) used abstract characters with an underlying UTF-8 representation internally. Because of this, every Perl program automatically handles all Unicode without the user having to constantly monkey around with idiotic surrogates. The length of a string is its count in code points, not code units. Anything else is sheer stupidity putting the backwards into backwards compatibility. – tchrist Aug 11 '11 at 14:50
  • 4
    @tchrist Great example! Because perl handles combined characters just as bad as any other well known programming language out there (its re module can handle them more sensibly, but its probably not the only one in that regard). `abU+0063U+0327cde` - a substring containing the first 3 characters really shouldn't return abc you know. So it doesn't handle "all of unicode perfectly" and the parts it does has to do with how it implemented its string library and not what encoding is used. – Voo Aug 15 '11 at 15:48
  • 3
    @Voo: I have no idea what you are talking about. Oh wait, maybe I do. The [Unicode::GCString](http://search.cpan.org/perldoc?Unicode::GCString) class has strings made of grapheme clusters instead of code points. `perl -CS -MUnicode::GCString -le 'print Unicode::GCString::->new("ab\x{63}\x{327}cde")->substr(0, 3)'` will dutifully print out **abΓ§** (which is `abc\x{327}`). Isn’t that all you want? Piece of cake. – tchrist Aug 15 '11 at 19:29
  • 1
    @tchrist Nice, but that's not the default string implementation in perl is it? (at least not for my version!) Nobody argues that additional libraries can solve the problem correctly, but I don't see why they'd have a harder time if the codepoints were encoded with UTF-16, 8 or 32. – Voo Aug 16 '11 at 21:13
  • 1
    What about `foreach (var c in myString)` in something like .NET? How often do you remember that this is UTF-16 you're dealing with? Let me guess: almost never. – Roman Starkov Dec 04 '12 at 15:31
  • @tchrist: The fact that Perl programs automatically handles Unicode (up to some degree) has little to do with the fact that Perl uses UTF-8 internally; it would still hold if Perl used UTF-16 or UTF-32 internally, as there's nothing in UTF-16 or UTF-32 that hinders correct Unicode handling. – musiphil Jun 18 '14 at 21:05
37

There is nothing wrong with Utf-16 encoding. But languages that treat the 16-bit units as characters should probably be considered badly designed. Having a type named 'char' which does not always represent a character is pretty confusing. Since most developers will expect a char type to represent a code point or character, much code will probably break when exposed to characters beyound BMP.

Note however that even using utf-32 does not mean that each 32-bit code point will always represent a character. Due to combining characters, an actual character may consist of several code points. Unicode is never trivial.

BTW. There is probably the same class of bugs with platforms and applications which expect characters to be 8-bit, which are fed Utf-8.

JacquesB
  • 57,310
  • 21
  • 127
  • 176
  • 12
    In Java's case, if you look at their timeline (http://www.java.com/en/javahistory/timeline.jsp), you see that the primarily development of String happened while Unicode was 16 bits (it changed in 1996). They had to bolt on the ability to handle non BMP code points, thus the confusion. – Kathy Van Stone Jun 26 '09 at 17:40
  • 10
    @Kathy: Not really an excuse for C#, though. Generally, I agree, that there should be a `CodePoint` type, holding a single code point (21 bits), a `CodeUnit` type, holding a single code unit (16 bits for UTF-16) and a `Character` type would ideally have to support a complete grapheme. But that makes it functionally equivalent to a `String` ... – Joey Apr 02 '10 at 13:43
  • 1
    This answer is almost two years old, but I can't help but comment on it. "Having a type named 'char' which does not always represent a character is pretty confusing." And yet people use it all the time in C and the like to represent integer data that can be stored in a single byte. – JAB Jun 06 '11 at 15:53
  • And I've seen a *lot* of C code that doesn't handle character encoding correctly. – dan04 Aug 18 '11 at 23:06
  • 1
    C# has a different excuse: it was designed for Windows, and Windows was built on UCS-2 (it's very annoying that even today Windows APIs cannot support UTF-8). Plus, I think Microsoft wanted Java compatibility (.NET 1.0 had a Java compatibility library, but they dropped Java support very quickly--I'm guessing this is due to Sun's lawsuit against MS?) – Qwertie May 01 '12 at 00:05
20

My personal choice is to always use UTF-8. It's the standard on Linux for nearly everything. It's backwards compatible with many legacy apps. There is a very minimal overhead in terms of extra space used for non-latin characters vs the other UTF formats, and there is a significant savings in space for latin characters. On the web, latin languages reign supreme, and I think they will for the foreseeable future. And to address one of the main arguments in the original post: nearly every programmer is aware that UTF-8 will sometimes have multi-byte characters in it. Not everyone deals with this correctly, but they are usually aware, which is more than can be said for UTF-16. But, of course, you need to choose the one most appropriate for your application. That's why there's more than one in the first place.

  • 3
    UTF-16 is simpler for anything inside BMP, that's why it is used so widely. But I'm a fan of UTF-8 too, it also has no problems with byte order, which works to its advantage. – Malcolm Jun 26 '09 at 16:57
  • @Malcolm: UTF-16 also has no problems with byte order as it requires a BOM which specifies the order :-) – Joey Apr 02 '10 at 14:17
  • 2
    Theoretically, yes. In practice there are such things as, say, UTF-16BE, which means UTF-16 in big endian without BOM. This is not some thing I made up, this is an actual encoding allowed in ID3v2.4 tags (ID3v2 tags suck, but are, unfortunately, widely used). And in such cases you have to define endianness externally, because the text itself doesn't contain BOM. UTF-8 is always written one way and it doesn't have such a problem. – Malcolm Apr 02 '10 at 15:33
  • 23
    No, UTF-16 is not simpler. It is harder. It misleads and deceives you into thinking it is fixed width. All such code is broken and all the moreso because you don’t notice until it’s too late. CASE IN POINT: I just found yet another stupid UTF-16 bug in the Java core libraries yesterday, this time in String.equalsIgnoreCase, which was left in UCS-2 braindeath buggery, and so fails on 16/17 valid Unicode code points. How long has that code been around? No excuse for it to be buggy. UTF-16 leads to sheer stupidity and an accident waiting to happen. Run screaming from UTF-16. – tchrist Aug 11 '11 at 14:42
  • 3
    @tchrist One must be a very ignorant developer to not know that UTF-16 is not fixed length. If you start with Wikipedia, you will read the following at the very top: "It produces a variable-length result of either one or two 16-bit code units per code point". Unicode FAQ says the same: http://www.unicode.org/faq//utf_bom.html#utf16-1. I don't know, how UTF-16 can deceive anybody if it is written everywhere that it is variable length. As for the method, it was never designed for UTF-16 and shouldn't be considered Unicode, as simple as that. – Malcolm Aug 13 '11 at 10:30
  • @Malcolm: You write *β€œOne must be a very ignorant developer to not know that UTF-16 is not fixed length.”* Well, welcome to the real world! At least 19/20 of them know about it **at best** extremely nebulously, and cannot even process a string by code points to save their lives. That is the reality. I know, because I’ve tested them on it. Until Java deprecates the whole char botchup you will always have crappy code full of BMP idiocies. All APIs need to be by int-32 code point, and the un-Unicode-Character char-16 versions deprecated into oblivion. Really they do. – tchrist Aug 15 '11 at 19:40
  • 2
    @tchrist Do you have a source for your statistics? Though if good programmers a scarce, I think this is good, because we become more valuable. :) As for the Java APIs, char-based parts may eventually get deprecated, but this is not a guarantee that they won't be used. And they definitely won't be removed for compability reasons. – Malcolm Aug 16 '11 at 08:29
  • 1
    @rmeador I would like to kill a widespread myth: UTF-8 is actually NOT compatible with any 8-bit based encoding. This is a fact that everyone in Europe should know. UTF-8 is backward compatible with US-ASCII, nothing more. – user877329 Jul 13 '13 at 09:35
18

Well, there is an encoding that uses fixed-size symbols. I certainly mean UTF-32. But 4 bytes for each symbol is too much of wasted space, why would we use it in everyday situations?

To my mind, most problems appear from the fact that some software fell behind the Unicode standard, but were not quick to correct the situation. Opera, Windows, Python, Qt - all of them appeared before UTF-16 became widely known or even came into existence. I can confirm, though, that in Opera, Windows Explorer, and Notepad there are no problems with characters outside BMP anymore (at least on my PC). But anyway, if programs don't recognise surrogate pairs, then they don't use UTF-16. Whatever problems arise from dealing with such programs, they have nothing to do with UTF-16 itself.

However, I think that the problems of legacy software with only BMP support are somewhat exaggerated. Characters outside BMP are encountered only in very specific cases and areas. According to the Unicode official FAQ, "even in East Asian text, the incidence of surrogate pairs should be well less than 1% of all text storage on average". Of course, characters outside BMP shouldn't be neglected because a program is not Unicode-conformant otherwise, but most programs are not intended for working with texts containing such characters. That's why if they don't support it, it is unpleasant, but not a catastrophy.

Now let's consider the alternative. If UTF-16 didn't exist, then we wouldn't have an encoding which is well-suited for non-ASCII text, and all the software created for UCS-2 would have to be completely redesigned to remain Unicode-compliant. The latter most likely would only slow Unicode adoption. Also we wouldn't have been able to maintain compability with text in UCS-2 like UTF-8 does in relation to ASCII.

Now, putting aside all the legacy issues, what are the arguments against the encoding itself? I really doubt that developers nowadays don't know that UTF-16 is variable length, it is written everywhere strarting with Wikipedia. UTF-16 is much less difficult to parse than UTF-8, if someone pointed out complexity as a possible problem. Also it is wrong to think that it is easy to mess up with determining the string length only in UTF-16. If you use UTF-8 or UTF-32, you still should be aware that one Unicode code point doesn't necessarily mean one character. Other than that, I don't think that there's anything substantial against the encoding.

Therefore I don't think the encoding itself should be considered harmful. UTF-16 is a compromise between simplicity and compactness, and there's no harm in using what is needed where it is needed. In some cases you need to remain compatible with ASCII and you need UTF-8, in some cases you want to work with work with Han ideographs and conserve space using UTF-16, in some cases you need universal representations of characters usign a fixed-length encoding. Use what's more appropriate, just do it properly.

Malcolm
  • 369
  • 4
  • 14
  • 1
    If a program uses UTF-16, shouldn't it be used "correctly"? –  Jun 26 '09 at 16:20
  • 2
    Certainly. But that doesn't mean that if someone can use something incorrectly, we shouldn't use it at all, right? – Malcolm Jun 26 '09 at 16:22
  • 21
    That's a rather blinkered, Anglo-centric view, Malcolm. Almost on a par with "ASCII is good enough for the USA - the rest of the world should fit in with us". – Jonathan Leffler Jun 26 '09 at 16:22
  • 28
    Actually I'm from Russia and encounter cyrillics all the time (including my own programs), so I don't think that I have Anglo-centric view. :) Mentioning ASCII is not quite appropirate, because it's not Unicode and doesn't support specific characters. UTF-8, UTF-16, UTF-32 support the very same international character sets, they are just intended for use in their specific areas. And this is exactly my point: if you use mostly English, use UTF-8, if you use mostly cyrillics, use UTF-16, if you use ancient languages, use UTF-32. Quite simple. – Malcolm Jun 26 '09 at 16:36
  • 6
    But you might not know in advance if your application need to handle characters outside BMP, if the application accepts data like names. For example some asian names might be written with characters outside of BMP. –  Jun 26 '09 at 21:28
  • 2
    Not true, Asian scripts like Japanese, Chinese or Arabic belong to BMP also. BMP itself is actually very large and certainly large enough to include all the scripts used nowadays, it's not like it includes only European scripts or something. No, if you are really going to encounter non-BMP characters, you'll almost definitely know it. – Malcolm Jun 27 '09 at 09:52
  • 1
    @Malcolm: The issue is more complex than that. See eg. http://www.jbrowse.com/text/unij.html –  Jun 29 '09 at 16:42
  • And what did I write wrong? All characters of plane 2 contain only rare or historic symbols and all other characters fit into BMP and thus don't need surrogate pairs. – Malcolm Jun 29 '09 at 18:31
  • 2
    @Malcolm: This issue is that some people apparently have names containing these rare symbols, even though they does not otherwise occur in regular language. –  Jul 02 '09 at 15:50
  • There is, but it's not really a problem specific for Unicode since standart encodings also don't include this characters. People use homophones and other ways to write such names, and that can be done in any encoding, including Unicode. Probably there are serious difficulties even with inputting rare symbols, so the situation doesn't happen all of a sudden and users won't be surprised to find out the program is refusing to accept them correctly if it does. – Malcolm Jul 02 '09 at 17:57
  • 16
    "Not true, Asian scripts like Japanese, Chinese or Arabic belong to BMP also. BMP itself is actually very large and certainly large enough to include all the scripts used nowadays" This is all so wrong. BMP contains 0xFFFF characters (65536). Chinese alone has more than that. Chinese standards (GB 18030) has more than that. Unicode 5.1 already allocated more than 100,000 characters. –  Jul 24 '09 at 08:11
  • 2
    It does, but characters outside BMP are not for everyday use, they can be used, for example, for old texts or to write names with rare hieroglyphs in them. And all characters that are commonly used fit into BMP. – Malcolm Aug 12 '09 at 17:57
  • 12
    @Marcolm: "BMP itself is actually very large and certainly large enough to include all the scripts used nowadays" Not true. At this point Unicode already allocated about 100K characters, way more than BMP can accomodate. There are big chunks of Chinese characters outside BMP. And some of them are required by GB-18030 (mandatory Chinese standard). Other are required by (non-mandatory) Japanese and Korean standards. So if you try to sell anything in those markets, you need beyond BMP support. –  Sep 25 '09 at 21:41
  • 4
    If BMP is *that* far from having enough capacity to write normally in Chinese, how do they manage to write in such encodings as GBK or GB 2312? It is clear that support of other planes would be useful, but nonetheless. – Malcolm Sep 26 '09 at 13:59
  • 2
    All the currently used languages in the world fir in the BMP, in 64k code points. Anything outside of the BMP is not for current use of the language; it's for old characters, for old languages, for exotic characters, or even Klingon. If Chinese and/or Japanese and/or Korean need characters out of the BMP, how did they handle this before Unicode was widely adopted? Nearly all the encodings used in Asia were variable-length, using 8 or 16 bits per character. – Mircea Chirea Mar 17 '10 at 14:25
  • 2
    You DO NOT want Klingon users to be angry at you ;-) –  Feb 13 '11 at 18:43
  • Why would they be angry? – Malcolm Feb 13 '11 at 19:42
  • 8
    Anything that uses UTF-16 but can only handle narrow BMP characters is not actually using UTF-16. It is buggy and broken. The premise of the OP is sound: UTF-16 is harmful, because it leads naΓ―ve people into writing broken code. Either you can handle Unicode text, or you can’t. If you cannot, then you are picking a subset, which is just as stupid as ASCII-only text processing. – tchrist Aug 11 '11 at 14:46
  • 2
    @tchrist Actually I think that most problems appear from the dated software which was designed for UCS-2. If you implement the standard today, you will be almost certainly aware that UTF-16 has a concept of surrogate pairs, since it is written about almost everywhere, starting with Wikipedia. And if you are an ignorant developer, you may not implement support for characters outside BMP in UTF-8 as well. Or even treat every text as if each byte represented only one character. – Malcolm Aug 11 '11 at 17:02
  • 1
    @Malcolm: I have never seen a Go or Perl programmer get BMP screwups, and those languages both use UTF-8 internally. In contrast, in every language that uses UTF-16 I have seen people major screwups on BMP stuff everywhere I look. Where have you actually seen BMP screwups with UTF8-based programming languages? That seems utterly bizarre. Yes, you can do UTF-16 right, if you are genius smart and working on ICU. But for regular people, UTF-8 and UTF-32 are way less prone to error. – tchrist Aug 11 '11 at 20:30
  • @tchrist This is all too general. First of all, define what is a "BMP screwup" and where you look for them. Also, I've already said that there's a huge difference between something that was designed for UTF-16 and something that was designed for UCS-2 and then switched to UTF-16. And if we put aside API and language deficiencies, I don't reallly see what's so terribly difficult in handling surrogate pairs, especially in comparison with UTF-8. – Malcolm Aug 12 '11 at 09:07
  • 1
    @Malcolm: The BMP screwup was iterating through a Java String a `char` at a time instead of code point at a time in the `String` class’s `equalsIgnoreCase` method. The code was never updated for UTF-16 so was stuck in UCS-2 brain damage, so does the wrong thing on anything outside the BMP. Plus it was using casemapping not casefolding, which is bound to get it into trouble. "", "", and "" are all casewise equal to each other, but the dumb UCS-2 Java method was too stupid to know that. ICU gets this right, and they use UTF-16 also, but not stupidly. – tchrist Aug 12 '11 at 09:17
  • 2
    Yes, equalsIgnoreCase() compares chars, not codepoints, and it is stated in the docs. I certainly agree that if this method compared strings usign code points, it would be much simpler. But this isn't a problem of UTF-16 itself, it is a problem of a platform which was originally designed for UCS-2 - exactly what I'm talking about. – Malcolm Aug 12 '11 at 10:14
  • 1
    @Malcom: I don’t believe that merely documenting something not to work on 16/17th of the Unicode space is acceptable. The correct solution is to make it do so. – tchrist Aug 12 '11 at 13:45
  • Consider it non-Unicode, that would clear the confusion. Historically this method is simply not meant to work reliably in conditions where I18N is required, it is not even locale-aware. Java has different facilities to do that. And I have to remind you that we're off the topic, still talking about subtleties of APIs, not about the UTF-16 itself. – Malcolm Aug 12 '11 at 16:08
  • 2
    @Malcolm: Unicode is very much not just about i18n!! It is about dealing with text sanely. There are innumerable non-ASCII code points that go into purely English text, of which the most common are EN DASH and NO-BREAK SPACE, with the curly single- and double-quotes not far behind. Also, casing is not supposed to need to be locale aware; yes, I am very familiar with the Turkic I issue, but that is an exception case, so to speak. Unicode also has algorithms for line breaking of text, collation (sorting) of text, etc, that works way better than not using them even on pure English. Mythbusted! – tchrist Aug 15 '11 at 19:38
  • @tchrist I never said that Unicode and I18N are the same, though thanks for the info. – Malcolm Aug 16 '11 at 08:46
  • 4
    -1 because you're defending using UTF-16 by denying the core purpose of Unicode: to unify all encoding. Having rare cases and basically ignoring them is a recipe for failure. – niXar Aug 18 '11 at 20:04
  • 1
    @niXar If some software is ignoring surrogate pairs, then it isn't using UTF-16 and therefore is non-Unicode. You wouldn't call a program which works only with ASCII as supporting UTF-8, right? I think you misunderstood my point a bit, maybe I should clarify this in the answer. I say that there's nothing wrong with the UTF-16, but there's a lot of legacy software, which was designed for UCS-2, and therefore happens to be mostly compliant with UTF-16. But *mostly* compliant is not compliant, and this causes problems. I also doubt that if UTF-16 didn't exist at all, it would get fixed sooner. – Malcolm Aug 18 '11 at 21:17
  • 1
    And I say there's plenty wrong with UTF-16, even ignoring UCS-2. – niXar Oct 05 '11 at 13:03
  • 1
    @Malcolm: +1: a programmer's claiming to support UTF-16 but actually supporting no more than UCS-2 is not UTF-16's fault at all. – musiphil Jun 18 '14 at 21:20
16

Years of Windows internationalization work especially in East Asian languages might have corrupted me, but I lean toward UTF-16 for internal-to-the-program representations of strings, and UTF-8 for network or file storage of plaintext-like documents. UTF-16 can usually be processed faster on Windows, though, so that's the primary benefit of using UTF-16 in Windows.

Making the leap to UTF-16 dramatically improved the adequacy of average products handling international text. There are only a few narrow cases when the surrogate pairs need to be considered (deletions, insertions, and line breaking, basically) and the average-case is mostly straight pass-through. And unlike earlier encodings like JIS variants, UTF-16 limits surrogate pairs to a very narrow range, so the check is really quick and works forward and backward.

Granted, it's roughly as quick in correctly-encoded UTF-8, too. But there's also many broken UTF-8 applications that incorrectly encode surrogate pairs as two UTF-8 sequences. So UTF-8 doesn't guarantee salvation either.

IE handles surrogate pairs reasonably well since 2000 or so, even though it typically is converting them from UTF-8 pages to an internal UTF-16 representation; I'm fairly sure Firefox has got it right too, so I don't really care what Opera does.

UTF-32 (aka UCS4) is pointless for most applications since it's so space-demanding, so it's pretty much a nonstarter.

JasonTrue
  • 9,001
  • 1
  • 32
  • 49
  • 6
    I didn't quite get your comment on UTF-8 and surrogate pairs. Surrogate pairs is only a concept that is meaningful in the UTF-16 encoding, right? Perhaps code that converts directly from UTF-16 encoding to UTF-8 encoding might get this wrong, and in that case, the problem is incorrectly reading the UTF-16, not writing the UTF-8. Is that right? – Craig McQueen Jun 27 '09 at 23:54
  • 11
    What Jason's talking about is software that deliberately implements UTF-8 that way: create a surrogate pair, then UTF-8 encode each half separately. The correct name for that encoding is CESU-8, but Oracle (e.g.) misrepresents it as UTF-8. Java employs a similar scheme for object serialization, but it's clearly documented as "Modified UTF-8" and only for internal use. (Now, if we could just get people to READ that documentation and stop using DataInputStream#readUTF() and DataOutputStream#writeUTF() inappropriately...) –  Jun 28 '09 at 14:35
  • AFAIK, UTF-32 is still variable length encoding, and not equal to UCS4 which is specific range of code point. – Eonil Nov 15 '13 at 21:41
  • @Eonil, UTF-32 will only ever be distinguishable from UCS4 if we have a Unicode standard that features something like a UCS5 or larger. – JasonTrue Nov 15 '13 at 21:49
  • @JasonTrue Still, only the results are equal coincidently, not guaranteed by design. Same thing was happen in 32-bit memory addressing, Y2K, UTF16/UCS2. Or do we have any guarantee of that equality? If we have, I would gladly use that. But I don't want to write a *possible breakable* code. I am writing a character level code, and lack of a guaranteed way to transcode between UTF <-> code point is bugging me a lot. – Eonil Nov 15 '13 at 22:02
  • @JasonTrue I really want some guaranteed way badly. If you have any clue please help me to share that guarantee. – Eonil Nov 15 '13 at 22:02
  • I think you haven't read the standards document. That's the closest thing there is to a guarantee. It says: "UCS-4. UCS-4 stands for β€œUniversal Character Set coded in 4 octets.” It is now treated sim- ply as a synonym for UTF-32, and is considered the canonical form for representation of characters in 10646." http://www.unicode.org/versions/Unicode6.2.0/appC.pdf – JasonTrue Nov 15 '13 at 22:16
  • There's nothing complicated about transforming between UTF32 and UTF8, and it's only a tiny bit complicated to transform between UTF16 and UTF32. The code to do so reliably was included in the Unicode standard since at least 2.0 or so. – JasonTrue Nov 15 '13 at 22:21
  • @JasonTrue Thanks for that! I really needed it. I am sorry for not fully reading Unicode standard. Now I am on Unicode website, and see repeating mentions about that equality. – Eonil Nov 15 '13 at 23:33
16

UTF-8 is definitely the way to go, possibly accompanied by UTF-32 for internal use in algorithms that need high performance random access (but that ignores combining chars).

Both UTF-16 and UTF-32 (as well as their LE/BE variants) suffer of endianess issues, so they should never be used externally.

  • 9
    Constant time random access is possible with UTF-8 too, just use code units rather than code points. Maybe you need real random code point access, but I've never seen a use case, and you're just as likely to want random grapheme cluster access instead. –  Aug 06 '10 at 07:32
15

UTF-16? definitely harmful. Just my grain of salt here, but there are exactly three acceptable encodings for text in a program:

  • ASCII: when dealing with low level things (eg: microcontrollers) that can't afford anything better
  • UTF8: storage in fixed-width media such as files
  • integer codepoints ("CP"?): an array of the largest integers that are convenient for your programming language and platform (decays to ASCII in the limit of low resorces). Should be int32 on older computers and int64 on anything with 64-bit addressing.

  • Obviously interfaces to legacy code use what encoding is needed to make the old code work right.

David X
  • 141
  • 1
  • 4
  • Unicode guarantees there will be no codepoints above `U+10FFFF`. You are talking about UTF-32/UCS-4 (they are identical). If you are thinking about speed, 32->64 is not 16->32; int64 is not faster for 64-processors. –  Jun 09 '10 at 07:52
  • 4
    @simon buchan, the `U+10ffff` max will go out the window when (not if) they run out of codepoints. That said, useing int32 on a p64 system for speed is probably safe, since i doubt they'll exceed `U+ffffffff` before you're forced to rewrite your code for 128 bit systems around 2050. (That is the point of "use the largest int that is convenient" as opposed to "largest available" (which would probably be int256 or bignums or something).) – David X Jun 10 '10 at 02:59
  • 1
    @David: Unicode 5.2 encodes 107,361 codepoints. There are 867,169 unused codepoints. "when" is just silly. A Unicode codepoint is *defined* as a number from 0 to 0x10FFFF, a property which UTF-16 depends upon. (Also 2050 seems much to low an estimate for 128 bit systems when a 64-bit system can hold the entirety of the Internet in it's address space.) –  Jun 11 '10 at 06:07
  • @Simon, yes, I was thinking 2050 sounded a bit low for either ETA, my point was that yes, "when" is silly, but it *will* happen. My point in the original answer, however, was to use an array of ints of whatever size is needed for the largest codepoint you expect to handle. (And yes, I did forget that most p64 systems still use int32 as a primary integer type. I'm not sure why.) – David X Jun 12 '10 at 00:04
  • 3
    @David: Your "when" was referring to running out of Unicode codepoints, not a 128-bit switch which, yes, will be in the next few centuries. Unlike memory, there is no exponential growth of characters, so the Unicode Consortium has *specifically* guaranteed they will *never* allocate a codepoint above `U+10FFFF`. This really is one of those situations when 21 bits *is* enough for anybody. –  Jun 13 '10 at 02:53
  • 10
    @Simon Buchan: At least until first contact. :) –  Oct 18 '10 at 17:38
  • 3
    Unicode used to guarantee that there would be no code points above U+FFFF too. – Shannon Severance Oct 04 '13 at 18:47
  • 1
    Perhaps they haven't considered the exponential growth of smileys covering an ever-subtler range of emotions :mournfulwithachanceofjoy: – Rob Grant Aug 28 '14 at 07:11
13

Unicode defines code points up to 0x10FFFF (1,114,112 codes), all applications running in multilingual environment dealing with strings/file names etc. should handle that correctly.

Utf-16: covers only 1,112,064 codes. Although those at the end of Unicode are from planes 15-16 (Private Use Area). It can not grow any further in the future except breaking Utf-16 concept.

Utf-8: covers theoretically 2,216,757,376 codes. Current range of Unicode codes can be represented by maximally 4 byte sequence. It does not suffer with byte order problem, it is "compatible" with ascii.

Utf-32: covers theoretically 2^32=4,294,967,296 codes. Currently it is not variable length encoded and probably will not be in the future.

Those facts are self explanatory. I do not understand advocating general use of Utf-16. It is variable length encoded (can not be accessed by index), it has problems to cover whole Unicode range even at present, byte order must be handled, etc. I do not see any advantage except that it is natively used in Windows and some other places. Even though when writing multi-platform code it is probably better to use Utf-8 natively and make conversions only at the end points in platform dependent way (as already suggested). When direct access by index is necessary and memory is not a problem, Utf-32 should be used.

The main problem is that many programmers dealing with Windows Unicode = Utf-16 do not even know or ignore the fact that it is variable length encoded.

The way it is usually in *nix platform is pretty good, c strings (char *) interpreted as Utf-8 encoded, wide c strings (wchar_t *) interpreted as Utf-32.

  • 7
    Note: UTF-16 does covers All Unicode as Unicode Consortium decided that 10FFFF is the TOP range of Unicode and defined UTF-8 maximal 4 bytes length and explicitly excluded range 0xD800-0xDFFF from valid code points range and this range is used for creation of surrogate pairs. So any valid Unicode text can be represented with each of one of these encodings. Also about growing to future. It doesn't seems that 1 Million code points would not be enough in any far future. –  Jan 21 '11 at 15:06
  • Exactly, all the encodings cover all the code points; and as for the lack of available codes, I don't see how this can be possible in forseeable future. Most supplementary planes are still unused, and even the used ones aren't full yet. So given the total sizes of the known writing systems left, it is very possible that most planes will never be used, unless they start to use code points for something different than writing systems. By the way, UTF-8 can theoretically include 6-byte sequences, so it can represent even more code points than UTF-32, but what's the point? – Malcolm Jan 23 '11 at 14:16
  • Malcolm: Not all encodings cover all codepoints. UCS-2 is if you will the fixed-size subset of UTF-16; it only covers the BMP. – Kerrek SB Jun 09 '11 at 11:34
  • 7
    @Kerrek: Incorrect: UCS-2 is not a valid Unicode encoding. All UTF-* encodings by definition can represent any Unicode code point that is legal for interchange. UCS-2 can represent far fewer than that, plus a few more. Repeat: UCS-2 is not a valid Unicode encoding, any moreso than ASCII is. – tchrist Aug 11 '11 at 14:33
  • @_tchrist: You're right, UCS-2 isn't an encoding, it's a subset. In that sense, *all* encodings for Unicode must by definition be able to represent all Unicode codepoints. Fair point. – Kerrek SB Aug 11 '11 at 14:58
  • 1
    "I do not understand advocating general use of **Utf-8**. It is variable length encoded (can not be accessed by index)" – Ian Boyd Aug 11 '11 at 15:35
  • 9
    @Ian Boyd, the need to access a string’s individual character in a random access pattern is incredibly overstated. It is about as common as wanting to compute the diagonal of a matrix of characters, which is super rare. **Strings are virtually always processed sequentially,** and since accessing UTF-8 char N+1 given that you are at UTF-8 char N is O(1), there is no issue. There is surpassingly little need to make random access of strings. Whether you think it is worth the storage space to go to UTF-32 instead of UTF-8 is your own opinion, but for me, it is altogether a non-issue. – tchrist Aug 11 '11 at 20:38
  • 2
    @tchrist, I will grant you strings are virtually always processed sequentially if you include reverse iteration as "sequential" and stretch that a little further comparison of the trailing end of a string to a known string. Two very common scenarios are truncating whitespace from the end of strings and checking the file extension at the end of a path. – Andy Dent May 13 '12 at 14:16
11

Add this to the list:

The presented scenario is simple (even more simple as I will present it here than it was originally!): 1.A WinForms TextBox sits on a Form, empty. It has a MaxLength set to 20.

2.The user types into the TextBox, or maybe pastes text into it.

3.No matter what you type or paste into the TextBox, you are limited to 20, though it will sympathetically beep at text beyond the 20 (YMMV here; I changed my sound scheme to give me that effect!).

4.The small packet of text is then sent somewhere else, to start an exciting adventure.

Now this is an easy scenario, and anyone can write this up, in their spare time. I just wrote it up myself in multiple programming languages using WinForms, because I was bored and had never tried it before. And with text in multiple actual languages because I am wired that way and have more keyboard layouts than possibly anyone in the entire freaking universe.

I even named the form Magic Carpet Ride, to help ameliorate the boredom.

This did not work, for what it's worth.

So instead, I entered the following 20 characters into my Magic Carpet Ride form:

0123401234012340123

Uh oh.

That last character is U+20000, the first Extension B ideograph of Unicode (aka U+d840 U+dc00, to its close friends who he is not ashamed to be disrobed, as it were, in front of)....

enter image description here

And now we have a ball game.

Because when TextBox.MaxLength talks about

Gets or sets the maximum number of characters that can be manually entered into the text box.

what it really means is

Gets or sets the maximum number of UTF-16 LE code units that can be manually entered into the text box and will mercilessly truncate the living crap out of any string that tries to play cutesy games with the linguistic character notion that only someone as obsessed as that Kaplan fellow will find offensive (geez he needs to get out more!).

I'll try and see about getting the document updated....
Regular readers who remember my UCS-2 to UTF-16 series will note my unhappiness with the simplistic notion of TextBox.MaxLength and how it should handle at a minimum this case where its draconian behavior creates an illegal sequence, one that other parts of the .Net Framework may throw a

  • System.Text.EncoderFallbackException: Unable to translate Unicode character \uD850 at index 0 to specified code page.*

exception if you pass this string elsewhere in the .Net Framework (as my colleague Dan Thompson was doing).

Now okay, perhaps the full UCS-2 to UTF-16 series is out of the reach of many.
But isn't it reasonable to expect that TextBox.Text will not produce a System.String that won't cause another piece of the .Net Framework to throw? I mean, it isn't like there is a chance in the form of some event on the control that tells you of the upcoming truncation where you can easily add the smarter validation -- validation that the control itself does not mind doing. I would go so far as to say that this punk control is breaking a safety contract that could even lead to security problems if you can class causing unexpected exceptions to terminate an application as a crude sort of denial of service. Why should any WinForms process or method or algorithm or technique produce invalid results?

Source : Michael S. Kaplan MSDN Blog

Yuhong Bao
  • 131
  • 1
  • 3
  • Thanks, very good link! I've added it to the issues list in the question. –  Dec 21 '10 at 06:28
9

I wouldn't necessarily say that UTF-16 is harmful. It's not elegant, but it serves its purpose of backwards compatibility with UCS-2, just like GB18030 does with GB2312, and UTF-8 does with ASCII.

But making a fundamental change to the structure of Unicode in midstream, after Microsoft and Sun had built huge APIs around 16-bit characters, was harmful. The failure to spread awareness of the change was more harmful.

dan04
  • 3,748
  • 1
  • 24
  • 26
  • 8
    UTF-8 is a superset of ASCII, but UTF-16 is NOT a superset of UCS-2. Although almost a superset, a correct encoding of UCS-2 into UTF-8 results in the abomination known as CESU-8; UCS-2 doesn't have surrogates, just ordinary code points, so they must be translated as such. The real advantage of UTF-16 is that it's easier to upgrade a UCS-2 codebase than a complete rewrite for UTF-8. Funny, huh? –  Aug 06 '10 at 07:28
  • 1
    Sure, technically UTF-16 isn't a superset of UCS-2, but when were U+D800 to U+DFFF ever *used* for anything except UTF-16 surrogates? – dan04 Aug 17 '10 at 18:51
  • 2
    Doesn't matter. Any processing other than blindly passing through the bytestream requires you to decode the surrogate pairs, which you can't do if you're treating it as UCS-2. –  Aug 29 '10 at 13:02
6

UTF-16 is the best compromise between handling and space and that's why most major platforms (Win32, Java, .NET) use it for internal representation of strings.

Nemanja Trifunovic
  • 6,815
  • 1
  • 26
  • 34
  • 31
    -1 because UTF-8 is likely to be smaller or not significantly different. For certain Asian scripts UTF-8 is three bytes per glyph while UTF-16 is only two, but this is balanced by UTF-8 being only one byte for ASCII (which does often appear even within asian languages in product names, commands and such things). Further, in the said languages, a glyph conveys more information than a latin character so it is justified for it to take more space. –  Mar 18 '10 at 02:47
  • Thanks for the downvote, but I still don't get which part of the "best compromise between handling and space" you consider wrong. Note the word "compromise". Or maybe you don't believe that Win32, Java and .NET (also ICU, btw) use UTF-16 internally? – Nemanja Trifunovic Mar 18 '10 at 18:16
  • 32
    I would not call combining the worst sides of both options a good compromise. –  Mar 23 '10 at 15:36
  • 1
    It is the *best* of both worlds: it is pretty easy to handle, unlike UTF-8, and does not take nearly as much memory as UTF-32. – Nemanja Trifunovic Mar 23 '10 at 17:03
  • 18
    It's not easier than UTF-8. It's variable-length too. – luiscubal Mar 25 '10 at 17:50
  • 3
    It is variable-length, but way easier than UTF-8. With UTF16, the only thing to look at is the surrogate pairs; a UTF-8 code point can be encoded as anyway between 1 and 4 bytes, plus you need to take care of things such as overlong sequences, etc. Look at this code to see how UTF-8 decoding looks like with C++: http://utfcpp.svn.sourceforge.net/viewvc/utfcpp/v2_0/source/utf8/core.h?revision=111&view=markup – Nemanja Trifunovic Mar 25 '10 at 20:10
  • 36
    Leaving debates about the benefits of UTF-16 aside: What you cited is *not* the reason for Windows, Java or .NET using UTF-16. Windows and Java date back to a time where Unicode was a 16-bit encoding. UCS-2 was a reasonable choice back then. When Unicode became a 21-bit encoding migrating to UTF-16 was the best choice existing platforms had. That had nothing to do with ease of handling or space compromises. It's just a matter of legacy. – Joey Apr 02 '10 at 14:13
  • @Johannes: It is a matter of legacy in case of Win32 and Java, but not .NET and especially not Python 3. – Nemanja Trifunovic Apr 02 '10 at 15:37
  • 10
    .NET inherits the Windows legacy here. – Joey Apr 02 '10 at 16:19
  • That's why I said "especially not Python 3", but it would have been perfectly feasible to implement even .NET strings as UTF-8. Of course, interop with Win32 is easier with UTF-16 strings. – Nemanja Trifunovic Apr 02 '10 at 17:51
  • Python3 and PHP6 are probably a case of "me too", and we all know how well that went with PHP6. – ninjalj Aug 01 '11 at 17:07
  • 3
    It does not in practice take a great deal more in UTF-8 than in UTF-16. [See this case-study.](http://stackoverflow.com/questions/6883434/at-all-times-text-encoded-in-utf-8-will-never-give-us-more-than-a-50-file-size/6884648#6884648) Python3 is still unacceptably dodgy because you cannot rely on a wide build, so you never know how to count characters, or whether it takes "." or ".." to match one in a regex. Look at languages that have always used UTF-8, like Go and Perl, and you will see that they have none of the endless insanity of UTF-16. I just found another Java CORE UTF-16 bug yesteday. – tchrist Aug 11 '11 at 14:39
  • 5
    @tchrist, you will be pleased to know that Python 3.3 fixes its internal representation of Unicode. See http://docs.python.org/py3k/whatsnew/3.3.html#pep-393 – Mark Ransom Oct 17 '12 at 14:47
6

I've never understood the point of UTF-16. If you want the most space-efficient representation, use UTF-8. If you want to be able to treat text as fixed-length, use UTF-32. If you want neither, use UTF-16. Worse yet, since all of the common (basic multilingual plane) characters in UTF-16 fit in a single code point, bugs that assume that UTF-16 is fixed-length will be subtle and hard to find, whereas if you try to do this with UTF-8, your code will fail fast and loudly as soon as you try to internationalize.

dsimcha
  • 17,224
  • 9
  • 64
  • 81
6

Since I cannot yet comment, I post this as an answer, since it seems I cannot otherwise contact the authors of utf8everywhere.org. It's a shame I don't automatically get the comment privilege, since I have enough reputation on other stackexchanges.

This is meant as a comment to the Opinion: Yes, UTF-16 should be considered harmful answer.

One little correction:

To prevent one from accidentally passing a UTF-8 char* into ANSI-string versions of Windows-API functions, one should define UNICODE, not _UNICODE. _UNICODE maps functions like _tcslen to wcslen, not MessageBox to MessageBoxW. Instead, the UNICODE define takes care of the latter. For proof, this is from MS Visual Studio 2005's WinUser.h header:

#ifdef UNICODE
#define MessageBox  MessageBoxW
#else
#define MessageBox  MessageBoxA
#endif // !UNICODE

At the very minimum, this error should be corrected on utf8everywhere.org.

A suggestion:

Perhaps the guide should contain an example of explicit use of the Wide-string version of a data structure, to make it less easy to miss/forget it. Using Wide-string versions of data structures on top of using Wide-string versions of functions makes it even less likely that one accidentally calls an ANSI-string version of such a function.

Example of the example:

WIN32_FIND_DATAW data; // Note the W at the end.
HANDLE hSearch = FindFirstFileW(widen("*.txt").c_str(), &data);
if (hSearch != INVALID_HANDLE_VALUE)
{
    FindClose(hSearch);
    MessageBoxW(nullptr, data.cFileName, nullptr, MB_OK);
}
Jelle Geerts
  • 101
  • 1
  • 1
  • Agreed; thanks! We will update the document. The document still needs more development and adding information about databases. We are happy to receive contributions of wordings. – Pavel Radzivilovsky Feb 14 '14 at 12:41
  • @PavelRadzivilovsky `_UNICODE` is still there :( – cubuspl42 Apr 18 '14 at 12:19
  • thanks for reminding. cubus, Jelle, Would you like a user to our SVN? – Pavel Radzivilovsky Apr 25 '14 at 13:15
  • @Pavel Sure, would appreciate it! – Jelle Geerts Apr 25 '14 at 13:45
  • @JelleGeerts: I apologize for this delay. You could always contact us by our emails (linked from the manifesto) or Facebook. We are easy to find. Although I believe we fixed the issue you brought here (and I credited you there), the whole UTF-8 vs UTF-16 debates are still relevant. If you have more to contribute feel free to contact us through those private channels. – ybungalobill Jun 24 '15 at 19:03
5

Someone said UCS4 and UTF-32 were same. No so, but I know what you mean. One of them is an encoding of the other, though. I wish they'd thought to specify endianness from the first so we wouldn't have the endianess battle fought out here too. Couldn't they have seen that coming? At least UTF-8 is the same everywhere (unless someone is following the original spec with 6-bytes).

If you use UTF-16 you have to include handling for multibyte chars. You can't go to the Nth character by indexing 2N into a byte array. You have to walk it, or have character indices. Otherwise you've written a bug.

The current draft spec of C++ says that UTF-32 and UTF-16 can have little-endian, big-endian, and unspecified variants. Really? If Unicode had specified that everyone had to do little-endian from the beginning then it would have all been simpler. (I would have been fine with big-endian as well.) Instead, some people implemented it one way, some the other, and now we're stuck with silliness for nothing. Sometimes it's embarrassing to be a software engineer.

  • Unspecified endianess is supposed to include BOM as the first character, used for determining which way the string should be read. UCS-4 and UTF-32 indeed are the same nowadays, i.e. a numeric UCS value between 0 and 0x10FFFF stored in a 32 bit integer. –  Oct 20 '10 at 23:34
  • 5
    @Tronic: Technically, this is not true. Although UCS-4 can store any 32-bit integer, UTF-32 is forbidden from storing the non-character code points that are illegal for interchange, such as 0xFFFF, 0xFFFE, and the all the surrogates. UTF is a transport encoding, not an internal one. – tchrist Aug 11 '11 at 14:30
  • Endianness issues are unavoidable as long as different processors continue to use different byte orders. However, it might have been nice if there were a "preferred" byte order for file storage of UTF-16. – Qwertie May 01 '12 at 00:16
  • Even though UTF-32 is fixed-width for _code points_, it is not fixed-width for _characters_. (Heard of something called "combining characters"?) So you can't go to the N'th _character_ simply by indexing 4N into the byte array. – musiphil Jun 19 '14 at 05:05
2

I don't think it's harmful if the developer is careful enough.
And they should accept this trade off if they know well too.

As a Japanese software developer, I find UCS-2 large enough and limiting the space apparently simplifies the logic and reduces runtime memory, so using utf-16 under UCS-2 limitation is good enough.

There are filesystem or other application which assumes codepoints and bytes to be proportional, so that raw codepoint number can be guaranteed to be fit to some fixed size storage.

One example is NTFS and VFAT specifying UCS-2 as their filename storage encoding.

If those example really wants to extend to support UCS-4, I could agree using utf-8 for everything anyway, but fixed length has good points like:

  1. can guarantee the size by length (data size and codepoint length is proportional)
  2. can use the encoding number for hash lookup
  3. non-compressed data is reasonably sized (compared to utf-32/UCS-4)

In the future when memory/processing power is cheap even in any embeded devices, we may accept the device being a bit slow for extra cache misses or page faults and extra memory usage, but this wont happen in the near future I guess...

holmes
  • 111
  • 2
  • 3
    For those reading this comment, its worth noting that UCS-2 is not the same thing as UTF-16. Please look up the differences to understand. – mikebabcock Dec 19 '12 at 19:39
1

"Should one of the most popular encodings, UTF-16, be considered harmful?"

Quite possibly, but the alternatives should not necessarily be viewed as being much better.

The fundamental issue is that there are many different concepts about: glyphs, characters, codepoints and byte sequences. The mapping between each of these is non-trivial, even with the aid of a normalization library. (For example, some characters in European languages that are written with a Latin-based script are not written with a single Unicode codepoint. And that's at the simpler end of the complexity!) What this means is that to get everything correct is quite amazingly difficult; bizarre bugs are to be expected (and instead of just moaning about them here, tell the maintainers of the software concerned).

The only way in which UTF-16 can be considered to be harmful as opposed to, say, UTF-8 is that it has a different way of encoding code points outside the BMP (as a pair of surrogates). If code is wishing to access or iterate by code point, that means it needs to be aware of the difference. OTOH, it does mean that a substantial body of existing code that assumes "characters" can always be fit into a two-byte quantity β€” a fairly common, if wrong, assumption β€” can at least continue to work without rebuilding it all. In other words, at least you get to see those characters that aren't being handled right!

I'd turn your question on its head and say that the whole damn shebang of Unicode should be considered harmful and everyone ought to use an 8-bit encoding, except I've seen (over the past 20 years) where that leads: horrible confusion over the various ISO 8859 encodings, plus the whole set of ones used for Cyrillic, and the EBCDIC suite, and… well, Unicode for all its faults beats that. If only it wasn't such a nasty compromise between different countries' misunderstandings.

Donal Fellows
  • 6,347
  • 25
  • 35
  • Knowing our luck, in a few years we'll find ourselves running out of space in UTF-16. Meh. – Donal Fellows Aug 21 '11 at 15:57
  • 3
    The fundamental issue is that text is deceptively hard. No approach to representing that information in a digital way can be uncomplicated. It's the same reason that dates are hard, calendars are hard, time is hard, personal names are hard, postal addresses are hard: whenever digital machines intersect with human cultural constructs, complexity erupts. It’s a fact of life. Humans do not function on digital logic. – Aristotle Pagaltzis May 07 '12 at 01:16