4

Unicode seems that its becoming more and more ubiquitous these days if it's not already, but I have to wonder if there are any domains were Unicode isn't the best implementation choice. Are there any languages or scripts that Unicode will not work well or won't work at all? Are there any technical reasons to use a different system entirely (other than working with legacy systems)? Naturally, I would assume that the answer is to always use Unicode. Am I wrong?

Daniel Wolfe
  • 141
  • 1
  • 5

3 Answers3

8

The only time I would avoid Unicode is in an embedded system where the requirements specifically state the system only needs to support a single code page (or ASCII).

Software is almost too easy to reuse. Whether a public project that will be used in ways the author is aware or did not envision, or corporate projects that some suit repurposes, you never know when and where software will be reused. With our global Internet people of all languages may have a use for your software, and it should support languages such as Chinese which are in widespread use and require Unicode to function well.

Embedded systems (a category in which I do NOT include smartphones) are the only domain I can think of that would resist the trend of software being used in diverse locations.

Edit: I just realized I did not really specify why I would avoid Unicode in those situations, even though the answer is fairly obvious. While some combinations of characters and encodings can take up the same space as an 8-bit character (e.g. UTF-8 English), not all will. This can increase space, especially when using characters that necessarily must use multiple bytes (e.g. Chinese, spoken by billions of people). Furthermore, decoding Unicode and transforming it to a glyph on a user interface requires additional code for which an embedded system may not have memory. If I had to develop a routine to transform ASCII characters into glyphs it would likely be a fairly small lookup table, and not involve decoding a variable-length character into a code plane with thousands of glyphs.

  • 1
    Another factor is that even if one managed to store a recognizable version of every glyph in only 32 bytes each (16x16 dot matrix), the space required to hold a font with 110,000 characters might for many applications be orders of magnitude larger than the space required for everything else combined. – supercat Apr 10 '14 at 19:29
  • 1
    @supercat that is perhaps a more explicit form of my last sentence, where I mention a code plane with thousands of glyphs. –  Apr 10 '14 at 19:39
  • 1
    I interpreted your main point as describing the difficulty of converting a sequence of bytes to a code point. My point was that even if one can decipher a sequence of code points, that's rather pointless unless one invests in a huge font. – supercat Apr 10 '14 at 19:43
  • I think we agree, it is just a matter of wording. –  Apr 10 '14 at 19:48
2

ASCII generally uses less memory than Unicode, and doesn't require special encodings. I would imagine that, if you're building a small embedded system for an English user, where the amount of memory is constrained, then Unicode might actually be an impediment.

Unicode is intended for larger systems where you might want to make it possible later to modify the software for other human languages and character sets.

Robert Harvey
  • 198,589
  • 55
  • 464
  • 673
  • 7
    Unicode doesn't take twice the space of ASCII. Certain encodings of Unicode (i.e. UTF-16/ UCS-2) might generally take twice the space of ASCII to represent ASCII characters. Other encodings (i.e. UTF-8) do not have that penalty (though at the cost of using more storage for other characters like those of the Chinese and Japanese languages). – Justin Cave Apr 10 '14 at 17:57
  • @JustinCave: fixed. – Robert Harvey Apr 10 '14 at 17:59
  • I'm not sure that I follow. If the only characters you are encoding are in the 7-bit ASCII character set, 7-bit ASCII and UTF-8 are bit-for-bit identical. So you can use UTF-8 with no penalty of any kind in that case. I'm not sure why UTF-8 would be any more of a "special encoding" than 7-bit ASCII. – Justin Cave Apr 10 '14 at 18:03
  • 2
    I think the point is that Unicode _may_ take more space than an 8 bit encoding using a code page. Without more information on what characters are being encoded, we cannot say _how much more_ space would be required, even if that value is zero bytes. –  Apr 10 '14 at 18:04
  • 1
    These days, I think "larger systems" would include everything from smart phones on up. I suspect you could use language statistics to determine rough extra memory required per language for UTF-8, but I don't know if anyone's done it. For western languages, it will be small because only a minority of characters are accented. – Gort the Robot Apr 10 '14 at 18:07
  • Agree with Justin, there is a 1-to-1 mapping between ASCII and UTF-8. If you're expecting only to handle ASCII, you pay no penalty even if you use UTF-8, and if your assumption is wrong your code is compatible with other languages for free. – Doval Apr 10 '14 at 18:07
  • Folks, do keep in mind that we're talking about a few megabytes of *total storage,* and sometimes just a few kilobytes or even less. You may be able to store the Unicode, but are you going to have room for the encoding/decoding software? Why would you take that penalty for potentially zero benefit? – Robert Harvey Apr 10 '14 at 18:09
  • I think UTF-8 should be considered the default these days, though, with ASCII seen as only a special case due to unusual memory restrictions. – Gort the Robot Apr 10 '14 at 18:20
  • @StevenBurnap: A 96-character 5x8 font will take 480 bytes to store. Even a 2048-character set (a tiny fraction of Unicode) will take 10K. Many embedded systems, if they support anything beyond 256 bytes, would have to pick and choose what characters to support; in most cases, there would be little reason for picking more than 256 particular code points to support. – supercat Apr 10 '14 at 19:34
  • These days 10K is an unusual memory restriction. – Gort the Robot Apr 10 '14 at 19:48
  • @JustinCave In reference to your first comment, "extra storage" for Chinese is a bit of a red herring. In real text, the average Chinese string has fewer characters than any other language. In the two apps I deal with, the utf-8 strings files for Chinese are actually smaller than those for English because of this. (It is still an issue for Korean and Japanese, though, as those languages tend to use more characters per string than Chinese.) – Gort the Robot Apr 10 '14 at 19:51
  • @StevenBurnap Don't Japanese glyphs generally correspond to 2-3 (English) character syllables? I would've expected them to break even or the overhead to be relatively low. – Doval Apr 10 '14 at 20:41
  • For us, the Chinese translations end up being substantially smaller than English in utf-8 while the Japanese translations end up being a bit bigger. – Gort the Robot Apr 10 '14 at 22:15
1

Network protocols. They're almost always defined in 7-bit ASCII. I suppose that falls under "legacy systems."

If it makes you feel better, imagine that SMTP's HELO, EHLO, DATA commands are binary. They're all 4 characters long for a reason.

HTTP GET and the URL and all of the HTTP Headers are also in ASCII.

DNS is most definitely ASCII.

Almost all of the network protocols are implemented first in C and intended for very fast processing. The technical reasons to avoid Unicode here are that the communicating processes are really just exchanging raw binary data. The "words" in the protocol are just to make it readable for people with a network sniffer and so that people can do very basic testing using telnet or netcat.

In these situations Unicode conversion is almost always a waste of time.

Pretty much the only place I can think Unicode would be useful for something like web servers and proxies is case insensitive rewrite rules. Case sensitive rules wouldn't matter because UTF-8 matches just fine without decoding.

Personally, I don't believe in case insensitive processing in network servers or file systems. If the request fails to find a resource, bounce it to an error handler script and it can mess around trying to guess at what the user was really after. That keeps things fast and very simple in the common case.

I would ask "Would you define a new network protocol to use Unicode?" except I am afraid I know the answer. I see people writing nasty, nasty JSON and XML "protocols" all the time. And whoever decided to transfer binary data inside XML in Base64 encoding should probably be shot, drawn, quartered, drowned and then buried alive. "Oooh, gigabit networks! I'll just expand everything hugely and make it impossible to use zero-copy!" Bonus points for then compressing the XML for transfer. Bloat it out, pack it down so you can unpack it and unbloat it. While making three copies some of which are expanding the XML and Base64 data into 32-bit UCS-4 "wide characters". Probably in Java so you can use an extra gig of RAM just because.

Zan Lynx
  • 1,300
  • 11
  • 14
  • Isn't Unicode supported in URLs? I'm deciding whether or not to make use of this. – Dogweather May 07 '18 at 23:36
  • 1
    @Dogweather: I do most of my work on back-end stuff. So my answer is that no, URLs don't use Unicode at all. HTTP uses bytes. Just bytes. Make them whatever you want. Don't assume they are Unicode although they probably are. Also domains. Those can't use anything except 7-bit ASCII. There's a compression format for Unicode domains where they start with "xn--" Look up Punycode. – Zan Lynx May 08 '18 at 16:30
  • 1
    @Dogweather: What web browsers and HTML require are different from what goes out on the network. There you have percent encodings and such-like. And I'm not sure what the rules are on Unicode in URL links, if the browser percent encodes them or puts it on the wire in UTF-8. – Zan Lynx May 08 '18 at 16:33