28

At work I come across a lot of Japanese text files in Shift-JIS and other encodings. It causes many mojibake (unreadable character) problems for all computer users. Unicode was intended to solve this sort of problem by defining a single character set for all languages, and the UTF-8 serialization is recommended for use on the Internet. So why doesn't everybody switch from Japanese-specific encodings to UTF-8? What issues with or disadvantages of UTF-8 are holding people back?

EDIT: The W3C lists some known problems with Unicode, could this be a reason too?

Mechanical snail
  • 902
  • 5
  • 12
Nicolas Raoul
  • 1,062
  • 1
  • 11
  • 20
  • Actually more and more popular sites are in UTF-8, one example is ニコニコ動画 and はてな – Ken Li Jun 08 '11 at 08:35
  • 8
    Why doesn't everybody switch from ISO-8851-1 to UTF-8 ? – ysdx Jun 08 '11 at 11:35
  • 2
    It's mentioned in passing [here](http://yehudakatz.com/2010/05/05/ruby-1-9-encodings-a-primer-and-the-solution-for-rails/) that SHIFT-JIS -> UTF-8 conversion isn't lossless, which would be a major reason to continue using SHIFT-JIS where it's already in use. I found that ostensible factoid surprising, though, so I was hoping one of the answers here might go into more detail or at least provide a source for the claim, but none of them do. – Kyle Strand Jun 20 '16 at 17:08
  • 1
    @KyleStrand see https://support.microsoft.com/en-us/help/170559/prb-conversion-problem-between-shift-jis-and-unicode – Ludwig Schulze Mar 23 '17 at 06:33
  • @LudwigSchulze Thanks. Still not a lot of detail, but at least an official source... – Kyle Strand Mar 23 '17 at 06:55

5 Answers5

29

In one word: legacy.

Shift-JIS and other encodings were used before Unicode became available/popular, since it was the only way to encode Japanese at all. Companies have invested in infrastructure that only supported Shift-JIS. Even if that infrastructure now supports Unicode, they are still stuck with Shift-JIS for various reasons ranging from it-works-so-don't-touch-it over encoding-what? to migrating-all-existing-documents-is-too-costly.

There are many western companies that are still using ASCII or latin-1 for the same reasons, only nobody notices since it's never causing a problem.

deceze
  • 2,215
  • 1
  • 19
  • 19
  • 9
    Japanese software industry... slower than dirt at utilizing new software/standards. – Mark Hosang Jun 08 '11 at 08:11
  • 3
    @Mark Truer words were ne'er spoken! (I'm working in/with Japanese IT... -_-;;) – deceze Jun 08 '11 at 08:21
  • 5
    True, but Western companies have the excuse that our legacy software is full of hard-coded assumptions that 1 byte = 1 character, which makes the transition to UTF-8 harder than for Asians who have long had to write MBCS-clean code. – dan04 Nov 14 '11 at 07:36
  • @MarkHosang I confirm that your statement is 100% correct (I work for Japanese company in Tokyo) – MD TAREQ HASSAN Apr 01 '19 at 04:28
10

These are the reasons that I remember were given for not making UTF-8 or another Unicode representation the default character encoding for the scripting language Ruby, which is mainly developed in Japan:

  • Reason 1: Han unification. The character sets (not sure if "alphabets" would be correct here) used China, Korea, and Japan are all related, have evolved from common history, not sure about the details. The Unicode consortium decided to waste only a single Unicode code point to encode all variants (Chinese, Japanese, and Korean) of the historic same character, even if their appearance differs in all 3 languages. Their reasoning is, appearance should be determined by the font used to display the text.

Apparently, this reasoning is as perceived to be as ridiculous by Japanese users as it would be to argue to English readers that, because the Latin alphabet has developed from the Greek alphabet, it is sufficient to have only a single code point for Greek alpha "α" and Latin "a", and let the appearance be decided by the font in use. (Same for "β" = "b", "γ" = "g", etc.)

(Note that I would not be able to include greek characters here on stackexchange if that were the case.)

  • Reason 2: Inefficient character conversions. Converting characters from Unicode to legacy Japanese encodings and back requires tables, i.e. there is no simple computation from Unicode code-point value to legacy code point value and vice versa. Also there is some loss of information when converting because not all code-points in one encoding have a unique representation in the other encoding.

More reasons may have been given that I do not remember anymore.

Ludwig Schulze
  • 201
  • 2
  • 3
  • 1
    It appears that as of 2.0 Ruby did adopt UTF-8 as the default. But Han unification seems to be a really important wrinkle (and [quite controversial issue](https://news.ycombinator.com/item?id=8041288)) in the world of Unicode that apparently doesn't get enough attention, since I've never heard of it before. – Kyle Strand Mar 23 '17 at 06:56
  • 1
    And here is a Wikipedia article on the Han unification issue: https://en.wikipedia.org/wiki/Han_unification That indeed seems to be a valid issue, great answer! Also, loss of date would be a good reason. – spbnick Oct 17 '17 at 13:06
8

deceze's answer has a very strong element of truth to it, but there is another reason why Shift-JIS and others are still in use: UTF-8 is horrifically inefficient for some languages, mostly in the CJK set. Shift-JIS is, IIRC, a two-byte wide encoding whereas UTF-8 is typically 3-byte and occasionally even 4-byte in its encodings with CJK and others.

JUST MY correct OPINION
  • 4,002
  • 1
  • 23
  • 22
  • 8
    While that is true, there's always the alternative of UTF-16, which may be as efficient as Shift-JIS. I'd also argue that the headache of dealing with different encodings far outweighs the slight increase in size in this day and age. To put it another way, I have never heard the argument of efficiency *for* Shift-JIS by anybody still using it. ;-) – deceze Jun 08 '11 at 09:32
  • 6
    I've heard the efficiency issue used as an excuse for sloth and inertia, though. – JUST MY correct OPINION Jun 08 '11 at 09:42
  • 1
    Hehe, same result then. :o) – deceze Jun 08 '11 at 09:45
  • 2
    UTF-16 makes basic ASCII characters [of which there are a sizable number in e.g. HTML] twice as large. As I understand it, this ends up actually making UTF-16 even worse than UTF-8 for Japanese webpages. – Random832 Jun 08 '11 at 13:00
  • Wouldn't that depend on how much English is on the Japanese web pages? – JUST MY correct OPINION Jun 08 '11 at 13:22
  • 3
    @JUST My correct OPINION: Try "View Source" or the equivalent. Assuming all the actual text is in Japanese, there's likely to be a lot of keywords and the like that were derived from English, and are represented in ASCII. – David Thornley Jun 08 '11 at 15:48
  • 1
    I agree with the efficiency principle, which is why my browser propose "deflate/gzip" as compression methods to the server. And since those are so standards that they are hardware accelerated in today's routers... – Matthieu M. Jun 08 '11 at 20:06
  • 6
    This sounds to me like a reason to do so we find **afterwards**. I am pretty sure efficiency has close to absolutely nothing to do with the status quo. To me it's just inertia and legacy. Actually I also think it has to do with the fact that most code produced by Japanese programmers is for other Japanese people, so they don't even feel the need to use something like Unicode. – Julien Guertault Jun 09 '11 at 01:15
  • 2
    The claim in this answer is false. For ideographic languagaes, it's 50% larger in uncompressed form than UTF-16 or legacy DBCS encodings, but not significantly larger when compressed. In no way is this "horrifically inefficient". And compared to other languages, either UTF-8 or DBCS CJK is extremely efficient just because of the density of information per character. To compare, many Indic languages require 3 bytes per character and have many (like 10+) character words for things that could be represented with 1-3 characters (3-9 bytes) in Japanese. – R.. GitHub STOP HELPING ICE Jun 21 '18 at 00:58
2

Count string size/memory usage amongst the primary reasons.

In UTF-8, east-asian languages frequently need 3 or more bytes for their characters. On average they need 50% more memory than when using UTF-16 -- the latter of which already is less efficient than native encoding.

The other main reason would be legacy, as point out by deceze.

Denis de Bernardy
  • 3,913
  • 21
  • 18
2

Legacy and storage size, as others said, but there is one more thing: Katakana characters.

It takes only one byte to represent Katakana characters in Shift-JIS, so Japanese text including Katakana takes less than 2 bytes per character (1.5 for a 50/50 mix), making Shift-JIS somewhat more efficient than UTF-16 (2 bytes/char), and much more efficient than UTF-8 (3 bytes/char).

Cheap storage should have made this a much smaller problem, but apparently not.

azheglov
  • 7,177
  • 1
  • 27
  • 49
  • 1
    this is not quite correct. *Half-width Katakana* are 1-byte long, but they're quite uncommon and is used only in limited terminals that only support a 1-byte charset. The normal Katakana is still 2-byte long – phuclv Apr 18 '20 at 12:47