Why does Java use UTF-16 for internal string representation?

Question

I would imagine the reason was fast, array like access to the character at index, but some characters won't fit into 16 bits, so it wouldn't work...

So if you have to handle special cases anyways, why not just use UTF-8?

I know this is an old question but for anyone else searching this topic I would like to point out the false premise here. Utf16 works as a 16bit array *most of the time* All the String type need do is maintain if a 4byte code point is presented ever. — Frank, Sep 13 '22 at 05:57

score 50 · Accepted Answer · edited Apr 17 '17 at 22:48

50

Because it used to be UCS-2, which was a nice fixed-length 16-bits. Of course, 16bit turned out not to be enough. They retrofitted UTF-16 in on top.

edited Apr 17 '17 at 22:48

stkent

144
8

answered Nov 07 '12 at 13:45

DeadMG

36,794
8
70
139

6

Here is a quote from the [Unicode FAQ](http://www.unicode.org/faq//utf_bom.html): `Originally, Unicode was designed as a pure 16-bit encoding, aimed at representing all modern scripts. (Ancient scripts were to be represented with private-use characters.) Over time, and especially after the addition of over 14,500 composite characters for compatibility with legacy sets, it became clear that 16-bits were not sufficient for the user community. Out of this arose UTF-16.` At the time of Java release UTF-16 hasn't yet appeared, and UTF-8 was not a part of Unicode standard. – Malcolm Nov 08 '12 at 02:26

score 15 · Answer 2 · edited May 23 '17 at 12:40

15

For the main part, for the sake of plain and simple future-proofing. Whether it was a misguided reason and the wrong way to go about it is a different question.

You can see some reasons behind some of their design decisions in this document about the 2004 switch to Java 5 and UTF-16, which explains some of the shortcomings as well: Supplementary Characters in the Java Platform, and see Why does the Java ecosystem use different encodings throughout their stack?.

For more details on the pitfalls of using UTF-16, and why UTF-8 is likely to be a better option in general, see Should UTF-16 be considered harmful? and the UTF-8 Everywhere manifesto.

edited May 23 '17 at 12:40

Community

1

answered Nov 07 '12 at 13:43

haylem

28,856
10
103
119

9

+1 for linking to the "Should UTF-16 be considered harmful?" question. I recently discovered the [UTF-8 Everywhere manifesto](http://utf8everywhere.org/) and I believe I am now pretty thoroughly convinced. For what it's worth, although Java got it wrong, I'm pretty convinced that Windows did much much worse. – Daniel Pryden Nov 07 '12 at 17:32
5

Well, it's not a surprise that Windows got it *more wrong*: They made the switch to Unicode earlier, so they had fewer correct choices and less experience. Java got later, got it *more right*, but still somewhat wrong. Now *both* have to live with old, incorrect-in-the-general-sense APIs that they have to keep supporting. – Joachim Sauer Nov 08 '12 at 07:44
4

That's life in the software world, you have to make choices without having all the data, and when you're wrong you get to live with the consequences for a long time. :-) – Brian Knoblauch Nov 09 '12 at 14:45
2

I wonder what the performance implications would have been of making `string` a "special" type in Java (much like `Array` is), rather than having `String` be an "ordinary" class which holds a reference to an "ordinary" array containing the actual characters. Depending upon how a string is generated, UTF-8, UTF-16, or even UTF-32 may be the most efficient way of storing it. I don't think there's any particularly efficient way for an "ordinary" class `String` to handle multiple formats, but a "special" type with JVM support could. – supercat Feb 26 '14 at 23:35
@supercat: I don't exactly have a precise answer for that, but I've got [a related SO answer](http://stackoverflow.com/a/4402560/453590) for that. :) Doesn't really address the special type approach, but discusses the potential gain of having streamlined strings. – haylem Feb 27 '14 at 08:51
@haylem: I've looked some into the inner workings of Java and the JVM, and I think efficient support for multiple kinds of string would require some features that are common in hardware but not supported by the JVM. For example, on many hardware platforms there would be no difficulty having an array which could either be accessed as an array of 200 long, 400 int, 800 short, or 1600 byte, and an efficient string type could really benefit from using such a thing, but a JVM array would only be able to support one usage directly and would have to emulate the rest. – supercat Mar 02 '14 at 23:25
@haylem: Many approaches for efficiently handling different usage patterns of string would require a means of quickly accessing some arrays as a mixture of primitive types; such ability may be the difference between efficient handling of compressed strings versus inefficient handling. – supercat Mar 02 '14 at 23:27

Why does Java use UTF-16 for internal string representation?

2 Answers2