16
  1. Is UTF-16 fixed-width or variable-width? I got different results from different sources:

    From http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF:

    UTF-16 stores Unicode characters in sixteen-bit chunks.

    From http://en.wikipedia.org/wiki/UTF-16/UCS-2:

    UTF-16 (16-bit Unicode Transformation Format) is a character encoding for Unicode capable of encoding 1,112,064[1] numbers (called code points) in the Unicode code space from 0 to 0x10FFFF. It produces a variable-length result of either one or two 16-bit code units per code point.

  2. From the first source

    UTF-8 also has the advantage that the unit of encoding is the byte, so there are no byte-ordering issues.

    Why doesn't UTF-8 have byte-order problem? It is variable-width, and one character may contain more than one byte, so I think byte-order can still be a problem?

Thanks and regards!

Robert Harvey
  • 198,589
  • 55
  • 464
  • 673
Tim
  • 5,405
  • 7
  • 48
  • 84
  • This great article [The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](http://joelonsoftware.com/articles/Unicode.html) will help answer all your questions about Unicode and UTF.. – Sorceror Jul 23 '11 at 14:19

2 Answers2

15

(1) What does byte sequence mean, an arrary of char in C? Is UTF-16 a byte sequence, or what is it then? (2) Why does a byte sequence have nothing to do with variable length?

You seem to be misunderstanding what endian issues are. Here's a brief summary.

A 32-bit integer takes up 4 bytes. Now, we know the logical ordering of these bytes. If you have a 32-bit integer, you can get the high byte of this with the following code:

uint32_t value = 0x8100FF32;
uint8_t highByte = (uint8_t)((value >> 24) & 0xFF); //Now contains 0x81

That's all well and good. Where the problem begins is how various hardware stores and retrieves integers from memory.

In Big Endian order, a 4 byte piece of memory that you read as a 32-bit integer will be read with the first byte being the high byte:

[0][1][2][3]

In Little Endian order, a 4 byte piece of memory that you read as a 32-bit integer will be read with the first byte being the low byte:

[3][2][1][0]

If you have a pointer to a pointer to a 32-bit value, you can do this:

uint32_t value = 0x8100FF32;
uint32_t *pValue = &value;
uint8_t *pHighByte = (uint8_t*)pValue;
uint8_t highByte = pHighByte[0]; //Now contains... ?

According to C/C++, the result of this is undefined. It could be 0x81. Or it could be 0x32. Technically, it could return anything, but for real systems, it will return one or the other.

If you have a pointer to a memory address, you can read that address as a 32-bit value, a 16-bit value, or an 8-bit value. On a big endian machine, the pointer points to the high byte; on a little endian machine, the pointer points to the low byte.

Note that this is all about reading and writing to/from memory. It has nothing to do with the internal C/C++ code. The first version of the code, the one that C/C++ doesn't declare as undefined, will always work to get the high byte.

The issue is when you start reading byte streams. Such as from a file.

16-bit values have the same issues as 32-bit ones; they just have 2 bytes instead of 4. Therefore, a file could contain 16-bit values stored in big endian or little endian order.

UTF-16 is defined as a sequence of 16-bit values. Effectively, it is a uint16_t[]. Each individual code unit is a 16-bit value. Therefore, in order to properly load UTF-16, you must know what the endian-ness of the data is.

UTF-8 is defined as a sequence of 8-bit values. It is a uint8_t[]. Each individual code unit is 8-bits in size: a single byte.

Now, both UTF-16 and UTF-8 allow for multiple code units (16-bit or 8-bit values) to combine together to form a Unicode codepoint (a "character", but that's not the correct term; it is a simplification). The order of these code units that form a codepoint is dictated by the UTF-16 and UTF-8 encodings.

When processing UTF-16, you read a 16-bit value, doing whatever endian conversion is needed. Then, you detect if it is a surrogate pair; if it is, then you read another 16-bit value, combine the two, and from that, you get the Unicode codepoint value.

When processing UTF-8, you read an 8-bit value. No endian conversion is possible, since there is only one byte. If the first byte denotes a multi-byte sequence, then you read some number of bytes, as dictated by the multi-byte sequence. Each individual byte is a byte and therefore has no endian conversion. The order of these bytes in the sequence, just as the order of surrogate pairs in UTF-16, is defined by UTF-8.

So there can be no endian issues with UTF-8.

Nicol Bolas
  • 11,813
  • 4
  • 37
  • 46
10

Jeremy Banks' answer is correct as far as it goes, but didn't address byte ordering.

When you use UTF-16, most glyphs are stored using a two-byte word - but when the word is stored in a disk file, what order do you use to store the constituent bytes?

As an example, the CJK (Chinese) glyph for the word "water" has a UTF-16 encoding in hexadecimal of 6C34. When you write that as two bytes to disk, do you write it as "big-endian" (the two bytes are 6C 34)? Or do you write it as "little-endian (the two bytes are 34 6C)?

With UTF-16, both orderings are legitimate, and you usually indicate which one the file has by making the first word in the file a Byte Order Mark (BOM), which for big-endian encoding is FE FF, and for little-endian encoding is FF FE.

UTF-32 has the same problem, and the same solution.

UTF-8 doesn't have this problem, because it's variable length, and you effectively write a glyph's byte sequence as if it were little-endian. For instance, the letter "P" is always encoded using one byte - 80 - and the replacement character is always encoded using the two bytes FF FD in that order.

Some programs put a three-byte indicator (EF BB BF) at the start of a UTF-8 file, and that helps distinguish UTF-8 from similar encodings like ASCII, but that's not very common except on MS Windows.

Bob Murphy
  • 16,028
  • 3
  • 51
  • 77
  • Thanks! (1) the letter "P" is just one byte in UTF-8. Why is the replacement character added to its code? (2) In UTF-8 , there are other characters that have more than one byte in UTF-8. Why are the byte-order between bytes for each such character not a problem? – Tim Jul 23 '11 at 00:59
  • @Tim: (1) You don't add the replacement character to the code for P. If you see 80 FF FD, that's two characters - a P character, and a replacement character. – Bob Murphy Jul 23 '11 at 01:53
  • (2) You always write and read the two bytes for the "replacement character" as FF FD, in that order. There would only be a byte-ordering issue if you could also write the "replacement character" as FD FF - but you can't; that sequence of two bytes would be something other than a "replacement character". – Bob Murphy Jul 23 '11 at 01:55
  • 1
    @Tim: You might want to work through http://en.wikipedia.org/wiki/UTF-8. It's really quite good, and if you can understand all of it and the other Unicode-related Wikipedia pages, I think you'll find you have no more questions about it. – Bob Murphy Jul 23 '11 at 02:00
  • 4
    The reason that UTF-8 has no problem with byte order is that the encoding is defined *as a byte sequence*, and that there are no variations with different endianness. It has nothing to do with variable length. – starblue Jul 23 '11 at 05:59
  • @starblue: Thanks! (1) What does byte sequence mean, an arrary of char in C? Is UTF-16 a byte sequence, or what is it then? (2) Why does a byte sequence have nothing to do with variable length? – Tim Jul 23 '11 at 15:07
  • Internally a UTF-16 string is a sequence of 16 bit words. The Unicode standard specifies three different ways to map it to a byte sequence, UTF-16 (big-endian or little-endian with byte order mark), UTF-16LE (little-endian without byte order mark) and UTF-16BE (big-endian without byte order mark). See chapter 3 of the Unicode standard for the details: http://www.unicode.org/versions/latest/ch03.pdf – starblue Jul 24 '11 at 06:36
  • @Tim No, the problems with byte order have nothing to do with variable length, they arise from the fact that there are different mappings from internal representation to external byte sequence for UTF-16 and UTF-32. These were introduced to make it more efficient for hardware to use these representations, basically by allowing them to use their internal representation directly, without conversion. You could easily have a little-endian variant of UTF-8, too, but as this would-provide no advantage and would just add more confusion it is not done, thankfully. – starblue Jul 24 '11 at 06:46
  • This answer talks about byte ordering on disk (LE vs. BE) as being something different from the order in memory. It is always the case that an OS's representation on disk is the same as it is in memory. A BOM applies specifically to text files, not data files. OS specific files that contain text often won't have BOM's, ex.: many NT logfiles. Of course, datafiles (like registry related files et al.) usually don't. – Astara Dec 01 '20 at 22:39
  • @Astara IIRC, my use of disk files in this answer was intended as an illustration. I didn't say it, but the ideas also carry over to other media that might store Unicode text, such as memory and network protocols. It's true that OS specific files that contain text often won't have BOM's. They can also be compressed, encrypted, stored in unusual encodings, etc., because the OS vendor controls the formats of those files. – Bob Murphy Dec 02 '20 at 21:15
  • However, my experience across many operating systems is that it's not "always the case that an OS's representation on disk is the same as it is in memory." Operating systems that have historically run on CPUs with different default endianisms often do this to preserve compatibility of data formats across CPU architectures. I know specific examples of this with macOS and (now defunct) SunOS, and while I can't think of any offhand, it wouldn't surprise me if there were some for Linux and/or Solaris. – Bob Murphy Dec 02 '20 at 21:15