1

I'm going to be writing an application that is pure HTML5 and JS and MVC.net back-end. We have .resx files that are getting compiled to .js files for resources in the html5 application. The application has to work in English and in Chinese which I understand to mean that we need to use UTF-16 everywhere.

Does anyone have any experience using UTF-16 for such a task, or any best practices thereof?

maxfridbe
  • 371
  • 2
  • 7
  • 2
    Only use UTF-8 when working with `string` and `char`. Use UTF-8 for output. The only unusual problem is that UCS-2 != UTF-16, since Chinese has some codepoints that require two code-units (i.e. one codepoint that consists of two `char`s) – CodesInChaos Mar 14 '13 at 15:51
  • 4
    Related: [Should UTF-16 be considered harmful?](http://programmers.stackexchange.com/questions/102205/should-utf-16-be-considered-harmful) –  Mar 14 '13 at 15:53
  • 4
    Your understanding is incorrect. You do not need to use UTF-16 everywhere. UTF-8 represents all Unicode characters, is more appropriate for a web app, and is arguably less likely to have Chinese-specific breakage than UTF-16. – comingstorm Mar 14 '13 at 16:54
  • I agree, after reading about endiness and var length, UTF-8 makes more sense overall. I was just looking for feedback, One of the devs on this project had recommended utf-16 but after reading about it has had no supporting reason. – maxfridbe Mar 14 '13 at 16:58

1 Answers1

13

Why do you have this understanding? Both encodings [UTF-8 and UTF-16] can encode all unicode characters by the definition of them being unicode encodings.

Anyway, UTF-8 is more optimal for storage and transmission than UTF-16 in your case. Majority of your characters in the files will not be in Chinese but in markup/js syntax. UTF-8 uses 1 byte for those whereas UTF-16 uses 2 bytes for those, hence UTF-8 wins.

For common Chinese characters UTF-8 needs 3 bytes and UTF-16 needs 2 bytes. Both need 4 bytes for the rarer characters on the supplemental planes. This gives 33% savings for UTF-16 per Chinese character.

UTF-8 uses 1 byte for any "programming character". <div> is 5 bytes in UTF-8 and 10 bytes in UTF-16. 50% savings for UTF-8 per "programming character".

Esailija
  • 5,364
  • 1
  • 19
  • 16
  • I suppose after reading a lot of "Should UTF-16 be considered harmful?" I am a bit confused. My main fear is that I would have a UTF-8 document that cannot show a character properly in a web browser in china. – maxfridbe Mar 14 '13 at 16:11
  • 4
    @maxfridbe Why do you think that? Browsers are required to support UTF-8, but not UTF-16. This is in the HTML5 draft, I can find it. – Esailija Mar 14 '13 at 16:14