7

http://babelstone.blogspot.com/2005/11/how-many-unicode-characters-are-there.html says there are 1 million unicode characters and around 240k of which are already assigned.

1 million > 240k > 65k

However,

http://msdn.microsoft.com/en-us/library/microsoft.visualbasic.strings.chrw says that chrw accept 65k characters.

Not only that chrw accept integer. Integer is 4 bytes in vb.net right and can store way more than 65k characters.

The numbers do not match up and so what am I missing?

user4951
  • 699
  • 6
  • 14

1 Answers1

10

Chances are that this method only supports the Basic Multilingual Plane of Unicode. That Plane contains the lower 64k of codepoints and can be represented with a 16 bit data type.

There was a time when the BMP was all the Unicode standard defined and at that time many languages and/or runtimes added "Unicode support". They thought that 16 bit will always be enough and therefore "Unicode" equals "16 bit characters" in many places (even though this is wrong these days). To be fair: The Unicode consortium also thought that 16 bit ought to be enough for everybody.

Unicode 2.0 however introduced additional planes and it was clear that 16 bit are no longer enough for representing every possible Unicode codepoint.

The "solution" to this is usually to use UTF-16 instead of UCS-2. I'm not only faulting .NET for this: Java has fallen into the same trap, having a 16-bit char data type and now having to support String instances that need 2 "characters" to represent a single codepoint.

Joachim Sauer
  • 10,956
  • 3
  • 52
  • 45
  • So soon we will have chrVeryWide (someNumber)? Or is that function already there? – user4951 May 23 '12 at 08:10
  • @JimThio: I don't know .NET well enough to know if its there and a quick peek into the String class documentation you linked to doesn't reveal anything obvious. In Java there's now a `int` equivalent for most methods dealing with a `char` where the word `Char`/`Character` is replaced by `Codepoint` in the name. So [`String.charAt()`](http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#charAt(int)) is replaced by [`String.codePointAt()`](http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#codePointAt(int)). – Joachim Sauer May 23 '12 at 09:26
  • 2
    yep, UTF-16 is a "historical accident" and should be considered broken. http://benlynn.blogspot.co.uk/2011/02/utf-8-good-utf-16-bad_07.html – gbjbaanb May 23 '12 at 13:19
  • 2
    @gbjbaanb: actually I'd argue that the historical accident is UCS-2. UTF-16 is an attempt to make the best out of it. It's a "UCS-2 compatible encoding" in the same way that UTF-8 is a "ASCII compatible encoding". So UTF-16 is the middle ground: it's not perfect but it allows you to continue using most of the tools that expect/handle UCS-2. UTF-8 is almost certainly the long term solution to all the encoding problems. – Joachim Sauer May 23 '12 at 13:23
  • Note: Java fallen into the trap because it was designed before UTF-16. .NET, on the other hand, was designed far later, so I would say that it was just a bad design choice that preferred compatibility with Windows rather than the rest of the world. And yes, it's a bad design choice because, for instance, all what ASP.NET does is converting all its strings back and forth from UTF-16 to UTF-8... in addition to higher memory consumption. – ybungalobill Jun 22 '12 at 11:31