Would UTF-8 be able to support the inclusion of a vast alien language with millions of new characters?

Question

In the event an alien invasion occurred and we were forced to support their languages in all of our existing computer systems, is UTF-8 designed in a way to allow for their possibly vast amount of characters?

(Of course, we do not know if aliens actually have languages, if or how they communicate, but for the sake of the argument, please just imagine they do.)

For instance, if their language consisted of millions of newfound glyphs, symbols, and/or combining characters, could UTF-8 theoretically be expanded in a non-breaking way to include these new glyphs and still support all existing software?

I'm more interested in if the glyphs far outgrew the current size limitations and required more bytes to represent a single glyph. In the event UTF-8 could not be expanded, does that prove that the single advantage over UTF-32 is simply size of lower characters?

"support their _languages_" (my emphasis)... How many? Are we sure the languages can be broken down to characters? Maybe the language is based on spatial relations. - see Ted Chiang "Story of Your Life", _Stories of Your Life and Others_. At best, this is simply a max-things-in-X-bytes question (off-topic). At worst, it's speculative nonsense. (not clear what you're asking) — Scant Roger, Nov 24 '15 at 13:12
@ScantRoger The accepted answer does a fine job at answering the question as it was intended. — Qix - MONICA WAS MISTREATED, Nov 24 '15 at 13:13
The accepted answer does a fine job of telling us the facts of UTF-8, UTF-16, and UTF-32. You could simply look this up on Wikipedia. As for "alien invasion", I don't see how the answer addresses it at all. — Scant Roger, Nov 24 '15 at 13:17
Related (on Stack Overflow): [Is UTF-8 enough for all common languages?](http://stackoverflow.com/q/2438896/99456) — yannis, Nov 24 '15 at 13:17
If they are alien how do you know how they are going to communicate. As an example. how would you add Dolphin to UTF-8? — Karl Gjertsen, Nov 24 '15 at 13:18
By the way: a hypothetical “UTF-8 v2” could host at most 70,936,234,112 characters. That is due to the internal structure of the encoding. `echo 'ibase=2;10000000+00100000*01000000+00010000*(01000000^10)+00001000*(01000000^11)+00000100*(01000000^100)+00000010*(01000000^101)+00000001*(01000000^110)' | bc` — Boldewyn, Nov 24 '15 at 15:06
@Boldewyn could you explain that? Gets me a few errors in the terminal. What would impose this limit? — Qix - MONICA WAS MISTREATED, Nov 24 '15 at 15:08
@Qix I'd love to, but the comments are a bit too narrow, and I can't answer a question on hold :(. Basically, section 3 of [RFC3629](https://tools.ietf.org/html/rfc3629#section-3). The first byte determines how many follow-up bytes mark up the character. This means, UTF-8 can at most use ~8bytes to represent one character (give or take). — Boldewyn, Nov 24 '15 at 15:10
@Qix OK, I found, that I missed the whole "use all 8 bytes" in the calculation above. See my answer for the in-depth calculation + correct value. — Boldewyn, Nov 24 '15 at 16:22
Unicode does not support languages, it supports *characters* - glyphs used to represent meaning in written form. Many human languages does not have a script and hence cannot be supported by unicode. Not to mention many animals communicate but don't have a written language. Communication by say illustrations or wordless comics cannot be supported by unicode since the set of glyphs are not finite. By definition we don't know how aliens communicate, so your question is impossible to answer. If you just want to know how many distinct characters unicode can support, you should probably clarify :) — JacquesB, Nov 24 '15 at 16:41
I've removed several meta comments, this is _not_ the place for them. See [When should I comment? & When shouldn't I comment?](http://programmers.stackexchange.com/help/privileges/comment) for more details. If anyone is interested in further discussing the suitability of the question, the site's various policies, our editing culture & guidelines, or alien invasion, please do so on [Meta](http://meta.programmers.stackexchange.com/q/7756/25936) or [chat](http://chat.stackexchange.com/rooms/info/21/the-whiteboard). — yannis, Nov 24 '15 at 17:46
It appears the [concept of aliens arriving on Earth and filling up our Unicode sets](http://unicode.org/mail-arch/unicode-ml/y2001-m05/0425.html) isn't all that unheard of ;) — Qix - MONICA WAS MISTREATED, Sep 06 '16 at 07:27

amon · Accepted Answer · 2015-11-24T13:47:46.027

109

The Unicode standard has lots of space to spare. The Unicode codepoints are organized in “planes” and “blocks”. Of 17 total planes, there are 11 currently unassigned. Each plane holds 65,536 characters, so there's realistically half a million codepoints to spare for an alien language (unless we fill all of that up with more emoji before first contact). As of Unicode 8.0, only 120,737 code points have been assigned in total (roughly 10% of the total capacity), with roughly the same amount being unassigned but reserved for private, application-specific use. In total, 974,530 codepoints are unassigned.

UTF-8 is a specific encoding of Unicode, and is currently limited to four octets (bytes) per code point, which matches the limitations of UTF-16. In particular, UTF-16 only supports 17 planes. Previously, UTF-8 supported 6 octets per codepoint, and was designed to support 32768 planes. In principle this 4 byte limit could be lifted, but that would break the current organization structure of Unicode, and would require UTF-16 to be phased out – unlikely to happen in the near future considering how entrenched it is in certain operating systems and programming languages.

The only reason UTF-16 is still in common use is that it's an extension to the flawed UCS-2 encoding which only supported a single Unicode plane. It otherwise inherits undesirable properties from both UTF-8 (not fixed-width) and UTF-32 (not ASCII compatible, waste of space for common data), and requires byte order marks to declare endianness. Given that despite these problems UTF-16 is still popular, I'm not too optimistic that this is going to change by itself very soon. Hopefully, our new Alien Overlords will see this impediment to Their rule, and in Their wisdom banish UTF-16 from the face of the earth.

edited Nov 24 '15 at 13:47

answered Nov 24 '15 at 12:48

amon

132,749
27
279
375

Thanks, I was wondering about the 4/6 octet thing. An [article](http://joelonsoftware.com/articles/Unicode.html) I was just reading mentioned it was 6, but that was written in 2003. – Qix - MONICA WAS MISTREATED Nov 24 '15 at 12:53
7

Actually, UTF-8 is limited to only a part of even the 4-byte-limit, in order to match UTF-16. Specifically, to 17/32 of it, slightly more than half. – Deduplicator Nov 24 '15 at 15:19
1

@Deduplicator Where did you get the 17/32 number from? – kasperd Nov 24 '15 at 16:37
@kasperd: Take a look at [the encoding](https://en.wikipedia.org/wiki/UTF-8#Description), and you will see that 4 codeunits mean 21 value-bits, which is 32 * 64K. – Deduplicator Nov 24 '15 at 16:42
@Deduplicator Now I get what you were saying. I misinterpreted your comment as saying 17/32 of the possible 32-bit code points. I should have read it as 17/32 of the code points that can be expressed with 4 bytes in UTF-8, which indeed means those that fit in 21 bits. – kasperd Nov 24 '15 at 16:54
3

The only real problem with UTF-16 is that a lot of software implements UCS-2 and says that it implements UTF-16. I know this sucks, but unless you have a time machine to wipe out UCS-2 existence, this can only be changed by fixing the legacy software. Complaining about it in every post about Unicode won't help. The question doesn't even mention UTF-16, why does this answer talk about it even more than UTF-8? – Malcolm Nov 24 '15 at 21:56
4

@Malcolm Good point! The question asks “could UTF-8 theoretically be expanded in a non-breaking way”. My answer asserts that (1) this is really about Unicode rather than any specific encoding, and (2) the reason why UTF-8 (and Unicode itself) is so limited is because these limitations match those of UTF-16. These limitations will have to stick around in the Unicode ecosystem until UTF-16 is phased out (which realistically will never happen). – amon Nov 24 '15 at 22:01
1

@amon I don't think that "only about 1.2 million possible codepoints" is much of a limitation. And if we had to extend the UTF-8 specification to allow more than 4 bytes, we could just as well extend UTF-16 to allow a third code unit – Voo Nov 24 '15 at 22:08
1

While we're at it, you could steal (reassign) private use areas, though some people would be annoyed with you. I imagine that violates the Unicode Stability Policy, too. But if we were really desperate for more space, it is probably less drastic than dropping UTF-16, though it gives you a lot less space to play with. – Kevin Nov 24 '15 at 22:42
We can unlock UTF-8 to pull the long sequences back in and support up to 31 bits per character. – Joshua Nov 24 '15 at 22:58
5

Outside of Windows I know of no other OS where either the OS or the majority of programs on the OS use UTF16. OSX programs are typically UTF8, Android programs are typically UTF8, Linux are typically UTF8. So all we need is for Windows to die (it already is sort of dead in the mobile space) – slebetman Nov 25 '15 at 03:14
25

*Unless we fill all of that up with more emoji before first contact*... There you have it. The most significant threat to peaceful interaction with aliens is emoji. We're doomed. – rickster Nov 25 '15 at 06:03
13

@slebetman Not really. Anything JVM-based uses UTF-16 (Android as well, not sure why you say it doesn't), JavaScript uses UTF-16, and given that Java and JavaScript are the most popular languages, UTF-16 is not going anywhere anytime soon. – Malcolm Nov 25 '15 at 08:34
1

@slebetman: All Android Java code uses UTf16 and most of its C++ code that supports unicode afaik uses UTF32. OS X and iOS use UTF16 in the C, objective C and Swift Apple frameworks and UTF32 in most normal C and C++ code with Unicode support. Most Linux code uses UTF32 for Unicode, too. UTF8 only makes sense when size is more important then processing performance, which means mainly in networking and when storing strings to disk. For the rest of the code, both, UTF16 and UTF32 are strongly preferably over UTF8 and therefor what is normally used by most code on all OS. – Kaiserludi Nov 25 '15 at 19:21
5

@Kaiserludi "Most linux code uses UTF32 for unicode", yeah, no. Seriously where the hell did you get that idea? There is not even a `wfopen` syscall or anything else, it's UTF8 all the way. Hell even Python and Java - both which define strings as UTF-16 due to historical reasons - do not store strings as UTF-16 except when necessary.. large memory benefits and no performance hits (and that despite the additional code to handle conversions - memory is expensive, CPU is cheap). Same goes for Android - the NDK's JString is UTF8, mostly because Google engineers are not insane. – Voo Nov 25 '15 at 22:19
@Kaiserludi: In C++ the Unicode library of choice -- ICU -- uses UTF-16 internally. – DevSolar Nov 26 '15 at 09:38
@DevSolar: Yes, because it started when there was UCS2. No way to throw it away, especially as there are users. Nowadays, it uses UTF-8 internally too. – Deduplicator Nov 26 '15 at 15:40
1

@Deduplicator: Sorry, but no. ICU uses 16bit `UChar`, both in the C API and as [internal buffer type](http://icu-project.org/apiref/icu4c/classicu_1_1UnicodeString.html#abbeabb99d92d2bfd07f67df82fec87e0) of the C++ `UnicodeString` class. And I don't know of any Linux-native, UTF-8 API that is fully Unicode compliant. I'm not talking about filenames etc., but actually *handling* text here. – DevSolar Nov 26 '15 at 16:06
@DevSolar Apart from ICU in its various incarnations there is no API that handles all the intricacies of Unicode that I know of (and I'm sure some things aren't even handled by ICU). But yes ICU mostly uses UTF-16 internally, because it's also a child of the 90s when UCS2 meant it made perfect sense to use it everywhere. But ICU is slowly moving towards UTF-8 - some of the new APIs (in c++) take a `icu::StringPiece` which is UTF-8 – Voo Nov 26 '15 at 20:52
1

@Voo Judging by [runtime/mirror/string.h](https://android.googlesource.com/platform/art/+/master/runtime/mirror/string.h), Android clearly stores string in UTF-16 internally. Same thing with [OpenJDK 8](http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/share/vm/classfile/javaClasses.cpp). So what you're saying about Java is untrue. But regardless it is much more important what the programmers are working with and not what Google decided to write for their own use, and the other programmers have to work with UTF-16 since this is how Java strings are represented. – Malcolm Nov 27 '15 at 11:49
@Malcolm I have no idea where you get the idea that I argued that Android uses anything but UTF-16 to represent java.lang.String. Now if you want to argue about what I actually said, namely that the NDKs JString is represented as UTF-8, feel free to do so. As for HotSpot, you might want to read JEP 254. – Voo Nov 27 '15 at 12:43
@Voo I linked the source file for the internal representations of strings in Android runtime, and it clearly doesn't use UTF-8. Maybe you meant something else by NDK string, in that case you should clarify that. As for JEP 254, it just says that when it's possible to store as ISO-8859-1, the strings should be stored as ISO-8859-1, but I wouldn't say this means they "do not store strings as UTF-16 except when necessary". It even explicitly says that storing strings as UTF-8 is a non-goal. – Malcolm Nov 27 '15 at 12:49
@Malcolm I said that HotSpot avoids using UTF-16 when possible, not that it uses UTF-8 (because well, it doesn't). And yeah I confused my terminology - Android uses modified UTF-8 in the NDK, but jstring is still the c++ wrapper around java.lang.String which yes uses UTF-16. – Voo Nov 29 '15 at 11:13

score 30 · Answer 2 · edited Oct 07 '21 at 07:34

If UTF-8 is actually to be extended, we should look at the absolute maximum it could represent. UTF-8 is structured like this:

Char. number range  |        UTF-8 octet sequence
   (hexadecimal)    |              (binary)
--------------------+---------------------------------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

(shamelessly copied from the RFC.) We see that the first byte always controls how many follow-up bytes make up the current character.

If we extend it to allow up to 8 bytes we get the additional non-Unicode representations

111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
11111110 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
11111111 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

Calculating the maximum possible representations that this technique allows we come to

  10000000₂
+ 00100000₂ * 01000000₂
+ 00010000₂ * 01000000₂^2
+ 00001000₂ * 01000000₂^3
+ 00000100₂ * 01000000₂^4
+ 00000010₂ * 01000000₂^5
+ 00000001₂ * 01000000₂^6
+ 00000001₂ * 01000000₂^7

or in base 10:

  128
+  32 * 64
+  16 * 64^2
+   8 * 64^3
+   4 * 64^4
+   2 * 64^5
+   1 * 64^6
+   1 * 64^7

which gives us the maximum amount of representations as 4,468,982,745,216.

So, if these 4 billion (or trillion, as you please) characters are enough to represent the alien languages I am quite positive that we can, with minimal effort, extend the current UTF-8 to please our new alien overlords ;-)

Currently UTF-8 is limited to only code points until 0x10FFFF - but that is only for compatibility with UTF-16. If there was a need to extend it, there is no ambiguity about how to extend it with code points until 0x7FFFFFFF (that's 2³¹-1). But beyond that I have seen conflicting definitions. One definition I have seen has `111111xx` as a possible first byte followed by five extension bytes for a maximum of 2³² code points. But that is only compatible with the definition you mention for the first 2³¹ code points. — kasperd, Nov 24 '15 at 16:49
Yes, [Wikipedia](https://en.wikipedia.org/wiki/UTF-8#Description) says something about UTF-16, when really they mean Unicode or ISO 10646 (depending on context). Actually, since RFC 3629, UTF-8 _is_ undefined beyond U+10FFFF (or `F4 8F BF BF` in UTF-8 bytes). So, everything I mention here beyond that is pure speculation. Of course, someone could think of other extensions, where a high first byte signifies some other structure following (and hopefully not destroying self sync in the process). I tried to complete the byte scheme to be as close to the real UTF-8 as possible, though. — Boldewyn, Nov 24 '15 at 19:47
It's not strictly necessary for the number of following bytes to always be one less than the number of leading ones in the first byte. Perl actually supports (since 2000) an internal variant of UTF-8 where the 5, 6, and 7 byte forms are the same as this answer, but `FF` introduces a 13-byte code unit capable of storing 72 bits. Anything over 2^36 is uniformly *very* expensive, but it allows encoding a 64-bit int and then some. — hobbs, Nov 25 '15 at 17:37

score 7 · Answer 3 · edited Oct 07 '21 at 07:34

7

RFC3629 restricts UTF-8 to a maximum of four bytes per character, with a maximum value of 0x10FFFF, allowing a maximum of 1,112,064 code points. Obviously this restriction could be removed and the standard extended, but this would prove a breaking change for existing code that works to that limit.

From a data-file point of view, this wouldn't be a breaking change as the standard works on the basis that if the most significant bit (MSB) of each byte is set, then the next byte is part of the encoding. Even before RFC3629, the standard was limited to 31 bits, leaving the MSB of the fourth byte unset.

Extending the standard beyond 0x10FFFF would break UTF-8's partial data compatibility with UTF-16 though.

edited Oct 07 '21 at 07:34

Community

1

answered Nov 24 '15 at 12:25

David Arno

38,972
9
88
121

5

So in theory, the **data** would be backwards compatible, but the **code** wouldn't inherently be compatible with the modification to the standard? – Qix - MONICA WAS MISTREATED Nov 24 '15 at 12:28
2

@Qix, That's a valid point. Any existing UTF-8 file would naturally be compatible with eg a maximum of 6 bytes to accommodate millions more code points, but many existing libraries designed to handle UTF-8 would likely not handle that extension. – David Arno Nov 24 '15 at 12:30
4

UTF-16 would break fatally. It can inherently only support code points up to 0x10FFFF. – gnasher729 Nov 24 '15 at 15:06
1

@gnasher729: Not as big an issue as you'd think. Pre-Unicode solved this via shift values (Shift JIS for Japanese). They'd simply mark a reserved/unused character (0xFFFD?) as a "shift character", that shifts the encoding into a more extended form. Probably UTF32. – Mooing Duck Nov 25 '15 at 17:45

Owen · Answer 4 · 2015-11-24T20:51:21.603

4

Really, only 2 Unicode code-points code stand for infinitely many glyphs, if they were combining characters.

Compare, for example, the two ways that Unicode encodes for the Korean Hangul alphabet: Hangul Syllables and Hangul Jamo. The character 웃 in Hangul Syllabels is the single code-point C6C3 whereas in Hangul Jamo it is the three code-points 110B (ㅇ) 116E (ㅜ) 11B9 (ㅅ). Obviously, using combining characters takes up vastly fewer code-points, but is less efficient for writing because more bytes are needed to write each character.

With this trick, there is no need to go beyond the number of code-points that can currently be encoded in UTF-8 or UTF-16.

I guess it comes down to how offended the aliens would be if their language happened to require many more bytes per message than earthly languages. If they don't mind, say, representing each of their millions of characters using a jumble of say, 100k combining characters, then there's no problem; on the other hand if being forced to use more bytes than earthlings makes them feel like second-class citizens, we could be in for some conflict (not unlike what we already observe with UTF-8).

edited Nov 24 '15 at 20:51

answered Nov 24 '15 at 20:18

Owen

576
3
10

This is only the case if the characters in the alien language is actually composed of a more limited set of graphemes. This might not be the case. – JacquesB Nov 25 '15 at 12:57
1

As far as I am aware there is no requirement that combining characters need to relate to individual graphemes. [Unicode FAQ](http://unicode.org/faq/char_combmark.html) is silent on this, but my impression is that it wouldn't be any harder for a layout engine to support combing sequences that are not sequences of graphemes, since in either case a precomposed glyph would be required. – Owen Nov 25 '15 at 14:33
How long do these aliens live, and how many characters not decomposable into graphemes can they learn during childhood? And does precomposed Hangul retain its byte advantage over decomposed Hangul even after gzip? – Damian Yerrick Jun 30 '16 at 15:28

JacquesB · Answer 5 · 2015-11-25T11:43:41.190

Edit: The question now says "millions of new characters". This makes it easy to answer:

No. Utf-8 is a Unicode encoding. Unicode have a codespace which allows 1,114,112 distinct codepoints, and less than a million is currently unassigned. So it is not possible to support millions of new characters in Unicode. By definition no Unicode encoding can support more characters than what is defined by Unicode. (Of course you can cheat by encoding a level further - any kind of data can be represented by just two characters after all.)

To answer the original question:

Unicode does not support languages as such, it supports characters - symbols used to represent language in written form.

Not all human languages have a written representation, so not all human languages can be supported by Unicode. Furthermore many animals communicate but don't have a written language. Whales for example have a form of communication which is complex enough to call a language, but does not have any written form (and cannot be captured by existing phonetic notation either). So not even all languages on earth can be supported by Unicode.

Even worse is something like the language of bees. Not only does it not have a written form, it cannot meaningfully be represented in written form. The language is a kind of dance which basically points in a direction but relies on the current position of the sun. Therefore the dance only have informational value at the particular place and time where it is performed. A symbolic or textual representation would have to include information (location, position of the sun) which the language of bees currently cannot express.

Even a written or symbolic form of communication might not be possible to represent in Unicode. For example illustrations or wordless comics cannot be supported by Unicode since the set of glyphs is not finite. You will notice a lot of pictorial communication in international settings like an airport, so it is not inconceivable that a race of space-travelling aliens will have evolved to use a pictorial language.

Even if an alien race had a language with a writing system with a finite set of symbols, this system might not be possible to support in Unicode. Unicode expects writing to be a linear sequence of symbols. Music notation is an example of a writing system which cannot be fully represented in Unicode, because meaning is encoded in both choice of symbols and vertical and horizontal placement. (Unicode does support individual musical symbols, but cannot encode a score.) An alien race which communicated using polyphonic music (not uncommon) or a channel of communication of similar complexity, might very well have a writing system looking like an orchestral score, and Unicode cannot support this.

But lets for the sake of argument assume that all languages, even alien languages, can be expressed as a linear sequence of symbols selected from a finite set. Is Unicode big enough for an alien invasion? Unicode have currently less than a million unassigned codepoints. The Chinese language contains a hundred thousands characters according to the most comprehensive Chinese dictionary (not all of them are currently supported by Unicode as distinct characters). So only ten languages with the complexity of Chinese would use up all of Unicode. On earth we have hundreds of distinct writing systems, but luckily most are alphabetical rather than ideographical and therefore contains a small number of character. If all written languages used ideograms like Chinese, Unicode would not even be big enough for earth. The use of alphabets is derived from speech which only uses a limited number of phonemes, but that is particular for human physiology. So even a single alien planet with only a dozen of ideographical writing systems might exceed what Unicode can support. Now consider if that this alien already have invaded other planets before earth and included their writing systems in the set of characters which have to be supported.

Expansion or modification of current encodings, or introduction of new encodings will not solve this, since the limitation is in the number of code points supported by Unicode.

So the answer is most likely no.

You're lacking in imagination. Dance choreographers have plenty of language and terminology they can use to describe and teach the dances the stage actors are to perform. If we were to learn what bees were communicating, we could definitely devise a written terminology for it. After all, most of our written languages today are an encoding of sound. Encoding movement isn't all that different from encoding sound. — whatsisname, Nov 24 '15 at 19:43
@whatsisname: Indeed there exists several written notations for choreography, but these notations are typically two-dimensional (like an orchestral score) and therefore cannot be represented as a sequence of unicode symbols. — JacquesB, Nov 24 '15 at 19:51
Parts of this answer are good but to say "Not only does it not have a written form, it cannot possibly be represented in written form" is just plain wrong. Anything that conveys information can be reduced to bits, and anything reduced to bits can be transformed into pretty much any stream of characters you like. — Gort the Robot, Nov 24 '15 at 21:17
@StevenBurnap True, but Unicode is more than just a sequence of bits. It is a way of interpreting those bits, that is fairly rigid. Yes, the Unicode character set could be expanded to represent anything from images to CNC instructions, but this would be a very different creature. — Owen, Nov 24 '15 at 21:23
Keep in mind that what unicode symbols describe (in most languages) are patterns in the variation of air pressure, and that for most languages it actually does a fairly crappy job of actually matching those patterns. — Gort the Robot, Nov 24 '15 at 21:28
@StevenBurnap That is a good point. There is a long history of illiterate cultures forcing their language into a written form upon contact with a literate culture. But in the case of the Celts and the Romans, the Romans were clearly dominant so it made sense for the Celts to adapt. But if the aliens were dominant, we might find the process going the other way, and Unicode (and all written scripts) going by the wayside. — Owen, Nov 24 '15 at 21:32
@StevenBurnap: The dance of the honeybee communicate direction and distance relative to the current position of the sun and the beehive.You need to be at the exact time and place of the dance to understand the information. A textual or symbolic representation of the dance itself would not carry any information unless you also know where the dancing happend, the position of the sun at the time etc - but the language of bees cannot express that. Hence the language of bees is not abstract enough that a written form would be meaningful. — JacquesB, Nov 24 '15 at 23:26
So you mean the sentence "fly 45 seconds with the sun 15 degrees to your left, then fly 10 seconds with the sun 10 degrees to your right" is impossible? It certainly requires the position of the sun at the time as context. — Gort the Robot, Nov 25 '15 at 03:26
@StevenBurnap: What bees do is basically pointing in a certain direction and saying how long. You cannot represent that textually in a useful way without a way of describing where you have to start. But the language of bees cannot express that. — JacquesB, Nov 25 '15 at 07:55
@JacquesB You should listen to your navigation system more often: "Drive 5 km straight, then turn left". Or even more complex are the instructions for airplanes. Or how do you think we steer spacecraft? We already have written concepts to express movement relative to the rotating plane of the solar system. And even in 3D, like for the Ulysses spacecraft and other sol-polar spacecrafts. So in short: Yes, we can express the dance of the bees in characters, not to say that its easy, but its certainly possible to do that in UTF-8. — Angelo Fuchs, Nov 25 '15 at 09:39
@AngeloFuchs: Of course *we* can. Because we have a more advanced language than bees. But the language of bees can't. A GPS only works because it has a built-in map. If bees could express and communicate maps in addition to directions they could have a useful written language. But the cant. (Yet!) — JacquesB, Nov 25 '15 at 10:07

Would UTF-8 be able to support the inclusion of a vast alien language with millions of new characters?

5 Answers5