Questions tagged [unicode]

Unicode is intended to be a universal character set for describing all the characters required for written text incorporating all writing systems, technical symbols and punctuation.

Unicode

Unicode assigns each character a code point to act as a unique reference:

  • U+0041 A
  • U+0042 B
  • U+0043 C
  • ...
  • U+039B Λ
  • U+039C Μ

Unicode Transformation Formats

UTFs describe how to encode code points as byte representations. The most common forms are UTF-8 (which encodes code points as a sequence of one, two, three or four bytes) and UTF-16 (which encodes code points as two or four bytes).

Code Point          UTF-8           UTF-16 (big-endian)
U+0041              41              00 41
U+0042              42              00 42
U+0043              43              00 43
...
U+039B              CE 9B           03 9B
U+039C              CE 9C           03 9C

Specification

The Unicode Consortium also defines standards for sorting algorithms, rules for capitalization, character normalization and other locale-sensitive character operations.

Identifying Characters

63 questions
432
votes
20 answers

Should UTF-16 be considered harmful?

I'm going to ask what is probably quite a controversial question: "Should one of the most popular encodings, UTF-16, be considered harmful?" Why do I ask this question? How many programmers are aware of the fact that UTF-16 is actually a variable…
Artyom
  • 2,079
  • 4
  • 17
  • 17
90
votes
5 answers

Would UTF-8 be able to support the inclusion of a vast alien language with millions of new characters?

In the event an alien invasion occurred and we were forced to support their languages in all of our existing computer systems, is UTF-8 designed in a way to allow for their possibly vast amount of characters? (Of course, we do not know if aliens…
89
votes
15 answers

Is it bad to use Unicode characters in variable names?

I recently tried to implement a ranking algorithm, AllegSkill, to Python 3. Here's what the maths looks like: No, really. This is then what I wrote: t = (µw-µl)/c # those are used in e = ε/c # multiple places. σw_new = (σw**2 * (1 -…
badp
  • 1,870
  • 1
  • 16
  • 21
46
votes
2 answers

Should UTF-8 CSV files contain a BOM (byte order mark)?

Our line-of-business software allows the user to save certain data as CSV. Since there are a lot of different formats (all called "CSV") in use in the wild, we are tying to decide what the "default format" should look like. Regarding line/field…
Heinzi
  • 9,646
  • 3
  • 46
  • 59
42
votes
8 answers

Why are there multiple Unicode encodings?

I thought Unicode was designed to get around the whole issue of having lots of different encoding due to a small address space (8 bits) in most of the prior attempts (ASCII, etc.). Why then are there so many Unicode encodings? Even multiple versions…
Matthew Scharley
  • 1,627
  • 13
  • 17
40
votes
3 answers

Why do we need to put N before strings in Microsoft SQL Server?

I'm learning T-SQL. From the examples I've seen, to insert text in a varchar() cell, I can write just the string to insert, but for nvarchar() cells, every example prefix the strings with the letter N. I tried the following query on a table which…
qinking126
  • 541
  • 1
  • 5
  • 6
35
votes
2 answers

Unicode license

The Unicode Terms of Use state that any software that uses their data files (or a modification of them) should carry the Unicode license references. It seems to me that most Unicode libraries have functions to check whether a character is a digit, a…
Eric Grange
  • 403
  • 3
  • 9
34
votes
8 answers

Should character encodings besides UTF-8 (and maybe UTF-16/UTF-32) be deprecated?

A pet peeve of mine is looking at so many software projects that have mountains of code for character set support. Don't get me wrong, I'm all for compatibility, and I'm happy that text editors let you open and save files in multiple character…
Joey Adams
  • 5,535
  • 3
  • 30
  • 34
31
votes
2 answers

Why does Java use UTF-16 for internal string representation?

I would imagine the reason was fast, array like access to the character at index, but some characters won't fit into 16 bits, so it wouldn't work... So if you have to handle special cases anyways, why not just use UTF-8?
zduny
  • 2,623
  • 2
  • 19
  • 24
28
votes
5 answers

What issues lead people to use Japanese-specific encodings rather than Unicode?

At work I come across a lot of Japanese text files in Shift-JIS and other encodings. It causes many mojibake (unreadable character) problems for all computer users. Unicode was intended to solve this sort of problem by defining a single character…
Nicolas Raoul
  • 1,062
  • 1
  • 11
  • 20
22
votes
1 answer

Why are there so many spaces and line breaks in Unicode?

Unicode has maybe 50 spaces \u0009\u000A-\u000D\u0020\u0085\u00A0\u1680\u180E\u2000-\u200A\u2028\u2029\u202F\u205F\u3000][\u0009\u000A-\u000D\u0020\u0085\u00A0\u1680\u180E\u2000-\u200A\u2028\u2029\u202F\u205F\u3000 and 6 line breaks not only…
maaartinus
  • 2,633
  • 1
  • 21
  • 29
18
votes
4 answers

Why exactly can't PHP have full unicode support?

Everybody knows, that PHP has problems with Unicode. Version 6 is effectively abandoned, because of Unicode implementation difficulties. But I wonder if anyone knows what are the exact reasons? Architecture/design problems, performance concerns,…
ts01
  • 1,171
  • 10
  • 17
16
votes
3 answers

Is it possible to write a generalized string reverse function that works for all localisations and string types?

I was just watching the Jon Skeet (with Tony the Pony) presentation from Dev-Days. Although "write a string reverse function" is coding interview 101 - I'm not sure that it's actually possible to write a general string reverse function, certainly…
Martin Beckett
  • 15,776
  • 3
  • 42
  • 69
16
votes
2 answers

Is UTF-16 fixed-width or variable-width? Why doesn't UTF-8 have byte-order problem?

Is UTF-16 fixed-width or variable-width? I got different results from different sources: From http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF: UTF-16 stores Unicode characters in sixteen-bit chunks. From…
Tim
  • 5,405
  • 7
  • 48
  • 84
14
votes
3 answers

A Unicode sentinel value I can use?

I am desiging a file format and I want to do it right. Since it is a binary format, the very first byte (or bytes) of the file should not form valid textual characters (just like in the PNG file header1). This allows tools that do not recognize the…
Daniel A.A. Pelsmaeker
  • 2,715
  • 3
  • 22
  • 27
1
2 3 4 5