Questions tagged [unicode]

Unicode is intended to be a universal character set for describing all the characters required for written text incorporating all writing systems, technical symbols and punctuation.

Unicode

Unicode assigns each character a code point to act as a unique reference:

U+0041 A
U+0042 B
U+0043 C
...
U+039B Λ
U+039C Μ

Unicode Transformation Formats

UTFs describe how to encode code points as byte representations. The most common forms are UTF-8 (which encodes code points as a sequence of one, two, three or four bytes) and UTF-16 (which encodes code points as two or four bytes).

Code Point          UTF-8           UTF-16 (big-endian)
U+0041              41              00 41
U+0042              42              00 42
U+0043              43              00 43
...
U+039B              CE 9B           03 9B
U+039C              CE 9C           03 9C

Specification

The Unicode Consortium also defines standards for sorting algorithms, rules for capitalization, character normalization and other locale-sensitive character operations.

Latest Version of the Standard

Identifying Characters

63 questions

432

votes

20 answers

Should UTF-16 be considered harmful?

I'm going to ask what is probably quite a controversial question: "Should one of the most popular encodings, UTF-16, be considered harmful?" Why do I ask this question? How many programmers are aware of the fact that UTF-16 is actually a variable…

unicode

asked Jun 26 '09 at 16:04

Artyom

2,079
4
17
17

votes

5 answers

Would UTF-8 be able to support the inclusion of a vast alien language with millions of new characters?

In the event an alien invasion occurred and we were forced to support their languages in all of our existing computer systems, is UTF-8 designed in a way to allow for their possibly vast amount of characters? (Of course, we do not know if aliens…

unicode utf-8

asked Nov 24 '15 at 12:18

Qix - MONICA WAS MISTREATED

1,896
16
32

votes

15 answers

Is it bad to use Unicode characters in variable names?

I recently tried to implement a ranking algorithm, AllegSkill, to Python 3. Here's what the maths looks like: No, really. This is then what I wrote: t = (µw-µl)/c # those are used in e = ε/c # multiple places. σw_new = (σw**2 * (1 -…

naming unicode

asked Nov 01 '10 at 10:51

badp

1,870
1
16
21

votes

2 answers

Should UTF-8 CSV files contain a BOM (byte order mark)?

Our line-of-business software allows the user to save certain data as CSV. Since there are a lot of different formats (all called "CSV") in use in the wild, we are tying to decide what the "default format" should look like. Regarding line/field…

standards unicode csv file-formats

asked Jun 18 '18 at 07:36

Heinzi

9,646
3
46
59

votes

8 answers

Why are there multiple Unicode encodings?

I thought Unicode was designed to get around the whole issue of having lots of different encoding due to a small address space (8 bits) in most of the prior attempts (ASCII, etc.). Why then are there so many Unicode encodings? Even multiple versions…

unicode text-encoding

asked May 20 '11 at 05:22

Matthew Scharley

1,627
13
17

votes

3 answers

Why do we need to put N before strings in Microsoft SQL Server?

I'm learning T-SQL. From the examples I've seen, to insert text in a varchar() cell, I can write just the string to insert, but for nvarchar() cells, every example prefix the strings with the letter N. I tried the following query on a table which…

sql sql-server character-encoding unicode

asked Jul 06 '12 at 14:47

qinking126

votes

2 answers

Unicode license

The Unicode Terms of Use state that any software that uses their data files (or a modification of them) should carry the Unicode license references. It seems to me that most Unicode libraries have functions to check whether a character is a digit, a…

unicode licensing

asked Sep 28 '12 at 07:02

Eric Grange

votes

8 answers

Should character encodings besides UTF-8 (and maybe UTF-16/UTF-32) be deprecated?

A pet peeve of mine is looking at so many software projects that have mountains of code for character set support. Don't get me wrong, I'm all for compatibility, and I'm happy that text editors let you open and save files in multiple character…

unicode utf-8 character-encoding

asked Jan 26 '11 at 03:32

Joey Adams

5,535
3
30
34

votes

2 answers

Why does Java use UTF-16 for internal string representation?

I would imagine the reason was fast, array like access to the character at index, but some characters won't fit into 16 bits, so it wouldn't work... So if you have to handle special cases anyways, why not just use UTF-8?

java strings unicode

asked Nov 07 '12 at 13:40

zduny

2,623
2
19
24

votes

5 answers

What issues lead people to use Japanese-specific encodings rather than Unicode?

At work I come across a lot of Japanese text files in Shift-JIS and other encodings. It causes many mojibake (unreadable character) problems for all computer users. Unicode was intended to solve this sort of problem by defining a single character…

legacy unicode character-encoding

asked Jun 08 '11 at 06:36

Nicolas Raoul

1,062
1
11
20

votes

1 answer

Why are there so many spaces and line breaks in Unicode?

Unicode has maybe 50 spaces \u0009\u000A-\u000D\u0020\u0085\u00A0\u1680\u180E\u2000-\u200A\u2028\u2029\u202F\u205F\u3000][\u0009\u000A-\u000D\u0020\u0085\u00A0\u1680\u180E\u2000-\u200A\u2028\u2029\u202F\u205F\u3000 and 6 line breaks not only…

unicode

asked Jan 30 '11 at 01:12

maaartinus

2,633
1
21
29

votes

4 answers

Why exactly can't PHP have full unicode support?

Everybody knows, that PHP has problems with Unicode. Version 6 is effectively abandoned, because of Unicode implementation difficulties. But I wonder if anyone knows what are the exact reasons? Architecture/design problems, performance concerns,…

php open-source architecture language-design unicode

asked Dec 26 '10 at 13:15

ts01

1,171
10
17

votes

3 answers

Is it possible to write a generalized string reverse function that works for all localisations and string types?

I was just watching the Jon Skeet (with Tony the Pony) presentation from Dev-Days. Although "write a string reverse function" is coding interview 101 - I'm not sure that it's actually possible to write a general string reverse function, certainly…

algorithms strings unicode localization

asked Jul 26 '11 at 17:28

Martin Beckett

15,776
3
42
69

votes

2 answers

Is UTF-16 fixed-width or variable-width? Why doesn't UTF-8 have byte-order problem?

Is UTF-16 fixed-width or variable-width? I got different results from different sources: From http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF: UTF-16 stores Unicode characters in sixteen-bit chunks. From…

unicode character-encoding utf-8

asked Jul 22 '11 at 23:45

Tim

5,405
7
48
84

votes

3 answers

A Unicode sentinel value I can use?

I am desiging a file format and I want to do it right. Since it is a binary format, the very first byte (or bytes) of the file should not form valid textual characters (just like in the PNG file header1). This allows tools that do not recognize the…

unicode

asked Mar 13 '13 at 15:15

Daniel A.A. Pelsmaeker

2,715
3
22
27

2 3 4 5 Next