Questions tagged [utf-8]

For questions about the character encoding for Unicode.

UTF stands for Unicode Transform Format. This format uses 8 bit blocks for character representation and may use between 1 and 4 of these 8 bit blocks to represent a particular character. This differs from UTF-16 which uses one or two 16 bit blocks, and UTF-32 which uses one 32 bit block.

A key advantage of UTF-8 is that it is backwards compatible with 7 bit ASCII. This means that if a given text file contains only the ASCII values of 0 through 127, the UTF-8 encoded form of the file is identical to the 7 bit ASCII encoded form of the file.

Further reading:

20 questions
580
votes
1 answer

Is the use of "utf8=✓" preferable to "utf8=true"?

I have recently seen a few URIs containing the query parameter "utf8=✓". My first impression (after thinking "mmm, looks cool") was that this could be used to detect a broken character encoding. So, is this a better way to resolve potential…
Gary
  • 24,420
  • 9
  • 63
  • 108
163
votes
5 answers

How to detect the encoding of a file?

On my filesystem (Windows 7) I have some text files (These are SQL script files, if that matters). When opened with Notepad++, in the "Encoding" menu some of them are reported to have an encoding of "UCS-2 Little Endian" and some of "UTF-8 without…
Marcel
  • 3,092
  • 3
  • 18
  • 19
119
votes
5 answers

What is the advantage of choosing ASCII encoding over UTF-8?

All characters in ASCII can be encoded using UTF-8 without an increase in storage (both requires a byte of storage). UTF-8 has the added benefit of character support beyond "ASCII-characters". If that's the case, why will we ever choose ASCII…
Pacerier
  • 4,973
  • 7
  • 39
  • 58
90
votes
5 answers

Would UTF-8 be able to support the inclusion of a vast alien language with millions of new characters?

In the event an alien invasion occurred and we were forced to support their languages in all of our existing computer systems, is UTF-8 designed in a way to allow for their possibly vast amount of characters? (Of course, we do not know if aliens…
69
votes
6 answers

Should Latin-1 be used over UTF-8 when it comes to database configuration?

We are using MySQL at the company I work for, and we build both client-facing and internal applications using Ruby on Rails. When I started working here, I ran into a problem what I had never encountered before; the database on the production server…
Ten Bitcomb
  • 1,154
  • 1
  • 9
  • 14
34
votes
8 answers

Should character encodings besides UTF-8 (and maybe UTF-16/UTF-32) be deprecated?

A pet peeve of mine is looking at so many software projects that have mountains of code for character set support. Don't get me wrong, I'm all for compatibility, and I'm happy that text editors let you open and save files in multiple character…
Joey Adams
  • 5,535
  • 3
  • 30
  • 34
19
votes
4 answers

Why does UTF-8 waste several bits in its encoding

According to the Wikipedia article, UTF-8 has this format: First code Last code Bytes Byte 1 Byte 2 Byte 3 Byte 4 point point Used U+0000 U+007F 1 0xxxxxxx U+0080 U+07FF 2 110xxxxx 10xxxxxx U+0800 U+FFFF…
qbt937
  • 301
  • 2
  • 6
16
votes
2 answers

Is UTF-16 fixed-width or variable-width? Why doesn't UTF-8 have byte-order problem?

Is UTF-16 fixed-width or variable-width? I got different results from different sources: From http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF: UTF-16 stores Unicode characters in sixteen-bit chunks. From…
Tim
  • 5,405
  • 7
  • 48
  • 84
12
votes
3 answers

Should my source code be in UTF-8?

I feel that often you don't really choose what format your code is in. I mean most of my tools in the past have decided for me. Or I haven't really even thought about it. I was using TextPad on windows the other day and as I was saving a file, it…
Parris
  • 241
  • 2
  • 8
8
votes
1 answer

Do C++'s iterator categories forbid writing a UTF-8 iterator adapter?

I've been working on a UTF-8 iterator adapter. By which, I mean an adapter that turns an iterator to a char or unsigned char sequence into an iterator to a char32_t sequence. My work here was inspired by this iterator I found online. However, as I…
Nicol Bolas
  • 11,813
  • 4
  • 37
  • 46
5
votes
1 answer

UTF-8 questions

When you encode a code point to code units based on UTF-8, then if the code point fits on 7 bits, the most significant bit is set to zero so that it tells you it is a character which is stored on 1 byte (or more precisely 7 bits). If the codepoint…
4
votes
2 answers

What steps can I take to avoid character encoding issues in a web application?

In previous web applications I've built, I've had issues with users entering exotic characters into forms which get stored strangely in the database, and sometimes appear different or double-encoded when retrieved from the database and displayed…
CFL_Jeff
  • 3,517
  • 23
  • 33
3
votes
2 answers

Are international UTF-8 e-mail addresses a thing or not?

RFC6530 defines the necessary steps for "international e-mail" (i.e., especially for UTF-8 e-mail addresses). Apparently Google adopted the RFC back in 2014 (source). Still, most validators I find on the web are having trouble with international…
D.R.
  • 231
  • 1
  • 5
2
votes
2 answers

Is there a good single byte delimeter for use with utf-8 strings that isn't a null terminator?

I'm looking for a quick way to split strings containing individual JSON payloads. Currently, I'm using newlines and searching for the newline ASCII character, but I figure if I start using utf-8 this could easily break. Is there any quick single…
2
votes
2 answers

How to detect client character encoding?

I programmed a telnet server using C as programming language but I have a problem to send characters with emphases (é, è, à ...). The character encoding is different between the telnet clients (windows, linux, putty, ...). What can I do to detect…
ipStack
  • 121
  • 2
1
2