Questions tagged [utf-8]

For questions about the character encoding for Unicode.

UTF stands for Unicode Transform Format. This format uses 8 bit blocks for character representation and may use between 1 and 4 of these 8 bit blocks to represent a particular character. This differs from UTF-16 which uses one or two 16 bit blocks, and UTF-32 which uses one 32 bit block.

A key advantage of UTF-8 is that it is backwards compatible with 7 bit ASCII. This means that if a given text file contains only the ASCII values of 0 through 127, the UTF-8 encoded form of the file is identical to the 7 bit ASCII encoded form of the file.

Is the use of "utf8=✓" preferable to "utf8=true"?

I have recently seen a few URIs containing the query parameter "utf8=✓". My first impression (after thinking "mmm, looks cool") was that this could be used to detect a broken character encoding. So, is this a better way to resolve potential…

asked Oct 13 '12 at 11:57

Gary

24,420
9
63
108

163

votes

5 answers

How to detect the encoding of a file?

On my filesystem (Windows 7) I have some text files (These are SQL script files, if that matters). When opened with Notepad++, in the "Encoding" menu some of them are reported to have an encoding of "UCS-2 Little Endian" and some of "UTF-8 without…

file-systems character-encoding utf-8 notepad++

asked Feb 15 '13 at 09:45

Marcel

3,092
3
18
19

119

votes

5 answers

What is the advantage of choosing ASCII encoding over UTF-8?

All characters in ASCII can be encoded using UTF-8 without an increase in storage (both requires a byte of storage). UTF-8 has the added benefit of character support beyond "ASCII-characters". If that's the case, why will we ever choose ASCII…

character-encoding utf-8 ascii

asked Jul 30 '11 at 13:08

Pacerier

4,973
7
39
58

votes

5 answers

Would UTF-8 be able to support the inclusion of a vast alien language with millions of new characters?

In the event an alien invasion occurred and we were forced to support their languages in all of our existing computer systems, is UTF-8 designed in a way to allow for their possibly vast amount of characters? (Of course, we do not know if aliens…

unicode utf-8

asked Nov 24 '15 at 12:18

Qix - MONICA WAS MISTREATED

1,896
16
32

votes

6 answers

Should Latin-1 be used over UTF-8 when it comes to database configuration?

We are using MySQL at the company I work for, and we build both client-facing and internal applications using Ruby on Rails. When I started working here, I ran into a problem what I had never encountered before; the database on the production server…

database mysql ruby-on-rails utf-8 ascii

asked Jan 30 '15 at 21:18

Ten Bitcomb

1,154
1
9
14

votes

8 answers

Should character encodings besides UTF-8 (and maybe UTF-16/UTF-32) be deprecated?

A pet peeve of mine is looking at so many software projects that have mountains of code for character set support. Don't get me wrong, I'm all for compatibility, and I'm happy that text editors let you open and save files in multiple character…

unicode utf-8 character-encoding

asked Jan 26 '11 at 03:32

Joey Adams

5,535
3
30
34

votes

4 answers

Why does UTF-8 waste several bits in its encoding

According to the Wikipedia article, UTF-8 has this format: First code Last code Bytes Byte 1 Byte 2 Byte 3 Byte 4 point point Used U+0000 U+007F 1 0xxxxxxx U+0080 U+07FF 2 110xxxxx 10xxxxxx U+0800 U+FFFF…

character-encoding utf-8 text-encoding

asked Nov 09 '14 at 19:50

qbt937

votes

2 answers

Is UTF-16 fixed-width or variable-width? Why doesn't UTF-8 have byte-order problem?

Is UTF-16 fixed-width or variable-width? I got different results from different sources: From http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF: UTF-16 stores Unicode characters in sixteen-bit chunks. From…

unicode character-encoding utf-8

asked Jul 22 '11 at 23:45

Tim

5,405
7
48
84

votes

3 answers

Should my source code be in UTF-8?

I feel that often you don't really choose what format your code is in. I mean most of my tools in the past have decided for me. Or I haven't really even thought about it. I was using TextPad on windows the other day and as I was saving a file, it…

coding-standards source-code character-encoding utf-8

asked Jun 13 '12 at 19:55

Parris

votes

1 answer

Do C++'s iterator categories forbid writing a UTF-8 iterator adapter?

I've been working on a UTF-8 iterator adapter. By which, I mean an adapter that turns an iterator to a char or unsigned char sequence into an iterator to a char32_t sequence. My work here was inspired by this iterator I found online. However, as I…

c++ c++11 unicode utf-8

asked Apr 01 '17 at 18:43

Nicol Bolas

11,813
4
37
46

votes

1 answer

UTF-8 questions

When you encode a code point to code units based on UTF-8, then if the code point fits on 7 bits, the most significant bit is set to zero so that it tells you it is a character which is stored on 1 byte (or more precisely 7 bits). If the codepoint…

unicode character-encoding text-encoding utf-8

asked Nov 15 '19 at 22:03

codepersonnel49

votes

2 answers

What steps can I take to avoid character encoding issues in a web application?

In previous web applications I've built, I've had issues with users entering exotic characters into forms which get stored strangely in the database, and sometimes appear different or double-encoded when retrieved from the database and displayed…

character-encoding utf-8

asked Apr 18 '12 at 14:28

CFL_Jeff

3,517
23
33

votes

2 answers

Are international UTF-8 e-mail addresses a thing or not?

RFC6530 defines the necessary steps for "international e-mail" (i.e., especially for UTF-8 e-mail addresses). Apparently Google adopted the RFC back in 2014 (source). Still, most validators I find on the web are having trouble with international…

email utf-8

asked Nov 12 '18 at 20:07

D.R.

votes

2 answers

Is there a good single byte delimeter for use with utf-8 strings that isn't a null terminator?

I'm looking for a quick way to split strings containing individual JSON payloads. Currently, I'm using newlines and searching for the newline ASCII character, but I figure if I start using utf-8 this could easily break. Is there any quick single…

parsing strings utf-8

asked Feb 21 '17 at 19:11

Mikey A. Leonetti

votes

2 answers

How to detect client character encoding?

I programmed a telnet server using C as programming language but I have a problem to send characters with emphases (é, è, à ...). The character encoding is different between the telnet clients (windows, linux, putty, ...). What can I do to detect…

client-server character-encoding utf-8 ascii

asked Jan 15 '15 at 10:47

ipStack

2 Next