Should my source code be in UTF-8?

Question

I feel that often you don't really choose what format your code is in. I mean most of my tools in the past have decided for me. Or I haven't really even thought about it. I was using TextPad on windows the other day and as I was saving a file, it prompted me about ASCII, UTF-8/16, Unicode etc etc...

I am assuming that almost all code written is ASCII, but why should it be ASCII? Should we actually be using UTF-8 files now for source code, and why? I'd imagine this might be useful on multi-lingual teams. Are there standards associated with how multilingual teams name variables/functions/etc?

And the Klingon script isn't in Unicode, so you'd either need to use "private use" characters or an ASCII transliteration. — dan04, Jun 13 '12 at 22:13
@dan04: Klingon has a pseudo-standard use of the private use part of the BMP (see [the ConScript registry](http://www.evertype.com/standards/csur/klingon.html)) :-) — Ross Patterson, Oct 18 '12 at 14:58

score 23 · Accepted Answer · answered Jun 13 '12 at 20:12

The choice is not between ASCII and UTF-8. ASCII is a 7-bit encoding, and UTF-8 supersedes it - any valid ASCII text is also valid UTF-8. The problems arise when you use non-ASCII characters; for these you have to pick between UTF-8, UTF-16, UTF-32, and various 8-bit encodings (ISO-xxxx, etc.).

The best solution is to stick with a strict ASCII charset, that is, just don't use any non-ASCII characters in your code. Most programming languages provide ways to express non-ASCII characters using ASCII characters, e.g. "\u1234" to indicate the Unicode code point at 1234. Especially, avoid using non-ASCII characters for identifiers. Even if they work correctly, people who use a different keyboard layout are going to curse you for making them type these characters.

If you can't avoid non-ASCII characters, UTF-8 is your best bet. Unlike UTF-16 and UTF-32, it is a superset of ASCII, which means anyone who opens it with the wrong encoding gets at least most of it right; and unlike 8-bit codepages, it can encode about every character you'll ever need, unambiguously, and it's available on every system, regardless of locale.

And then you have the encoding that your code processes; this doesn't have to be the same as the encoding of your source file. For example, I can easily write PHP in UTF-8, but set its internal multibyte-encoding to, say, Latin-1; because the PHP parser does not concern itself with encodings at all, but rather just reads byte sequences, my UTF-8 string literals will be misinterpreted as Latin-1. If I output these strings on a UTF-8 terminal, you won't see any differences, but string lengths and other multibyte operations (e.g. substr) will produce wrong results.

My rule of thumb is to use UTF-8 for everything; only if you absolutely have to deal with other encodings, convert to UTF-8 as early as possible and from UTF-8 as late as possible.

score 6 · Answer 2 · answered Jun 13 '12 at 19:56

6

Most IDEs will default to saving with UTF-8 encoding, and you should almost certainly choose UTF-8 over ASCII when given the option. This is will ensure you don't run into weird problems with internationalization code.

answered Jun 13 '12 at 19:56

Oleksi

11,874
2
53
54

2

You're making it seem as if ASCII vs. UTF-8 is a choice. When there are non-ASCII characters in a file, it isn't. When there are only ASCII characters, UTF-8 *is* ASCII. – Fred Foo Jun 13 '12 at 21:03
I wish Eclipse would adhere to this. As a first year CS-ish student, my god has this been the cause of many headaches when working in groups, where there's a presence of OS X, Windows and Linux users. (For reference it defaults to MacRoman on OS X, CP-1252 on Windows and i forgot which one on linux, but you bet you its a different one.) – leflings Jun 14 '12 at 09:56
@leflings - probably a default environment encoding which currently is usually UTF-8. – Maciej Piechotka May 09 '14 at 19:19

score 1 · Answer 3 · edited Aug 05 '15 at 01:53

Being able to type plain text into quoted strings or characters in source code and being able to see the actual character is very nice. For example the pi symbol 'π' or the ideograph '' are much nicer than the equivalent '\u3c0' for pi and L'\u2000A' for the ideograph.

It is possible to type and/or copy and paste these characters directly into source code, just as you would ASCII characters, in a decent editor.

I find concrete examples helpful in conceptualizing and understanding things that word descriptions sometimes don't seem to drive home. Conceptualize Unicode character constants typed into the source code, such as the following brief example code snippet:

const unsigned char  ASCII_0X7E      = (unsigned char)  '~';
const unsigned short UNICODE_0X3C0   = (unsigned short) 'π';
const unsigned long  UNICODE_0X2000A = (unsigned long)  '';
const unsigned long  UNICODE_0X2893D = (unsigned long)  '';

The ASCII tilde character '~' can be saved in an ASCII or UTF-8 source file, but the Unicode characters are not able to be stored in ASCII form. The PI symbol 'π' is Unicode code point 0x3c0 and can be stored in UTF-8 form as a two byte value 0xcf, 0x80. The Ideographs at Unicode code points 0x2000a and 0x2893d require 4 byte UTF-8 sequences.

In order for those characters to retain their intended values and the compiler to interpret them as intended, the source code needs to be saved in a format that supports the Unicode character set, such as UTF-8 or UTF-16. If saved as UTF-8, a decent compiler will understand and interpret be the values as intended and a decent editor will load and display the characters properly.

As others have been pointing out, if you simply do not have any characters in your source code that are outside of the ASCII range, saving as UTF-8 will result in a file that is no different from saving an ASCII file, since UTF-8 is designed to overlap ASCII in the ASCII range of characters. As soon as you type any character into your source code that is outside the ASCII range, a decent editor will inform you that you have to pick an encoding to use to save the file. UTF-8 is a good choice since it can handle ASCII as is and virtually every other character supported in your development environment.

Should my source code be in UTF-8?

3 Answers3