119

All characters in ASCII can be encoded using UTF-8 without an increase in storage (both requires a byte of storage).

UTF-8 has the added benefit of character support beyond "ASCII-characters". If that's the case, why will we ever choose ASCII encoding over UTF-8?

Is there a use-case when we will choose ASCII instead of UTF-8?

Pacerier
  • 4,973
  • 7
  • 39
  • 58
  • 10
    To support legacy stuff... – fretje Jul 30 '11 at 13:39
  • 10
    i mean the UTF8 *is* legacily supporting ASCII too. so even if you have to support legacy stuff, UTF8 would work just fine no other changes needed. – Pacerier Jul 30 '11 at 14:13
  • 3
    Maybe you've got to interoperate with a system that packs 8 ASCII characters into 7 bytes? People did _crazy_ stuff to fit things in. – Donal Fellows Jul 31 '11 at 11:42
  • 6
    Call me nuts, but I'd say security and stability. A character set without multi-byte sequences is a lot harder to break. Don't get me wrong, when human language support is important ASCII won't cut it. But if you're just doing some basic programming and can squeeze yourself into the native language the compiler and operating system were written for, why add the complexity? @Donal Fellows. Last I checked... ASCII *is* 7 bytes. (anything with that extra bit just isn't ASCII and is asking for trouble) – ebyrob Apr 01 '14 at 13:37
  • 3
    @ebyrob I think Donal Fellows means bit packing 8 ascii symbols into 7 bytes, since each symbol is using 7 bits each ... 8*7=56 bits = 7 bytes. It would mean a special encode and decode function, just to save 1 byte of storage out of every 8. – dodgy_coder Feb 26 '15 at 07:08
  • Noone mentioned embedded programming ... I'd imagine most microcontrollers only need to support the ASCII character set. Unicode would likely be too complex to implement and or unnecessary. – dodgy_coder Feb 26 '15 at 07:10
  • 2
    @dodgy_coder again, to re-iterate. Just not having the extra complexity is sometimes a very good thing (even if the bits are the same). Embedded is one example where simplicity is good. Security and stability also remain important. If there's no concept in the system of a 4-byte character code, then it can't be used to wreck any applications. In UTF-8, you have to fully understand how those 4 byte (or longer) character codes will be handled. – ebyrob Apr 07 '15 at 14:22
  • @ebyrob UTF-8 sequences are at most 4 octets long, since Unicode 4.0 (2003-04) set U+10FFFF as the upper limit and overlong UTF-8 sequences are invalid. – Rhymoid Oct 28 '15 at 17:40
  • @Rhymoid, I think you `@` the wrong person... – Pacerier Dec 03 '15 at 23:17
  • Thanks! My comment was indeed a response to @DonalFellows – Rhymoid Dec 04 '15 at 20:34
  • [Characters, Symbols and the Unicode Miracle - Computerphile](https://youtu.be/MijmeoH9LT4). Describes character encoding issue then explains UTF-8 encoding scheme. – radarbob Feb 16 '18 at 22:44
  • In many cases, the code is simplified (immensely?) by restricting the input to ASCII. Many common and simple operations in ASCII are difficult, complex, or even impossible in Unicode. Example: character ranges in regular expression character classes. So, avoid Unicode if you don't need it. – jrw32982 Aug 29 '18 at 19:08

5 Answers5

106

In some cases it can speed up access to individual characters. Imagine string str='ABC' encoded in UTF8 and in ASCII (and assuming that the language/compiler/database knows about encoding)

To access third (C) character from this string using array-access operator which is featured in many programming languages you would do something like c = str[2].

Now, if the string is ASCII encoded, all we need to do is to fetch third byte from the string.

If, however string is UTF-8 encoded, we must first check if first character is a one or two byte char, then we need to perform same check on second character, and only then we can access the third character. The difference in performance will be the bigger, the longer the string.

This is an issue for example in some database engines, where to find a beginning of a column placed 'after' a UTF-8 encoded VARCHAR, database does not only need to check how many characters are there in the VARCHAR field, but also how many bytes each one of them uses.

Mchl
  • 4,103
  • 1
  • 22
  • 23
  • 4
    If the database doesn't store both the "character count" *and* the "byte count", then I'd say it's got some problems... – Dean Harding Jul 31 '11 at 22:20
  • 1
    TBH I know no database that would store either... – Mchl Aug 01 '11 at 17:45
  • @Mchl: how do you imagine the database knows when it has reached the end of the string? – kevin cline Jan 11 '13 at 21:25
  • 2
    Usually by reaching 0x00 or 0x0000 – Mchl Feb 27 '13 at 10:11
  • 9
    @DeanHarding How does the character count tell you where the second character starts? Or should the database hold an index for each character offset too? Note: It isn't just 2 characters, but could be up to 4 (unless when it's 6) http://stackoverflow.com/questions/9533258/what-is-the-maximum-number-of-bytes-for-a-utf-8-encoded-character. (I think it's only utf-16 that had the really long abominations that could destroy your system) – ebyrob Apr 01 '14 at 13:44
  • Well, with ASCII you have a single well-defined concept of a *character*. With Unicode, things are more nuanced, and you probably used the wrong definition, especially if you aren't aware of that. – Deduplicator Feb 06 '18 at 01:38
10

If you're going to use only the US-ASCII (or ISO 646) subset of UTF-8, then there's no real advantage to one or the other; in fact, everything is encoded identically.

If you're going to go beyond the US-ASCII character set, and use (for example) characters with accents, umlauts, etc., that are used in typical western European languages, then there's a difference -- most of these can still be encoded with a single byte in ISO 8859, but will require two or more bytes when encoded in UTF-8. There are also, of course, disadvantages: ISO 8859 requires that you use some out of band means to specify the encoding being used, and it only supports one of these languages at a time. For example, you can encode all the characters of the Cyrillic (Russian, Belorussian, etc.) alphabet using only one byte apiece, but if you need/want to mix those with French or Spanish characters (other than those in the US-ASCII/ISO 646 subset) you're pretty much out of luck -- you have to completely change character sets to do that.

ISO 8859 is really only useful for European alphabets. To support most of the alphabets used in most Chinese, Japanese, Korean, Arabian, etc., alphabets, you have to use some completely different encoding. Some of these (E.g., Shift JIS for Japanese) are an absolute pain to deal with. If there's any chance you'll ever want to support them, I'd consider it worthwhile to use Unicode just in case.

Jerry Coffin
  • 44,385
  • 5
  • 89
  • 162
8

Yes, there are still some use cases where ASCII makes sense: file formats and network protocols. In particular, for uses where:

  • You have data that's generated and consumed by computer programs, never presented to end users;
  • But which it's useful for programmers to be able to read, for ease of development and debugging.

By using ASCII as your encoding you avoid the complexity of multi-byte encoding while retaining at least some human-readability.

A couple of examples:

  • HTTP is a network protocol defined in terms of sequences of octets, but it's very useful (at least for English-speaking programmers) that these correspond to the ASCII encoding of words like "GET", "POST", "Accept-Language" and so on.
  • The chunk types in the PNG image format consist of four octets, but it's handy if you're programming a PNG encoder or decoder that IDAT means "image data", and PLTE means "palette".

Of course you need to be careful that the data really isn't going to be presented to end users, because if it ends up being visible (as happened in the case of URLs), then users are rightly going to expect that data to be in a language they can read.

Gareth Rees
  • 1,449
  • 10
  • 9
  • 1
    Well said. It's a little ironic that HTTP, the protocol that transmits the most unicode on the planet only needs to support ASCII. (Actually, I suppose the same goes for TCP and IP, binary support, ASCII support... that's all you need at that level of the stack) – ebyrob Apr 01 '14 at 13:58
  • Not so ironic when you consider thar HTTP came before UTF-8 – JoelFan May 31 '23 at 00:28
5

ANSI can be many things, most being 8 bit character sets in this regard (like code page 1252 under Windows).

Perhaps you were thinking of ASCII which is 7-bit and a proper subset of UTF-8. I.e. any valid ASCII stream is also a valid UTF-8 stream.

If you were thinking of 8-bit character sets, one very important advantage would be that all representable characters are 8-bits exactly, where in UTF-8 they can be up to 24 bits.

  • yes i'm talking about the 7-bit ASCII set. can you think of 1 advantage we will ever need to save something as ascii instead of utf-8? (since the 7-bit would be saved as 8-bit anyway, the filesize would be exactly the same) – Pacerier Jul 30 '11 at 14:13
  • 1
    If you have characters larger than unicode value 127, they cannot be saved in ASCII. –  Jul 30 '11 at 14:47
  • 1
    @Pacerier: **Any ASCII string is a UTF-8 string**, so there is **no difference**. The encoding routine *might* be faster depending on the string representation of the platform you use, although I wouldn't expect significant speedup, while you have a significant loss in flexibility. – back2dos Jul 30 '11 at 16:04
  • @Thor that is exactly why i'm asking if saving as ASCII has any advantages at all – Pacerier Jul 30 '11 at 17:06
  • 5
    @Pacerier, if you save XML as ASCII you need to use e.g.   for a non-breakable space. This is more filling, but makes your data more resistant against ISO-Latin-1 vs UTF-8 encoding errors. This is what we do as our underlying platform does a lot of invisible magic with characters. Staying in ASCII makes our data more robust. –  Jul 30 '11 at 17:29
  • Did you mean "up to 32 bits"? – JoelFan May 31 '23 at 00:28
2

First of all: your title uses/d ANSI, while in the text you refer to ASCII. Please note that ANSI does not equal ASCII. ANSI incorporates the ASCII set. But the ASCII set is limited to the first 128 numeric values (0 - 127).

If all your data is restricted to ASCII (7-bit), it doesn't matter whether you use UTF-8, ANSI or ASCII, as both ANSI and UTF-8 incorperate the full ASCII set. In other words: the numeric values 0 up to and including 127 represent exactly the same characters in ASCII, ANSI and UTF-8.

If you need characters outside of the ASCII set, you'll need to choose an encoding. You could use ANSI, but then you run into the problems of all the different code pages. Create a file on machine A and read it on machine B may/will produce funny looking texts if these machines are set up to use different code pages, simple because numeric value nnn represents differents characters in these code pages.

This "code page hell" is the reason why the Unicode standard was defined. UTF-8 is but a single encoding of that standard, there are many more. UTF-16 being the most widely used as it is the native encoding for Windows.

So, if you need to support anything beyond the 128 characters of the ASCII set, my advice is to go with UTF-8. That way it doesn't matter and you don't have to worry about with which code page your users have set up their systems.

Marjan Venema
  • 8,151
  • 3
  • 32
  • 35
  • if i do not need to support beyond 128 chars, what is the advantage of choosing ACSII encoding over UTF8 encoding? – Pacerier Jul 30 '11 at 17:16
  • Besides limiting yourself to those 128 chars? Not much. UTF-8 was specifically designed to cater for ASCII and most western languages that "only" need ANSI. You will find that UTF-8 will encode only a relatively small number of the higher ANSI characters with more than one byte. There is a reason most of the HTML pages use UTF-8 as a default... – Marjan Venema Jul 30 '11 at 18:57
  • 1
    @Pacerier, if you don't need encoding above 127, choosing ASCII may be worth when you use some API to encode/decode, because UTF needs additional bit verification to consider additional bytes as the same character, it can takes additional computation rather than pure ASCII which just read 8 bits without verification. But I only recommend you to use ASCII if you really need a high level of optimization in large (big large) computation and you know what you're doing in that optimization. If not, just use UTF-8. – Luciano Dec 19 '12 at 13:03