69

We are using MySQL at the company I work for, and we build both client-facing and internal applications using Ruby on Rails.

When I started working here, I ran into a problem what I had never encountered before; the database on the production server is set to Latin-1, meaning that the MySQL gem throws an exception whenever there is user input where the user copies & pastes UTF-8 characters.

My boss calls these "bad characters" since most of them are non-printable characters, and says that we need to strip them out. I've found a few ways to do this, but eventually we've ended up in a circumstance where a UTF-8 character was needed. Plus it's a bit of a hassle, especially since it seems like the only solution I ever read about for this issue is to just set the database to UTF-8 (makes sense to me).

The only argument that I've heard for sticking with Latin-1 is that allowing non-printable UTF-8 characters can mess up text/full-text searches in MySQL. Is this really true?

Are there other reasons one should use Latin-1 over UTF-8? It's my understanding that it is superior and becoming more ubiquitous.

djechlin
  • 2,212
  • 1
  • 15
  • 26
Ten Bitcomb
  • 1,154
  • 1
  • 9
  • 14
  • 2
    Latin is the default, but UTF-8 (Unicode) will allow any character to be stored like Chinese, Cyrillic, etc. Latin1 is just an extension of ASCII, which offers a very lmited (English specific) characters set. If app never goes outside the bounds of the US, Latin1 would be OK, but I would trend to UTF-8. – Jon Raynor Jan 30 '15 at 21:34
  • 4
    @jon LATIN-1 is *not* English specific. Spanish is contained perfectly there, as well as French if I'm not mistaken. – Darkhogg Jan 30 '15 at 23:48
  • 1
    This question would fit better on [database administrators](https://dba.stackexchange.com/). – Philipp Jan 31 '15 at 11:45
  • 4
    @Darkhog: Latin1 is indeed not specific for English, but it is essentially restricted to west-European alphabets. – Bart van Ingen Schenau Jan 31 '15 at 11:58
  • 1
    How much of you codebase would break when converting to UTF-8 ? – Pieter B Jan 31 '15 at 14:18
  • 4
    Latin-1 is a character set, UTF-8 is a character encoding. Comparing the two doesn't even make sense. Those are two completely different things. UTF-8 is in fact a valid encoding of Latin-1. – Jörg W Mittag Jan 31 '15 at 14:59
  • 16
    The only possible benefit from using Latin 1 rather than UTF-8 in a modern system is sabotage. That of course is only a benefit to the saboteur, and whoever their loyalties are to, not to the owners or developers of the system. – Jon Hanna Jan 31 '15 at 23:50
  • 3
    @Darkhogg, it doesn't *quite* contain French as it's missing the uppercase-Y-with-an-umlaut character, which shows up in a (very few) proper names. – Matt Krause Feb 01 '15 at 00:47
  • 13
    Too bad your database would not be able to hold the Euro symbol, or even my name (דותן). – dotancohen Feb 01 '15 at 08:39
  • 20
    user "copy and pastes" non-latin-1 characters? don't treat unicode as some irrelevant frivolous thing that only mischievous nerds care about. quite a lot of us _type_ characters that won't fit in latin-1 on a regular basis — i hear a lot of people speak non-european languages, even ♥ – Eevee Feb 01 '15 at 11:17
  • 4
    Be careful how you approach this. The answers are right, but you have to consider the political climate. Someone who cries "bad characters" when faced with an internationally accepted standard which took a consortium 5 years and *countless* man-hours -- that's the kind of person to whom you don't just want to provide a logical argument and assume it speaks for itself. You will need to take the time to build up an emotional argument as well. – Cort Ammon Feb 01 '15 at 17:42
  • 1
    The Latin-1 character set is fairly restrictive and doesn't even support half of the European languages. You can't spell the name of Paul Erdős in Latin-1. – Szabolcs Feb 01 '15 at 18:39
  • 1
    @CortAmmon: Not just emotional. But these things are already in production. So there is non zero financial cost as old systems need to be converted. Not just on the the DB but all the old code that touches the DB. – Martin York Feb 01 '15 at 20:26
  • 1
    He is your boss. Just make sure folks understand that it is his decision. – copper.hat Feb 02 '15 at 00:33
  • What does "your boss" have anything else to do with the technical matter of this question? Just to make the viewpoint sound ignorant before you ask the question? – djechlin Feb 02 '15 at 15:02
  • @xehpuk A visual representation of a grapheme that is internally represented by 8 bit accordingly to the UTF-8 format? – Alex Feb 02 '15 at 15:07
  • @Darkhogg: not quite: http://en.wikipedia.org/wiki/ISO/IEC_8859-1#Languages_commonly_supported_but_with_incomplete_coverage (Spanish yes, but french not so much) – njzk2 Feb 02 '15 at 15:47
  • @JörgWMittag: Latin-1 is a character encoding. It is also a character set (a block in the unicode), but this is not what is considered in this question – njzk2 Feb 02 '15 at 15:48
  • 1
    @Ravensine: Regarding full-text search, there are tools for that, which will handle composed characters affinity, control characters, and case sensitivity. I don't know about MySQL, but SQLite has FTS: http://www.sqlite.org/fts3.html#section_1 – njzk2 Feb 02 '15 at 15:52
  • @njzk2: Thanks for pointing me to that. It turns out that MySQL also has FTS. Since it seems to ignore certain characters, it probably ignores non-printable control characters as well(which makes sense since I don't know why anyone would want to include those). – Ten Bitcomb Feb 02 '15 at 19:43
  • @zehpuk: You are right, since UTF-8 is not a character set. My statement was due to a lack of understanding. – Ten Bitcomb Feb 02 '15 at 19:45
  • It just feels like western-centric shortsightedness. At least half of population in the world use some language that is not Latin-1 compatible. The business is going to have massive troubles if you truly want to be inclusive. – xji Feb 03 '15 at 03:56
  • @Darkhogg - You are right, my comment should be amended to include limited character set and scratch Engligh specific. Thx for pointing that out. – Jon Raynor Feb 03 '15 at 15:51
  • @JörgWMittag - No, in the terminology of MySQL, both latin1 and utf8 are character sets. See https://dev.mysql.com/doc/refman/5.5/en/charset-charsets.html Perhaps that is sloppy use of terminology by the designers of MySQL, but clearly this is what OP is asking about. – ToolmakerSteve Jul 01 '16 at 00:14

6 Answers6

136

Unicode is certainly difficult, and the UTF-8 encoding has a couple of inconvenient properties. However, UTF-8 has become the de-facto standard encoding on the web, surpassing ASCII, Latin-1, UCS-2 and UTF-16. Just use UTF-8 everywhere.

The most important reason why you should support Unicode is that you shouldn't make unnecessary assumptions about user input. I have no idea what your domain is, but things like Hebrew usernames, a blog post about China, a comment with Emoji, or simply well styled text – like “this” – should be possible… Oh, those were typographically correct quotation marks (“” rather than ""), en-wide dashes, and an ellipsis, which are characters that are common in English text, but not supported by ASCII or Latin-1. So not supporting other scripts isn't just a big f*ck you to other cultures, but sticking to Latin-1 doesn't even allow you to write proper English.

The notion that Unicode only allows “bad characters” is wrong. Yes, text is really complicated, and Unicode won't hide that from you. Your boss may be thinking about composed characters, where one base codepoint such as a is modified by subsequent codepoints that e.g. represent diacritics to form one visual character such as á. This doesn't really get into your way when trying to do searches if you do some kind of normalization. For example, you could store all text in the NFC form which collapses such compositions into their precomposed form if one is available. When doing searching, you could also strip all composing characters from the text, but this may substantially change their meaning in some languages.

Unicode also adds a lot of unprintable characters – but even ASCII has loads of them. Will you handle a NUL in the middle of a string? How about 0x1C, a “File Separator”? I've never seen half of those. Latin-1 adds a soft hyphen that indicates word break opportunities, but is otherwise invisible. Does that also break your full-text search? In other words, even ASCII and Latin-1 allow you to completely break your input if you assume it's all just printable text!

amon
  • 132,749
  • 27
  • 279
  • 375
  • 8
    From a database perspective, some of those characters are not/should not be allowed in a text type field (text/varchar/char/etc.). MySQL *does* allow null characters in these data types, but other databases like PostgreSQL do not. You're supposed to use BLOB (MySQL) or BYTEA (PostgreSQL) if you want to be able to store such characters. – cimmanon Jan 31 '15 at 17:24
  • 16
    "sticking to Latin-1 doesn't even allow you to write proper English" That's a good thing, otherwise unicode would be resisted even stronger. ;-) – Deduplicator Feb 01 '15 at 04:18
  • @cimmanon A blob contains binary data, not characters, right? – Paŭlo Ebermann Feb 01 '15 at 11:19
  • 3
    @PaŭloEbermann Embedded NUL characters means your data is a binary blob, not just a string. NULs was a strange example, since I believe UTF-8 avoids ever using a `\0` byte as part of a multi-byte encoding, to make sure non-UTF8-aware code doesn't stop in the middle of a string. – Peter Cordes Feb 01 '15 at 11:25
  • 7
    All unicode characters are printable -- you just need the correct font :-) – James Anderson Feb 02 '15 at 04:30
  • 4
    @JamesAnderson the font would then be wrong and broken. http://en.wikipedia.org/wiki/Unicode_control_characters – djechlin Feb 02 '15 at 15:08
  • AAA @ these are really the ASCII control characters and you can get fonts which do display them -- very useful for debugging and hex editors. – James Anderson Feb 03 '15 at 01:52
  • 1
    @amon: Might you expand on what "inconvenient properties" UTF-8 has over other unicode encodings? Because the only ones I know are slightly worse size for dense asian text than UTF-16, and multi-codeunit codepoints in contrast to UTF-32. And both points are offset completely: The UTF-16 one because of better size for anything favoring/featuring ASCII, like markup, source-code, protocols and most other languages to varying degrees. The UTF-32 one by it being far more bulkier, and unicode having multi-codepoint characters. – Deduplicator Feb 03 '15 at 02:53
  • 1
    @Deduplicator Well, you know your encoding pros and cons. The main drawback of UTF-8 is that it is a variable-length encoding. While this saves significant space in most cases, it makes certain operations more difficult. But in the context of MySQL + Rails, these details are already hidden from the programmer. UTF-8 wins over all other Unicode encodings (UTF-16, UTF-32) because they tend to waste more memory and UTF-16 is also variable length. In the context of MySQL, care has to be taken to pick the `utf8mb4` encoding, as the `utf8` encoding is as crippled as UCS-2 and only supports the BMP. – amon Feb 03 '15 at 09:44
64

I think beyond the technical question, your boss may not have the time to keep up to date on current standards.

Since his stance is not completely out to lunch, just out-dated, respect his position when discussing this matter (and you need to remember to discuss, not argue), and try to work through concerns he has with regards to UTF-8. I suspect the underlying issue is not a technical issue and may require some level of soft-skill negotiation.

Nelson
  • 861
  • 6
  • 11
  • 6
    I couldn't approve more. Actually I regret that in my own answer I completely overlooked the "human side", which in this issue might well be paramount. Wish I could upvote more than once :-) – LSerni Jan 31 '15 at 10:38
  • 2
    calling everything outside of latin-1 `bad character` and thinking these are `non-printable` is `just out-dated` to you? – njzk2 Feb 02 '15 at 15:50
  • 2
    The real issue is, "Is it a technical issue we are dealing with?" I don't believe the OP's boss went to school and was taught this, or read some technical manual/journal and came to that conclusion. I don't get the sense that the solution is strictly a technical solution. Ironically the comment shows exactly the heart of the issue; addressing this issue can be extremely offensive if done improperly. – Nelson Feb 02 '15 at 17:04
49

Which of us is right?

Once upon a time, your boss was. But as time goes by, things change. Nowadays, you are (but before running to your boss, be sure to read Nelson's answer too).

Old versions of MySQL, and old versions of mostly everything, dealt much better with the older Latin1/ISO-8859-1(5) than UTF8.

There is a reason why UTF8 has been created, evolved, and pushed mostly everywhere: if properly implemented, it works much better. There are some performance and storage issues stemming from the fact that a Latin1 character is 8 bits, while a UTF8 character may be from 8 to 32 bits long. So when planning VARCHAR you need to take this into account. And your search routines will be a tad slower. They will be able to do more things (e.g. searches with accent sensitivity or without. Can't do those in Latin1 without extensive work), but they will take a bit more time.

But on the other hand, storage is cheap, the realistic overhead on file sizes is less than 2-3%, computing power is also cheap and getting cheaper in good accord with Moore's Law; while your time and your customers' expectations definitely aren't.

You might have to worry for search tools etc. if you were the one to develop such tools. But you probably aren't. You use those tools; even those that were not completely UTF8 compliant yesterday (as the earlier MySQLs weren't), are today, or soon will be (e.g. MySQL with utf8mb4 support).

So by carefully planning and implementing UTF8 the right way (not slapping it over Latin1 as an afterthought) you can have code that is very reasonably future-proof, which, if you plan on ever doing business with any Asiatic country, is a Very Good Thing. And if you have no such plans, other people will have, and those people could be your customers, suppliers, or partners.

So when they start sending you UTF8 data, you'll have to set up a complicated thingamajig to convert to and fro Latin1, and deal with unsolvable cases.

When you factor in the budget the cost of several skirmishes against the evil mojibake ninjas, and consider that they are not going to go away - as you already discovered - then you'll realize that going UTF8 is not only simpler, it's going to be cheaper as well.

LSerni
  • 2,411
  • 15
  • 21
4

Some situations where restricting the character set only to ASCII may make sense is for limited choice fields, e.g. status fields, because you strictly control the values that can be there, and foreign key/references to external system, because there are rarely any reasons for them to have anything but alphanumeric characters and a few symbols.

For any other texts, just use UTF-8.

Lie Ryan
  • 12,291
  • 1
  • 30
  • 41
  • 2
    Doesn't MySQL have enums? – raptortech97 Feb 02 '15 at 06:55
  • 2
    And since ASCII is a subset of UTF8, just use UTF8 even then. – RemcoGerlich Feb 02 '15 at 10:00
  • @RemcoGerlich: I disagree that you could use UTF8 for those. In my view, external references are not text but opaque sequence of bytes. They have no charset except for notational convenience. If the sequence of bytes have an interpretation in certain charset, that is either the external system's or the application's domain, not the database's. – Lie Ryan Feb 02 '15 at 11:14
  • 3
    @LieRyan: I see that point, but then it shouldn't be ASCII either, probably some binary blob format or so. – RemcoGerlich Feb 02 '15 at 13:03
3

To begin with the answer, it doesn't matter, how your server is configured. The character encoding in MySQL could be configured per-column (means, same table could hold characters in multiple encodings, easy). I.e. my server (and a number of legacy databases in it) is configured for cp1251 by default for old clients that unable to set correct collation upon connect (different hardware clients), but main databases in production are all using UTF-8.

Speaking of "wasted space" - you can't realistically call important data a waste, can you? Storage space increase, however, will be different depending on the language your data is in. From insignificant (less than 1%) increase if your site is primarily in English and up to 100%, if it is mailny using characters outside the ASCII range. And even more, if you move firther east. Later UTF-8 (so-called UTF8mb4) specifications allow up to 4 bytes per code point.

And to "who's right"… Truth is, this is a social question more than it is technical. There could be valid reasons for specific server setups, but you must know the implications. But if you ask me, there's no reason to not use UTF-8. It's the one kind to rule all texts in the world.

AnrDaemon
  • 139
  • 1
  • MySQL will try to convert data in Database encoding before converting it to column encoding. If you have utf8 client, latin1 database and utf8 columnt, then text data can be lost. – Ivan Solntsev Feb 02 '15 at 15:50
  • Ivan, that is an entirely different question. The intereaction between character-set-client, character-set-server, character-set-connection, character-set-results is a long article in the MySQL documentation. And in case of per-column collation settings, "database collation" is column collation, and it is directly converted to character-set-result, ignoring database collation. – AnrDaemon Feb 06 '15 at 08:12
0

Just explain to him that UTF-8 is the default for web traffic. And any user can enter any valid unicode character in their browser.

Its just much easier to have utf-8/unicode all the way from front end to back end than to deal with the many and various issues that result from utf-8-> latin-1-> utf-8.

James Anderson
  • 18,049
  • 1
  • 42
  • 72