Is the use of "utf8=✓" preferable to "utf8=true"?

Question

I have recently seen a few URIs containing the query parameter "utf8=✓". My first impression (after thinking "mmm, looks cool") was that this could be used to detect a broken character encoding.

So, is this a better way to resolve potential problems with character encoding, or is it just a developer having fun with a hack?

I disagree. There are schemes out there that look like URNs and that take query parameters - such as Bitcoin. URIs are not confined to browsers. See http://en.wikipedia.org/wiki/URI_scheme. This question **may** also address the general case where character encoding is required when a browser accesses a protocol handler. — Gary, Oct 19 '12 at 08:29
Off topic, but OK. Here's my personal donation Bitcoin URI: bitcoin:1KzTSfqjF2iKCduwz59nv2uqh1W2JsTxZH?amount=0.5&label=Agile%20Stack. Notice that the scheme is essentially a URN with query parameters, but it hands off to a protocol handler. This kind of URI could probably benefit from the “utf8=✓” workaround as well. — Gary, Oct 22 '12 at 17:47

score 823 · Accepted Answer · edited Oct 18 '12 at 12:57

823

By default, older versions of IE (<=8) will submit form data in Latin-1 encoding if possible. By including a character that can't be expressed in Latin-1, IE is forced to use UTF-8 encoding for its form submissions, which simplifies various backend processes, for example database persistence.

If the parameter was instead utf8=true then this wouldn't trigger the UTF-8 encoding in these browsers.

edited Oct 18 '12 at 12:57

ChrisF

38,878
11
125
168

answered Oct 13 '12 at 12:45

Gareth

5,092
1
17
13

1

I can't quite see how it would save you from having to validate input, considering that you always have to consider malicious input for public interfaces. – Lars Viklund Oct 13 '12 at 13:07
@LarsViklund You have to secured against malicious input, sure. But this is, to a large degree, independent of character encoding weirdness. – Oct 13 '12 at 13:47
8

@LarsViklund I should have been clearer with my comment. I meant that the validation associated with character encoding is simplified, not bypassed. – Gary Oct 13 '12 at 13:48
3

@Lars Correct, it doesn't absolve you from having to check your input. But it does mean that encoding tweaks only become part of your security handling and don't taint the concept of your "standard processing" path – Gareth Oct 14 '12 at 10:08
40

Also see http://stackoverflow.com/questions/3222013/what-is-the-snowman-param-in-rails-3-forms-for/3348524#3348524. Apparently Ruby on Rails used to use a snowman character, and was changed to a checkmark which was less ambiguous but less funny. – Jack V. Oct 17 '12 at 10:06
Amusingly one of the first versions of this used the UTF-8 snowman ☃ to force the conversion. – tadman Oct 18 '12 at 17:07
How does the browser/parser know to evaluate this to true? Does anything other than false or 0 auto evaluate to true, or are there specific values which map to true and others to false? – JohnLBevan Oct 18 '12 at 19:21
11

@JohnLBevan it's ignored by the receiving end, it's done it's job to force the browser to send things in utf8 instead of latin1. I've also seen it as ie= (that's the 'pile of poo' code point, looks like it's not rendering in comments.) – cabbey Oct 18 '12 at 19:54
Ahh sorry, just reread the question & spotted that this is in the URI; not the html/xml. So presumably putting utf8=false would be meaningless - the parameter's only purpose is to act as a hack for ie. Thanks @cabbey. – JohnLBevan Oct 18 '12 at 20:08
3

@Gareth: Can you back-up the statement that IE <= 8 forms do not support the document and/or form encoding? – hakre Oct 22 '12 at 13:00
2

_By default, older versions of IE (<=8)..._ strikes again! – Flash Jan 09 '13 at 08:03
That makes sense - instead of having to handle non-UTF8 you can just bomb out with a "Outdated browser not supported; please upgrade" error. – Demi Dec 18 '13 at 05:09
But isn't it true that the Latin1 charset (ISO-8859) is 100% compatible for storing in a UTF-8 datastore? Why would you need to *force conversion*? – Simon East May 18 '17 at 05:51
3

@SimonEast It's absolutely not true. For that to be true, of the two codepage layouts for [Latin1](https://en.wikipedia.org/wiki/ISO/IEC_8859-1#Codepage_layout) and [UTF8](https://en.wikipedia.org/wiki/UTF-8#Codepage_layout) one would have to be a proper subset of the other. As it is, they only overlap between 0x20 and 0x7E. Any bytes outside of that range are ambiguous as to what they represent. – Gareth May 23 '17 at 11:25

Is the use of "utf8=✓" preferable to "utf8=true"?

1 Answers1