38

Which caveats should I be aware while localizing numbers in my front-end application ?

Example: In Brazilian Portuguese (pt-BR) we split thousands with dots and decimals with commas. In US English (en-US) that's the contrary. In pt-BR we present the digits separated by the thousands, the same as en-US. But reading about Indian English (en-IN) today I came across this gem:

The Indian numbering system is preferred for digit grouping. When written in words, or when spoken, numbers less than 100,000/100 000 are expressed just as they are in Standard English. Numbers including and beyond 100,000 / 100 000 are expressed in a subset of the Indian numbering system.

https://en.wikipedia.org/wiki/Indian_English#Numbering_system

Which means:

1000000 units in pt-BR are formatted 1.000.000
1000000 units in en-US are formatted 1,000,000
1000000 units in en-IN are formatted 10,00,000

Besides commas and dots and other specific separators, it seems that masking is also a valid concern.

Which other caveats should I be aware while localizing numbers in my front-end application? Specially if I'm showing numbers to non-latin character sets?

Paul D. Waite
  • 1,164
  • 14
  • 18
Machado
  • 4,090
  • 3
  • 25
  • 37
  • 3
    Gets even more interesting when dealing with money! :-) – Stephan Bijzitter Dec 06 '16 at 21:52
  • 4
    Not talking about the Martian numbering system which has the base 6 (two times 3 fingers) ;-) But Japanese has also a strangeness: man = 10.000 written as 1.0000, oku = 100.000.000 written in Japan as 1.0000.0000 and chō ... guess –  Dec 06 '16 at 23:14
  • @StephanBijzitter: But at least £sd isn't in use anymore ;-) – dan04 Dec 07 '16 at 00:17
  • 6
    Why do *you* have to worry about this? Can't you follow the OS settings? – Jan Doggen Dec 07 '16 at 08:11
  • 3
    @JanDoggen because that's one of the interesting problems of the Software Engineering domain, "how to properly present data to people". What I should be worried about when designing a system is the domain of this question. And I'm not even talking about money, as our friend Stephan said, nor date and time. Just raw numbers. – Machado Dec 07 '16 at 12:08
  • @dan04 Your spell checker burped. I think you meant "alas, £sd isn't in use anymore" – Mawg says reinstate Monica Dec 07 '16 at 13:13
  • 5
    @JanDoggen, this gets a lot more complex when dealing with online software. The user might be in India, on a US English computer, but reading a webpage in Brazilian Portuguese. Your server might be Chinese. Your app must understand what the user wants, regardless of what OS he/she is using, or where your server is. So your 1,000.00 dollars become 67.545,00 rupees: a US currency, converted at the local exchange rate, but displayed in Portuguese format. – noderman Dec 08 '16 at 02:12
  • 2
    In a word, don’t do it yourself. Rely on system-provided formatting functions that interact with the user’s settings. If you *are* doing that already, the caveats would be to make sure the proper fonts can be used and your layout has enough room, which is the same issue you have with all the content being displayed, not just the numbers. – JDługosz Dec 08 '16 at 06:16

6 Answers6

87

Most programming languages and frameworks already have a sensible, working mechanism that you can use for this.

For example, the C# ecosystem has the System.Globalization namespace, which allows you to specify the Culture you want:

Console.WriteLine(myMoneyValue.ToString("C", "en-US"));

This is not something that you want to re-invent. Use the internationalization features provided by your favorite language or framework.

Robert Harvey
  • 198,589
  • 55
  • 464
  • 673
  • 3
    I'm aware of System.Globalization and other frameworks that handle these kind of complexity for me. What I don't know is what problems they're solving. For instance, several applications I see uses specific masking on ToString, like .ToString("#,##0.00", locale), but that mask per-se is invalid if I'm showing this number to an Indian person. So, besides "don't use specific masks", what else should I be aware of ? – Machado Dec 06 '16 at 16:52
  • 7
    Nothing that I know of. If you use the framework properly, it should just work. There are certain, specific cases of internationalization problems, but building a comprehensive list of them is not something we do here. See [this example](http://haacked.com/archive/2012/07/05/turkish-i-problem-and-why-you-should-care.aspx/). – Robert Harvey Dec 06 '16 at 17:33
  • Each user would pick a culture they want to use as part of their user preferences, then it's just applying that culture variable to your display functions to display the data properly. If they don't like the current display you could even allow them to create their own custom formats. – Jon Raynor Dec 06 '16 at 18:35
  • 5
    This is the only correct answer: set your locale, then push your values through the i18n layer before displaying to the user and let the framework authors deal with it. This is true for numbers, currency values, translated strings, dates, everything. –  Dec 06 '16 at 19:55
  • 2
    Perfect answer. "Do not reinvent the wheel" is something that should always be taken into consideration while dealing with common problems like this one. It's a pity I can't upvote more than once. – BgrWorker Dec 07 '16 at 10:15
  • 3
    @Machado "For instance, several applications I see uses specific masking on ToString, like .ToString("#,##0.00", locale), but that mask per-se is invalid if I'm showing this number to an Indian person." -- It might not be clear, but note that the position of `,` in the format string is largely irrelevant and "#,0.00" would have the same effect. `,` simply means "use number group separators in the manner specified by the locale". – hvd Dec 07 '16 at 11:46
  • @hvd, I never knew that about the masking properties. Thanks! RobertHarvey, your example link is very good. BgrWorker/Snowman, I got a feeling I may have expressed myself badly on the question. Localization is a solved problem, but what I should be worried about when presenting data to the users may not be. Simple things such as formatting a number has several caveats, and I'm interested in these caveats. – Machado Dec 07 '16 at 12:12
  • For instance, when presenting a list of numbers in a table in pt-BR, it's usual for us to present the list right-aligned. Does the same principle applies to a right-to-left idiom ? Or in this case they prefer to read the numbers left-aligned or centered ? – Machado Dec 07 '16 at 12:17
  • Though this is the most upvoted answer, I'm not comfortable to check this as the correct one, as it doesn't address the question. I'm downvoting because of that. As an analogy Q&A: "What should I be concerned when building a car ?" - "Just buy one, it's a solved problem". – Machado Dec 08 '16 at 12:31
  • I guess you also have to worry about languages which might use a completely different character set to format numbers for formal documents, like 大字 in Japanese. – Hakanai Jun 14 '17 at 23:56
24

Some excellent answers here already, but they did not mention one thing which I think is important not to forget: make sure wherever a number formatting takes place, it is clear (or can be controlled) what the output is used for:

  • when it is for the user interface, the localized formatting must be applied

  • when the number is going to be written to a file, or sent over the network, or another form of where the number is needed in machine readable form, make sure it is not formatted according to the current culture, but according to a fixed setting (for example, in the .NET environment, use InvariantCulture).

Otherwise you get problems when numbers are written or sent using culture A, and read or received using culture B.

To my experience, this is one of the biggest hurdles in doing proper localization of numbers: in an attempt to centralize the number formatting and conversion, people start to create general, reusable functions for the formatting, and then start to use them all over the place. However, as soon as one needs the numbers also in a machine readable string format somewhere else in the program, two variants are needed: a localized and a non-localized formatting. This introduces a high risk of mixing up the two forms of conversions (especially when the developers and testing machines have their default locale settings similar to the "fixed" setting used for non-UI formatting, but part of the user base has not).

Addendum: this problem can become really nasty in situations where it is not clear beforehand if the number will be processed by a machine, or by a human (or both) later. For example, as part of the output of a log file. In such cases it is probably best to stick to the "neutral" standard of using no separator except the point as a decimal separator.

jscs
  • 828
  • 9
  • 17
Doc Brown
  • 199,015
  • 33
  • 367
  • 565
  • 2
    And worse many modern pogramming languages the obvious/default functions in the standard library are "localised". So if the developer doesn't know or care about localisation the resulting application is likely to be disfunctional rather than merely ugly on foreign systems. – Peter Green Dec 07 '16 at 08:35
  • @PeterGreen: I guess it would be equally bad if the default behaviour would be not-localised, then if the developer doesn't know or care about localisation, he would screw up the UI. The only way around this might be to provide functions with no default behaviour, which force the dev to make a decision between "localised" and "not-localised". – Doc Brown Dec 07 '16 at 08:59
  • 4
    I disagree on equally bad. A tool that doesn't follow local numeric conventions in it's UI is still going to be usable. A tool that fails to read it's own data files or fails to talk to it's server because of numeric convention mismatches is far more likely to be unusable. – Peter Green Dec 07 '16 at 09:17
  • 5
    An anecdote of this: The decimal seperator for en-ZA changed between Win 7 and Win 8. Previously locally stored values started to fail to deserialize – Caleth Dec 07 '16 at 09:17
  • 1
    @PeterGreen: a tool that doesn't follow local numeric conventions in it's UI **may** still be usable, or it **may** be fully unusable for certain use cases. I would be very careful of making such assumptions. The reason why so many devs get localization of numbers wrong is exactly that - making these kind of assumptions. – Doc Brown Dec 07 '16 at 09:36
  • 1
    @DocBrown I have the most horrible legacy code to maintain that suffers from the standard library's localized integer/float parsing routines. I think it's fair to say that a program written without care for localization when the default routines for these jobs are non-localized **may** be unusable for some situations, but if the default routines are localized, the program **will always** be broken the moment it is executed on a computer where the global locale is not English. – Sebastian Redl Dec 07 '16 at 09:40
  • 1
    @SebastianRedl: fair enough. – Doc Brown Dec 07 '16 at 12:17
8

Proper localization is quite difficult. Most programming ecosystems have attempts at a solutions for localization, but in my experience they are all more or less broken. I would therefore suggest:

  • Don't try to automate localization. It won't always work. It is difficult for you to spot the problems, and frustrating for your users.

  • Be consistent: don't mix different languages and formatting conventions, e.g. Brasilian-style decimal separators in English text.

  • Explicitly support a given set of locales. Work together with your translators to figure out proper formatting for dates and numbers. You will likely end up creating your own localization toolkit, though most (but not all) problems can be delegated to an existing library.

  • Make simple formatting choices configurable by each user: formats for dates and times, decimal separators, preferred currency, …. This is especially useful for travellers, expats, or other people that need to mix multiple locales or cultures independently of language.

amon
  • 132,749
  • 27
  • 279
  • 375
  • 18
    Also be aware that a large number of users **hate** the convention that's deemed "correct for their locale", consider it a hideous legacy practice, and want no grouping at all, or a different sort of grouping. As such there should probably be options to turn it off or manually override it. – R.. GitHub STOP HELPING ICE Dec 06 '16 at 20:15
2

You can't be aware of all the caveats of languages. You are talking about numbers, but there are plurals, genders, collation. You need to know they exist and rely on extensive work performed by other people, most notably the ICU and CLDR projects.

Most modern languages implement some or all features of these projects, but even if they don't, reading about these projects will give you a good idea of what to look for.

http://site.icu-project.org

http://cldr.unicode.org

Update

The CLDR survey tool provides access to all patterns. That will show you how to format a number in certain language and region. For example, Portuguese (Portugal):

http://st.unicode.org/cldr-apps/v#/pt_PT/Number_Formatting_Patterns/

And if you really want to check all data (and perhaps use it), you can download the CLDR in JSON format from GitHub:

https://github.com/unicode-cldr/cldr-json#cldr-json

More info about downloads here:

http://cldr.unicode.org/index/downloads

noderman
  • 129
  • 3
  • Thanks for the input, but I'm mostly interested in numbers by now. :) – Machado Dec 07 '16 at 12:14
  • Sure. I just edited the response to include a link to the survey tool, where you can narrow down your search. – noderman Dec 07 '16 at 21:11
  • I tried to change do Brazil, to check the differences, but it doesn't seem to enable visualization for that: http://st.unicode.org/cldr-apps/v#/pt_BR/Number_Formatting_Patterns/ Otherwise, the tool seems pretty good. – Machado Dec 07 '16 at 21:17
  • That's because Brazil is the root language. The survey tool is actually used for making changes to the CLDR data, so the roots require special accounts. You can go to GitHub and get all info directly: https://github.com/unicode-cldr/cldr-numbers-modern/tree/master/main Specifically, Brazil is here: https://github.com/unicode-cldr/cldr-numbers-modern/blob/master/main/pt/numbers.json – noderman Dec 08 '16 at 00:33
2

An important consideration: You should decide how much is enough. Because if you go down the rabbit hole of trying to localize perfectly, it will become increasingly complex.

Take a typical label like "You have selected n items." This reads wrong if there is only one item selected. The ugly but pragmatic solution is to write "You have selected n item(s)." But if you want to do it correctly, you need two different texts depending on n. If you try to do this in multiple locales it will quickly get really complex, since different languages have different grammar. Some languages have different conjugations for one, two and multiple items and so on. For this reasons people in the know will always complain that existing localization frameworks are insufficient.

But you have to choose your battles, and decide what level of sophistication is sufficient. For many purposes a standard localization library for formatting numbers and dates should be sufficient.

JacquesB
  • 57,310
  • 21
  • 127
  • 176
  • This is solved by ICU (MessageFormat). The drawback is that the adoption of ICU on many languages is still weak. However, the developer still needs to construct the message in the right way. It is really more than the engineering aspect of it. http://userguide.icu-project.org/formatparse/messages – noderman Dec 08 '16 at 00:37
  • This is also solved by the more widely available [ngettext](https://www.gnu.org/software/gettext/manual/html_node/Plural-forms.html) function in GNU gettext, but the MessageFormat class appears to also solve some extra problems that ngettext doesn't. – hvd Dec 09 '16 at 11:33
0

Well, while I'm happy with all the answers here, I'm not really satisfied with each of them separately to mark one as the correct answer.

So far this is what we should be aware of when localizing numbers:

For humans:

  • Thousands separators are not always separating at thousands. See Indian case in the question;
  • Thousands and decimals characters varies culture to culture. In German thousands are split using spaces, for example, while in English it's commans and in Portuguese it's dots;
  • We don't have information if there's a relevant difference between left-to-right and right-to-left languages;
  • Provide a specific set of supported localizations and make it clear for your users;
  • Allow your users to change the default localization to one of the supported localization and they'll be happy and send you cakes being grateful, because you're a generous god. :) ;

For computers:

  • Remember that machines are not lenient and should always receive the same formatting while serializing and de-serializing a number;
  • Stick with a single format for it;
  • Use the minimum necessary format possible. Avoid thousands separation, decimals should be enough for serialization and de-serialization.

For developers:

  • (as suggested by @hyde below): Use existing library for localization;
  • If you can, use native testers and specify localization/internationalization test cases, otherwise trust the library;
  • Remember that localization is a problem mostly solved. Every major language has a library, native or external, that can localize numbers, dates and times;
Machado
  • 4,090
  • 3
  • 25
  • 37
  • 1
    Missing item: For developers: use existing library for localization. If you can, use native testers and specify localization/internationalization test cases, otherwise trust the library. – hyde Dec 08 '16 at 06:19