Best practices for internationalization: composed sentences?

Question

I am working on a project where clients are able to create objects in a database. Each of these objects has a description string that describes the object. Let's assume we are looking at an object that represents a car:

A red car manufactured by BMW with 62000 miles
A pickup-truck manufactured by Dodge from 2010
A car with 5 seats

The "car" class has different attributes, and not all of them are mandatory. For example:

type of car: car, sedan, pickup truck, SUV
mileage
brand
seats
year
number of previous owners

The description sentence should contain this information. E.g. if we know the number of seats, this information should be part of the sentence, otherwise it shouldn't. If we would do this in one language only, this is not too complicated. We just need to analyze the structure of the sentence and compose the sentence as follows:

A [{color}] {type of car} [manufactured by {brand name}] [with {miles} miles] [from {year}] [with {seats} seats}

parts in [....] are only part of the final sentence if the attribute (in {...}) is set.

However, this project needs to support several languages and we need a fast way of translating this. This means, that we can't just translate "manufactured by" and all the other elements of the sentence in all different languages, and then compose the sentence with the same structure. Different languages might have a very different sentence structure. Obviously, we could translate each combination of elements separately, but the effort of that is quickly getting to high, as the number of combinations can be huge (we have objects with 10 or more attributes).

What is the recommended way of dealing with this kind of scenario?

The project is implemented in Ruby on Rails, so I am ideally looking for an approach that supports this.

score 12 · Answer 1 · answered May 24 '19 at 15:03

If you want internationalization that isn't infuriating/weird/offensive/confusing to users, you do one of two things: you take the entire block out and have the copy translated in one go per item, or you remove the narrative sentence structure you propose that includes variable attributes so that you completely avoid differences in language structure.

The most common way this is done is you have a general product blurb, written in natural language, and this is translated to any language you support. It contains no information that will vary by product class, so if you must include some options it is said like "available in 2-door or 4-door models..." (but generally you avoid this when possible).

Then you use the attribute section as basically key-value pairs, so you have something like this:

Attribute: Value
Color: Red
Doors: 4
Year: 3000

The reason this is so popular is precisely because your proposed place-holder solution simply does not work across languages. You can beat your head against it and be surprised when native speakers of other languages don't appreciate your broken translation, or just hope they still are willing to buy it anyway, but natural language is hard and it is better to either do it right (fluent translator working with a single set block of text) or avoid the problem (key-value is not natural language anywhere, it is easier to scan and compare products with, and it allows you to perform word-level translation far more easily into the majority of languages you are likely to encounter).

score 6 · Answer 2 · edited Jun 16 '20 at 10:01

Basic principles

The sentence structure and grammatical subtleties can indeed make the internationalization exercise difficult.

The core practices that you'll need to use in all the cases are:

separate the text data from the code. Put it in a separate resource file, that you can give to translators.
for numeric values, have some code that chan make unit conversions (e.g. miles to kilimoeters)

List of attributes

The easiest way is certainly to list the attributes and their values, as proposed in Brian's answer.

The inconvenience is that the presentation of the data consumes a lot of place on the screen.

Classified style

Another alternative could be to create a classified-like string that simply concatenates the different values, without even naming the attributes. This works well if the values could not lead to confusion.

Here you need another good pratice: for every error message or text assembly, do not hard-code the order of parameters, but use a multilingual string that contains named placeholders. The order of the placeholders will of course depend on language and local usages.

English example               French example                  German example
---------------               --------------                  --------------
BMW car, red, 62000 miles     Berline BMW, rouge, 99000 km    BMW Pkw, rot, 99000 km
Dodge pickup-truck, 2010      Pick-up Dodge, 2010             Dodge Pick-up, 2010
Car, 5 seats                  Berline, 5 sièges               Pkw, 5 Sizter 

{brand}{type}{miles}{seats}   {type}{brand}{miles}{seats}     {brand}{type}{miles}{seats}

It's more complex but there are a couple of advantages:

It's not difficult to generate;
It's much more compact, which makes it easy for a user to browse through a list;
It's much more accessible for visually impaired users who need to use a screen reader.
It's more voice-assistant friendly than a complex screen layout.

Sentences

Generating full sentences is one level further in complexity. But depending on your requirements, this could be a must-have feature (e.g. generation of contracts, voice assistant applications, ...).

So here, in addition to the multilingual string with the placeholders in the right order, and unless you have some sophisticated grammar-aware text generator, you'll need to foresee more vocabulary, in order to cope with:

Singular (car) vs. Plurals (cars)
grammatical gender: many languages have a gender. The car type in French could be masculine ("le pick-up") or feminine ("la berline"). In German you even have three genders (masculine, feminine and neutral). With such languages, the right order of words is not sufficient, and using the english word in a localisation API to find the localized equivalent is no longer a solution either. Here you must have text generation code that copes with these grammatical constraints (e.g. in French: "le pick-up bleu" vs. "la berline bleue")

For simple sentences, or self-description of objects, it is then sufficient to manage the multilingual resources with additional attributes:

type: car:     ->  French: gender=F;  (berline, S), (berlines,  P)  
type: pick-up  ->  French: gender=M;  (pick-up, S), (pick-ups, P)
color:  red    ->  French: (rouge, M, S), (rouge, F, S), (rouges, M, P), (rouges, F, P) 
color:  blue   ->  French:  (bleu, M, S), (bleue, F, S), (bleus, M, P), (bleues, F, P)

So depending on the context, you may find out if it's singular or plural. Then you need to deduce the gender to be used (the type will tell you). Then for the remaining values, you'll pick the word with the known attributes combination.

Caution: If you want to construct more sophisticated sentences, you'll quickly face a huge complexity (e.g. using the car's self-description in a sentence can be extremely difficult in languages like German, which use declensions that may require every word to be fine-tuned depending on the grammatical functions of the group). And every languages might have different rules. Then, it could be an idea to use a NLP translation service.

The larger the text block you give to your translation team, the better the results, and the smaller the text, the worse. It is already difficult enough to generate a well-formed natural-language sentence from arbitrary templated elements "plugged in" at run-time when you are working in a single language. It becomes a very, *very* hard problem when working with multiple languages and not for any one reason alone. Grammatical concord in inflected languages is staggeringly complex even for languages like English or Romance without case markings in nouns, adjectives, or articles. [....] — tchrist, May 25 '19 at 15:03
Getting even "simple" languages you know well right is hard, like ***a** choice* but ***an** option* in English.The moment you add a new target language for translation, you find problems you never dreamt of having to provide for in your existing model and so must extend that model's complexity. Consider how German distinguishes grammatical case not merely in pronouns like English or Romance but also in nouns, adjectives, and articles. So now you have to add case markings (nominative, accustative, dative, genitive, etc.) to your model. — tchrist, May 25 '19 at 15:12
Another easy example of extreme failure in templating can be seen when trying to translate *“You’ve chosen three incompatible options”* versus *“The three options you've chosen are incompatible”* into French. The French word you'll need for *chosen* is spelt differently in those two sentences because of the preceding direct object in one sentence but not the other. You can't know how to inflect that in the first scenario with a preceding direct object until you know the number and gender of the word for *option/options*. Countless complexities like these make this a nearly impossible problem. — tchrist, May 25 '19 at 15:19
@tchrist all your remarks are perfectly valid. I do not recommend an automated multilingual text generator that solves all the cases. This is far to difficult already in one language; huge research teams are working on automated translation tools, but still some languages remain resistent. What I advocate here is a simple class output that allows to describe the object, at the level of simplicity of the asked question, in a narrowly defined domain. This is something that most programmers are able to do in their native languages. A little generalisation allows to target more languages(not all) — Christophe, May 25 '19 at 15:28

Best practices for internationalization: composed sentences?

2 Answers2

Basic principles

List of attributes

Classified style

Sentences