How do you mix left-to-right and right-to-left scripts without your files looking crazy?

Question

Say your native language is Hebrew, and you're working in a programming language like Python 3, which lets you put Hebrew in source code. Good for you! You've got a dict:

d = {'a': 1}

and you want to replace that a with some Hebrew. So you replace that single character:

d = {'א': 1}

Uh oh. Just by replacing one character, without making any other changes, your display went crazy. Everything from the Hebrew to the 1 is backward, and it's extremely non-obvious that this is even valid syntax (it is), let alone what it means.

Hebrew is intrinsically right-to-left, and even without any invisible control characters, Hebrew text will show up right-to left. This also applies to certain "regular" characters in positions near Hebrew, as well as characters from a few other scripts. The details are complicated.

How do you deal with this? You can't stick control characters into your source code to fix the display without breaking the code. Writing everything in hex escapes trades one kind of unreadability for another. Even if you resign yourself to naming everything with characters from the Basic Latin block and sticking all Hebrew strings in localization files, it's hard to avoid mixing right-to-left text with left-to-right.

JSON or CSV with Hebrew in it will be garbled. If those localization files you shoved your strings into were supposed to be human-readable, well, they're probably not. What do you do?

I think this is related to your code editor or IDE. Logical order of mixed English/Hebrew have no problem. The problem exists just in visual. I put your two lines of code in Visual Studio 2015 and it just displayed well. That mean Hebrew character displayed in left of of 1. — Afshar Mohebi, Apr 09 '16 at 05:48
@afsharm: If you put in more Hebrew, does the Hebrew show up left-to-right or right-to-left? If it's left-to-right, your Hebrew is showing up backwards, and you're in the situation an English native would be if Visual Studio displayed their strings as `'.dlrow olleH'`. If it's right-to-left, your Visual Studio is doing something weird that's neither forced left-to-right nor the proper Unicode Bidirectional Algorithm. Either case has its own sources of confusion. — user2357112, Apr 09 '16 at 06:00
@afsharm: Your profile says Iran, so you're probably way more familiar with right-to-left text than I am, though. What does it look like when you type Persian in Visual Studio? (Or have I made a bad assumption somewhere?) — user2357112, Apr 09 '16 at 06:08
You guess correctly. My native is Persian that is a RTL language just like Arabic and Hebrew. Visual Studio 2015 do not mess single language strings. See http://tinypic.com/r/2em2137/9 But Visual Studio is not smart enough to show string that contains both RTL and LTR simultaneously correctly. — Afshar Mohebi, Apr 09 '16 at 06:20
Other editors may or may not have better support of RTL languages. For example Sublime have not a good support of RTL scripts by default. — Afshar Mohebi, Apr 09 '16 at 06:22

Basile Starynkevitch · Answer 1 · 2016-04-09T06:18:17.867

AFAIK, this mostly is relevant when you use non-ASCII letter in identifiers (and perhaps comments) in your code.

If you discipline yourself to avoid that, e.g. if your code use "English" looking identifiers and keywords and comments, this is much less an issue (and every software developer should be able to read English documentation and code). Then, internationalization & localization of your application happens only in messages, notably literal strings.

You could then use some message catalog. For example in C and POSIX, you'll use gettext(3) and friends. The localized message catalog contains all the localized / internationalized variants of the message. If your application is only for Hebrew users (and that is not a big market) have Hebrew only in literal strings.

To be more specific, the hello world application would contain

void say_hello(char*towhom) {
  printf(gettext("hello %s"), towhom);
}

and your application would customize itself at start of run by calling some setlocale(3) with appropriate arguments.

See locale(7). Adapt all this to your Python and operating system. Many cross-platform frameworks (e.g. Qt) have extensive support for internationalization & localization.

Of course there is the delicate issue to display Unicode strings. Most serious display and GUI libraries and toolkits (Qt, GTk, ...) are able to deal with mixed languages strings (e.g. displaying something containing Hebrew and English and Russian and Chinese).

For a broader view, read the wikipage on internationalization and localization of software.

A JSON file is valid when containing only ASCII characters, with other characters (which would appear only in JSON strings) encoded with \u05d0 (instead of א) in the string.

Perhaps you could find a good enough editor and customize it for your needs. I'm sure that you could find some Emacs submode (or else customize one) to cover the particular issue of having Hebrew literal strings in Python (but still have English looking identifiers and comments).

BTW, I don't know how an Hebrew keyboard looks like, but in most keyboard layouts, you can configure them so that typing ASCII letters (i.e. Latin ones) is faster than typing non-ASCII ones. So even for yourself, it could be better to type English looking code.

Regarding JSON data, you should be able to configure your editor to see א when a string contains \u05d0 (otherwise use a JSON converter à la jq)

So I believe your real issue should be to choose and configure well enough a good editor (while having Hebrew only inside literal strings; in the rare case where a literal string needs to contain both Hebrew and English, split it into several pieces.). I guess that both Emacs and Vim could be configured to fit your needs.

It's pretty lame to have to bring in a localization framework for a monolingual program, and you've still got the problem of data files being human-unreadable. Do you just accept that data formats intended for human-readability lose that property in the face of bidirectional text? — user2357112, Apr 09 '16 at 05:41
I would say that yes, but I never coded a monolingual program for non ASCII things. I am myself not a native English speaker (but a French one), but my code is always English-like. I have to force myself to code with French identifiers, and I almost never do that (the only special case is when I am writing the code only for one particular person who is not understanding English well; this happens rarely: software developers need to be able to read English documentation) — Basile Starynkevitch, Apr 09 '16 at 05:43

How do you mix left-to-right and right-to-left scripts without your files looking crazy?

1 Answers1