imagining a localized programming language

october 31, 2021

This is a draft from a few days ago I just finished up, so uhh spooky times yeah. I might post a horror film review later today.

⁂

People rightly claim that programming is very anglocentric. From a historical perspective it makes sense, the first language designers were American or British and so used English, and from that point on if you're a programmer that could create a new language you also spoke English and would make your new language be English to fit in, and it got to the point where all programming is English basically. Even if you don't speak English regularly in daily life you have to use English keywords and variable/function names, read English documentation for things, and presumably write your own things in English to fit in with the rest of the community.

Well, recently I've been thinking of a stratagem that allows for localizing not only the text output when running a program, but the source code itself. It may be completely impractical and the burden it places upon programmers and translators probably makes it unlikely to become used even in multilingual communities, but it's an interesting thought experiment.

For the hypothetical language I'm going to be using Ada because 1) it's on my mind right now; 2) it uses a relatively small number of symbols, and the symbols are generally used in a similar manner as written text or math rather than being arbitrary (with a few exceptions like `=>` and `<>`); and 3) it uses descriptive, non-abbreviated keywords that are used for one or two purposes only (albeit with some exceptions *cough*`with`*cough*).

Symbols used in Ada

Keywords used in Ada

This would presumably make it relatively easy to come up with a localized version of Ada. For instance, the only “enclosure” symbols used in Ada are parentheses (`(` & `)`), quotes for strings (`"`), and goto labels (`<<` & `>>`, they're used very rarely). This means to localize you just need a way to enclose text quotes (`«` & `»` for example), and a way to enclose asides or related information, and some leftover symbols that could be assigned to goto labels. In general keywords and symbols have only one meaning and usage (albeit with the exceptions of parentheses and `with` which are both heavily overused for different contexts).

There, you have a localized programming language with relatively little effort. Well… except for identifiers and comments and documentation written by the programmer. This is a lot tricker, and relies a lot on effort on the part of the programmers. While you may have volunteers that are willing to translate the software's UI, it seems unlikely that they'd be willing to volunteer to translate the source code stuff not directly visible to the end user. Maybe if your company or project has multilingual contributors they could be expected to write it in both languages at once, but in general the source code would probably only be in one language. At least the language keywords could be localized to the same language as the identifiers in this scheme.

Setting aside the issue of identifiers for the moment, let's look at how this could be used for actual compilable code—at some point we have to turn our localized version into real code that can be actually compiled. The simplest method would be something like multi-file RCS where you “check out” a file which will localize it from some opaque format, and when you're done editing you can “check in” your changes which converts the file back to the opaque format. Assume that is some form that is compilable and uses the native keywords and symbols so it can be compiled, but uses some opaque naming scheme for identifiers.

The RCS method sounds tedious but it allows tool integration while still being usable for people without tool integration. You could have git hooks to check in files before a commit and check out new files after a pull. Or your editor could automatically check in and out files as you open and close buffers. Or you could manually check in and out files and use ed(1) and CVS.

Ideally we'd like error messages as well as references to source code identifiers be localized into the programmer's preferred language. I don't know about localizing compiler messages but Ada has had Unicode identifiers support since Ada 2005, which means that we could leave identifiers in a language of choice and just localize back keywords.

This isn't anything I'm actually working on, nothing that's likely to come to exist and if it did probably nothing like the form I'm proposing here. It is an interesting thought experiment for something that seems pretty possible and would (presumably) make programming a lot more accessible outside the anglosphere.