Drupal is a lot easier to localize than it used to be, but it’s still hard work; partly because localization is intrinsically difficult, and partly because of flaws in the design of Drupal’s internationalization approach.
In localizing a site there are four main areas to deal with:
- Human-readable formats for data items such as dates and numbers.
- The (generally short) text strings used commonly throughout the user interface, for such things as labels on form buttons.
- Major items of content such as blog articles, informational pages and static blocks.
- Theme content. To accomodate different local conventions – and quite often to make room for longer text strings – the layout of pages and their components may need to be adjusted.
This list is not exhaustive, but I only want to talk about item 2 (and to some extent 3) here for now. Drupal provides for the internationalization of such strings with the t() function, meant to be called whenever a module outputs an English text string. This takes as arguments a string for translation and, optionally, an array of values to be substituted into it, so for example, you might call it like this:
t('My hat it has @count corners', array('@count'=>$number_of_corners));
The Engish result when $number_of_corners is 3 would be “My hat it has 3 corners” whereas the German translation might be “Mein Hut er hat 3 Ecken”.
If you are familiar with software localization such as that provided in Java, you’ll have spotted a problem right away with this function: it doesn’t make allowance for plurals. If you have a phrase that should vary depending on whether the value to be substituted is singular, plural or zero, then you must test for the different cases in your own code, rather than the t() function doing the job for you as the Java MessageFormat class does.
Edit: as the commenter below points out, the format_plural() function is designed to solve the problem of plurals and does so very well. Sorry, my mistake!
Another criticism of the t() function is that it gives you no means of providing context which could be used to deal with ambiguity. Here’s the problem: suppose there are two modules you use that both output the word “Store”: one module being for e-commerce, using the term to refer to an online shop; the other used to manage some form of data repository, using it as a label on a button which causes it to store a record.
The t() function is unable to distinguish between these cases, so you can’t translate them differently even though they are quite different meanings of the word (one a noun and the other a verb, for a start). The module writer has to be alert to such possibilities, which is quite a heavy burden when design and coding is hard enough as it is. In my experience programmers who are sensitive even to issues of use of English are rare enough, so expecting them to fully consider the needs of those who speak other languages is asking a lot.
So, how might the t() function be improved for better localisation? An optional more sophisticated version with similar power to Java’s MessageFormat class would be helpful, but for the problems arising from ambiguity I’d like to suggest a new function, called perhaps t1(). This would take two arguments, the first being a context. This string could then be used in localisation, when required to distinguish between two cases, so the value would need to be meaningful and guidelines would need to be developed. Perhaps module name + one of several predefined constant values would cover most common situations.
Any other thoughts on the subject?
(Footnote: the languages I have worked with include Japanese and Welsh – though I don’t speak either – every situation has its own interesting challenges, not least the thorny one of character encoding, especially a few years ago when Unicode was less commonly supported.)