The current state of the Romani language in the Google Translate program and its prospects
A couple of days ago, Google announced it had expanded the language palette in Google Translate to include 110 more languages. One is the language traditionally spoken by Romani people.
It might surprise you to learn that the Romani language is in the top 3 % of the world’s most-spoken languages. At the same time, however, it is the language of a marginalized minority whose language rights are rarely upheld – anybody who doubts this should ask how many children growing up in Romani have an opportunity to spend at least their lower primary educations in their native language – and it is a language that is either being lost under pressure from various social factors, or was already lost at some point in the past, depending on the group of speakers. This rather large but endangered language is finally represented in a dignified way in the online public space. The symbolic value of the fact that we find the Romani language among the offerings from one of the most-used online translation programs is hard to overstate. It’s not surprising that enthusiasm for this news predominated among my Romani and Romani Studies friends from the Czech Republic and Slovakia when it was announced. However, those who have tested the program in more detail are starting to express doubts about its quality.
I have to agree with them. Currently, the Google translations into Romani are quite limited for Czech and Slovak Roma and others interested in Romani as it is spoken here, and I would not hesitate to characterize the translations from “our” Romani into Czech and other languages as mostly a lost cause. A simple sentence meaning “I have a dog”, which in different varieties of Czech and Slovak Romani can be either hin man rikono, hin man rukono, si ma žukel or hi man džukel, is translated into Czech by Google as meaning “who is rich”, “who will buy”, “I feel like” and the meaningless “hi man jukele”. Since Google Translate works rather respectably with translations from Czech into larger languages (and vice versa), why is this the case?
To answer this question, one must bear in mind what the basic methodology of Google Translate is. Its translations, like the translations of most online translation programs today, are not based on predetermined descriptions of vocabulary and the grammatical rules of the language being translated, but come from the extensive corpus of texts, including bilingual ones, in which the translation model is trained. What is essential, therefore, is how much data from the texts (and from which such sources) the program has available to it for any given language. The quantity and selection of the Romani texts from which the program learns impacts the accuracy and quality of its translations.
The linguistic consultants to Google are very well aware of the dialectal variations within languages, and Romani is even mentioned as an outstanding example of this in the company’s announcement. We learn there that the program was trained predominantly on texts from what is called Southern Vlax Romani (spoken by a significant proportion of Romani people, above all in the countries of the former Yugoslavia and elsewhere in the Balkans, and the variety of Romani used most online, according to Google), as well as texts from some other dialectal groups of Romani. However, among the groups explicitly mentioned, Central Romani, used by most Romani speakers in the Czech Republic and Slovakia, is not there. In this context, it is necessary to recall that the dialects of Romani are really quite distinct in terms of vocabulary and cannot just be considered deviations from some literary, standard variant of the language, as we sometimes consider the dialects of Czech, for example. There is no unified literary Romani or pan-Romani standard. The regional Romani standards which are gradually coming into being are based exactly on the different dialects of Romani. Romani Studies scholar Milena Hübschmannová spoke of this as a polycentric model of standardization.
The consequence of Google Translate using data from several Romani dialects is the significant lexical variation of its translations. If you want to translate the Czech word “voják” [soldier] in the singular into Romani, the program offers you xelavdo (an originally secret word meaning “rinsed out, washed away”), which is being used by the authors of some of the entries in the Romani version of Wikipedia. However, if you want to translate this word in the plural, you get ketani, which is a loan word from Old Romanian (cătană) characteristic of the Vlax dialects. The Google translation of “with a soldier” is e soldatosa, containing a form of the word soldato, a loan word from either modern Romanian or a dialect of Serbo-Croatian. Google’s translations into Romani are thus, generally speaking, dialectally inconsistent, and we also encounter sentence translations containing elements from several different dialects of Romani, elements which would never coexist in any actual Romani speech.
Let’s look at the other direction of translation, from Romani into Czech. The words for “soldier” which are commonly used in “our” Romani are unknown to Google Translate: the Eastern Slovak Romani slugaďis or slugadžis is translated as “služebník” [servant], the Central Slovak Romani and Southern Slovak Romani lukesto is translated as “rychlost” [speed], the Western Slovak Romani sasos is translated as “byl” [he was], lurdo from the indigenous, now extinct Czech Romani is translated as “hmotnost“ [weight], and the Vlax Romani ketana, paradoxically, is not translated at all, but is translated into Czech as ketana once more. The program also does not cope very well with interdialectal variants: While the Vlax form of milaj is properly translated as meaning “léto” [summer], the forms ňilaj, linaj and ľinaj, ordinarily used in Central Romani, are translated as meaning “s pozdravem” [sincerely] “čára” [dash or line] and “snížit” [to decrease, lower, reduce]. This is not surprising: In the Romani texts on which the program was trained, these incorrectly translated words were either never present at all or were quite rare.
It is well known that programs such as Google Translate are the least successful at translating words on their own, or at translating conjugated or declined words out of context. To test the success of Google Translate in Romani when translating longer segments of text, I randomly chose 10 sentences from our dialectological corpus of Central Romani in the variety spoken by Romani people in a village near Prešov and had the program translate them. Here is the result:
The correct translation of the Romani sentences is on the left, while the machine translation from Google Translate is on the right
- “All his children are baptized.” > “All his children are baptized.” [Translator’s Note: The adjective in Czech has a different ending in the machine version.]
- “He walks to work.” > “He enters professional work on foot”.
- “That food won’t make you full.” > “Black food does not cause hunger.”
- “The tree is next to the house.” > “The stream flows around the house.”
- “That person was a good musician.” > “That person has a good orator.”
- “I don’t go to mass on Sundays.” > “I haven’t walked in a week.”
- “Bend over for that money!” > “Banjo for all money!”
- “Anybody could say that.” > “Oda can say hociko.”
- “Is anybody home at noon?” > “Is anybody home po dilu?”
- “They buried him with his violin.” > “They changed him and put a lavutu in him.”
These sentences are ordered roughly according to how successful Google Translate was. Only the first sentence was translated in a completely adequate way. The second one at least captures the fundamental meaning of the sentence, although it requires a significant degree of tolerance from the user in terms of style. In the rest of the cases, Google Translate coped adequately with some parts of the sentences, but overall the meaning of the machine-translated sentences does not correspond to the original sentences. In sentences 7 through 10, Google Translate left the Romani words in the Czech translation or translated them just on the basis of their formal similarity (those words are in italics), and such translations will be hard to understand for people who do not speak “our” version of Romani. Again, this is not surprising: The sentences in the variant of Romani used here contain words which the program most probably never encountered in its training.
It is also well known that when Google Translate works between two relatively smaller languages such as Czech and Romani, it does so through English. This is confirmed by the example of the Romani sentence given above, hi man džukel [I have a dog]. The program does not know how to translate this into English, so it just reproduces a form of it with a different spelling, hi man jukel. In the subsequent translation into Czech, the now “English” words hi man are translated into Czech as ahoj chlape and the “English” jukel is identified as part of a greeting and is translated into the “Czech” vocative declension of jukele from the non-existing “Czech” word jukel. We also encounter English in this facilitator role when we translate in the other direction: The Czech phrases “děkuju ti” [Thank you, informal, singular] and “děkuju vám” [Thank you, either formal address in the singular or addressing more than one person] is translated into (Vlax) Romani as nais tuqe, which is a form that cannot be used if we are thanking more than one person; tuqe, which is usually written tuke in “our” Romani, means “tobě” in Czech (you, informal, singular), not “vám” (you, plural, or formal address, singular). The reason is obvious: The English you expresses no difference in terms of number (or formality) and so the distinction which exists in both Czech and Romani is lost in translation.
In light of the dialectal inconsistency of the Romani data in Google Translate, the inconsistency of its script can be seen as a somewhat marginal problem. The program just works with texts in the Latin alphabet, even though many varieties of Romani are usually written in cyrillic and even in other scripts, although rarely. While the Romani translation of the Czech sentence “chci jít do města” [I want to go to town], kamav te ʒav anθ-o foro, is written in the exoticizing, reprehensible graphemes used in the late 20th century by the Romani Studies scholar Marcel Courthiade, which are used most often today in Romania only, Google’s Romani translation of the negative version of this sentence, “nechci jít do města” [I do not want to go to town], chi kamav te zhav ando foro, contains the English-inspired spellings of ch for č and zh for ž, which again are used primarily in the official writing of the Vlax dialect in Hungary.
What, then, does Google Translate offer to a speaker of “our” Romani languages and to those who are learning them, and what can we expect from it in the future?
The fact that the corpus Google used for Romani has not yet included a larger number of texts in any dialect of either Czech or Slovak Romani means that the program will be appropriate more for random, unsystematic familiarization with the grammatical structures and vocabularies of other varieties of Romani, mostly Vlax ones in terms of dialect and geographically, mostly the Balkan dialects. The program does not offer systematic recognition of the variability of Romani, which can be acquired only by studying the dialects of Romani in the many existing dictionaries and grammars or, naturally, directly from speakers of these dialects. The conclusion that is most essential to draw from all of the above is this: In the context of the Czech Republic and Slovakia, it would be more than foolish to want to use Google Translate during instruction, study, or for translation, whether official or otherwise.
The significant error rate when translating from “our” dialects of Romani into Czech and other languages can be eliminated over time rather easily by expanding the corpus in Google for Romani to include many texts in those dialects. However, such data enrichment in and of itself cannot contribute to the program’s usefulness, in a didactic sense or any other, for translation from Czech or Slovak into the Romani spoken here. Given that texts in different dialects of Romani are not differentiated within the corpus used by Google, the program will always offer the most frequent variants of constructions and words, so the translations will still offer dialectal hybrids, and that means it will remain practically unusable in the context of the Czech Republic and Slovakia.
The diversity of the dialects of Romani in grammatical and lexical terms roughly corresponds to the diversity of the languages in the Slavic family. We can imagine an analogous hypothetical situation: Somebody who wanted to learn Czech or to translate a CV into Czech would look up “Slavic” in Google Translate, if the program did not offer Czech directly, would input an English sentence, and would get a translation including three Russian words, two Polish or Ukrainian words, and maybe one Bulgarian, or Czech, or Serbo-Croatian word.
The only practical solution to this problem is to divide the Romani corpus into smaller sections, i.e., subcorpuses on the dialects of Romani, and basically to separate Romani, for the purposes of translation, into more than one language. Given the enormous diversity of Romani, it is not at all absurd to consider it not a single language with strong dialectal differentiations, but a group of a dozen or more closely-related languages (which is how Glottolog, the brilliant catalogue of the languages of the world, conceives of Romani). However, such questions have not just a linguistic dimension, but also an essentially cultural one, and require difficult decisions which are political in nature, ideally to be made by elites who are Romani themselves.
If, on the other hand, the corpus in Google for Romani remains intact in the future, we should not underestimate its potential for standardization through its dialectally hybrid translations. These might, among users of the program, create the notion of what an “international” Romani would sound like, the form of which could be attempted in more global contexts, for example, during communication online in Romani transnationally. For this reason, it would also be favorable for speakers of “our” Romani if texts in the Czech and Slovak dialects of Romani became part of the corpus for Google and modulated the form of this potential online standard.
In conclusion, I would note one more thing. At the start of this article I mentioned that Czech and Slovak Roma mostly welcome the inclusion of Romani among the languages available through Google Translate. Voices saying the opposite are also being heard, mainly from Romani people abroad in some Vlax groups. According to them – as the Romani vlogger Florian from Romania says – Romani should stay a language that non-Roma do not understand, and they are afraid that this cryptic function of Romani is threatened by the existence of the program. I even noted the call of one Argentinian Romani linguist for Romani speakers to stop using Romani on social media entirely so as not to provide data for the program’s training. The Sinti, whose dialect of Romani is generally strictly understood by community members as a secret language, have little to worry about: The program cannot cope with Sinti at all, because the corpus at Google does not contain a significant amount of Sinti texts.