Wikipedia was founded with the aim of making knowledge freely available around the world — but right now, it’s mostly making it available in English. The English Wikipedia is the largest edition by far, with 5.5 million articles, and only 15 of the 301 editions have more than a million. The quality of those articles can vary drastically, with vital content often entirely missing. Two hundred and six editions are missing an article on the emotional state of happiness and just under half are missing an article on Homo sapiens.
It seems like the perfect problem for machine translation tools, and in January, Google partnered with the Wikimedia Foundation to solve it, incorporating Google Translate into the Foundation’s own content translation tool, which uses open-source translation software. But for the editors that work on non-English Wikipedia editions, the content translation tool has been more of a curse than a blessing, renewing debate over whether Wikipedia should be in the business of machine translation at all.
Available as a beta feature, the content translation tool lets editors generate a preview of a new article based on an automated translation from another edition. Used correctly, the tool can save valuable time for editors building out understaffed editions — but when it goes wrong, the results can be disastrous. One global administrator pointed to a particularly atrocious translation from English to Portuguese. What is “village pump” in the English version became “bomb the village” when put through machine translation into Portuguese.
“People take Google Translate to be flawless,” said the administrator, who asked to be referred to by their Wikipedia username, Vermont. “Obviously it isn’t. It isn’t meant to be a replacement for knowing the language.”
Those shoddy machine translations have become such a problem that some editions have created special admin rules just to stamp them out. The English Wikipedia community elected to have a temporary “speedy deletion” criteria solely to allow administrators to delete “any page created by the content translation tool prior to 27 July 2016,” so long as no version exists in the page history which is not machine-translated. The name of this “exceptional circumstances” speedy deletion criterion is “X2. Pages created by the content translation tool.”
The Wikimedia Foundation, which administers Wikipedia, defended the tool when reached for comment, emphasizing that it is just one tool among many. “The content translation tool provides critical support to our editors,” a representative said, “and its impact extends even beyond Wikipedia in addressing the broader, internet-wide challenge of the lack of local language content online.”
That may be surprising if you’ve seen headlines in recent years about AI reaching “parity” with human translators. But those stories usually refer to narrow, specialized tests of machine translation’s abilities, and when the software is actually deployed in the wild, the limitations of artificial intelligence become clear. As Douglas Hofstadter, professor of cognition at Indiana University Bloomington, spelled out in an influential article on the topic, AI translation is shallow. It produces text that has surface-level fluency, but which usually misses the deeper meaning of words and sentences. AI systems learn how to translate by studying statistical patterns in large bodies of training data, but that means they’re blind to the nuances of language that are used more infrequently, and lack the common sense of human translators.
The result for Wikipedia editors is a major skills gap. Their machine translation usually requires close supervision by those translating, who themselves must have a good understanding of both languages they are translating. It’s a real problem for smaller Wikipedia editions that are already strapped for volunteers.
Guilherme Morandini, an administrator on the Portuguese Wikipedia, often sees users open articles in the content translation tool and immediately publish to another language edition without any review. In his experience, the result is shoddy translation or outright nonsense, a disaster for the edition’s credibility as a source of information. Reached by The Verge, Morandini pointed to this article about Jusuf Nurkić as an example, machine translated into Portuguese from its English equivalent. The first line, “... é um Bósnio profissional que atualmente joga ...” translates directly to “... is a professional Bosnian that currently plays ...,” as opposed to the English version “… is a Bosnian professional basketball player.”
The Indonesian Wikipedia community has gone so far as to formally request that the Wikimedia Foundation remove the tool from the edition. The Wikimedia Foundation appears to be reluctant to do so based on the thread, and has overruled community consensus in the past. Privately, concerns were expressed to The Verge that there are fears this could turn into a replay of the 2014 Media Viewer fight, which caused significant distrust between the Foundation and the community-led editions it oversees.
Wikimedia described that response in more positive terms. “In response to community feedback, we made adjustments and received positive feedback that the adjustments we made were were effective,” a representative said.
João Alexandre Peschanski, a professor of journalism at Faculdade Cásper Líbero in Brazil who teaches a course on Wikiversity, is another critic of the current machine translation system. Peschanski says “a community-wide strategy to improve machine learning should be discussed, as we might be losing efficiency by what I would say is a rather arduous translation endeavor.” Translation tools “are key,” and in Peschanski’s experience they work “fairly well.” The main problems being faced, he says, are a result of inconsistent templates used in articles. Ideally, those templates contain repetitive material which may be needed across many articles or pages, often between various language editions, making language easier to parse automatically.
Peschanski views translation as an activity of reuse and adaptation, where reuse between language editions depends on whether content is present on another site. But adaptation means bringing a “different cultural, language-specific background” into the translation before continuing. A broader possible solution would be to enact some sort of project-wide policy banning machine translations without human supervision.
Most of the users that The Verge interviewed for this article preferred to combine manual translation with machine translation, using the latter only to look up specific words. All interviewed agreed with Vermont’s statement that “machine translation will never be a viable way to make articles on Wikipedia, simply because it cannot understand complex human phrases that don’t translate between languages,” but most agree that it does have its uses.
Faced with those obstacles, smaller projects may always have a lower standard of quality when compared to the English Wikipedia. Quality is relative, and unfinished or poorly written articles are impossible to stamp out completely. But that disparity comes with a real cost. “Here in Brazil,” Morandini says, “Wikipedia is still regarded as non-trustworthy,” a reputation that isn’t helped by shoddily done translations of English articles. Both Vermont and Morandini agree that, in the case of pure machine translation, the articles in question are better off deleted. In too many cases, they’re simply “too terrible to keep.”
James Vincent contributed additional reporting to this article.
Disclosure: Kyle Wilson is an administrator on the English Wikipedia and a global user renamer. He does not receive payment from the Wikimedia Foundation nor does he take part in paid editing, broadly construed.
5/30 9:22AM ET: Updated to include comment from the Wikimedia Foundation.