Volunteers develop first machine translation benchmarks for 30 African languages

[Article by Wiida Fourie-Basson]

From Khoekhoegowab to Igbo and Sepedi – these are only three of the low-resource languages in Africa that a group of over 400 volunteers from more than 20 African countries are targeting to address the lack of diversity in the field of natural language processing.

Natural language processing (NLP) is a branch of artificial intelligence that helps computers understand, interpret and manipulate human language. But while Africa has more than 2000 living languages, most of these have very little data, making it difficult to develop speech and language technologies relevant to the African context. Hence the term low-resource languages.

The Masakhane project is a grassroots initiative involving a virtual community of content creators, translators, curators, language technologists and evaluators – all with the mission of addressing the lack of geographic diversity in the field. It was established in 2019 by machine learning engineer Jade Abbott during the Deep Learning Indaba held in Kenya.

Now one of the project’s research papers, on a participatory research model for low-resourced machine translation, is one of two papers to have won the inaugural Wikimedia Foundation Research Award for 2020.

Stellenbosch University’s Dr Herman Kamper, a senior lecturer in the Department of Electric and Electronic Engineering, and Elan van Biljon, an MSc student in Computer Science, were among the 45 co-authors on the paper “Participatory research for low-resourced machine translation: a case study in African languages“, published in the Findings of the Association for Computational Linguistics: EMNLP at the end of 2020.

Dr Kamper says they specifically contributed to a machine translation system for translating English to Afrikaans, while Elan also worked on English to Sepedi and Setswana translations: “For us, it was amazing to play a small part in such a big effort, working with people from across Africa as well as established researchers such as Julia Kreutzer from Google Research, who developed some of the core code”.

In NLP terms, most African languages are classified as “The Left Behinds”, “Scraping By” or “Hopeful”. Only a few, such as Afrikaans, Kiswahili and Yoruba, find themselves in the “Rising Stars” category. There is also a lack of NLP researchers in Africa. In 2018, only five out of the 2 695 affiliations of participants in the five major NLP conferences were from African institutions.

In the paper, they describe how a participatory approach has enabled them to set machine translation benchmarks for 30 African languages. This means that, for the first time, machine translation systems have been developed to translate from English into these different languages (the way Google Translate would translate sentences from English to German), setting a benchmark and making it possible for researchers to make further improvements.

From Nigeria, volunteers are translating their own writings, including personal religious stories and undergraduate theses, into Yoruba and Igbo. This is in an effort to ensure that accessible and representative data of their culture are used to train models.

In Namibia, Jade Abbot is hosting collaborative sessions with Damara speakers, to collect and translate phrases in Khoekhoegowab that reflect Damara culture around traditional clothing, songs and prayers.

Another unique feature of the participatory approach is the human evaluation of the machine translation system developed for these languages. For example, in 2020 eleven participants volunteered to evaluate translations in their mother tongue, often involving family or friends to determine the most correct translations. Within only ten days, they gathered a total of 707 evaluated translations covering Igo, Nigerian Pidgin, Shona, Luo, Hausa, Kiswahili, Yoruba, Fon and Dendi. This was the first time that human evaluation of an MT system has been performed.

In their announcement, The Wikimedia Foundation says the paper and the Maskhane community have fundamentally changed the approach to the challenge of ‘low-resourced languages’ in Africa: “The research describes a novel approach for participatory research around machine translation for African languages. The authors show how this approach can overcome the challenges these languages face to join the Web and some of the technologies other languages benefit from today.”

The Wikimedia Foundation Research team established the Wikimedia Foundation Research Award of the Year in 2021 to recognize recent research that has the potential to have significant impact on the Wikimedia projects or research in this space.

Media enquiries

Masakhane project: e-mail: masakhanetranslation@gmail.com

Dr Herman Kamper: e-mail: kamperh@sun.ac.za

Department of Electric and Electronic Engineering, Faculty of Engineering

Stellenbosch University