Don't Rule Out Monolingual Speakers: A Method For Crowdsourcing Machine Translation Data

About

High-performing machine translation (MT) systems can help overcome language barriers while making it possible for everyone to communicate and use language technologies in the language of their choice. However, such systems require large amounts of parallel sentences for training, and translators can be difficult to find and expensive. Here, we present a data collection strategy for MT which, in contrast, is cheap and simple, as it does not require bilingual speakers. Based on the insight that humans pay specific attention to movements, we use graphics interchange formats (GIFs) as a pivot to collect parallel sentences from monolingual annotators. We use our strategy to collect data in Hindi, Tamil and English. As a baseline, we also collect data using images as a pivot. We perform an intrinsic evaluation by manually evaluating a subset of the sentence pairs and an extrinsic evaluation by finetuning mBART on the collected data. We find that sentences collected via GIFs are indeed of higher quality.

Rajat Bhatnagar, Ananya Ganesh, Katharina Kann• 2021

Related benchmarks

Task	Dataset	Result
Machine Translation (English to Hindi)	GIF (test)	BLEU Score3.07	14
Machine Translation (English to Hindi)	Image (test)	BLEU0.0195	14
Machine Translation (English to Hindi)	M20 (test)	BLEU Score13.31	14
Machine Translation (English to Hindi)	All weighted average (test)	BLEU Score0.0675	14
Machine Translation (Hindi to English)	GIF (test)	BLEU16.09	14
Machine Translation (Hindi to English)	Image (test)	BLEU Score12.14	14
Machine Translation (Hindi to English)	M20 (test)	BLEU Score8.23	14
Machine Translation (Hindi to English)	All weighted average (test)	BLEU Score10.28	14
Machine Translation	Tamil-English GIF	BLEU9.27	10
Machine Translation	Tamil-English (All)	BLEU6.47	10

Showing 10 of 14 rows

Other info

Code

Follow for update

@wizwand_team Discord