CipherDAug: Ciphertext based Data Augmentation for Neural Machine Translation

About

We propose a novel data-augmentation technique for neural machine translation based on ROT-$k$ ciphertexts. ROT-$k$ is a simple letter substitution cipher that replaces a letter in the plaintext with the $k$th letter after it in the alphabet. We first generate multiple ROT-$k$ ciphertexts using different values of $k$ for the plaintext which is the source side of the parallel data. We then leverage this enciphered training data along with the original parallel data via multi-source training to improve neural machine translation. Our method, CipherDAug, uses a co-regularization-inspired training procedure, requires no external data sources other than the original training data, and uses a standard Transformer to outperform strong data augmentation techniques on several datasets by a significant margin. This technique combines easily with existing approaches to data augmentation, and yields particularly strong results in low-resource settings.

Nishant Kambhatla, Logan Born, Anoop Sarkar• 2022

Related benchmarks

Task	Dataset	Result
Machine Translation	WMT En-De 2014 (test)	BLEU27.9	379
Machine Translation	IWSLT De-En 2014 (test)	BLEU37.53	146
Machine Translation	IWSLT En-De 2014 (test)	BLEU30.65	92
Machine Translation	IWSLT De-En 14	BLEU Score37.53	35
Machine Translation	IWSLT Fr-En 2017 (test)	BLEU40.35	22
Machine Translation	IWSLT17 En-Fr (test)	BLEU41.44	18
Machine Translation	sk-en (test)	BLEU32.62	15
Machine Translation	TED low-resource En-Sk (test)	BLEU24.61	7
Machine Translation	TED Sk-En	BLEU32.62	5
Machine Translation	TED En-Sk	BLEU24.61	4

Showing 10 of 10 rows

Other info

Code

Follow for update

@wizwand_team Discord