CipherDAug: Ciphertext based Data Augmentation for Neural Machine Translation
About
We propose a novel data-augmentation technique for neural machine translation based on ROT-$k$ ciphertexts. ROT-$k$ is a simple letter substitution cipher that replaces a letter in the plaintext with the $k$th letter after it in the alphabet. We first generate multiple ROT-$k$ ciphertexts using different values of $k$ for the plaintext which is the source side of the parallel data. We then leverage this enciphered training data along with the original parallel data via multi-source training to improve neural machine translation. Our method, CipherDAug, uses a co-regularization-inspired training procedure, requires no external data sources other than the original training data, and uses a standard Transformer to outperform strong data augmentation techniques on several datasets by a significant margin. This technique combines easily with existing approaches to data augmentation, and yields particularly strong results in low-resource settings.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Machine Translation | WMT En-De 2014 (test) | BLEU27.9 | 379 | |
| Machine Translation | IWSLT De-En 2014 (test) | BLEU37.53 | 146 | |
| Machine Translation | IWSLT En-De 2014 (test) | BLEU30.65 | 92 | |
| Machine Translation | IWSLT De-En 14 | BLEU Score37.53 | 33 | |
| Machine Translation | IWSLT17 En-Fr (test) | BLEU41.44 | 18 | |
| Machine Translation | sk-en (test) | BLEU32.62 | 15 | |
| Machine Translation | IWSLT Fr-En 2017 (test) | BLEU40.35 | 14 | |
| Machine Translation | TED low-resource En-Sk (test) | BLEU24.61 | 7 | |
| Machine Translation | TED Sk-En | BLEU32.62 | 5 | |
| Machine Translation | TED En-Sk | BLEU24.61 | 4 |