DOCmT5: Document-Level Pretraining of Multilingual Language Models
About
In this paper, we introduce DOCmT5, a multilingual sequence-to-sequence language model pretrained with large scale parallel documents. While previous approaches have focused on leveraging sentence-level parallel data, we try to build a general-purpose pretrained model that can understand and generate long documents. We propose a simple and effective pretraining objective - Document reordering Machine Translation (DrMT), in which the input documents that are shuffled and masked need to be translated. DrMT brings consistent improvements over strong baselines on a variety of document-level generation tasks, including over 12 BLEU points for seen-language-pair document-level MT, over 7 BLEU points for unseen-language-pair document-level MT and over 3 ROUGE-1 points for seen-language-pair cross-lingual summarization. We achieve state-of-the-art (SOTA) on WMT20 De-En and IWSLT15 Zh-En document translation tasks. We also conduct extensive analysis on various factors for document pretraining, including (1) The effects of pretraining data quality and (2) The effects of combining mono-lingual and cross-lingual pretraining. We plan to make our model checkpoints publicly available.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Document-Level Machine Translation | TED15 Zh-En 2010-2013 (test) | d-BLEU31.4 | 16 | |
| Document-Level Machine Translation | WMT20 De-En (test) | d-BLEU44.73 | 12 | |
| Cross-lingual Summarization | Wikilingua Ru-En Seen Languages GEM | ROUGE-133.56 | 10 | |
| Cross-lingual Summarization | Wikilingua Tr-En Seen Languages GEM | ROUGE-137.66 | 10 | |
| Cross-lingual Summarization | Wikilingua Vi-En Seen Languages GEM | ROUGE-133.29 | 10 | |
| Cross-lingual Summarization | Wikilingua Es-En Seen Languages GEM | ROUGE-136.79 | 10 | |
| Cross-lingual Summarization | Wikilingua Fr-En | ROUGE-136.28 | 9 | |
| Cross-lingual Summarization | Wikilingua Id-En | ROUGE-135.15 | 9 | |
| Cross-lingual Summarization | Wikilingua Hi-En | ROUGE-134.16 | 9 | |
| Machine Translation | WMT20 JA-EN (test) | BLEU19.17 | 8 |