Z-Code++: A Pre-trained Language Model Optimized for Abstractive Summarization
About
This paper presents Z-Code++, a new pre-trained language model optimized for abstractive text summarization. The model extends the state of the art encoder-decoder model using three techniques. First, we use a two-phase pre-training process to improve model's performance on low-resource summarization tasks. The model is first pre-trained using text corpora for language understanding, and then is continually pre-trained on summarization corpora for grounded text generation. Second, we replace self-attention layers in the encoder with disentangled attention layers, where each word is represented using two vectors that encode its content and position, respectively. Third, we use fusion-in-encoder, a simple yet effective method of encoding long sequences in a hierarchical manner. Z-Code++ creates new state of the art on 9 out of 13 text summarization tasks across 5 languages. Our model is parameter-efficient in that it outperforms the 600x larger PaLM-540B on XSum, and the finetuned 200x larger GPT3-175B on SAMSum. In zero-shot and few-shot settings, our model substantially outperforms the competing models.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Natural Language Understanding | GLUE (dev) | SST-2 (Acc)96.5 | 504 | |
| Natural Language Understanding | GLUE (test) | SST-2 Accuracy97.9 | 416 | |
| Summarization | XSum (test) | ROUGE-224.7 | 231 | |
| Dialogue Summarization | SamSum (test) | ROUGE-230.3 | 80 | |
| Natural language generation | E2E (test) | ROUGE-L54 | 79 | |
| Abstractive dialogue summarization | SamSum (test) | ROUGE-L43.9 | 53 | |
| Multi-document summarization | Multi-News (test) | ROUGE-221.6 | 45 | |
| Abstractive Summarization | XSum (test) | ROUGE-L33.6 | 44 | |
| Summarization | Newsroom (test) | ROUGE-233.1 | 40 | |
| Long document summarization | arXiv (test) | ROUGE-2 Score22.5 | 24 |