Efficient Attentions for Long Document Summarization
About
The quadratic computational and memory complexities of large Transformers have limited their scalability for long document summarization. In this paper, we propose Hepos, a novel efficient encoder-decoder attention with head-wise positional strides to effectively pinpoint salient information from the source. We further conduct a systematic study of existing efficient self-attentions. Combined with Hepos, we are able to process ten times more tokens than existing models that use full attentions. For evaluation, we present a new dataset, GovReport, with significantly longer documents and summaries. Results show that our models produce significantly higher ROUGE scores than competitive comparisons, including new state-of-the-art results on PubMed. Human evaluation also shows that our models generate more informative summaries with fewer unfaithful errors.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Summarization | arXiv (test) | ROUGE-148.24 | 161 | |
| Summarization | PubMed (test) | ROUGE-148.12 | 107 | |
| Document Summarization | GovReport (test) | ROUGE-156.9 | 50 | |
| Long document summarization | arXiv (test) | ROUGE-2 Score20.3 | 24 | |
| Abstractive Summarization | arXiv (test) | R-148.24 | 20 | |
| Summarization | arXiv original (test) | R-148.24 | 18 | |
| Document Summarization | GovReport | ROUGE-156.86 | 15 | |
| Abstractive Summarization | PubMed (test) | ROUGE-1 (R-1)48.12 | 11 | |
| Long document summarization | PubMed (test) | ROUGE-147.93 | 7 |