Unsupervised Summarization Re-ranking
About
With the rise of task-specific pre-training objectives, abstractive summarization models like PEGASUS offer appealing zero-shot performance on downstream summarization tasks. However, the performance of such unsupervised models still lags significantly behind their supervised counterparts. Similarly to the supervised setup, we notice a very high variance in quality among summary candidates from these models while only one candidate is kept as the summary output. In this paper, we propose to re-rank summary candidates in an unsupervised manner, aiming to close the performance gap between unsupervised and supervised models. Our approach improves the unsupervised PEGASUS by up to 7.27% and ChatGPT by up to 6.86% relative mean ROUGE across four widely-adopted summarization benchmarks ; and achieves relative gains of 7.51% (up to 23.73% from XSum to WikiHow) averaged over 30 zero-shot transfer setups (finetuning on a dataset, evaluating on another).
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Abstractive Summarization | XSum (test) | ROUGE-L14.93 | 44 | |
| Abstractive Summarization | WikiHow | ROUGE-27.26 | 26 | |
| Abstractive Summarization | Xsum | ROUGE-127.98 | 18 | |
| Abstractive Summarization | CNN/DM | ROUGE-142.05 | 14 | |
| Unsupervised abstractive summarization | CNN-DM (test) | ROUGE-139.76 | 12 | |
| Summarization | CNN/DM human evaluation | Informational Content Score24 | 4 | |
| Unsupervised abstractive summarization | WikiHow (test) | ROUGE-10.265 | 4 | |
| Unsupervised abstractive summarization | SamSum (test) | ROUGE-128.91 | 4 | |
| Abstractive Summarization | SamSum (test) | factCC96.28 | 2 |