HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document Summarization
About
Neural extractive summarization models usually employ a hierarchical encoder for document encoding and they are trained using sentence-level labels, which are created heuristically using rule-based methods. Training the hierarchical encoder with these \emph{inaccurate} labels is challenging. Inspired by the recent work on pre-training transformer sentence encoders \cite{devlin:2018:arxiv}, we propose {\sc Hibert} (as shorthand for {\bf HI}erachical {\bf B}idirectional {\bf E}ncoder {\bf R}epresentations from {\bf T}ransformers) for document encoding and a method to pre-train it using unlabeled data. We apply the pre-trained {\sc Hibert} to our summarization model and it outperforms its randomly initialized counterpart by 1.25 ROUGE on the CNN/Dailymail dataset and by 2.0 ROUGE on a version of New York Times dataset. We also achieve the state-of-the-art performance on these two datasets.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text Summarization | CNN/Daily Mail (test) | ROUGE-219.95 | 65 | |
| Summarization | CNN/DM | ROUGE-142.37 | 56 | |
| Extractive Summarization | CNN/Daily Mail (test) | ROUGE-130 | 36 | |
| Extractive Summarization | NYT50 (test) | ROUGE-149.47 | 21 | |
| Summarization | CNNDM full-length F1 (test) | ROUGE-142.37 | 19 | |
| Summarization | CNN/Daily Mail full length (test) | ROUGE-142.37 | 18 | |
| Extractive Summarization | CNN-DM (test) | ROUGE-142.37 | 18 | |
| Document Classification | MIND (test) | Accuracy0.8189 | 12 | |
| Document Classification | IMDB (test) | Accuracy52.96 | 10 | |
| Summarization | PubMed Short | ROUGE-142.03 | 6 |