Multimodal Abstractive Summarization for How2 Videos
About
In this paper, we study abstractive summarization for open-domain videos. Unlike the traditional text news summarization, the goal is less to "compress" text information but rather to provide a fluent textual summary of information that has been collected and fused from different source modalities, in our case video and audio transcripts (or text). We show how a multi-source sequence-to-sequence model with hierarchical attention can integrate information from different modalities into a coherent output, compare various models trained with different modalities and present pilot experiments on the How2 corpus of instructional videos. We also propose a new evaluation metric (Content F1) for abstractive summarization task that measures semantic adequacy rather than fluency of the summaries, which is covered by metrics like ROUGE and BLEU.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Abstractive Summarization | How2 (test) | Content F148.9 | 18 | |
| Multimodal Abstractive Summarization | How2 (test) | ROUGE-160.3 | 13 | |
| Multimodal Abstractive Text Summarization | How2 300h (test) | ROUGE-148.4 | 9 | |
| Multimodal Summarization | How2 | ROUGE-160.3 | 6 | |
| Video Summarization | How2 1.0 (test) | INF3.89 | 3 |