WikiVideo: Article Generation from Multiple Videos
About
We introduce the task of grounded article generation with the goal of creating a Wikipedia-style article from multiple diverse videos about real-world events -- from natural disasters to political elections -- where all the information in the article is supported by video evidence. Videos are intuitive sources for retrieval-augmented generation (RAG), but most contemporary RAG workflows focus heavily on text while existing methods for video-based summarization focus on low-level scene understanding rather than high-level event semantics. To close this gap, we introduce WikiVideo, a benchmark consisting of expert-written articles and densely annotated videos that provide evidence for articles' claims, facilitating the integration of video into RAG pipelines and enabling the creation of in-depth content that is grounded in multimodal sources. We further propose Collaborative Article Generation (CAG), a novel interactive method for article creation from multiple videos. CAG leverages an iterative interaction between an r1-style reasoning model and a VideoLLM to draw higher-level inferences about the target event than is possible with VideoLLMs alone, which fixate on low-level visual features. We benchmark state-of-the-art VideoLLMs and CAG in both oracle retrieval and RAG settings and find that CAG consistently outperforms alternative methods, while suggesting intriguing avenues for future work.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Article Generation | MAGMaR oracle (leaderboard snapshot) | Human Preference Score3.09 | 15 | |
| Multi-video Grounding and Retrieval | MAGMaR Oracle Track 2026 (val) | Human Evaluation Score3.088 | 11 | |
| Citation Quality Evaluation | WikiVideo 1.0 (test) | CITEP R0.00e+0 | 4 |