VideoAgent: Personalized Synthesis of Scientific Videos
About
The technical complexity of research papers often limits their reach, necessitating more accessible formats like scientific videos to disseminate key insights through engaging narration. However, existing automated methods primarily focus on static posters or slide presentations that remain template-bound and linear. Shifting to audience-adaptive video synthesis requires addressing non-linear narrative orchestration and the joint synchronization of disparate multimodal assets. We introduce VideoAgent, a modular framework that redefines scientific video synthesis as an intent-driven planning problem. By decoupling content understanding from multimodal synthesis, VideoAgent adaptively interleaves static slides with dynamic animations to match the semantic density of the narration. We further propose SciVidEval, a benchmark evaluating multimodal quality and pedagogical utility through automated metrics and human knowledge transfer studies. Extensive experiments demonstrate that VideoAgent effectively conveys complex technical logic with high narrative fidelity and communicative impact.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video-Quiz Evaluation | SciVidEval | VLM-as-Judge Score99.5 | 10 | |
| Visual Quality Evaluation | SciVidEval | VLM-as-Judge Score8.03 | 9 | |
| Narration Quality Evaluation | SciVidEval | Perplexity (PPL)18.08 | 8 | |
| Synchronization Evaluation | SciVidEval | CLIP Score0.635 | 7 |