VideoAgent: Personalized Synthesis of Scientific Videos

About

The technical complexity of research papers often limits their reach, necessitating more accessible formats like scientific videos to disseminate key insights through engaging narration. However, existing automated methods primarily focus on static posters or slide presentations that remain template-bound and linear. Shifting to audience-adaptive video synthesis requires addressing non-linear narrative orchestration and the joint synchronization of disparate multimodal assets. We introduce VideoAgent, a modular framework that redefines scientific video synthesis as an intent-driven planning problem. By decoupling content understanding from multimodal synthesis, VideoAgent adaptively interleaves static slides with dynamic animations to match the semantic density of the narration. We further propose SciVidEval, a benchmark evaluating multimodal quality and pedagogical utility through automated metrics and human knowledge transfer studies. Extensive experiments demonstrate that VideoAgent effectively conveys complex technical logic with high narrative fidelity and communicative impact.

Xiao Liang, Bangxin Li, Zixuan Chen, Hanyue Zheng, Zhi Ma, Di Wang, Cong Tian, Quan Wang• 2025

Related benchmarks

Task	Dataset	Result
Video-Quiz Evaluation	SciVidEval	VLM-as-Judge Score99.5	10
Visual Quality Evaluation	SciVidEval	VLM-as-Judge Score8.03	9
Narration Quality Evaluation	SciVidEval	Perplexity (PPL)18.08	8
Synchronization Evaluation	SciVidEval	CLIP Score0.635	7

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord