Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Towards Reason-Informed Video Editing in Unified Models with Self-Reflective Learning

About

Unified video models exhibit strong capabilities in understanding and generation, yet they struggle with reason-informed visual editing even when equipped with powerful internal vision-language models (VLMs). We attribute this gap to two factors: (1) existing datasets are inadequate for training and evaluating reasoning-aware video editing, and (2) an inherent disconnect between the models' reasoning and editing capabilities, which prevents understanding from guiding the editing process. To address this, we introduce the Reason-Informed Video Editing (RVE) task, which requires reasoning about physical plausibility and causal dynamics during editing. To support systematic evaluation, we construct RVE-Bench, a comprehensive benchmark with two complementary subsets: Reasoning-Aware Video Editing (RAVE) and In-Context Video-to-Video Generation (ICVG), spanning diverse reasoning dimensions across both editing and generation scenarios. Building upon this foundation, we propose ReViSE, a self-reflective learning framework that harnesses the model's internal VLM to evaluate and refine its own generation during training. Unlike prior reward-based approaches that rely on external critics, ReViSE leverages the model's internal VLM as a self-reflective evaluator, providing differentiable feedback that directly refines the generator's reasoning behavior during training. Extensive experiments on RVE-Bench demonstrate that ReViSE enhances editing accuracy and visual fidelity, outperforming the finetuned counterpart by 10% in Overall score on the RAVE subset, demonstrating the effectiveness of self-reflective differentiable reward.

Xinyu Liu, Hangjie Yuan, Yujie Wei, Jiazheng Xing, Yujin Han, Jiahao Pan, Yanbiao Ma, Chi-Min Chan, Kang Zhao, Shiwei Zhang, Wenhan Luo, Yike Guo• 2025

Related benchmarks

TaskDatasetResultRank
In-context Video GenerationRVE-Bench Camera Reasoning
ViCLIPT0.2193
5
In-context Video GenerationRVE-Bench Causal Reasoning
ViCLIPT0.2259
5
In-context Video GenerationRVE-Bench Emotional Reasoning
ViCLIPT0.2181
5
In-context Video GenerationRVE-Bench Commonsense Reasoning
ViCLIPT Score0.226
5
Reasoning-Informed Video EditingRVE-Bench Temporal Reasoning
ViCLIPT Score0.1684
5
Reasoning-Informed Video EditingRVE-Bench Causal Reasoning
ViCLIPT Score0.1758
5
Reasoning-Informed Video EditingRVE-Bench Spatial Reasoning
ViCLIPT0.1734
5
Reasoning-Informed Video EditingRVE-Bench Commonsense Reasoning
ViCLIPT0.1826
5
Video EditingDitto-1M
ViCLIPT0.1877
5
Video EditingDitto-1M randomly selected 809 samples 1
ViCLIP0.1877
5
Showing 10 of 10 rows

Other info

GitHub

Follow for update