Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

InsViE-1M: Effective Instruction-based Video Editing with Elaborate Dataset Construction

About

Instruction-based video editing allows effective and interactive editing of videos using only instructions without extra inputs such as masks or attributes. However, collecting high-quality training triplets (source video, edited video, instruction) is a challenging task. Existing datasets mostly consist of low-resolution, short duration, and limited amount of source videos with unsatisfactory editing quality, limiting the performance of trained editing models. In this work, we present a high-quality Instruction-based Video Editing dataset with 1M triplets, namely InsViE-1M. We first curate high-resolution and high-quality source videos and images, then design an effective editing-filtering pipeline to construct high-quality editing triplets for model training. For a source video, we generate multiple edited samples of its first frame with different intensities of classifier-free guidance, which are automatically filtered by GPT-4o with carefully crafted guidelines. The edited first frame is propagated to subsequent frames to produce the edited video, followed by another round of filtering for frame quality and motion evaluation. We also generate and filter a variety of video editing triplets from high-quality images. With the InsViE-1M dataset, we propose a multi-stage learning strategy to train our InsViE model, progressively enhancing its instruction following and editing ability. Extensive experiments demonstrate the advantages of our InsViE-1M dataset and the trained model over state-of-the-art works. Codes are available at \href{https://github.com/langmanbusi/InsViE}{InsViE}.

Yuhui Wu, Liyi Chen, Ruibin Li, Shihao Wang, Chenxi Xie, Lei Zhang• 2025

Related benchmarks

TaskDatasetResultRank
Video EditingOpenVE-Bench (test)
Overall Score3.25
16
Instruction-Guided Video EditingOpenVE-Bench 1.0 (full)
Overall Quality1.53
16
Video EditingDAVIS (first 33 frames)
Background MSE4.97e+3
14
Video Object RetexturingPexels video dataset (test)
Background MSE5.45e+3
14
Video EditingEditVerse latest (full)
Editing Quality4.36
11
Instruction-Guided Video EditingOpenVE-Bench
Overall Score1.53
8
Video EditingOpenVE-Bench 1.0 (test)
Overall Score3.25
8
Video Editing EvaluationOpenVE-Bench Video Paris 1.0
Overall Score2.62
8
Video EditingVideo Editing (test)
VBench Quality79.19
6
Reasoning-Informed Video EditingRVE-Bench Spatial Reasoning
ViCLIPT0.1636
5
Showing 10 of 25 rows

Other info

Follow for update