Se\~norita-2M: A High-Quality Instruction-based Dataset for General Video Editing by Video Specialists
About
Recent advancements in video generation have spurred the development of video editing techniques, which can be divided into inversion-based and end-to-end methods. However, current video editing methods still suffer from several challenges. Inversion-based methods, though training-free and flexible, are time-consuming during inference, struggle with fine-grained editing instructions, and produce artifacts and jitter. On the other hand, end-to-end methods, which rely on edited video pairs for training, offer faster inference speeds but often produce poor editing results due to a lack of high-quality training video pairs. In this paper, to close the gap in end-to-end methods, we introduce Se\~norita-2M, a high-quality video editing dataset. Se\~norita-2M consists of approximately 2 millions of video editing pairs. It is built by crafting four high-quality, specialized video editing models, each crafted and trained by our team to achieve state-of-the-art editing results. We also propose a filtering pipeline to eliminate poorly edited video pairs. Furthermore, we explore common video editing architectures to identify the most effective structure based on current pre-trained generative model. Extensive experiments show that our dataset can help to yield remarkably high-quality video editing results. More details are available at https://senorita-2m-dataset.github.io.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Generation Quality | PointBench | Success Rate (%)56.18 | 18 | |
| Video Editing | EditVerseBench Appearance (test) | Pick Score19.69 | 12 | |
| Video Editing | EditVerseBench 125 videos | CLIP Score98.9 | 11 | |
| Video Editing | EditVerse latest (full) | Editing Quality6.45 | 11 | |
| Video Editing | TGVE benchmark | Pick Score20.54 | 11 | |
| Video Editing | EgoEditBench | VLM Score7.52 | 10 | |
| Mask-based video object insertion | Internal (test) | MSE106.3 | 9 | |
| Point-based video object insertion | DAVIS (test) | Acc Pos49.63 | 9 | |
| Video Object Removal | BridgeRemoval-Bench | CLIP-T0.292 | 7 | |
| Video Object Removal | DAVIS 2016 | CLIP-T0.2618 | 7 |