InsEdit: Towards Instruction-based Visual Editing via Data-Efficient Video Diffusion Models Adaptation
About
Instruction-based video editing is a natural way to control video content with text, but adapting a video generation model into an editor usually appears data-hungry. At the same time, high-quality video editing data remains scarce. In this paper, we show that a video generation backbone can become a strong video editor without large scale video editing data. We present InsEdit, an instruction-based editing model built on HunyuanVideo-1.5. InsEdit combines a visual editing architecture with a video data pipeline based on Mutual Context Attention (MCA), which creates aligned video pairs where edits can begin in the middle of a clip rather than only from the first frame. With only O(100)K video editing data, InsEdit achieves state-of-the-art results among open-source methods on our video instruction editing benchmarks. In addition, because our training recipe also includes image editing data, the final model supports image editing without any modification.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Instruction-Guided Video Editing | OpenVE-Bench | Overall Score4.43 | 17 | |
| Image Instruction Editing | GEdit | G-SC Score6.98 | 10 | |
| Video Instruction Editing | InsEdit-Bench | Overall Score4.61 | 9 |