ASTRA: Let Arbitrary Subjects Transform in Video Editing

About

While existing video editing methods excel with single subjects, they struggle in dense, multi-subject scenes, frequently suffering from attention dilution and mask boundary entanglement that cause attribute leakage and temporal instability. To address this, we propose ASTRA, a training-free framework for seamless, arbitrary-subject video editing. Without requiring model fine-tuning, ASTRA precisely manipulates multiple designated subjects while strictly preserving non-target regions. It achieves this via two core components: a prompt-guided multimodal alignment module that generates robust conditions to mitigate attention dilution, and a prior-based mask retargeting module that produces temporally coherent mask sequences to resolve boundary entanglement. Functioning as a versatile plug-and-play module, ASTRA seamlessly integrates with diverse mask-driven video generators. Extensive experiments on our newly constructed benchmark, MSVBench, demonstrate that ASTRA consistently outperforms state-of-the-art methods. Code, models, and data are available at https://github.com/XWH-A/ASTRA.

Fei Shen, Weihao Xu, Rui Yan, Dong Zhang, Xiangbo Shu, Jinhui Tang, Maocheng Zhao• 2025

Related benchmarks

Task	Dataset	Result
Video Editing	MSVBench (test)	Warp Error1.85	10
Video Editing	LOVEU-TGVE 2023	Warp-Err2.04	6
Video Editing	Video Editing Dataset	CLIP-T Score27.23	3

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord