ASTRA: Let Arbitrary Subjects Transform in Video Editing
About
While existing video editing methods excel with single subjects, they struggle in dense, multi-subject scenes, frequently suffering from attention dilution and mask boundary entanglement that cause attribute leakage and temporal instability. To address this, we propose ASTRA, a training-free framework for seamless, arbitrary-subject video editing. Without requiring model fine-tuning, ASTRA precisely manipulates multiple designated subjects while strictly preserving non-target regions. It achieves this via two core components: a prompt-guided multimodal alignment module that generates robust conditions to mitigate attention dilution, and a prior-based mask retargeting module that produces temporally coherent mask sequences to resolve boundary entanglement. Functioning as a versatile plug-and-play module, ASTRA seamlessly integrates with diverse mask-driven video generators. Extensive experiments on our newly constructed benchmark, MSVBench, demonstrate that ASTRA consistently outperforms state-of-the-art methods. Code, models, and data are available at https://github.com/XWH-A/ASTRA.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Editing | MSVBench (test) | Warp Error1.85 | 10 | |
| Video Editing | LOVEU-TGVE 2023 | Warp-Err2.04 | 6 | |
| Video Editing | Video Editing Dataset | CLIP-T Score27.23 | 3 |