VideoMaMa: Mask-Guided Video Matting via Generative Prior
About
Generalizing video matting models to real-world videos remains a significant challenge due to the scarcity of labeled data. To address this, we present Video Mask-to-Matte Model (VideoMaMa) that converts coarse segmentation masks into pixel accurate alpha mattes, by leveraging pretrained video diffusion models. VideoMaMa demonstrates strong zero-shot generalization to real-world footage, even though it is trained solely on synthetic data. Building on this capability, we develop a scalable pseudo-labeling pipeline for large-scale video matting and construct the Matting Anything in Video (MA-V) dataset, which offers high-quality matting annotations for more than 50K real-world videos spanning diverse scenes and motions. To validate the effectiveness of this dataset, we fine-tune the SAM2 model on MA-V to obtain SAM2-Matte, which outperforms the same model trained on existing matting datasets in terms of robustness on in-the-wild videos. These findings emphasize the importance of large-scale pseudo-labeled video matting and showcase how generative priors and accessible segmentation cues can drive scalable progress in video matting research.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Matting | V-HIM60 Hard | MAD1.306 | 29 | |
| Video Matting | YouTubeMatte 1920x1080 (test) | MAD0.934 | 20 | |
| Video Matting | V-HIM60 Easy 14 | MAD1.3446 | 4 | |
| Video Matting | V-HIM60 Medium 14 | MAD2.271 | 4 | |
| Video Matting | V-HIM60 Hard 14 | MAD2.6112 | 4 | |
| Video Matting | YouTubeMatte 1920 x 1080 48 | MAD1.2695 | 4 |