Abstract
VideoMaMa converts coarse masks to accurate alpha mattes using pretrained video diffusion models, enabling zero-shot generalization and scalable pseudo-labeling for video matting.
Generalizing video matting models to real-world videos remains a significant challenge due to the scarcity of labeled data. To address this, we present Video Mask-to-Matte Model (VideoMaMa) that converts coarse segmentation masks into pixel accurate alpha mattes, by leveraging pretrained video diffusion models. VideoMaMa demonstrates strong zero-shot generalization to real-world footage, even though it is trained solely on synthetic data. Building on this capability, we develop a scalable pseudo-labeling pipeline for large-scale video matting and construct the Matting Anything in Video (MA-V) dataset, which offers high-quality matting annotations for more than 50K real-world videos spanning diverse scenes and motions. To validate the effectiveness of this dataset, we fine-tune the SAM2 model on MA-V to obtain SAM2-Matte, which outperforms the same model trained on existing matting datasets in terms of robustness on in-the-wild videos. These findings emphasize the importance of large-scale pseudo-labeled video matting and showcase how generative priors and accessible segmentation cues can drive scalable progress in video matting research.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MatAnyone 2: Scaling Video Matting via a Learned Quality Evaluator (2025)
- EasyOmnimatte: Taming Pretrained Inpainting Diffusion Models for End-to-End Video Layered Decomposition (2025)
- M3DDM+: An improved video outpainting by a modified masking strategy (2026)
- CtrlVDiff: Controllable Video Generation via Unified Multimodal Video Diffusion (2025)
- Beyond Inpainting: Unleash 3D Understanding for Precise Camera-Controlled Video Generation (2026)
- Guardians of the Hair: Rescuing Soft Boundaries in Depth, Stereo, and Novel Views (2026)
- Repurposing Video Diffusion Transformers for Robust Point Tracking (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper