Towards Universal Video MLLMs with Attribute-Structured and Quality-Verified Instructions
About
Universal video understanding requires modeling fine-grained visual and audio information over time in diverse real-world scenarios. However, the performance of existing models is primarily constrained by video-instruction data that represents complex audiovisual content as single, incomplete descriptions, lacking fine-grained organization and reliable annotation. To address this, we introduce: (i) ASID-1M, an open-source collection of one million structured, fine-grained audiovisual instruction annotations with single- and multi-attribute supervision; (ii) ASID-Verify, a scalable data curation pipeline for annotation, with automatic verification and refinement that enforces semantic and temporal consistency between descriptions and the corresponding audiovisual content; and (iii) ASID-Captioner, a video understanding model trained via Supervised Fine-Tuning (SFT) on the ASID-1M. Experiments across seven benchmarks covering audiovisual captioning, attribute-wise captioning, caption-based QA, and caption-based temporal grounding show that ASID-Captioner improves fine-grained caption quality while reducing hallucinations and improving instruction following. It achieves state-of-the-art performance among open-source models and is competitive with Gemini-3-Pro.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Captioning | VDC | Short Accuracy28.8 | 28 | |
| Audiovisual Video Captioning | UGC-VideoCap | Audio Score79.1 | 26 | |
| Audiovisual Video Captioning | SALMONN 2 (test) | Miss Rate20.5 | 26 | |
| Video Captioning Evaluation | VidCapBench AE | Overall Accuracy18.2 | 17 | |
| QA performance by Gemini-2.5-Pro based on captions | Daily-Omni (test) | Daily-Omni QA Score61.2 | 13 | |
| QA performance by Gemini-2.5-Pro based on captions | World-Sense (test) | World-Sense QA Score34 | 13 | |
| Attribute-level Instruction Following | Attribute-level Instruction Following Evaluation Set | Acc (1 Attribute)52.3 | 7 |