OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering
About
To extend the reinforcement learning post-training paradigm to omni-modal models for concurrently bolstering video-audio understanding and collaborative reasoning, we propose OmniJigsaw, a generic self-supervised framework built upon a temporal reordering proxy task. Centered on the chronological reconstruction of shuffled audio-visual clips, this paradigm strategically orchestrates visual and auditory signals to compel cross-modal integration through three distinct strategies: Joint Modality Integration, Sample-level Modality Selection, and Clip-level Modality Masking. Recognizing that the efficacy of such proxy tasks is fundamentally tied to puzzle quality, we design a two-stage coarse-to-fine data filtering pipeline, which facilitates the efficient adaptation of OmniJigsaw to massive unannotated omni-modal data. Our analysis reveals a ``bi-modal shortcut phenomenon'' in joint modality integration and demonstrates that fine-grained clip-level modality masking mitigates this issue while outperforming sample-level modality selection. Extensive evaluations on 15 benchmarks show substantial gains in video, audio, and collaborative reasoning, validating OmniJigsaw as a scalable paradigm for self-supervised omni-modal learning.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Reasoning | Video-MME | Overall Performance73.1 | 39 | |
| Video Reasoning | Video-Holmes | Accuracy52.53 | 37 | |
| Temporal Reasoning | TempCompass | Accuracy72.34 | 33 | |
| Video Understanding Reasoning | MLVU | Accuracy73.46 | 21 | |
| Complex Reasoning | Video-TT | Accuracy46.5 | 19 | |
| Audio Reasoning | MMAU-Pro | Average Score58.59 | 18 | |
| Audio Reasoning | MMAU mini (test) | Average Score76.3 | 17 | |
| Video Reasoning | AoT Bench | Accuracy68.9 | 15 | |
| Video Reasoning | TUNA-Bench | Accuracy66.2 | 15 | |
| Video Reasoning | MLVU (test) | Accuracy62.75 | 15 |