Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering

About

To extend the reinforcement learning post-training paradigm to omni-modal models for concurrently bolstering video-audio understanding and collaborative reasoning, we propose OmniJigsaw, a generic self-supervised framework built upon a temporal reordering proxy task. Centered on the chronological reconstruction of shuffled audio-visual clips, this paradigm strategically orchestrates visual and auditory signals to compel cross-modal integration through three distinct strategies: Joint Modality Integration, Sample-level Modality Selection, and Clip-level Modality Masking. Recognizing that the efficacy of such proxy tasks is fundamentally tied to puzzle quality, we design a two-stage coarse-to-fine data filtering pipeline, which facilitates the efficient adaptation of OmniJigsaw to massive unannotated omni-modal data. Our analysis reveals a ``bi-modal shortcut phenomenon'' in joint modality integration and demonstrates that fine-grained clip-level modality masking mitigates this issue while outperforming sample-level modality selection. Extensive evaluations on 15 benchmarks show substantial gains in video, audio, and collaborative reasoning, validating OmniJigsaw as a scalable paradigm for self-supervised omni-modal learning.

Yiduo Jia, Muzhi Zhu, Hao Zhong, Mingyu Liu, Yuling Xi, Hao Chen, Bin Qin, Yongjie Yang, Zhenbo Luo, Chunhua Shen• 2026

Related benchmarks

TaskDatasetResultRank
Video ReasoningVideo-MME
Overall Performance73.1
39
Video ReasoningVideo-Holmes
Accuracy52.53
37
Temporal ReasoningTempCompass
Accuracy72.34
33
Video Understanding ReasoningMLVU
Accuracy73.46
21
Complex ReasoningVideo-TT
Accuracy46.5
19
Audio ReasoningMMAU-Pro
Average Score58.59
18
Audio ReasoningMMAU mini (test)
Average Score76.3
17
Video ReasoningAoT Bench
Accuracy68.9
15
Video ReasoningTUNA-Bench
Accuracy66.2
15
Video ReasoningMLVU (test)
Accuracy62.75
15
Showing 10 of 15 rows

Other info

GitHub

Follow for update