Modeling Cross-vision Synergy for Unified Large Vision Model
About
Recent advances in large vision models (LVMs) have shifted from modality-specific designs toward unified architectures that jointly process images, videos, and 3D data. However, existing unified LVMs primarily pursue functional integration, while overlooking the deeper goal of cross-vision synergy: the ability to reason over complementary priors across visual modalities. To address this, we present PolyV, a unified LVM that achieves cross-vision synergy at both the architectural and training levels. Architecturally, PolyV adopts a sparse Mixture-of-Experts LVM coordinated by a dynamic modality router, allowing each expert to specialize in modality-specific priors while enabling bidirectional interaction and mutual refinement across modalities. Training-wise, a synergy-aware paradigm combines modality-specific pretraining with coarse-to-fine synergy tuning via knowledge distillation and object-/relation-level alignment. Extensive experiments on 10 benchmarks spanning image, video, and 3D understanding, including synergy-focused datasets requiring spatial or temporal priors, demonstrate that PolyV consistently outperforms existing models, achieving over 10% average improvement over its backbone. Overall, PolyV establishes a unified framework for synesthetic visual reasoning, advancing toward truly synergistic LVMs. Project page: https://sqwu.top/PolyV.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 3D Question Answering | ScanQA (val) | METEOR23.1 | 217 | |
| 3D Question Answering | SQA3D (test) | EM@164.8 | 98 | |
| Image Understanding | 3DSRBench real 45 (test) | Average Score63.4 | 12 | |
| Video Understanding | VSI-Bench 67 (test) | Average Score52.7 | 12 | |
| Image Understanding | MMSI-Bench 68 (test) | Average Score31.7 | 12 | |
| Image Understanding | MMStar 11 (test) | Average Score71.4 | 11 | |
| Video Understanding | VideoMME w/o subtitles 21 (test) | Average Score69.6 | 11 | |
| Video Understanding | CVBench 83 (test) | Average Score59.1 | 10 | |
| 3D Question Answering | Open-EQA HM3D | LLM-Match Score63.4 | 3 |