Training-Free Multimodal Large Language Model Orchestration
About
Building interactive omni-modal assistants often relies on end-to-end multimodal alignment to fuse heterogeneous modalities, which incurs substantial data and compute costs and limits extensibility. We present Training-Free Large Language Model Orchestration (LLM Orchestration), a training-free orchestration framework that integrates off-the-shelf modality experts into a unified multimodal input--output system without additional gradient-based training for integration. LLM Orchestration comprises three components: (1) an LLM controller that infers user intent and emits explicit control tokens for expert selection and sequencing, enabling protocol-constrained and auditable routing; (2) a text-centric cross-modal memory that compresses multimodal evidence into structured records for lightweight retrieval and reuse, reducing redundant expert invocations across turns; and (3) a unified interaction layer that executes routing and memory decisions to support consistent modality transitions, full-duplex streaming, and interruption-aware dialogue. Across diverse multimodal benchmarks, LLM Orchestration achieves strong performance under standard evaluation constraints while maintaining low orchestration overhead and modular upgradeability, providing a practical alternative to costly joint training for omni-modal systems.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multimodal Understanding | MMStar | Accuracy69.37 | 407 | |
| Long Video Understanding | LVBench | Accuracy50.27 | 218 | |
| Multimodal Understanding | MME | Score1.92e+3 | 125 | |
| Multi-modal Understanding | MMBench EN | Accuracy88.54 | 105 | |
| Multi-modal Video Understanding | VideoMME | Accuracy65.58 | 64 | |
| Omni-modal Understanding | WorldSense | Accuracy44.1 | 12 |