Training-Free Multimodal Large Language Model Orchestration

About

Building interactive omni-modal assistants often relies on end-to-end multimodal alignment to fuse heterogeneous modalities, which incurs substantial data and compute costs and limits extensibility. We present Training-Free Large Language Model Orchestration (LLM Orchestration), a training-free orchestration framework that integrates off-the-shelf modality experts into a unified multimodal input--output system without additional gradient-based training for integration. LLM Orchestration comprises three components: (1) an LLM controller that infers user intent and emits explicit control tokens for expert selection and sequencing, enabling protocol-constrained and auditable routing; (2) a text-centric cross-modal memory that compresses multimodal evidence into structured records for lightweight retrieval and reuse, reducing redundant expert invocations across turns; and (3) a unified interaction layer that executes routing and memory decisions to support consistent modality transitions, full-duplex streaming, and interruption-aware dialogue. Across diverse multimodal benchmarks, LLM Orchestration achieves strong performance under standard evaluation constraints while maintaining low orchestration overhead and modular upgradeability, providing a practical alternative to costly joint training for omni-modal systems.

Tianyu Xie, Yuexiao Ma, Yuhang Wu, Wang Chen, Jiayi Ji, Tat-Seng Chua, Xiawu Zheng, Rongrong Ji• 2025

Related benchmarks

Task	Dataset	Result
Multimodal Understanding	MMStar	Accuracy69.37	511
Long Video Understanding	LVBench	Accuracy50.27	267
Multimodal Understanding	MME	Score1.92e+3	150
Multi-modal Understanding	MMBench EN	Accuracy88.54	113
Multi-modal Video Understanding	VideoMME	Accuracy65.58	64
Omni-modal Understanding	WorldSense	Accuracy44.1	12

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord