Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Training-Free Multimodal Large Language Model Orchestration

About

Building interactive omni-modal assistants often relies on end-to-end multimodal alignment to fuse heterogeneous modalities, which incurs substantial data and compute costs and limits extensibility. We present Training-Free Large Language Model Orchestration (LLM Orchestration), a training-free orchestration framework that integrates off-the-shelf modality experts into a unified multimodal input--output system without additional gradient-based training for integration. LLM Orchestration comprises three components: (1) an LLM controller that infers user intent and emits explicit control tokens for expert selection and sequencing, enabling protocol-constrained and auditable routing; (2) a text-centric cross-modal memory that compresses multimodal evidence into structured records for lightweight retrieval and reuse, reducing redundant expert invocations across turns; and (3) a unified interaction layer that executes routing and memory decisions to support consistent modality transitions, full-duplex streaming, and interruption-aware dialogue. Across diverse multimodal benchmarks, LLM Orchestration achieves strong performance under standard evaluation constraints while maintaining low orchestration overhead and modular upgradeability, providing a practical alternative to costly joint training for omni-modal systems.

Tianyu Xie, Yuexiao Ma, Yuhang Wu, Wang Chen, Jiayi Ji, Tat-Seng Chua, Xiawu Zheng, Rongrong Ji• 2025

Related benchmarks

TaskDatasetResultRank
Multimodal UnderstandingMMStar
Accuracy69.37
407
Long Video UnderstandingLVBench
Accuracy50.27
218
Multimodal UnderstandingMME
Score1.92e+3
125
Multi-modal UnderstandingMMBench EN
Accuracy88.54
105
Multi-modal Video UnderstandingVideoMME
Accuracy65.58
64
Omni-modal UnderstandingWorldSense
Accuracy44.1
12
Showing 6 of 6 rows

Other info

Follow for update