Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Rethinking Model Efficiency: Multi-Agent Inference with Large Models

About

Most vision-language models (VLMs) apply a large language model (LLM) as the decoder, where the response tokens are generated sequentially through autoregression. Therefore, the number of output tokens can be the bottleneck of the end-to-end latency. However, different models may require vastly different numbers of output tokens to achieve comparable performance. In this work, we conduct a comprehensive analysis of the latency across different components of VLMs on simulated data. The experiment shows that a large model with fewer output tokens can be more efficient than a small model with a long output sequence. The empirical study on diverse real-world benchmarks confirms the observation that a large model can achieve better or comparable performance as a small model with significantly fewer output tokens. To leverage the efficiency of large models, we propose a multi-agent inference framework that keeps large models with short responses but transfers the key reasoning tokens from the small model when necessary. The comparison on benchmark tasks demonstrates that by reusing the reasoning tokens from small models, it can help approach the performance of a large model with its own reasoning, which confirms the effectiveness of our proposal.

Sixun Dong, Juhua Hu, Steven Li, Wei Wen, Qi Qian• 2026

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE--
1455
Multimodal UnderstandingMMBench
Accuracy85.1
637
Visual Question AnsweringChartQA
Accuracy86
371
Visual Question AnsweringRealworldQA
Accuracy71.6
179
Visual Question AnsweringInfoVQA
Accuracy84.1
135
Multimodal UnderstandingMMBench
Latency8.68e+3
16
Document Visual Question AnsweringInfoVQA (test)
Final Performance82.9
5
Multi-modal UnderstandingMMBench (test)
Final Performance85.5
5
Object Hallucination DetectionPOPE (test)
Final Performance Score89.4
5
Multi-modal UnderstandingMMMU (test)
Final Performance62
5
Showing 10 of 15 rows

Other info

Follow for update