Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Dual Latent Memory for Visual Multi-agent System

About

While Visual Multi-Agent Systems (VMAS) promise to enhance comprehensive abilities through inter-agent collaboration, empirical evidence reveals a counter-intuitive "scaling wall": increasing agent turns often degrades performance while exponentially inflating token costs. We attribute this failure to the information bottleneck inherent in text-centric communication, where converting perceptual and thinking trajectories into discrete natural language inevitably induces semantic loss. To this end, we propose L$^{2}$-VMAS, a novel model-agnostic framework that enables inter-agent collaboration with dual latent memories. Furthermore, we decouple the perception and thinking while dynamically synthesizing dual latent memories. Additionally, we introduce an entropy-driven proactive triggering that replaces passive information transmission with efficient, on-demand memory access. Extensive experiments among backbones, sizes, and multi-agent structures demonstrate that our method effectively breaks the "scaling wall" with superb scalability, improving average accuracy by 2.7-5.4% while reducing token usage by 21.3-44.8%. Codes: https://github.com/YU-deep/L2-VMAS.

Xinlei Yu, Chengming Xu, Zhangquan Chen, Bo Yin, Cheng Yang, Yongbo He, Yihao Hu, Jiangning Zhang, Cheng Tan, Xiaobin Hu, Shuicheng Yan• 2026

Related benchmarks

TaskDatasetResultRank
Multimodal UnderstandingMMBench
Accuracy88.8
367
Video UnderstandingMVBench
Accuracy73.1
247
Multimodal UnderstandingMMStar
Accuracy81.4
197
Visual Question AnsweringRealworldQA
Accuracy80.2
98
Visual PerceptionBLINK
Accuracy72.7
71
Long Video UnderstandingLVBench
Accuracy61.5
63
Multi-image ReasoningMuirBench
Accuracy77.2
48
Showing 7 of 7 rows

Other info

Follow for update