Dual Latent Memory for Visual Multi-agent System

About

While Visual Multi-Agent Systems (VMAS) promise to enhance comprehensive abilities through inter-agent collaboration, empirical evidence reveals a counter-intuitive "scaling wall": increasing agent turns often degrades performance while exponentially inflating token costs. We attribute this failure to the information bottleneck inherent in text-centric communication, where converting perceptual and thinking trajectories into discrete natural language inevitably induces semantic loss. To this end, we propose L$^{2}$-VMAS, a novel model-agnostic framework that enables inter-agent collaboration with dual latent memories. Furthermore, we decouple the perception and thinking while dynamically synthesizing dual latent memories. Additionally, we introduce an entropy-driven proactive triggering that replaces passive information transmission with efficient, on-demand memory access. Extensive experiments among backbones, sizes, and multi-agent structures demonstrate that our method effectively breaks the "scaling wall" with superb scalability, improving average accuracy by 2.7-5.4% while reducing token usage by 21.3-44.8%. Codes: https://github.com/YU-deep/L2-VMAS.

Xinlei Yu, Chengming Xu, Zhangquan Chen, Bo Yin, Cheng Yang, Yongbo He, Yihao Hu, Jiangning Zhang, Cheng Tan, Xiaobin Hu, Shuicheng Yan• 2026

Related benchmarks

Task	Dataset	Result
Multimodal Understanding	MMBench	Accuracy88.8	847
Video Understanding	MVBench	Accuracy73.1	563
Multimodal Understanding	MMStar	Accuracy81.4	407
Visual Question Answering	RealworldQA	Accuracy80.2	259
Visual Perception	BLINK	Accuracy72.7	241
Long Video Understanding	LVBench	Accuracy61.5	218
Multi-image Reasoning	MuirBench	Accuracy77.2	89

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord