Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Value-Guidance MeanFlow for Offline Multi-Agent Reinforcement Learning

About

Offline multi-agent reinforcement learning (MARL) aims to learn the optimal joint policy from pre-collected datasets, requiring a trade-off between maximizing global returns and mitigating distribution shift from offline data. Recent studies use diffusion or flow generative models to capture complex joint policy behaviors among agents; however, they typically rely on multi-step iterative sampling, thereby reducing training and inference efficiency. Although further research improves sampling efficiency through methods like distillation, it remains sensitive to the behavior regularization coefficient. To address the above-mentioned issues, we propose Value Guidance Multi-agent MeanFlow Policy (VGM$^2$P), a simple yet effective flow-based policy learning framework that enables efficient action generation with coefficient-insensitive conditional behavior cloning. Specifically, VGM$^2$P uses global advantage values to guide agent collaboration, treating optimal policy learning as conditional behavior cloning. Additionally, to improve policy expressiveness and inference efficiency in multi-agent scenarios, it leverages classifier-free guidance MeanFlow for both policy training and execution. Experiments on tasks with both discrete and continuous action spaces demonstrate that, even when trained solely via conditional behavior cloning, VGM$^2$P efficiently achieves performance comparable to state-of-the-art methods.

Teng Pang, Zhiqiang Dong, Yan Zhang, Rongjian Xu, Guoqiang Wu, Yilong Yin• 2026

Related benchmarks

TaskDatasetResultRank
Multi-agent continuous controlMA-MuJoCo 6Halfcheetah-Medium
Average Performance5.16e+3
16
Multi-agent continuous controlMA-MuJoCo 6Halfcheetah-Expert
Average Performance4.90e+3
8
Multi-agent continuous controlMA-MuJoCo 6Halfcheetah-MR
Average Performance4.07e+3
8
Multi-agent continuous controlMA-MuJoCo 3Hopper-Medium
Average Performance2.01e+3
8
Multi-agent continuous controlMA-MuJoCo 3Hopper-MR
Average Performance1.43e+3
8
Multi-agent continuous controlMA-MuJoCo 3Hopper-ME
Average Performance3.37e+3
8
Multi-agent continuous controlMA-MuJoCo 2Ant-Expert
Average Performance2.08e+3
8
Multi-agent continuous controlMA-MuJoCo 2Ant-Medium
Average Performance1.43e+3
8
Multi-agent continuous controlMA-MuJoCo 2Ant-MR
Average Performance1.31e+3
8
Multi-agent continuous controlMA-MuJoCo 2Ant-ME
Average Performance1.97e+3
8
Showing 10 of 11 rows

Other info

Follow for update