Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Heterogeneous Agent Collaborative Reinforcement Learning

About

We introduce Heterogeneous Agent Collaborative Reinforcement Learning (HACRL), a new Reinforcement Learning from Verifiable Reward (RLVR) problem that addresses the inefficiencies of isolated multi-agent on-policy optimization. HACRL enables collaborative optimization with independent execution: heterogeneous agents share verified rollouts during training to mutually improve, while operating independently at inference time. Unlike LLM-based multi-agent reinforcement learning (MARL), HACRL does not require coordinated deployment, and unlike on-/off-policy distillation, it enables bidirectional mutual learning among heterogeneous agents rather than one-directional homogeneous teacher-to-student transfer. Building on this problem, we propose HACPO, a collaborative RL algorithm that enables principled rollout sharing to maximize sample utilization and cross-agent knowledge transfer. To mitigate capability discrepancies and policy distribution shifts, HACPO introduces four tailored mechanisms with theoretical guarantees on unbiased advantage estimation. Extensive experiments across diverse heterogeneous model combinations and reasoning benchmarks show that HACPO consistently improves all participating agents, outperforming GSPO with double rollouts by an average of 3.6% while using only half the rollout cost.

Zhixia Zhang, Zixuan Huang, Gongxun Li, Huaiyang Wang, Chengyi Yuan, Xin Xia, Deqing Wang, Fuzhen Zhuang, Shuai Ma, Ning Ding, Yaodong Yang, Jianxin Li, Yikun Ban• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH 500
Accuracy94.8
442
Mathematical ReasoningMATH
Accuracy94.3
338
Mathematical ReasoningOlympiad
Accuracy73.2
90
Mathematical ReasoningOlympiadBench
Accuracy46.7
82
Mathematical ReasoningAMC 23
Accuracy95
81
Mathematical ReasoningAIME 2025
Acc32.3
81
Mathematical ReasoningMath Benchmarks Aggregate
Accuracy (Avg)63
62
Mathematical ReasoningMinerva
Accuracy (@avg1)42.3
57
Showing 8 of 8 rows

Other info

Follow for update