Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Adaptive Collaboration with Humans: Metacognitive Policy Optimization for Multi-Agent LLMs with Continual Learning

About

While scaling individual Large Language Models (LLMs) has delivered remarkable progress, the next frontier lies in scaling collaboration through multi-agent systems (MAS). However, purely autonomous MAS remain ''closed-world'' systems, constrained by the static knowledge horizon of pre-trained models. This limitation makes them brittle on tasks requiring knowledge beyond training data, often leading to collective failure under novel challenges. To address this, we propose the Human-In-the-Loop Multi-Agent Collaboration (HILA) framework, a principled paradigm for human--agent collaboration. HILA trains agents to learn a metacognitive policy that governs when to solve problems autonomously and when to defer to a human expert. To operationalize this policy, we introduce Dual-Loop Policy Optimization, which disentangles immediate decision-making from long-term capability growth. The inner loop applies Group Relative Policy Optimization (GRPO) with a cost-aware reward to optimize deferral decisions, while the outer loop implements continual learning, transforming expert feedback into high-quality supervised signals that strengthen the agent's reasoning ability. Experiments on challenging mathematical and problem-solving benchmarks show that HILA, equipped with Dual-Loop Policy Optimization, consistently outperforms advanced MAS, establishing a principled foundation for collaborative and continually improving agentic systems.

Wei Yang, Defu Cao, Jiacheng Pang, Muyan Weng, Yan Liu• 2026

Related benchmarks

TaskDatasetResultRank
Language UnderstandingMMLU
Accuracy73.62
825
MathGSM8K
Accuracy0.8986
206
Math ReasoningAMC
Accuracy35.83
95
Program synthesisHumanEval
Accuracy72.15
32
Quantitative mathematicsAIME
Accuracy9.37
11
Showing 5 of 5 rows

Other info

Follow for update