Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

MagicAgent: Towards Generalized Agent Planning

About

The evolution of Large Language Models (LLMs) from passive text processors to autonomous agents has established planning as a core component of modern intelligence. However, achieving generalized planning remains elusive, not only by the scarcity of high-quality interaction data but also by inherent conflicts across heterogeneous planning tasks. These challenges result in models that excel at isolated tasks yet struggle to generalize, while existing multi-task training attempts suffer from gradient interference. In this paper, we present \textbf{MagicAgent}, a series of foundation models specifically designed for generalized agent planning. We introduce a lightweight and scalable synthetic data framework that generates high-quality trajectories across diverse planning tasks, including hierarchical task decomposition, tool-augmented planning, multi-constraint scheduling, procedural logic orchestration, and long-horizon tool execution. To mitigate training conflicts, we propose a two-stage training paradigm comprising supervised fine-tuning followed by multi-objective reinforcement learning over both static datasets and dynamic environments. Empirical results demonstrate that MagicAgent-32B and MagicAgent-30B-A3B deliver superior performance, achieving accuracies of $75.1\%$ on Worfbench, $55.9\%$ on NaturalPlan, $57.5\%$ on $\tau^2$-Bench, $86.9\%$ on BFCL-v3, and $81.2\%$ on ACEBench, as well as strong results on our in-house MagicEval benchmarks. These results substantially outperform existing sub-100B models and even surpass leading closed-source models.

Xuhui Ren, Shaokang Dong, Chen Yang, Qing Gao, Yunbin Zhao, Yongsheng Liu, Xinwei Geng, Xiang Li, Demei Yan, Yanqing Li, Chenhao Huang, Dingwei Zhu, Junjie Ye, Boxuan Yue, Yingnan Fu, Mengzhe Lv, Zezeng Feng, Boshen Zhou, Bocheng Wang, Xuanjing Huang, Yu-Gang Jiang, Tao Gui, Qi Zhang, Yunke Zhang• 2026

Related benchmarks

TaskDatasetResultRank
Cross-Lingual PlanningACEBench
Score (En)78.3
14
Hierarchical Task DecompositionMagicEval-Plan Condition 3
Step Count97.5
14
Hierarchical Task DecompositionMagicEval-Plan Context Inheritance 3
Step Score97.6
14
Multi-Constraint SchedulingNaturalPlan
Trip Success Rate48.6
14
Tool-Augmented PlanningBFCL V3
Live Success Rate84.1
14
Tool-Augmented PlanningMagicEval-Tool General
Name Accuracy97.7
14
Tool-Augmented PlanningMagicEval-Tool Dependency
Name Acc98.5
14
Tool-Augmented PlanningMagicEval-Tool Condition
Name Accuracy95.4
14
Tool-Augmented PlanningMagicEval-Tool Context Inheritance
Name Accuracy98.8
14
Workflow PlanningWorfBench
F1 Chain80.3
14
Showing 10 of 13 rows

Other info

Follow for update