Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

AgentCPM-Explore: Realizing Long-Horizon Deep Exploration for Edge-Scale Agents

About

While Large Language Model (LLM)-based agents have shown remarkable potential for solving complex tasks, existing systems remain heavily reliant on large-scale models, leaving the capabilities of edge-scale models largely underexplored. In this paper, we present the first systematic study on training agentic models at the 4B-parameter scale. We identify three primary bottlenecks hindering the performance of edge-scale models: catastrophic forgetting during Supervised Fine-Tuning (SFT), sensitivity to reward signal noise during Reinforcement Learning (RL), and reasoning degradation caused by redundant information in long-context scenarios. To address the issues, we propose AgentCPM-Explore, a compact 4B agent model with high knowledge density and strong exploration capability. We introduce a holistic training framework featuring parameter-space model fusion, reward signal denoising, and contextual information refinement. Through deep exploration, AgentCPM-Explore achieves state-of-the-art (SOTA) performance among 4B-class models, matches or surpasses 8B-class SOTA models on four benchmarks, and even outperforms larger-scale models such as Claude-4.5-Sonnet or DeepSeek-v3.2 in five benchmarks. Notably, AgentCPM-Explore achieves 97.09% accuracy on GAIA text-based tasks under pass@64. These results provide compelling evidence that the bottleneck for edge-scale models is not their inherent capability ceiling, but rather their inference stability. Based on our well-established training framework, AgentCPM-Explore effectively unlocks the significant, yet previously underestimated, potential of edge-scale models.

Haotian Chen, Xin Cong, Shengda Fan, Yuyang Fu, Ziqin Gong, Yaxi Lu, Yishan Li, Boye Niu, Chengjun Pan, Zijun Song, Huadong Wang, Yesai Wu, Yueying Wu, Zihao Xie, Yukun Yan, Zhong Zhang, Yankai Lin, Zhiyuan Liu, Maosong Sun• 2026

Related benchmarks

TaskDatasetResultRank
General AI Assistant Task CompletionGAIA Text-Only
Accuracy0.639
15
Deep Information Search and Synthesisxbench DeepSearch
Score70
14
Web Browsing CompetitionBrowse Comp
Score24.1
14
Expert-Level Question AnsweringHumanity's Last Exam
Accuracy19.1
14
Web Navigation Question AnsweringWebWalker QA
Accuracy68.1
13
Web Browsing Competition (Chinese)Browse Comp ZH
Score29.1
13
Fact Retrieval and AnalysisFRAMES
Accuracy82.7
9
Agent Capability EvaluationSEAL 0
Score40.5
9
Showing 8 of 8 rows

Other info

Follow for update