Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ReSum: Unlocking Long-Horizon Search Intelligence via Context Summarization

About

Large Language Model (LLM)-based web agents excel at knowledge-intensive tasks but face a fundamental conflict between the need for extensive exploration and the constraints of limited context windows. Current solutions typically rely on architectural modifications, e.g., internal memory tokens, which break compatibility with pre-existing agents and necessitate costly end-to-end retraining. To overcome these limitations, we introduce ReSum, a lightweight, plug-and-play paradigm that enables unbounded exploration by periodically invoking an external tool to condense interaction histories into compact summaries. Although this paradigm functions without training, standard agents are not inherently aligned to reason over such compressed contexts. To bridge this gap, we propose ReSum-GRPO, which adapts Group Relative Policy Optimization (GRPO) via advantage broadcasting to propagate final rewards across segmented trajectories, enabling credit assignments over long-horizons. Extensive experiments show that ReSum achieves a 4.5% improvement over ReAct in training-free settings, with ReSum-GRPO yielding a further 8.2% gain. Notably, with only 1K training samples, a ReSum-enhanced 30B agent achieves competitive performance with leading open-source models, showing ReSum's effectiveness.

Xixi Wu, Kuan Li, Yida Zhao, Liwen Zhang, Litu Ou, Huifeng Yin, Zhongwang Zhang, Xinmiao Yu, Dingchu Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Minhao Cheng, Shuai Wang, Hong Cheng, Jingren Zhou• 2025

Related benchmarks

TaskDatasetResultRank
Interactive Tool-Use Agent Performancetau2-Bench
Retail Performance Score70.4
102
Agentic Web BrowsingBrowsecomp
Pass@118.8
47
Agentic Web BrowsingBrowseComp-ZH
Pass@127.3
44
Multi-turn tool-use interactionTau-Bench
Retail Success Rate69.6
35
Deep ResearchBrowsecomp
Pass@150.9
33
Deep Researchxbench
Accuracy11
30
Clinical Decision-MakingMIMIC Common IV (test)
Diagnoses Error0.1753
28
Long-context ReasoningOOLONG trec_coarse
Score46
28
General AI Assistant TasksGAIA
Pass@1 Score51.5
26
Multi-turn tool-use interactionVitaBench
Delivery Score53.8
20
Showing 10 of 25 rows

Other info

Follow for update