Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Agentic Aggregation for Parallel Scaling of Long-Horizon Agentic Tasks

About

We study parallel test-time scaling for long-horizon agentic tasks such as agentic search and deep research, where multiple rollouts are generated in parallel and aggregated into a final response. While such scaling has proven effective for chain-of-thought reasoning, agentic tasks pose unique challenges: trajectories are long, multi-turn, and tool-augmented, and outputs are often open-ended. Aggregating only final answers discards rich information from trajectories, while concatenating all trajectories exceeds the model's context window. To address this, we propose AggAgent, an aggregation agent that treats parallel trajectories as an environment. We equip it with lightweight tools to inspect candidate solutions and search across trajectories, enabling it to navigate and synthesize information on demand. Across six benchmarks and three model families (GLM-4.7, Qwen3.5, MiniMax-M2.5), AggAgent outperforms all existing aggregation methods-by up to 5.3% absolute on average and 10.3% on two deep research tasks-while adding minimal overhead, as the aggregation cost remains bounded by a single agentic rollout. Our findings establish agentic aggregation as an effective and cost-efficient approach to parallel test-time scaling.

Yoonsang Lee, Howard Yen, Xi Ye, Danqi Chen• 2026

Related benchmarks

TaskDatasetResultRank
Web BrowsingBrowsecomp
Accuracy71.33
52
Logical reasoningHLE
Accuracy0.6
46
Medical ReasoningHealthBench Hard
Accuracy28.06
41
BrowseComp-PlusBrowseComp+
Accuracy74.67
25
HLEHLE
Accuracy54.19
25
Long-horizon agentic taskBrowsecomp
Performance71.33
24
Long-horizon agentic taskBrowseComp+
Performance77.33
24
Long-horizon agentic taskHLE
Performance60
24
ResearchRubricsRESEARCHRUBRICS
Accuracy49.36
19
DeepSearchQADeepSearchQA
Accuracy66
19
Showing 10 of 19 rows

Other info

Follow for update