Agentic Aggregation for Parallel Scaling of Long-Horizon Agentic Tasks

About

We study parallel test-time scaling for long-horizon agentic tasks such as agentic search and deep research, where multiple rollouts are generated in parallel and aggregated into a final response. While such scaling has proven effective for chain-of-thought reasoning, agentic tasks pose unique challenges: trajectories are long, multi-turn, and tool-augmented, and outputs are often open-ended. Aggregating only final answers discards rich information from trajectories, while concatenating all trajectories exceeds the model's context window. To address this, we propose AggAgent, an aggregation agent that treats parallel trajectories as an environment. We equip it with lightweight tools to inspect candidate solutions and search across trajectories, enabling it to navigate and synthesize information on demand. Across six benchmarks and three model families (GLM-4.7, Qwen3.5, MiniMax-M2.5), AggAgent outperforms all existing aggregation methods-by up to 5.3% absolute on average and 10.3% on two deep research tasks-while adding minimal overhead, as the aggregation cost remains bounded by a single agentic rollout. Our findings establish agentic aggregation as an effective and cost-efficient approach to parallel test-time scaling.

Yoonsang Lee, Howard Yen, Xi Ye, Danqi Chen• 2026

Related benchmarks

Task	Dataset	Result
Web Browsing	Browsecomp	Accuracy71.33	68
Logical reasoning	HLE	Accuracy0.6	62
Long-horizon agentic task	HLE	Performance60	41
Medical Reasoning	HealthBench Hard	Accuracy28.06	41
BrowseComp-Plus	BrowseComp+	Accuracy74.67	25
HLE	HLE	Accuracy54.19	25
Long-horizon agentic task	Browsecomp	Performance71.33	24
Long-horizon agentic task	BrowseComp+	Performance77.33	24
ResearchRubrics	RESEARCHRUBRICS	Accuracy49.36	19
DeepSearchQA	DeepSearchQA	Accuracy66	19

Showing 10 of 19 rows

Other info

Follow for update

@wizwand_team Discord