Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling

About

Existing Multi-Agent Systems (MAS) typically rely on homogeneous model configurations, failing to exploit the diverse expertise inherent in different post-trained architectures. We propose Team-of-Thoughts, a heterogeneous MAS framework that treats diverse models as specialized tools within an orchestrator-driven paradigm. Team-of-Thoughts introduces two novel components: (1) Orchestrator Calibration, which identifies models with superior coordination and synthesis capabilities, and (2) Agent Self-Assessment, a protocol where tool agents profile their own domain-specific strengths to guide selection. At inference, the orchestrator dynamically activates the most compatible agents based on these profiles to maximize capability coverage. Across five mathematical reasoning and code generation benchmarks, Team-of-Thoughts consistently outperforms individual models and existing MAS baselines. Notably, on AIME24 and LiveCodeBench, Team-of-Thoughts achieves 96.00% and 77.91% accuracy, respectively, significantly improving over homogeneous role-play baselines (80.00% and 65.93%).

Jeffrey T. H. Wong, Zixi Zhang, Junyi Liu, Yiren Zhao• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	AIME 2024 (test)	Accuracy96.67	209
Code Generation	HumanEval+ (test)	--	132
Code Generation	MBPP Plus (test)	Accuracy83.6	89
Code Generation	LiveCodeBench 2025/01/01 - 2025/05/01 v6 (test)	Accuracy72.53	9

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord