Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Scaling External Knowledge Input Beyond Context Windows of LLMs via Multi-Agent Collaboration

About

With the rapid advancement of post-training techniques for reasoning and information seeking, large language models (LLMs) can incorporate a large quantity of retrieved knowledge to solve complex tasks. However, the limited context window of LLMs obstructs scaling the amount of external knowledge input, prohibiting further improvement. Existing context window extension methods inevitably cause information loss. LLM-based multi-agent methods emerge as a new paradigm to handle massive input in a distributional manner, where we identify two core bottlenecks in existing agent orchestration designs. In this work, we develop a multi-agent framework, \textbf{\ExtAgents}, to overcome the bottlenecks and enable better scalability in inference-time knowledge integration without longer-context training. Benchmarked with our enhanced multi-hop question answering test, \textbf{$\boldsymbol{\infty}$Bench+}, and other public test sets including long survey generation, \ExtAgents significantly enhances the performance over existing non-training methods with the same amount of external knowledge input, regardless of whether it falls \emph{within or exceeds the context window}. Moreover, the method maintains efficiency due to high parallelism. We believe further study in the coordination of LLM agents on increasing external knowledge input could benefit real-world applications.

Zijun Liu, Zhennan Wan, Peng Li, Ming Yan, Fei Huang, Yang Liu• 2025

Related benchmarks

TaskDatasetResultRank
Multi-hop QAHotpotQA (test val)
F1 Score59.7
11
Multi-hop Question AnsweringHotpotQA
Helmet Score1.86
11
Multi-hop QAEn.QA
F138.2
8
Multi-hop QAZh.QA
F1 Score48.2
8
Multi-hop Question AnsweringZh.QA
Helmet Correctness Score1.1
8
Multi-hop Question AnsweringEn.QA
Helmet Correctness Score1.2
8
Question Answering∞Bench En.QA
F1 Score42.1
7
Question Answering∞Bench Zh.QA
F1 Score49.1
7
Long survey generationAutoSurvey
LLM-as-a-Judge Score (1-10)7.63
2
Showing 9 of 9 rows

Other info

Follow for update