Recursive Language Models

About

We study allowing large language models (LLMs) to process arbitrarily long prompts through the lens of inference-time scaling. We propose Recursive Language Models (RLMs), a general inference paradigm that treats long prompts as part of an external environment and allows the LLM to programmatically examine, decompose, and recursively call itself over snippets of the prompt. We find that RLMs can successfully process inputs up to two orders of magnitude beyond model context windows and, even for shorter prompts, dramatically outperform the quality of vanilla frontier LLMs and common long-context and coding scaffolds (e.g., on GPT-5 by a median across the evaluated benchmarks of $26\%$ against compaction, $130\%$ against CodeAct with sub-calls, and $13\%$ against Claude Code) across four diverse long-context tasks while having comparable cost. At a small scale, we post-train the first model around the RLM. Our model, RLM-Qwen3-8B, outperforms the underlying Qwen3-8B model by $28.3\%$ on average and even approaches the quality of vanilla GPT-5 on three long-context tasks. Code is available at https://github.com/alexzhang13/rlm.

Alex L. Zhang, Tim Kraska, Omar Khattab• 2025

Related benchmarks

Task	Dataset	Result
Long-context Question Answering	LongBench (test)	--	69
Long-context Reasoning	OOLONG	Accuracy63.8	37
Long-context Reasoning	OOLONG trec_coarse	Score53	28
Coding Question Answering	CodeQA	Accuracy62.1	27
Semantic Needle-In-A-Haystack	S-NIAH	Accuracy52.4	27
Long-context reasoning (Pairs)	OOL-Pairs	Accuracy42.7	27
Code Question Answering	CodeQA	Latency (s)98.7	27
Long-context Reasoning	OOLONG	Latency (s)108.2	27
Long-context Reasoning	OOL-Pairs	Latency (s)156.4	27
Long-context retrieval	S-NIAH	Latency (s)86.3	27

Showing 10 of 39 rows

Other info

GitHub

Follow for update

@wizwand_team Discord