Recursive Language Models
About
We study allowing large language models (LLMs) to process arbitrarily long prompts through the lens of inference-time scaling. We propose Recursive Language Models (RLMs), a general inference paradigm that treats long prompts as part of an external environment and allows the LLM to programmatically examine, decompose, and recursively call itself over snippets of the prompt. We find that RLMs can successfully process inputs up to two orders of magnitude beyond model context windows and, even for shorter prompts, dramatically outperform the quality of vanilla frontier LLMs and common long-context scaffolds across four diverse long-context tasks while having comparable cost. At a small scale, we post-train the first natively recursive language model. Our model, RLM-Qwen3-8B, outperforms the underlying Qwen3-8B model by $28.3\%$ on average and even approaches the quality of vanilla GPT-5 on three long-context tasks. Code is available at https://github.com/alexzhang13/rlm.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Long-context Question Answering | LongBench (test) | -- | 69 | |
| Long-context Reasoning | OOLONG | Accuracy63.8 | 37 | |
| Long-context Reasoning | OOLONG trec_coarse | Score53 | 28 | |
| Coding Question Answering | CodeQA | Accuracy62.1 | 27 | |
| Semantic Needle-In-A-Haystack | S-NIAH | Accuracy52.4 | 27 | |
| Long-context reasoning (Pairs) | OOL-Pairs | Accuracy42.7 | 27 | |
| Code Question Answering | CodeQA | Latency (s)98.7 | 27 | |
| Long-context Reasoning | OOLONG | Latency (s)108.2 | 27 | |
| Long-context Reasoning | OOL-Pairs | Latency (s)156.4 | 27 | |
| Long-context retrieval | S-NIAH | Latency (s)86.3 | 27 |