Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

About

We study a novel language model architecture that is capable of scaling test-time computation by implicitly reasoning in latent space. Our model works by iterating a recurrent block, thereby unrolling to arbitrary depth at test-time. This stands in contrast to mainstream reasoning models that scale up compute by producing more tokens. Unlike approaches based on chain-of-thought, our approach does not require any specialized training data, can work with small context windows, and can capture types of reasoning that are not easily represented in words. We scale a proof-of-concept model to 3.5 billion parameters and 800 billion tokens. We show that the resulting model can improve its performance on reasoning benchmarks, sometimes dramatically, up to a computation load equivalent to 50 billion parameters.

Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, Tom Goldstein• 2025

Related benchmarks

TaskDatasetResultRank
Commonsense ReasoningWinoGrande
Accuracy59.4
1442
Language ModelingWikiText
PPL41.31
740
Commonsense ReasoningHellaSwag
HellaSwag Accuracy65.2
711
Multitask Language UnderstandingMMLU
Accuracy31.4
520
Mathematical ReasoningSVAMP
Accuracy54.8
403
Language ModelingWiki
Perplexity (PPL)12.14
298
Reading ComprehensionBoolQ
Accuracy (BoolQ)69.8
228
Language ModelingThe Pile
Perplexity6.29
129
Mathematical ReasoningGSM8K
EM32.6
123
Reading ComprehensionDROP
F1 Score17.8
96
Showing 10 of 31 rows

Other info

Follow for update