Prioritize the Process, Not Just the Outcome: Rewarding Latent Thought Trajectories Improves Reasoning in Looped Language Models

About

Looped Language Models (LoopLMs) perform multi-step latent reasoning prior to token generation and outperform conventional LLMs on reasoning benchmarks at smaller parameter budgets. However, attempts to further improve LoopLM reasoning with reinforcement learning have failed - standard objectives such as Group Relative Policy Optimization (GRPO) only assign credit to the final latent state, creating a fundamental mismatch with the model's internal computation. To resolve this, we introduce RLTT (Reward Latent Thought Trajectories), a reinforcement learning framework which distributes reward across the full latent reasoning trajectory. RLTT provides dense, trajectory-level credit assignment without relying on external verifiers and can directly replace GRPO with negligible overhead. Across extensive experiments with Ouro-1.4B/2.6B-Thinking under identical training and inference conditions, RLTT yields statistically significant improvements over GRPO on challenging mathematical reasoning benchmarks, improving mean accuracy over MATH-500, AIME24/26, and BeyondAIME by +5.8% on the 1.4B scale, and +10.9% on the 2.6B scale. Despite being trained exclusively on mathematics, RLTT also transfers effectively to non-mathematical reasoning benchmarks, demonstrating the effectiveness of trajectory-level credit assignment for reinforcement learning in LoopLMs. Code is available at https://github.com/jonwill8/RLTT.git.

Jonathan Williams, Esin Tureci• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	AIME24	Accuracy33.3	160
Code Generation	MBPP	Accuracy (%)64.6	146
Reasoning	ARC-C	--	112
Mathematical Reasoning	MATH 500	MATH 500 Accuracy86	106
Scientific Reasoning	GPQA	Accuracy38.4	55
Mathematical Reasoning	Beyond AIME	Accuracy16	45
Multi-domain Question Answering	MMLU-ST	Accuracy89.6	8

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord