Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Prioritize the Process, Not Just the Outcome: Rewarding Latent Thought Trajectories Improves Reasoning in Looped Language Models

About

Looped Language Models (LoopLMs) perform multi-step latent reasoning prior to token generation and outperform conventional LLMs on reasoning benchmarks at smaller parameter budgets. However, attempts to further improve LoopLM reasoning with reinforcement learning have failed - standard objectives such as Group Relative Policy Optimization (GRPO) only assign credit to the final latent state, creating a fundamental mismatch with the model's internal computation. To resolve this, we introduce RLTT (Reward Latent Thought Trajectories), a reinforcement learning framework which distributes reward across the full latent reasoning trajectory. RLTT provides dense, trajectory-level credit assignment without relying on external verifiers and can directly replace GRPO with negligible overhead. Across extensive experiments with Ouro-2.6B-Thinking under identical training and inference conditions, RLTT yields substantial improvements over GRPO on challenging mathematical reasoning benchmarks, improving accuracy by +14.4% on MATH-500, +16.6% on AIME24, and +10.0% on BeyondAIME. Despite being trained exclusively on mathematics, RLTT also transfers effectively to non-mathematical reasoning benchmarks, demonstrating the effectiveness of trajectory-level credit assignment for reinforcement learning in LoopLMs.

Jonathan Williams, Esin Tureci• 2026

Related benchmarks

TaskDatasetResultRank
Code GenerationMBPP
Accuracy (%)64.6
146
Mathematical ReasoningAIME24
Accuracy33.3
130
Mathematical ReasoningMATH 500
MATH 500 Accuracy86
106
Scientific ReasoningGPQA
Accuracy38.4
55
ReasoningARC-C--
42
Mathematical ReasoningBeyond AIME
Accuracy16
32
Multi-domain Question AnsweringMMLU-ST
Accuracy89.6
8
Showing 7 of 7 rows

Other info

Follow for update