Hybrid Latent Reasoning via Reinforcement Learning

About

Recent advances in large language models (LLMs) have introduced latent reasoning as a promising alternative to autoregressive reasoning. By performing internal computation with hidden states from previous steps, latent reasoning benefit from more informative features rather than sampling a discrete chain-of-thought (CoT) path. Yet latent reasoning approaches are often incompatible with LLMs, as their continuous paradigm conflicts with the discrete nature of autoregressive generation. Moreover, these methods rely on CoT traces for training and thus fail to exploit the inherent reasoning patterns of LLMs. In this work, we explore latent reasoning by leveraging the intrinsic capabilities of LLMs via reinforcement learning (RL). To this end, we introduce hybrid reasoning policy optimization (HRPO), an RL-based hybrid latent reasoning approach that (1) integrates prior hidden states into sampled tokens with a learnable gating mechanism, and (2) initializes training with predominantly token embeddings while progressively incorporating more hidden features. This design maintains LLMs' generative capabilities and incentivizes hybrid reasoning using both discrete and continuous representations. In addition, the hybrid HRPO introduces stochasticity into latent reasoning via token sampling, thereby enabling RL-based optimization without requiring CoT trajectories. Extensive evaluations across diverse benchmarks show that HRPO outperforms prior methods in both knowledge- and reasoning-intensive tasks. Furthermore, HRPO-trained LLMs remain interpretable and exhibit intriguing behaviors like cross-lingual patterns and shorter completion lengths, highlighting the potential of our RL-based approach and offer insights for future work in latent reasoning.

Zhenrui Yue, Bowen Jin, Huimin Zeng, Honglei Zhuang, Zhen Qin, Jinsung Yoon, Lanyu Shang, Jiawei Han, Dong Wang• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH	Accuracy58.6	338
Multi-hop Question Answering	Bamboogle	Exact Match29.6	128
Scientific Reasoning	ARC Challenge	Accuracy0.82	115
Open-domain Question Answering	TriviaQA	EM59.3	88
Open-domain Question Answering	Natural Questions (NQ)	Exact Match (EM)37.8	82
Mathematical Reasoning	MATH500	Accuracy60.2	41
Scientific Reasoning	MMLU STEM	Accuracy59	27
Mathematical Reasoning	GSM8K	Pass@185.97	21
Mathematical Reasoning	AMC23	Pass@147.5	21
Mathematical Reasoning	Average Reasoning Benchmarks	Pass@140.02	21

Showing 10 of 17 rows

Other info

Follow for update

@wizwand_team Discord