Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

TTRL: Test-Time Reinforcement Learning

About

This paper investigates Reinforcement Learning (RL) on data without explicit labels for reasoning tasks in Large Language Models (LLMs). The core challenge of the problem is reward estimation during inference while not having access to ground-truth information. While this setting appears elusive, we find that common practices in Test-Time Scaling (TTS), such as majority voting, yield surprisingly effective rewards suitable for driving RL training. In this work, we introduce Test-Time Reinforcement Learning (TTRL), a novel method for training LLMs using RL on unlabeled data. TTRL enables self-evolution of LLMs by utilizing the priors in the pre-trained models. Our experiments demonstrate that TTRL consistently improves performance across a variety of tasks and models. Notably, TTRL boosts the pass@1 performance of Qwen-2.5-Math-7B by approximately 211% on the AIME 2024 with only unlabeled test data. Furthermore, although TTRL is only supervised by the maj@n metric, TTRL has demonstrated performance to consistently surpass the upper limit of the initial model maj@n, and approach the performance of models trained directly on test data with ground-truth labels. Our experimental findings validate the general effectiveness of TTRL across various tasks and highlight TTRL's potential for broader tasks and domains. GitHub: https://github.com/PRIME-RL/TTRL

Yuxin Zuo, Kaiyan Zhang, Li Sheng, Shang Qu, Ganqu Cui, Xuekai Zhu, Haozhan Li, Yuchen Zhang, Xinwei Long, Ermo Hua, Biqing Qi, Youbang Sun, Zhiyuan Ma, Lifan Yuan, Ning Ding, Bowen Zhou• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH
Accuracy45.11
882
Instruction FollowingAlpacaEval 2.0--
507
Mathematical ReasoningMATH 500
Accuracy84.3
442
Mathematical ReasoningAIME 2024
Accuracy33
370
Medical Question AnsweringMedMCQA
Accuracy58.1
346
Mathematical ReasoningMATH
Accuracy47.2
338
Multi-hop Question AnsweringHotpotQA (test)--
255
Mathematical ReasoningMATH 500
pass@185.7
239
Instruction FollowingAlpacaEval--
227
Mathematical ReasoningAMC
Accuracy65.95
221
Showing 10 of 112 rows
...

Other info

Follow for update