Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Generalist Reward Models: Found Inside Large Language Models

About

The alignment of Large Language Models (LLMs) is critically dependent on reward models trained on costly human preference data. While recent work explores bypassing this cost with AI feedback, these methods often lack a rigorous theoretical foundation. In this paper, we discover that a powerful generalist reward model is already latently present within any LLM trained via standard next-token prediction. We prove that this endogenous reward is not a heuristic, but is theoretically equivalent to a reward function learned through offline inverse reinforcement learning. This connection allows us to directly elicit a high-quality reward signal from a base (pre-trained or supervised fine-tuned) model without any further training. Critically, we also prove that subsequent reinforcement learning using this endogenous reward leads to a policy with a provably superior error bound compared to the base model. To our best knowledge, this is the first theoretical proof of the effectiveness of reinforcement learning for LLMs. Our experiments validate this theory, demonstrating that our method not only outperforms existing LLM-as-a-judge approaches but can also surpass explicitly trained reward models. These findings suggest that the reward modeling stage can be replaced by a principled method of eliciting the knowledge already captured during pre-training, heralding a more efficient, powerful, and scalable paradigm for LLMs alignment as well as multi-modal models.

Yi-Chen Li, Tian Xu, Yang Yu, Xuqin Zhang, Xiong-Hui Chen, Zhongxiang Ling, Ningjing Chao, Lei Yuan, Zhi-Hua Zhou• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMinerva
Pass@1 Accuracy42.6
289
Mathematical ReasoningAMC
Accuracy (ACC)60.9
215
Mathematical ReasoningAMC'23 (test)
Accuracy30
152
Mathematical ReasoningOlympiad
Accuracy0.468
134
Mathematical ReasoningAIME 24
Pass@1 Accuracy30.5
128
Mathematical ReasoningAMC23 (test)
Pass@110
61
Mathematical ReasoningMath Reasoning Suite Average
Average Accuracy48.9
49
Mathematical ReasoningAIME 25
Accuracy25.4
48
Process-level Error LocalizationPROCESSBENCH
GSM8K Accuracy35
44
Mathematical ReasoningMATH 500
Pass@460.6
20
Showing 10 of 20 rows

Other info

Follow for update