Prompt-Level Reward Specifications for Open-Ended Post-Training

About

Open-ended post-training benefits from rewards that make prompt-specific success conditions explicit, rather than relying only on post-hoc scalar scores. In instruction following, writing, and decision-support tasks, response quality depends on local requirements, holistic preferences, and explicit constraints, but existing reward methods often leave these criteria implicit or cover only narrowly verifiable cases. We propose a prompt-level reward specification framework that separates reward specification from reward computation. Given only prompts, our framework constructs reusable task-adaptive rubrics and executable hard-constraint checkers offline, making reward criteria explicit before training and reusable across rollouts. At scoring time, artifact-anchored rubric and code scores are combined with an independent global score for residual holistic quality, yielding a normalized hybrid reward over requirement satisfaction, holistic quality, and deterministic constraints. The framework requires no human preference annotations, reference answers, or a separately trained reward model. Experiments show that the resulting reward improves offline RM-style response ranking and supports online reinforcement learning across multiple open-ended benchmarks. Ablations further show that rubrics, global scoring, and executable verification provide complementary supervision.

Zijun Weng, Xiaohui Hu, Shuangyong Song, Yongxiang Li, Kaidong Yu, Xuanjing Huang• 2026

Related benchmarks

Task	Dataset	Result
Reward Modeling	RM-Bench	--	137
Writing	WritingBench	Score81.4	104
Reward Modeling	RewardBench 2	Precise IF Score71	90
Instruction Following	IFEval	Genuine-Followup Rate87.5	65
Instruction Following	IFEval	--	49
Creative Writing	Creative Writing v3	Overall Rubric Score83.3	44
General Language Capability	Aggregate IFEval, IFBench, Arena-Hard-v2.0, Creative Writing v3, WritingBench	Average Score71.9	11
Pairwise Preference Comparison	150 prompt-response pairs	Win Rate63.3333	9
Instruction Following	IFBench	Pr. (S)57.3	8
Open-ended generation	Arena-Hard V2.0	Score47.8	8

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord