Prompt-Level Reward Specifications for Open-Ended Post-Training
About
Open-ended post-training benefits from rewards that make prompt-specific success conditions explicit, rather than relying only on post-hoc scalar scores. In instruction following, writing, and decision-support tasks, response quality depends on local requirements, holistic preferences, and explicit constraints, but existing reward methods often leave these criteria implicit or cover only narrowly verifiable cases. We propose a prompt-level reward specification framework that separates reward specification from reward computation. Given only prompts, our framework constructs reusable task-adaptive rubrics and executable hard-constraint checkers offline, making reward criteria explicit before training and reusable across rollouts. At scoring time, artifact-anchored rubric and code scores are combined with an independent global score for residual holistic quality, yielding a normalized hybrid reward over requirement satisfaction, holistic quality, and deterministic constraints. The framework requires no human preference annotations, reference answers, or a separately trained reward model. Experiments show that the resulting reward improves offline RM-style response ranking and supports online reinforcement learning across multiple open-ended benchmarks. Ablations further show that rubrics, global scoring, and executable verification provide complementary supervision.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Reward Modeling | RM-Bench | -- | 137 | |
| Writing | WritingBench | Score81.4 | 74 | |
| Instruction Following | IFEval | Genuine-Followup Rate87.5 | 65 | |
| Instruction Following | IFEval | -- | 45 | |
| Reward Modeling | RewardBench 2 | Precise IF Score71 | 41 | |
| Creative Writing | Creative Writing v3 | Overall Rubric Score83.3 | 32 | |
| General Language Capability | Aggregate IFEval, IFBench, Arena-Hard-v2.0, Creative Writing v3, WritingBench | Average Score71.9 | 11 | |
| Pairwise Preference Comparison | 150 prompt-response pairs | Win Rate63.3333 | 9 | |
| Instruction Following | IFBench | Pr. (S)57.3 | 8 | |
| Open-ended generation | Arena-Hard V2.0 | Score47.8 | 8 |