Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Prompt-Level Reward Specifications for Open-Ended Post-Training

About

Open-ended post-training benefits from rewards that make prompt-specific success conditions explicit, rather than relying only on post-hoc scalar scores. In instruction following, writing, and decision-support tasks, response quality depends on local requirements, holistic preferences, and explicit constraints, but existing reward methods often leave these criteria implicit or cover only narrowly verifiable cases. We propose a prompt-level reward specification framework that separates reward specification from reward computation. Given only prompts, our framework constructs reusable task-adaptive rubrics and executable hard-constraint checkers offline, making reward criteria explicit before training and reusable across rollouts. At scoring time, artifact-anchored rubric and code scores are combined with an independent global score for residual holistic quality, yielding a normalized hybrid reward over requirement satisfaction, holistic quality, and deterministic constraints. The framework requires no human preference annotations, reference answers, or a separately trained reward model. Experiments show that the resulting reward improves offline RM-style response ranking and supports online reinforcement learning across multiple open-ended benchmarks. Ablations further show that rubrics, global scoring, and executable verification provide complementary supervision.

Zijun Weng, Xiaohui Hu, Shuangyong Song, Yongxiang Li, Kaidong Yu, Xuanjing Huang• 2026

Related benchmarks

TaskDatasetResultRank
Reward ModelingRM-Bench--
137
WritingWritingBench
Score81.4
74
Instruction FollowingIFEval
Genuine-Followup Rate87.5
65
Instruction FollowingIFEval--
45
Reward ModelingRewardBench 2
Precise IF Score71
41
Creative WritingCreative Writing v3
Overall Rubric Score83.3
32
General Language CapabilityAggregate IFEval, IFBench, Arena-Hard-v2.0, Creative Writing v3, WritingBench
Average Score71.9
11
Pairwise Preference Comparison150 prompt-response pairs
Win Rate63.3333
9
Instruction FollowingIFBench
Pr. (S)57.3
8
Open-ended generationArena-Hard V2.0
Score47.8
8
Showing 10 of 11 rows

Other info

Follow for update