Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Direct Reasoning Optimization: Token-Level Reasoning Reflectivity Meets Rubric Gates for Unverifiable Tasks

About

Reinforcement learning (RL) training of large language models (LLMs) on unverifiable tasks is challenging even when a reasonable-quality reference answer is available. We propose a constrained RL training framework that (i) optimizes a token-level dense Reasoning Reflection Reward (R3) aligned with reasoning quality, and (ii) enforces rubric-gating as feasibility constraints at the rollout group level. R3 measures the model's token-level certainty of a reference answer under its chain-of-thought (CoT) prefix, and selectively emphasizes tokens with high cross-rollout variance, which we call reasoning-reflective tokens, that would otherwise be diluted by the bulk of low-variance tokens. The same variance signal also drives a filter that discards queries with insufficient signal for comparative learning. Rubric-gating complements R3 by operationalizing principled task criteria as hard accept/reject checks on final answers. Empirically, across four datasets spanning scientific writing, medicine, legal contracts, and finance, our framework outperforms strong baselines, achieves faster, more sample-efficient learning, and respects feasibility constraints.

Yifei Xu, Tusher Chakraborty, Srinagesh Sharma, Leonardo Nunes, Swati Sharma, Kate Drakos Demopulos, Emre K{\i}c{\i}man, Songwu Lu, Ranveer Chandra• 2025

Related benchmarks

TaskDatasetResultRank
Financial Question AnsweringFinQA
Accuracy68.4
30
Question AnsweringDROP (test)
ROUGE74.85
12
Question AnsweringMedicalQA (test)
ROUGE50.52
12
Medical ReasoningRaR Medicine
WR vs Base57.6
8
Natural Language InferenceContractNLI
Macro-F184.5
8
Reasoning evaluationParaRev
WR vs. Base63.7
8
Showing 6 of 6 rows

Other info

Follow for update