Direct Reasoning Optimization: Token-Level Reasoning Reflectivity Meets Rubric Gates for Unverifiable Tasks

About

Reinforcement learning (RL) training of large language models (LLMs) on unverifiable tasks is challenging even when a reasonable-quality reference answer is available. We propose a constrained RL training framework that (i) optimizes a token-level dense Reasoning Reflection Reward (R3) aligned with reasoning quality, and (ii) enforces rubric-gating as feasibility constraints at the rollout group level. R3 measures the model's token-level certainty of a reference answer under its chain-of-thought (CoT) prefix, and selectively emphasizes tokens with high cross-rollout variance, which we call reasoning-reflective tokens, that would otherwise be diluted by the bulk of low-variance tokens. The same variance signal also drives a filter that discards queries with insufficient signal for comparative learning. Rubric-gating complements R3 by operationalizing principled task criteria as hard accept/reject checks on final answers. Empirically, across four datasets spanning scientific writing, medicine, legal contracts, and finance, our framework outperforms strong baselines, achieves faster, more sample-efficient learning, and respects feasibility constraints.

Yifei Xu, Tusher Chakraborty, Srinagesh Sharma, Leonardo Nunes, Swati Sharma, Kate Drakos Demopulos, Emre K{\i}c{\i}man, Songwu Lu, Ranveer Chandra• 2025

Related benchmarks

Task	Dataset	Result
Financial Question Answering	FinQA	Accuracy68.4	30
Question Answering	DROP (test)	ROUGE74.85	12
Question Answering	MedicalQA (test)	ROUGE50.52	12
Medical Reasoning	RaR Medicine	WR vs Base57.6	8
Natural Language Inference	ContractNLI	Macro-F184.5	8
Reasoning evaluation	ParaRev	WR vs. Base63.7	8

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord