Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Markovian ODE-guided scoring can assess the quality of offline reasoning traces in language models

About

Reasoning traces produced by generative language models are increasingly used for tasks ranging from mathematical problem solving to automated fact checking. However, existing evaluation methods remain largely mechanical and fail to capture human-centric notions of reasoning quality in a way that generalizes across varied and progressively degraded reasoning. We introduce MarODE, an offline evaluation framework that assigns quality scores to reasoning traces. Its effectiveness is assessed using human-centric perturbations and human judgments, which jointly evaluate the fundamental dimensions of an evaluation metric - goodness and soundness. The approach is grounded in a Markovian formulation of reasoning progression and an ordinary differential equation based characterization of trace dynamics, enabling efficient evaluation of reasoning quality. In a large-scale evaluation, MarODE outperforms existing baselines by over 250% under Somers' D correlation. Our results emphasize the value of theory-driven evaluation frameworks as reasoning traces become central to language model-based systems.

Arghodeep Nandi, Ojasva Saxena, Tanmoy Chakraborty• 2026

Related benchmarks

TaskDatasetResultRank
Reasoning Quality Correlation AnalysisLIAR
Somers' D0.2769
45
Reasoning Quality Correlation AnalysisPolitiFact
Somers' D0.2895
45
Reasoning Quality EvaluationEntailmentBank
Somers' D0.1773
15
Reasoning Quality EvaluationProofWriter
Somers' D0.339
15
Reasoning Quality EvaluationStrategyQA
Somers' D0.2735
15
Reasoning Quality EvaluationGSM8K
Somers' D0.1858
11
Reasoning Quality Correlation AnalysisSynthetic Reasoning Traces Aggregate
Somers' D0.2937
10
Showing 7 of 7 rows

Other info

Follow for update