Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning

About

Large language models show improved downstream task performance when prompted to generate step-by-step reasoning to justify their final answers. These reasoning steps greatly improve model interpretability and verification, but objectively studying their correctness (independent of the final answer) is difficult without reliable methods for automatic evaluation. We simply do not know how often the stated reasoning steps actually support the final end task predictions. In this work, we present ROSCOE, a suite of interpretable, unsupervised automatic scores that improve and extend previous text generation evaluation metrics. To evaluate ROSCOE against baseline metrics, we design a typology of reasoning errors and collect synthetic and human evaluation scores on commonly used reasoning datasets. In contrast with existing metrics, ROSCOE can measure semantic consistency, logicality, informativeness, fluency, and factuality - among other traits - by leveraging properties of step-by-step rationales. We empirically verify the strength of our metrics on five human annotated and six programmatically perturbed diagnostics datasets - covering a diverse set of tasks that require reasoning skills and show that ROSCOE can consistently outperform baseline metrics.

Olga Golovneva, Moya Chen, Spencer Poff, Martin Corredor, Luke Zettlemoyer, Maryam Fazel-Zarandi, Asli Celikyilmaz• 2022

Related benchmarks

TaskDatasetResultRank
Reasoning Quality Correlation AnalysisLIAR
Somers' D0.1386
45
Reasoning Quality Correlation AnalysisPolitiFact
Somers' D0.135
45
Reasoning Quality EvaluationProofWriter
Somers' D0.2114
15
Reasoning Quality EvaluationEntailmentBank
Somers' D0.0301
15
Reasoning Quality EvaluationStrategyQA
Somers' D0.0887
15
Reasoning Quality EvaluationGSM8K
Somers' D-0.022
11
Reasoning Quality Correlation AnalysisSynthetic Reasoning Traces Aggregate
Somers' D0.1187
10
Showing 7 of 7 rows

Other info

Follow for update