Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Beyond Scalars: Evaluating and Understanding LLM Reasoning via Geometric Progress and Stability

About

Evaluating LLM reliability via scalar probabilities often fails to capture the structural dynamics of reasoning. We introduce TRACED, a framework that assesses reasoning quality through theoretically grounded geometric kinematics. By decomposing reasoning traces into Progress (displacement) and Stability (curvature), we reveal a distinct topological divergence: correct reasoning manifests as high-progress, stable trajectories, whereas hallucinations are characterized by low-progress, unstable patterns (stalled displacement with high curvature fluctuations). Leveraging these signatures, our probabilistic framework achieves competitive performance and superior robustness across diverse benchmarks. Crucially, TRACED bridges geometry and cognition by mapping high curvature to ''Hesitation Loops'' and displacement to ''Certainty Accumulation'', offering a physical lens to decode the internal dynamics of machine thought.

Xinyan Jiang, Ninghao Liu, Di Wang, Lijie Hu• 2026

Related benchmarks

TaskDatasetResultRank
ReasoningMATH
AUROC0.8495
46
Reasoning Quality AssessmentSocial-IQA
AUROC77.94
34
Reasoning Quality AssessmentUnderstanding Fables
AUROC0.7191
32
Reasoning Quality AssessmentGPQA
AUROC0.83
32
Reasoning Quality AssessmentGSM8K
AUROC80.61
32
Reasoning Quality AssessmentTheoremQA
AUROC0.873
32
ReasoningTheorem (test)
AUROC87.3
2
ReasoningGSM8K (test)
AUROC80.6
2
ReasoningMATH (test)
AUROC0.749
2
ReasoningFables (test)
AUROC0.719
2
Showing 10 of 10 rows

Other info

Follow for update