Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

The Signal is in the Steps: Local Scoring for Reasoning Data Selection

About

Distilling long-form reasoning from teacher models into smaller students requires selecting which candidate solutions to train on. Recent work argues that one should select responses the student model assigns highest probability, i.e., favoring solutions ``natural'' to the student. However, we find that this approach works within a single teacher but fails when scaling to long reasoning traces from multiple diverse teachers. We identify a key cause: this approach scores entire solutions, but students generalize by recombining familiar reasoning steps, not by memorizing complete solutions. Full-trajectory scoring optimizes the wrong target; it rewards global fluency while the transferable signal lies in local step transitions. We propose Local Average Log Probability (LALP), which scores each reasoning step using only a small window of preceding context, measuring whether each step is justified by its immediate premises rather than whether the full response looks natural to the student. LALP enables two practical use cases: selecting the best teacher before fine-tuning and curating training data from diverse teacher pools. Across math, coding, and science reasoning tasks, LALP consistently improves accuracy when selecting the most natural solutions by a large margin.

Hoang Anh Just, Myeongseob Ko, Ruoxi Jia• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH 500
Accuracy (Acc)85.6
149
Mathematical ReasoningOlympiadBench
Accuracy67.3
81
Mathematical ReasoningCN Middle School 24
Accuracy83.3
51
Science ReasoningGPQA
GPQA Score69.4
27
Mathematical ReasoningOlympiadBench
Accuracy50.14
18
Mathematical ReasoningAIME 24
Accuracy61.66
16
Mathematical ReasoningAIME 25
Accuracy49.16
16
Mathematical ReasoningOlympicB
Accuracy49.26
16
Trajectory quality correlation analysisTeacher-generated Reasoning Datasets Qwen-2.5-3B student
Spearman Correlation (Abs)0.72
13
Trajectory quality correlation analysisTeacher-generated Reasoning Datasets Qwen-2.5-7B student
Spearman Correlation (Abs)0.55
13
Showing 10 of 21 rows

Other info

Follow for update