From Meta-Thought to Execution: Cognitively Aligned Post-Training for Generalizable and Reliable LLM Reasoning

About

Current LLM post-training methods optimize complete reasoning trajectories through Supervised Fine-Tuning (SFT) followed by outcome-based Reinforcement Learning (RL). While effective, a closer examination reveals a fundamental gap: this approach does not align with how humans actually solve problems. Human cognition naturally decomposes problem-solving into two distinct stages: first acquiring abstract strategies (i.e., meta-knowledge) that generalize across problems, then adapting them to specific instances. In contrast, by treating complete trajectories as basic units, current methods are inherently problem-centric, entangling abstract strategies with problem-specific execution. To address this misalignment, we propose a cognitively-inspired framework that explicitly mirrors the two-stage human cognitive process. Specifically, Chain-of-Meta-Thought CoMT focuses supervised learning on abstract reasoning patterns without specific executions, enabling acquisition of generalizable strategies. Confidence-Calibrated Reinforcement Learning (CCRL) then optimizes task adaptation via confidence-aware rewards on intermediate steps, preventing overconfident errors from cascading and improving execution reliability. Experiments across four models and ten benchmarks show 2.10% and 3.86% improvements in-distribution and out-of-distribution respectively over standard methods, while remaining highly robust to variations in teacher model selection, optimization methods, and symbolic perturbations.

Shaojie Wang, Liang Zhang• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	GSM8K (test)	Accuracy91.4	954
Mathematical Reasoning	GSM8K	Accuracy90.9	499
Mathematical Reasoning	SVAMP	Accuracy93.4	403
Mathematical Reasoning	SVAMP (test)	Accuracy93.2	293
Mathematical Reasoning	ASDIV	Accuracy0.917	268
Mathematical Reasoning	MAWPS	Accuracy97.9	241
Mathematical Reasoning	TabMWP	Accuracy74.3	203
Mathematical Reasoning	GSM-Hard	Solve Rate67.9	162

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord