Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

From Meta-Thought to Execution: Cognitively Aligned Post-Training for Generalizable and Reliable LLM Reasoning

About

Current LLM post-training methods optimize complete reasoning trajectories through Supervised Fine-Tuning (SFT) followed by outcome-based Reinforcement Learning (RL). While effective, a closer examination reveals a fundamental gap: this approach does not align with how humans actually solve problems. Human cognition naturally decomposes problem-solving into two distinct stages: first acquiring abstract strategies (i.e., meta-knowledge) that generalize across problems, then adapting them to specific instances. In contrast, by treating complete trajectories as basic units, current methods are inherently problem-centric, entangling abstract strategies with problem-specific execution. To address this misalignment, we propose a cognitively-inspired framework that explicitly mirrors the two-stage human cognitive process. Specifically, Chain-of-Meta-Thought (CoMT) focuses supervised learning on abstract reasoning patterns without specific executions, enabling acquisition of generalizable strategies. Confidence-Calibrated Reinforcement Learning (CCRL) then optimizes task adaptation via confidence-aware rewards on intermediate steps, preventing overconfident errors from cascading and improving execution reliability. Experiments across four models and eight benchmarks show 2.19\% and 4.63\% improvements in-distribution and out-of-distribution respectively over standard methods, while reducing training time by 65-70% and token consumption by 50%, demonstrating that aligning post-training with human cognitive principles yields not only superior generalization but also enhanced training efficiency.

Shaojie Wang, Liang Zhang• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningGSM8K (test)
Accuracy91.4
797
Mathematical ReasoningSVAMP
Accuracy93.4
368
Mathematical ReasoningGSM8K
Accuracy90.9
351
Mathematical ReasoningSVAMP (test)
Accuracy93.2
233
Mathematical ReasoningASDIV
Accuracy0.917
221
Mathematical ReasoningMAWPS
Accuracy97.9
219
Mathematical ReasoningGSM-Hard
Solve Rate67.9
162
Mathematical ReasoningTabMWP
Accuracy74.3
157
Showing 8 of 8 rows

Other info

Follow for update