VCORE: Variance-Controlled Optimization-based Reweighting for Chain-of-Thought Supervision

About

Supervised fine-tuning (SFT) on long chain-of-thought (CoT) trajectories has emerged as a crucial technique for enhancing the reasoning abilities of large language models (LLMs). However, the standard cross-entropy loss treats all tokens equally, ignoring their heterogeneous contributions across a reasoning trajectory. This uniform treatment leads to misallocated supervision and weak generalization, especially in complex, long-form reasoning tasks. To address this, we introduce \textbf{V}ariance-\textbf{C}ontrolled \textbf{O}ptimization-based \textbf{RE}weighting (VCORE), a principled framework that reformulates CoT supervision as a constrained optimization problem. By adopting an optimization-theoretic perspective, VCORE enables a principled and adaptive allocation of supervision across tokens, thereby aligning the training objective more closely with the goal of robust reasoning generalization. Empirical evaluations demonstrate that VCORE achieves the strongest overall average performance, with especially clear gains on lower-capacity models. Across both in-domain and out-of-domain settings, VCORE achieves substantial performance gains on mathematical and coding benchmarks, using models from the Qwen3 series (4B, 8B, 32B) and LLaMA-3.1-8B-Instruct. Moreover, we show that VCORE serves as a more effective initialization for subsequent reinforcement learning, establishing a stronger foundation for advancing the reasoning capabilities of LLMs. The Code will be released at https://github.com/coder-gx/VCORE.

Xuan Gong, Senmiao Wang, Hanbo Huang, Ruoyu Sun, Shiyu Liang• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	GSM8K	Accuracy94.16	192
Math Reasoning	AIME	Pass@151.67	30
Multi-Task Reasoning	Average (2WikiMultiHop, MMLU, GSM8k) (in-distribution)	Accuracy41.29	29
Code Reasoning	LiveCodeBench (LCB)	Pass@135.45	26
Math Reasoning	Olympiad	Accuracy68.55	24
Math Reasoning	R-Bench-T Math	Accuracy49.91	24
Math Reasoning	SuperGPQA SGPQA-1k Math	Accuracy45	24
Multi-Task Reasoning	Average Out-of-Domain	Accuracy (OOD)45.93	24
Code Reasoning	OJBench	Accuracy9.48	24
Code Reasoning	R-Bench-T Code	Accuracy45.7	24

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord