Reasoning with OmniThought: A Large CoT Dataset with Verbosity and Cognitive Difficulty Annotations

About

The emergence of large reasoning models (LRMs) has transformed Natural Language Processing by excelling in complex tasks such as mathematical problem-solving and code generation. These models leverage chain-of-thought (CoT) processes, enabling them to emulate human-like reasoning strategies. However, the advancement of LRMs is hindered by the lack of comprehensive CoT datasets. Current resources often fail to provide extensive reasoning problems with coherent CoT processes distilled from multiple teacher models and do not account for multifaceted properties describing the internal characteristics of CoTs. To address these challenges, we introduce OmniThought, a large-scale dataset featuring 2 million CoT processes generated and validated by two powerful LRMs as teacher models. Each CoT process in OmniThought is annotated with novel Reasoning Verbosity (RV) and Cognitive Difficulty (CD) scores, which describe the appropriateness of CoT verbosity and cognitive difficulty level for models to comprehend these reasoning processes. We further establish a self-reliant pipeline to curate this dataset. Extensive experiments using Qwen2.5 models of various sizes demonstrate the positive impact of our proposed scores on LRM training effectiveness. Based on the proposed OmniThought dataset, we further train and release a series of high-performing LRMs, specifically equipped with stronger reasoning abilities and optimal CoT output length and difficulty level. Our contributions significantly enhance the development and training of LRMs for solving complex tasks.

Wenrui Cai, Chengyu Wang, Junbing Yan, Jun Huang, Xiangzhong Fang• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	HMMT 2025	Accuracy35.8	194
Mathematical Reasoning	Omni-MATH	Accuracy59	123
Mathematical Reasoning	OlympiadBench Math	Accuracy74.9	84
Mathematical Reasoning	AIME 2025	Accuracy45.4	59
Multi-domain language model evaluation	ODA benchmark suite (test)	General Accuracy55.8	21
Code Generation	Code domain benchmarks	HumanEval91.5	16
Reasoning	Reasoning domain benchmarks ARC-C, BBH, GPQA, CALM, KOR-BENCH	ARC-C Score93.9	16
Mathematical Reasoning	Math domain benchmarks (GSM8K, MATH500, Omni-Math, Olympiad, AIME'24) standard (test)	GSM8K Accuracy94.2	16
Dataset Diversity Analysis	Post-training LLM Datasets	Vendi Score162.5	9

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord