Closing the Data Loop: Using OpenDataArena to Engineer Superior Training Datasets

About

The construction of Supervised Fine-Tuning (SFT) datasets is a critical yet under-theorized stage in the post-training of Large Language Models (LLMs), as prevalent practices often rely on heuristic aggregation without a systematic understanding of how individual samples contribute to model performance. In this report, we propose a paradigm shift from ad-hoc curation to a closed-loop dataset engineering framework using OpenDataArena (ODA), which leverages value-anchored rankings and multi-dimensional analysis to transform value benchmarking into feedback signals guiding dataset construction. We instantiate this methodology through two new datasets: \textbf{ODA-Math-460k}, a specialized mathematics reasoning dataset that utilizes a novel two-stage difficulty-aware pipeline to achieve State-of-the-Art (SOTA) results on benchmarks such as AIME and HMMT, and \textbf{ODA-Mixture (100k \& 500k)}, a series of multi-domain instruction datasets built via an ``Anchor-and-Patch'' strategy that outperforms significantly larger open-source baselines. Our empirical results demonstrate that ODA-driven datasets significantly improve both domain-specific reasoning and general utility while achieving superior data efficiency, validating a transition toward data-centric AI where transparent evaluation serves as the primary engine for engineering high-quality training data.

Xin Gao, Xiaoyang Wang, Yun Zhu, Mengzhang Cai, Conghui He, Lijun Wu• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	HMMT 2025	Accuracy45.4	194
Mathematical Reasoning	Omni-MATH	Accuracy66.9	123
Mathematical Reasoning	OlympiadBench Math	Accuracy76.3	84
Mathematical Reasoning	AIME 2025	Accuracy63.3	59
Multi-domain language model evaluation	ODA benchmark suite (test)	General Accuracy71.2	21
General Language Understanding and Reasoning	General domain benchmarks (test)	DROP Score91.6	16
Code Generation	Code domain benchmarks	HumanEval91.5	16
Mathematical Reasoning	Math domain benchmarks (GSM8K, MATH500, Omni-Math, Olympiad, AIME'24) standard (test)	GSM8K Accuracy94.7	16
Reasoning	Reasoning domain benchmarks ARC-C, BBH, GPQA, CALM, KOR-BENCH	ARC-C Score92.2	16

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord