Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Closing the Data Loop: Using OpenDataArena to Engineer Superior Training Datasets

About

The construction of Supervised Fine-Tuning (SFT) datasets is a critical yet under-theorized stage in the post-training of Large Language Models (LLMs), as prevalent practices often rely on heuristic aggregation without a systematic understanding of how individual samples contribute to model performance. In this report, we propose a paradigm shift from ad-hoc curation to a closed-loop dataset engineering framework using OpenDataArena (ODA), which leverages value-anchored rankings and multi-dimensional analysis to transform value benchmarking into feedback signals guiding dataset construction. We instantiate this methodology through two new datasets: \textbf{ODA-Math-460k}, a specialized mathematics reasoning dataset that utilizes a novel two-stage difficulty-aware pipeline to achieve State-of-the-Art (SOTA) results on benchmarks such as AIME and HMMT, and \textbf{ODA-Mixture (100k \& 500k)}, a series of multi-domain instruction datasets built via an ``Anchor-and-Patch'' strategy that outperforms significantly larger open-source baselines. Our empirical results demonstrate that ODA-driven datasets significantly improve both domain-specific reasoning and general utility while achieving superior data efficiency, validating a transition toward data-centric AI where transparent evaluation serves as the primary engine for engineering high-quality training data.

Xin Gao, Xiaoyang Wang, Yun Zhu, Mengzhang Cai, Conghui He, Lijun Wu• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningOlympiadBench Math
Accuracy76.3
84
Mathematical ReasoningOmni-MATH
Accuracy66.9
68
Mathematical ReasoningHMMT 2025
Accuracy45.4
38
Mathematical ReasoningAIME 2025
Accuracy63.3
37
Multi-domain language model evaluationODA benchmark suite (test)
General Accuracy71.2
21
General Language Understanding and ReasoningGeneral domain benchmarks (test)
DROP Score91.6
16
Code GenerationCode domain benchmarks
HumanEval91.5
16
Mathematical ReasoningMath domain benchmarks (GSM8K, MATH500, Omni-Math, Olympiad, AIME'24) standard (test)
GSM8K Accuracy94.7
16
ReasoningReasoning domain benchmarks ARC-C, BBH, GPQA, CALM, KOR-BENCH
ARC-C Score92.2
16
Showing 9 of 9 rows

Other info

Follow for update