Closing the Data Loop: Using OpenDataArena to Engineer Superior Training Datasets
About
The construction of Supervised Fine-Tuning (SFT) datasets is a critical yet under-theorized stage in the post-training of Large Language Models (LLMs), as prevalent practices often rely on heuristic aggregation without a systematic understanding of how individual samples contribute to model performance. In this report, we propose a paradigm shift from ad-hoc curation to a closed-loop dataset engineering framework using OpenDataArena (ODA), which leverages value-anchored rankings and multi-dimensional analysis to transform value benchmarking into feedback signals guiding dataset construction. We instantiate this methodology through two new datasets: \textbf{ODA-Math-460k}, a specialized mathematics reasoning dataset that utilizes a novel two-stage difficulty-aware pipeline to achieve State-of-the-Art (SOTA) results on benchmarks such as AIME and HMMT, and \textbf{ODA-Mixture (100k \& 500k)}, a series of multi-domain instruction datasets built via an ``Anchor-and-Patch'' strategy that outperforms significantly larger open-source baselines. Our empirical results demonstrate that ODA-driven datasets significantly improve both domain-specific reasoning and general utility while achieving superior data efficiency, validating a transition toward data-centric AI where transparent evaluation serves as the primary engine for engineering high-quality training data.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | OlympiadBench Math | Accuracy76.3 | 84 | |
| Mathematical Reasoning | Omni-MATH | Accuracy66.9 | 68 | |
| Mathematical Reasoning | HMMT 2025 | Accuracy45.4 | 38 | |
| Mathematical Reasoning | AIME 2025 | Accuracy63.3 | 37 | |
| Multi-domain language model evaluation | ODA benchmark suite (test) | General Accuracy71.2 | 21 | |
| General Language Understanding and Reasoning | General domain benchmarks (test) | DROP Score91.6 | 16 | |
| Code Generation | Code domain benchmarks | HumanEval91.5 | 16 | |
| Mathematical Reasoning | Math domain benchmarks (GSM8K, MATH500, Omni-Math, Olympiad, AIME'24) standard (test) | GSM8K Accuracy94.7 | 16 | |
| Reasoning | Reasoning domain benchmarks ARC-C, BBH, GPQA, CALM, KOR-BENCH | ARC-C Score92.2 | 16 |