Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Autonomous Data Selection with Zero-shot Generative Classifiers for Mathematical Texts

About

We present Autonomous Data Selection (AutoDS), a method that leverages base language models themselves as zero-shot "generative classifiers" to automatically curate high-quality mathematical texts. Unlike prior approaches that require human annotations or training a dedicated data filter, AutoDS relies solely on a model's logits to determine whether a given passage is mathematically informative and educational. By integrating AutoDS into a continual pretraining pipeline, we substantially boost downstream performance on challenging math benchmarks (MATH, GSM8K, and BBH) while using far fewer tokens than previous methods. Empirically, our approach achieves roughly a twofold improvement in pretraining token efficiency over strong baselines, underscoring the potential of self-directed data selection in enhancing mathematical reasoning. We release our curated AutoMathText dataset to facilitate future research in automated domain-specific data curation. The AutoMathText dataset is available at https://huggingface.co/datasets/math-ai/AutoMathText. The code is available at https://github.com/yifanzhang-pro/AutoMathText.

Yifan Zhang, Yifan Luo, Yang Yuan, Andrew C Yao• 2024

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningGSM8K
Accuracy45.41
983
Mathematical ReasoningMATH
Accuracy16.14
643
ReasoningBBH
Accuracy58.61
507
Commonsense ReasoningPIQA 1.0 (test)
Accuracy82.21
48
Commonsense ReasoningHellaSwag 1.0 (test)
Accuracy62.72
17
World Knowledge and Reading ComprehensionLM Evaluation Harness NQ, MMLU STEM, ARC, SciQ, LogiQA, BoolQ
NQ Accuracy29.06
15
Commonsense ReasoningWinoGrande 1.0 (test)
Accuracy0.8003
15
Showing 7 of 7 rows

Other info

Code

Follow for update