Autonomous Data Selection with Zero-shot Generative Classifiers for Mathematical Texts

About

We present Autonomous Data Selection (AutoDS), a method that leverages base language models themselves as zero-shot "generative classifiers" to automatically curate high-quality mathematical texts. Unlike prior approaches that require human annotations or training a dedicated data filter, AutoDS relies solely on a model's logits to determine whether a given passage is mathematically informative and educational. By integrating AutoDS into a continual pretraining pipeline, we substantially boost downstream performance on challenging math benchmarks (MATH, GSM8K, and BBH) while using far fewer tokens than previous methods. Empirically, our approach achieves roughly a twofold improvement in pretraining token efficiency over strong baselines, underscoring the potential of self-directed data selection in enhancing mathematical reasoning. We release our curated AutoMathText dataset to facilitate future research in automated domain-specific data curation. The AutoMathText dataset is available at https://huggingface.co/datasets/math-ai/AutoMathText. The code is available at https://github.com/yifanzhang-pro/AutoMathText.

Yifan Zhang, Yifan Luo, Yang Yuan, Andrew C Yao• 2024

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	GSM8K	Accuracy45.41	1362
Mathematical Reasoning	MATH	Accuracy16.14	882
Reasoning	BBH	Accuracy58.61	672
Commonsense Reasoning	PIQA 1.0 (test)	Accuracy82.21	48
Commonsense Reasoning	HellaSwag 1.0 (test)	Accuracy62.72	17
World Knowledge and Reading Comprehension	LM Evaluation Harness NQ, MMLU STEM, ARC, SciQ, LogiQA, BoolQ	NQ Accuracy29.06	15
Commonsense Reasoning	WinoGrande 1.0 (test)	Accuracy0.8003	15

Showing 7 of 7 rows

Other info

Code

Follow for update

@wizwand_team Discord