SynQuE: Estimating Synthetic Dataset Quality Without Annotations

About

We introduce and formalize the Synthetic Dataset Quality Estimation (SynQuE) problem: ranking synthetic datasets by their expected real-world task performance using only limited unannotated real data. This addresses a critical and open challenge where data is scarce due to collection costs or privacy constraints. We establish the first comprehensive benchmarks for this problem by introducing and evaluating proxy metrics that choose synthetic data for training to maximize task performance on real data. We introduce the first proxy metrics for SynQuE by adapting distribution and diversity-based distance measures to our context via embedding models. To address the shortcomings of these metrics on complex planning tasks, we propose LENS, a novel proxy that leverages large language model reasoning. Our results show that SynQuE proxies correlate with real task performance across diverse tasks, including sentiment analysis, Text2SQL, web navigation, and image classification, with LENS consistently outperforming others on complex tasks by capturing nuanced characteristics. For instance, on text-to-SQL parsing, training on the top-3 synthetic datasets selected via SynQuE proxies can raise accuracy from 30.4% to 38.4 (+8.1)% on average compared to selecting data indiscriminately. This work establishes SynQuE as a practical framework for synthetic data selection under real-data scarcity and motivates future research on foundation model-based data characterization and fine-grained data selection. We release our code.

Arthur Chen, Victor Zhong• 2025

Related benchmarks

Task	Dataset	Result
Sentiment Classification	Twitter Financial News (test)	F1 Score0.52	23
Image Classification	unmet-promise (Split 1)	Task Performance57.3	9
Text2SQL	BIRD Computer Students	Execution Accuracy48.3	9
Web navigation	WebNav	Task Performance26.5	9
Image Classification	unmet-promise (Split 2)	Accuracy56.2	9
Image Classification	unmet-promise (Split 3)	Task Performance60.2	9
Text2SQL	BIRD Movies	Execution Accuracy44.7	9
Text2SQL	BIRD App Store	Execution Accuracy36.3	9
Image Classification	ImageNet (Split 2)	Spearman Correlation0.2	8
Web navigation	WebVoyager	Spearman Correlation0.15	8

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord