UniPROT: Uniform Prototype Selection via Partial Optimal Transport with Submodular Guarantees

About

Selecting prototypical examples from a source distribution to represent a target data distribution is a fundamental problem in machine learning. Existing subset selection methods often rely on implicit importance scores, which can be skewed towards majority classes and lead to low-quality prototypes for minority classes. We present $\methodprop$, a novel subset selection framework that minimizes the optimal transport (OT) distance between a uniformly weighted prototypical distribution and the target distribution. While intuitive, this formulation leads to a cardinality-constrained maximization of a \emph{super-additive} objective, which is generally intractable to approximate efficiently. To address this, we propose a principled reformulation of the OT marginal constraints, yielding a partial optimal transport-based submodular objective. We prove that this reformulation enables a greedy algorithm with a $(1-1/e)$ approximation guarantee relative to the original super-additive maximization problem. Empirically, we showcase that enforcing uniform prototype weights in UniPROT consistently improves minority-class representation in imbalanced classification benchmarks without compromising majority-class accuracy. In both finetuning and pretraining regimes for large language models under domain imbalance, UniPROT enforces uniform source contributions, yielding robust performance gains. Our results establish UniPROT as a scalable, theoretically grounded solution for uniform-weighted prototype selection. Our code is publicly available at GitHub\footnote{Code: https://github.com/efficiency-learning/UniPROT}

Prateek Chanda, Prayas Agrawal, Karthik S. Gurumoorthy, Ganesh Ramakrishnan, Bamdev Mishra, Pratik Jawanpuria• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH	Accuracy16.8	882
Mathematical Reasoning	SVAMP	Accuracy86.2	403
Mathematical Reasoning	Mathematics	Accuracy37.2	46
Mathematical Reasoning	MATH	Accuracy38.4	40
Mathematical Reasoning	NUMGLUE	Accuracy68.8	39
Mathematical Reasoning	SIMULEQ	Accuracy68.28	30
Mathematical Reasoning	MATHINSTRUCT Evaluation Suite In-domain and Out-of-domain	GSM8K Accuracy54.89	11
Natural Language Understanding	SuperGLUE	Accuracy (SST2)94.65	6

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord