Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

From Points to Coalitions: Hierarchical Contrastive Shapley Values for Prioritizing Data Samples

About

How should we quantify the value of each training example when datasets are large, heterogeneous, and geometrically structured? Classical Data-Shapley answers in principle, but its O(n!) complexity and point-wise perspective are ill-suited to modern scales. We propose Hierarchical Contrastive Data Valuation (HCDV), a three-stage framework that (i) learns a contrastive, geometry-preserving representation, (ii) organizes the data into a balanced coarse-to-fine hierarchy of clusters, and (iii) assigns Shapley-style payoffs to coalitions via local Monte-Carlo games whose budgets are propagated downward. HCDV collapses the factorial burden to O(T sum_{l} K_{l}) = O(T K_max log n), rewards examples that sharpen decision boundaries, and regularizes outliers through curvature-based smoothness. We prove that HCDV approximately satisfies the four Shapley axioms with surplus loss O(eta log n), enjoys sub-Gaussian coalition deviation tilde O(1/sqrt{T}), and incurs at most k epsilon_infty regret for top-k selection. Experiments on four benchmarks--tabular, vision, streaming, and a 45M-sample CTR task--plus the OpenDataVal suite show that HCDV lifts accuracy by up to +5 pp, slashes valuation time by up to 100x, and directly supports tasks such as augmentation filtering, low-latency streaming updates, and fair marketplace payouts.

Canran Xiao, Jiabao Dou, Zhiming Lin, Zong Ke, Liwei Hou• 2025

Related benchmarks

TaskDatasetResultRank
Image ClassificationFashion MNIST
Accuracy89.1
16
Noisy label detectionIMDB embedding
NLD35
12
Noisy label detectionCIFAR10 embedding
NLD16
12
Dataset RemovalIMDB embedding
DR74
12
Dataset Additionbbc-embedding
DA Score9
12
Noisy label detectionbbc-embedding
NLD Score0.21
12
Dataset AdditionCIFAR10 embedding
DA Score0.12
12
Dataset RemovalCIFAR10 embedding
DR57
12
Dataset Removalbbc-embedding
DR0.86
12
Dataset AdditionIMDB embedding
DA Score30
12
Showing 10 of 18 rows

Other info

Follow for update