Close to Reality: Interpretable and Feasible Data Augmentation for Imbalanced Learning

About

Many machine learning classification tasks involve imbalanced datasets, which are often subject to over-sampling techniques aimed at improving model performance. However, these techniques are prone to generating unrealistic or infeasible samples. Furthermore, they often function as black boxes, lacking interpretability in their procedures. This opacity makes it difficult to track their effectiveness and provide necessary adjustments, and they may ultimately fail to yield significant performance improvements. To bridge this gap, we introduce the Decision Predicate Graphs for Data Augmentation (DPG-da), a framework that extracts interpretable decision predicates from trained models to capture domain rules and enforce them during sample generation. This design ensures that over-sampled data remain diverse, constraint-satisfying, and interpretable. In experiments on synthetic and real-world benchmark datasets, DPG-da consistently improves classification performance over traditional over-sampling methods, while guaranteeing logical validity and offering clear, interpretable explanations of the over-sampled data.

Matheus Camilo da Silva, Gabriel Gustavo Costanzo, Andrea de Lorenzo, Sylvio Barbon Junior• 2026

Related benchmarks

Task	Dataset	Result
Imbalanced Classification	abalone 19	F1-Score62.7	25
Imbalanced Classification	Arrhythmia	F1 Score86	25
Imbalanced Classification	coil 2000	F1-Score61.9	25
Imbalanced Classification	ecoli	F1-Score84.2	25
Imbalanced Classification	Isolet	F1-Score88.4	25
Imbalanced Classification	oil	F1-Score71.3	25
Imbalanced Classification	optical_digits	F1 Score93.7	25
Imbalanced Classification	ozone_level	F1-Score68.2	25
Imbalanced Classification	Scene	F1-Score65.7	25
Imbalanced Classification	spectrometer	F1-Score90	25

Showing 10 of 28 rows

Other info

Follow for update

@wizwand_team Discord