Trajectory-Based Difficulty Scoring for Reliable Learning on Tabular Data
About
Gradient-boosted trees achieve strong performance on tabular data, yet often leave a long tail of poorly predicted instances. We introduce a Trajectory-based Difficulty Score (TDS), an instance-level difficulty estimator for boosted ensembles derived from per-tree cumulative prediction trajectories. For each instance, we compute interpretable trajectory descriptors (e.g., variance, oscillation peaks, sign switches, and tail stability) and train a lightweight regression model to predict held-out loss. An empirical CDF calibrates the resulting signal into a score in $[0,1]$ that supports ranking hard cases. Across diverse tabular benchmarks and ensemble sizes, TDS exhibits strong rank correlation with error and outperforms established instance-hardness and uncertainty baselines on classification, while remaining competitive on regression. We then show how a single difficulty signal improves multiple data mining workflows: difficulty-driven active learning for label-efficient training, difficulty-thresholded selective prediction for improved risk-coverage trade-offs, and TDS-stratified (Mondrian) conformal prediction for more uniform conditional coverage. Finally, clustering high-TDS instances using SHAP attributions reveals coherent failure modes characterized by compact feature-value ranges, supporting error analysis and targeted data acquisition.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Active Learning | Multiple Datasets (test) | AULC58.78 | 33 | |
| Regression | Mean across regression datasets (val) | RMSE11.081 | 33 | |
| Regression | Mean across regression datasets (test) | AULC13.899 | 33 | |
| Active Learning | Multiple Datasets (val) | LL58.24 | 33 | |
| Regression | Multiple Datasets | Pearson r0.239 | 15 | |
| Regression | Average of Regression Datasets (Adult, WiDS, Bike Sharing, Cal. Housing) (test) | Coverage90.5 | 12 | |
| Selective Prediction | Classification Datasets Average (test) | NAURC71.6 | 12 | |
| Classification | Multiple Datasets | Pearson r0.375 | 12 | |
| Selective Prediction | Regression Datasets Average (test) | NAURC0.486 | 9 | |
| Classification | Average of Classification Datasets (Adult, WiDS, Bike Sharing, Cal. Housing) (test) | Covariance0.904 | 9 |