Trajectory-Based Difficulty Scoring for Reliable Learning on Tabular Data

About

Gradient-boosted trees achieve strong performance on tabular data, yet often leave a long tail of poorly predicted instances. We introduce a Trajectory-based Difficulty Score (TDS), an instance-level difficulty estimator for boosted ensembles derived from per-tree cumulative prediction trajectories. For each instance, we compute interpretable trajectory descriptors (e.g., variance, oscillation peaks, sign switches, and tail stability) and train a lightweight regression model to predict held-out loss. An empirical CDF calibrates the resulting signal into a score in $[0,1]$ that supports ranking hard cases. Across diverse tabular benchmarks and ensemble sizes, TDS exhibits strong rank correlation with error and outperforms established instance-hardness and uncertainty baselines on classification, while remaining competitive on regression. We then show how a single difficulty signal improves multiple data mining workflows: difficulty-driven active learning for label-efficient training, difficulty-thresholded selective prediction for improved risk-coverage trade-offs, and TDS-stratified (Mondrian) conformal prediction for more uniform conditional coverage. Finally, clustering high-TDS instances using SHAP attributions reveals coherent failure modes characterized by compact feature-value ranges, supporting error analysis and targeted data acquisition.

Tomer Lavi, Bracha Shapira, Nadav Rappoport• 2026

Related benchmarks

Task	Dataset	Result
Active Learning	Multiple Datasets (test)	AULC58.78	33
Regression	Mean across regression datasets (val)	RMSE11.081	33
Regression	Mean across regression datasets (test)	AULC13.899	33
Active Learning	Multiple Datasets (val)	LL58.24	33
Regression	Multiple Datasets	Pearson r0.239	15
Regression	Average of Regression Datasets (Adult, WiDS, Bike Sharing, Cal. Housing) (test)	Coverage90.5	12
Selective Prediction	Classification Datasets Average (test)	NAURC71.6	12
Classification	Multiple Datasets	Pearson r0.375	12
Selective Prediction	Regression Datasets Average (test)	NAURC0.486	9
Classification	Average of Classification Datasets (Adult, WiDS, Bike Sharing, Cal. Housing) (test)	Covariance0.904	9

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord