Better by Default: Strong Pre-Tuned MLPs and Boosted Trees on Tabular Data

About

For classification and regression on tabular data, the dominance of gradient-boosted decision trees (GBDTs) has recently been challenged by often much slower deep learning methods with extensive hyperparameter tuning. We address this discrepancy by introducing (a) RealMLP, an improved multilayer perceptron (MLP), and (b) strong meta-tuned default parameters for GBDTs and RealMLP. We tune RealMLP and the default parameters on a meta-train benchmark with 118 datasets and compare them to hyperparameter-optimized versions on a disjoint meta-test benchmark with 90 datasets, as well as the GBDT-friendly benchmark by Grinsztajn et al. (2022). Our benchmark results on medium-to-large tabular datasets (1K--500K samples) show that RealMLP offers a favorable time-accuracy tradeoff compared to other neural baselines and is competitive with GBDTs in terms of benchmark scores. Moreover, a combination of RealMLP and GBDTs with improved default parameters can achieve excellent results without hyperparameter tuning. Finally, we demonstrate that some of RealMLP's improvements can also considerably improve the performance of TabR with default parameters.

David Holzm\"uller, L\'eo Grinsztajn, Ingo Steinwart• 2024

Related benchmarks

Task	Dataset	Result
Binary Classification	TALENT (test)	Top-1 Accuracy7.41	113
Binary Classification	TabArena	Elo Rating1.49e+3	74
Multiclass Classification	TabArena Lite	Elo Rating1.36e+3	63
Tabular Learning	TabArena	Elo1.51e+3	54
Regression	TabArena Lite	Elo1.72e+3	48
Multiclass Classification	TALENT	SGMε10.7	42
Classification	Covertype	--	40
Tabular Prediction	TabArena all 51 datasets	Elo Rating1.51e+3	38
Multiclass Classification	TALENT Multiclass (> 10 classes) Full (avg across datasets)	Rank3.75	31
Regression	TALENT 100 datasets	Rank8.2	28

Showing 10 of 57 rows

Other info

Code

Follow for update

@wizwand_team Discord