CatBoost: unbiased boosting with categorical features
About
This paper presents the key algorithmic techniques behind CatBoost, a new gradient boosting toolkit. Their combination leads to CatBoost outperforming other publicly available boosting implementations in terms of quality on a variety of datasets. Two critical algorithmic advances introduced in CatBoost are the implementation of ordered boosting, a permutation-driven alternative to the classic algorithm, and an innovative algorithm for processing categorical features. Both techniques were created to fight a prediction shift caused by a special kind of target leakage present in all currently existing implementations of gradient boosting algorithms. In this paper, we provide a detailed analysis of this problem and demonstrate that proposed algorithms solve it effectively, leading to excellent empirical results.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Tabular Classification | 75 Tabular Classification Datasets (test) | Accuracy72.64 | 89 | |
| Tabular Regression | 52 Tabular Datasets (test) | NMAE0.158 | 85 | |
| Classification | 33 datasets missing rate <= 10% (test) | AUC86.42 | 65 | |
| Classification | 10 Datasets Missing rate > 10% (test) | AUC80.34 | 50 | |
| Regression | CA Housing | RMSE0.4303 | 45 | |
| Classification | HI | Accuracy0.564 | 45 | |
| Classification | HE | Accuracy38.46 | 38 | |
| Aggregate Tabular Benchmarking | Aggregate | Avg Rank7.44 | 33 | |
| Binary Classification | Higgs (test) | AUC84.5425 | 30 | |
| Tabular Data Classification | UCI machine learning repository 21 datasets (test) | Median Rank14 | 29 |