PIDForest: Anomaly Detection via Partial Identification
About
We consider the problem of detecting anomalies in a large dataset. We propose a framework called Partial Identification which captures the intuition that anomalies are easy to distinguish from the overwhelming majority of points by relatively few attribute values. Formalizing this intuition, we propose a geometric anomaly measure for a point that we call PIDScore, which measures the minimum density of data points over all subcubes containing the point. We present PIDForest: a random forest based algorithm that finds anomalies based on this definition. We show that it performs favorably in comparison to several popular anomaly detection methods, across a broad range of benchmarks. PIDForest also provides a succinct explanation for why a point is labelled anomalous, by providing a set of features and ranges for them which are relatively uncommon in the dataset.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Anomaly Detection | Shuttle | AUC0.864 | 39 | |
| Anomaly Detection | Pageblocks | AUC-ROC0.851 | 32 | |
| Anomaly Detection | Fraud | AUC-PR0.186 | 21 | |
| Anomaly Detection | R8 | AUC-ROC88.1 | 10 | |
| Anomaly Detection | COVER | AUC-ROC0.939 | 10 | |
| Anomaly Detection | Exploits | AUC-ROC79.7 | 10 | |
| Anomaly Detection | Analysis | AUC-ROC0.82 | 10 | |
| Anomaly Detection | Backdoor | AUC-ROC0.808 | 10 | |
| Anomaly Detection | DOS | AUC-ROC0.802 | 10 |