SMOTE: Synthetic Minority Over-sampling Technique
About
An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of "normal" examples with only a small percentage of "abnormal" or "interesting" examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Under-sampling of the majority (normal) class has been proposed as a good means of increasing the sensitivity of a classifier to the minority class. This paper shows that a combination of our method of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space) than only under-sampling the majority class. This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space) than varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of over-sampling the minority class involves creating synthetic minority class examples. Experiments are performed using C4.5, Ripper and a Naive Bayes classifier. The method is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Node Classification | Cora (test) | Mean Accuracy71.8 | 687 | |
| Node Classification | PubMed (test) | Accuracy68.5 | 500 | |
| Node Classification | Cora standard (test) | Accuracy73.24 | 130 | |
| Tabular Data Synthesis Fidelity | biodeg | KS Statistic (Mean)0.53 | 90 | |
| Tabular Data Synthesis Fidelity | steel | KS Statistic (Mean)0.65 | 90 | |
| Tabular Data Synthesis Fidelity | fourier | KS Fidelity0.89 | 88 | |
| Tabular Data Synthesis Fidelity | PROTEIN | Mean KS Statistic0.88 | 88 | |
| Node Classification | Cora | AUC93.2 | 65 | |
| Tabular Data Synthesis Fidelity | Texture | KS Statistic (Mean)0.9 | 64 | |
| Tabular Data Synthesis | fourier | Chi-squared Result0.42 | 48 |