Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Focused PU learning from imbalanced data

About

We propose a new method of learning from positive and unlabeled (PU) examples in highly imbalanced datasets. Many real-world problems, such as disease gene identification, targeted marketing, fraud detection, and recommender systems, are hard to address with machine learning methods, due to limited labeled data. Often, training data comprises positive and unlabeled instances, the latter typically being dominated by negative, but including also several positive instances. While PU learning is well-studied, few methods address imbalanced settings or hard-to-detect positive examples that resemble negative ones. Our approach uses a focused empirical risk estimator, incorporating both positive and unlabeled examples to train binary classifiers. Empirical evaluations demonstrate state-of-the-art performance on imbalanced datasets under two labeling mechanisms - selecting positives completely at random (SCAR) and selecting at random (SAR). Beyond these controlled experiments, we demonstrate the value of the proposed method in the real-world application of financial misstatement detection.

Elias Zavitsanos, Georgios Paliouras• 2026

Related benchmarks

TaskDatasetResultRank
Positive-Unlabeled Classification14 imbalanced datasets SCAR assumption macro-averaged
ROC AUC0.87
36
Positive-Unlabeled Classification14 imbalanced datasets SAR - 75% labeled
ROC-AUC0.87
12
Positive-Unlabeled Classification14 imbalanced datasets SAR - 25% labeled
ROC-AUC0.8
12
Positive-Unlabeled Classification14 imbalanced datasets SAR - 50% labeled
ROC-AUC84
12
Financial Misstatement DetectionFinancial Misstatement Dataset 2003-2014 (test)
R-precision19.95
9
Showing 5 of 5 rows

Other info

Follow for update