Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Privacy Violation Detection on PKU-SafeRLHF

87.5Acc

Dual-agent

68.7873.6478.583.36Dec 1, 2025
Updated 4d ago

Evaluation Results

MethodLinks
2025.12
87.583.29488.3880.7130.708
2025.12
8782.59487.989.80.7360.75
2025.12
8290728083.50.6440.668
2025.12
8290728087.80.7160.707
2025.12
81.581.28281.681.90.630.63
2025.12
79.586.47077.380.50.6010.585
2025.12
74.567.49578.885.40.6520.651
2025.12
7086.449.56370.60.4520.452
2025.12
69.56493.275.981.50.5450.535